Ask HN: Why is it taking so long to build computer controlling agents?

9 points by louisfialho 17 hours ago

I'm not a PhD but I assume training computer controlling agents is a straightforward problem as we can define clear tasks (e.g. schedule appointment with details xyz or buy product xyz) on real or generated websites and just let the models figure our where to click (through vlm) and learn through RL.

What am I missing, why isn't this a solved problem by now?

simne 5 hours ago

As people already said, ca problem is complex problem, and need to resolve at simplest consideration, few simpler problems.

I will list some of simpler problems:

1. Some sort of reliable screen read, capable for all sorts of screen output (not just html-like or any other already structured markup).

2. Some sort of universal optimizer, capable to solve any task, solvable for human in simplified computer environment.

3. Some sort of reliable "Understanding Engine", to make queries with simplified language, easy to use by human, which we could theoretically solve using few different ways (I list only two most known).

3a. Some deep learning AI.

3b. Some huge implementation of semantic AI.

muzani 11 hours ago

It has been done:

https://youtube.com/watch?v=shnW3VerkiM

https://youtube.com/watch?v=VQhS6Uh4-sI

First one is more impressive looking. Second one more reliable.

I think the real hard part is nobody wants to maintain these, and nobody really wants to pay to use them either. It's a lot of work and not something people do for free. It's no surprise these emerged (and won) in hackathons.

All the major operating systems are dedicating their full efforts into this, so it doesn't make much sense to actually raise money and do it.

jfbfkdnxbdkdb 13 hours ago

Because a llm != a generally Intelligent mind...

Whilst they are a massive Step forward ... We still have a long way to go for that...

Why not try it yourself with ollama a large model and some rented hardware ... You will get something ... But it will not be consistent...

  • jfbfkdnxbdkdb 13 hours ago

    Not to be a doubter in llms being powerful ... Just that every time I try them ... They just don't do what I want....

    • mu53 13 hours ago

      have you tried adding "please"? I found that it works wonders

      • collingreen 11 hours ago

        I can't tell if this is serious or tongue in cheek and I find that both funny and deeply discouraging about the state of the world. For some reason it's giving me Rick and Morty butter robot vibes.

      • jfbfkdnxbdkdb 10 hours ago

        Tried that ... But competently writing rust is just not a priority for the llms I chat with

wavemode 14 hours ago

There are technical limitations, sure, (getting an AI to parse a screen and interact with it via mouse and keyboard is harder than it sounds - and it sounds hard to start with) but the main limitation is still economical. Does it really make sense to train a multi-billion-parameter AI to click buttons, if you could instead just make an API call?

There's an intersection between "high accuracy" and "low cost" that AI has not quite reached yet for this sort of task, when compared to simpler and cheaper alternatives.

  • hildolfr 13 hours ago

    People are using huge capable LLMs to answer things like "what's five percent of 250"; I don't see a big leap in using them to skip APIs.

    On the other side, a lot of user access methods are more able than an API call equivalent, people already exploit things like autohotkey to work around such limitations -- if people are already working around things that way that must indicate the presence of some sort of market.

tacostakohashi 16 hours ago

Because the websites want to serve ads to humans, upsell you, get you to sign up for their credit card too, so their implementation are highly obfuscated and dynamic.

If they wanted to be easy to work with, they'd offer a simple API, or plain HTML form interface.

louisfialho 15 hours ago

Thanks for the answers. Even the unexpected patterns like pop-ups etc to me feel pretty structured - I would expect models to generalize and navigate any of them I could see more websites blocking agents in the future but it seems like we're so early that this is not a limiting factor yet

louisfialho 17 hours ago

If someone is actively working on this and believes there is a path please reach out to me

42lux 16 hours ago

Because reality has a lot of details.

MattGaiser 16 hours ago

I have experience with a tiny part of this problem; accessing the various websites and figuring out where to click.

Presently, doing this requires a fair bit of continuous work.

Many websites don't want bots on them and are actively using countermeasures that would block operators in the same way they block scrapers. There is a ton of stuff a website can do to break those bots and they do it. Some even feed back "phantom" data to make the process less reliable.

There are a lot of businesses out there where the business model breaks if someone else can see the whole board.