Ask HN: Why is it taking so long to build computer controlling agents?

10 points by louisfialho 4 months ago

I'm not a PhD but I assume training computer controlling agents is a straightforward problem as we can define clear tasks (e.g. schedule appointment with details xyz or buy product xyz) on real or generated websites and just let the models figure our where to click (through vlm) and learn through RL.

What am I missing, why isn't this a solved problem by now?

muzani 4 months ago

It has been done:

https://youtube.com/watch?v=shnW3VerkiM

https://youtube.com/watch?v=VQhS6Uh4-sI

First one is more impressive looking. Second one more reliable.

I think the real hard part is nobody wants to maintain these, and nobody really wants to pay to use them either. It's a lot of work and not something people do for free. It's no surprise these emerged (and won) in hackathons.

All the major operating systems are dedicating their full efforts into this, so it doesn't make much sense to actually raise money and do it.

louisfialho 4 months ago

I see so basically saying this problem is too hard for a small team but the big labs will figure it out?
- bruce511 4 months ago
  
  I read it rather that this is an economic question not a technical question.
  There's no point working on a 3rd party feature if the OS will have that feature built-in. Economically that greatly reduces the likelihood of a return.
  Especially in a market where customers expect, and will continue to expect this for free.
  
  muzani 4 months ago
  
  Yup, economics. Dude in the first video got a car for 3 months of work but he has no interest in continuing work on it now that he won.
  Dude in the second video is trying to turn it into a startup, because he didn't get rewarded. He built it before Claude & OpenAI did theirs. So individualw can work faster than extremely well paid teams. But apparently people haven't shown that much interest in the product, so his interest is waning in building it.
NewUser76312 4 months ago

Sorry but these are more preconceived demos, not software products that will automated anyone out of a job (yet).
I think it's coming but people are underestimating how hard GUI understanding and action control is in general. For specific RPA, it'll be great and find initial uses there.
- louisfialho 4 months ago
  
  makes sense, could you elaborate on the complexity? would like to understand what makes it a hard problem, thx!

tacostakohashi 4 months ago

Because the websites want to serve ads to humans, upsell you, get you to sign up for their credit card too, so their implementation are highly obfuscated and dynamic.

If they wanted to be easy to work with, they'd offer a simple API, or plain HTML form interface.

jfbfkdnxbdkdb 4 months ago

Because a llm != a generally Intelligent mind...

Whilst they are a massive Step forward ... We still have a long way to go for that...

Why not try it yourself with ollama a large model and some rented hardware ... You will get something ... But it will not be consistent...

louisfialho 4 months ago

I think this was the case even 6 months ago but now that we have reasoning models I'd expect models can understand a web page and the action to take next no?
- jfbfkdnxbdkdb 4 months ago
  
  Even deepseek r1 (70b) still flubs things ... Maybe if I try 640b but yeah I don't have the ram...
jfbfkdnxbdkdb 4 months ago

Not to be a doubter in llms being powerful ... Just that every time I try them ... They just don't do what I want....
- mu53 4 months ago
  
  have you tried adding "please"? I found that it works wonders
  
  collingreen 4 months ago
  
  I can't tell if this is serious or tongue in cheek and I find that both funny and deeply discouraging about the state of the world. For some reason it's giving me Rick and Morty butter robot vibes.
  
  jfbfkdnxbdkdb 4 months ago
  
  Tried that ... But competently writing rust is just not a priority for the llms I chat with

simne 4 months ago

As people already said, ca problem is complex problem, and need to resolve at simplest consideration, few simpler problems.

I will list some of simpler problems:

1. Some sort of reliable screen read, capable for all sorts of screen output (not just html-like or any other already structured markup).

2. Some sort of universal optimizer, capable to solve any task, solvable for human in simplified computer environment.

3. Some sort of reliable "Understanding Engine", to make queries with simplified language, easy to use by human, which we could theoretically solve using few different ways (I list only two most known).

3a. Some deep learning AI.

3b. Some huge implementation of semantic AI.

MattGaiser 4 months ago

I have experience with a tiny part of this problem; accessing the various websites and figuring out where to click.

Presently, doing this requires a fair bit of continuous work.

Many websites don't want bots on them and are actively using countermeasures that would block operators in the same way they block scrapers. There is a ton of stuff a website can do to break those bots and they do it. Some even feed back "phantom" data to make the process less reliable.

There are a lot of businesses out there where the business model breaks if someone else can see the whole board.

louisfialho 4 months ago

Good point, raises a very interesting question: if the big websites block agents, won't agents be extremely limited? Should big websites 'charge' agents?

wavemode 4 months ago

There are technical limitations, sure, (getting an AI to parse a screen and interact with it via mouse and keyboard is harder than it sounds - and it sounds hard to start with) but the main limitation is still economical. Does it really make sense to train a multi-billion-parameter AI to click buttons, if you could instead just make an API call?

There's an intersection between "high accuracy" and "low cost" that AI has not quite reached yet for this sort of task, when compared to simpler and cheaper alternatives.

hildolfr 4 months ago

People are using huge capable LLMs to answer things like "what's five percent of 250"; I don't see a big leap in using them to skip APIs.
On the other side, a lot of user access methods are more able than an API call equivalent, people already exploit things like autohotkey to work around such limitations -- if people are already working around things that way that must indicate the presence of some sort of market.

louisfialho 4 months ago

Thanks for the answers. Even the unexpected patterns like pop-ups etc to me feel pretty structured - I would expect models to generalize and navigate any of them I could see more websites blocking agents in the future but it seems like we're so early that this is not a limiting factor yet

42lux 4 months ago

Because reality has a lot of details.

louisfialho 4 months ago

If someone is actively working on this and believes there is a path please reach out to me

pestatije 4 months ago

you mean Alexa?

aaron695 4 months ago

[dead]