Meet AI agents with multiple personalities


For the next few years agent It is widely expected that more and more chores will be taken over on behalf of humans, such as using computers and smartphones. But for now, They are too error prone It has many uses.

Created by Startup Simular AI, the new agent called S2 combines frontier models with models specialized for computer use. Agents achieve cutting-edge performance in tasks such as app usage and file manipulation, suggesting that relying on different models in various situations could help agents move forward.

“Computer-using agents differ from coding, unlike large-scale language models,” said Ang Li, co-founder and CEO of Simular. “That’s a different type of problem.”

In the Simular approach, powerful general-purpose AI models like Openai’s GPT-4O and Anthropic’s Claude 3.7 are used to infer the best way to complete tasks at hand.

Li, a Google Deepmind researcher before establishing Simular in 2023, explains that large-scale language models are good at planning, but not good at recognizing elements of graphical user interfaces.

S2 is designed to record actions and user feedback and learn from experience with external memory modules that use those recordings to improve future actions.

Especially on complex tasks, the S2 performs better than any other model Osworlda benchmark that measures the ability of agents to use computer operating systems.

For example, S2 completes 34.5% of tasks with 50 steps and hits beat Openai Operatoryou can complete 32%. Similarly, S2 won 50% on Android World. AndroidWorld is the benchmark for smartphone-using agents, with the next best agent earning 46%.

Victor Zhong, a computer scientist at the University of Waterloo in Canada and one of the creators of Osworld, believes that the big AI models of the future incorporate training data that will help you understand the visual world and understand the graphical user interface.

“This helps agents navigate the GUI with much higher accuracy,” says Zhong. “In the meantime, before these basic breakthroughs, I think that cutting-edge systems are simultaneously similar in that they combine multiple models to patch the limitations of a single model.”

To prepare for this column, I used Simular to book my flights and scrutinized Amazon for the deal. Auto-generated and vimgpt.

But even the smartest AI agents still plague with the case of Edge, and seem to sometimes show strange behavior. In one example, when asked S2 to help find the contact information for the researcher behind Osworld, the agent was caught up in loop hopping between the project page and Osworld’s Discord login.

Osworld’s benchmarks show why agents are making more hype than they are. Humans can complete 72% of Osworld tasks, but agents lose 38% of their time on complex tasks. That said, when the benchmark was introduced in April 2024, the best agents were only able to complete 12% of the tasks.

Leave a Reply

Your email address will not be published. Required fields are marked *