A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its new AI model can control computers better than systems built by OpenAI and Anthropic — at a fraction of the cost.
OpenAGI, led by chief executive Zengyi Qin, released Lux, a foundation model designed to operate computers autonomously by interpreting screenshots and executing actions across desktop applications. The San Francisco-based company says Lux achieves an 83.6 percent success rate on Online-Mind2Web, a benchmark that has become the industry’s most rigorous test for evaluating AI agents that control computers.
That score is a significant leap over the leading models from well-funded competitors. OpenAI’s Operator, released in January, scores 61.3 percent on the same benchmark. Anthropic’s Claude Computer Use achieves 56.3 percent.
“Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text,” Qin said in an exclusive interview with VentureBeat. “By contrast, our model learns to produce actions. The model is trained with a large amount of computer screenshots and action sequences, allowing it to produce actions to control the computer.”
The announcement arrives at a pivotal moment for the AI industry. Technology giants and startups alike have poured billions of dollars into developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have all released or announced agent products in the past year, betting that computer-controlling AI will become as transformative as chatbots.
Yet independent research has cast doubt on whether current agents are as capable as their creators suggest.
Why university researchers built a tougher benchmark to test AI agents—and what they discovered
The Online-Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was designed specifically to expose the gap between marketing claims and actual performance.
Published in April and accepted to the Conference on Language Modeling 2025, the benchmark comprises 300 diverse tasks across 136 real websites — everything from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of websites, Online-Mind2Web tests agents in live online environments where pages change dynamically and unexpected obstacles appear.
The results, according to the researchers, painted “a very different picture of the competency of current agents, suggesting over-optimism in previously reported results.”
When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems — despite heavy investment and marketing fanfare — did not outperform SeeAct, a relatively simple agent released in January 2024. Even OpenAI’s Operator, the best performer among commercial offerings in their study, achieved only 61 percent success.
“It seemed that highly capable and practical agents were maybe indeed just months away,” the researchers wrote in a blog post accompanying their paper. “However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict.”
The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and companies.
How OpenAGI trained its AI to take actions instead of just generating text
OpenAGI’s claimed performance advantage stems from what the company calls “Agentic Active Pre-training,” a training methodology that differs fundamentally from how most large language models learn.
Conventional language models train on vast text corpora, learning to predict the next word in a sequence. The resulting systems excel at generating coherent text but were not designed to take actions in graphical environments.
Lux, according to Qin, takes a different approach. The model trains on computer screenshots paired with action sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal.
“The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training,” Qin told VentureBeat. “This is a naturally self-evolving process, where a better model produces better exploration, better exploration produces better knowledge, and better knowledge leads to a better model.”
This self-reinforcing training loop, if it functions as described, could help explain how a smaller team might achieve results that elude larger organizations. Rather than requiring ever-larger static datasets, the approach would allow the model to continuously improve by generating its own training data through exploration.
OpenAGI also claims significant cost advantages. The company says Lux operates at roughly one-tenth the cost of frontier models from OpenAI and Anthropic while executing tasks faster.
Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applications
A critical distinction in OpenAGI’s announcement: Lux can control applications across an entire desktop operating system, not just web browsers.
Most commercially available computer-use agents, including early versions of Anthropic’s Claude Computer Use, focus primarily on browser-based tasks. That limitation excludes vast categories of productivity work that occur in desktop applications — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe products, code editing in development environments.
OpenAGI says Lux can navigate these native applications, a capability that would substantially expand the addressable market for computer-use agents. The company is releasing a developer software development kit alongside the model, allowing third parties to build applications on top of Lux.
The company is also working with Intel to optimize Lux for edge devices, which would allow the model to run locally on laptops and workstations rather than requiring cloud infrastructure. That partnership could address enterprise concerns about sending sensitive screen data to external servers.
“We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model,” Qin said.
The company confirmed it is in exploratory discussions with AMD and Microsoft about additional partnerships.
What happens when you ask an AI agent to copy your bank details
Computer-use agents present novel safety challenges that do not arise with conventional chatbots. An AI system capable of clicking buttons, entering text, and navigating applications could, if misdirected, cause significant harm — transferring money, deleting files, or exfiltrating sensitive information.
OpenAGI says it has built safety mechanisms directly into Lux. When the model encounters requests that violate its safety policies, it refuses to proceed and alerts the user.
In an example provided by the company, when a user asked the model to “copy my bank details and paste it into a new Google doc,” Lux responded with an internal reasoning step: “The user asks me to copy the bank details, which are sensitive information. Based on the safety policy, I am not able to perform this action.” The model then issued a warning to the user rather than executing the potentially dangerous request.
Such safeguards will face intense scrutiny as computer-use agents proliferate. Security researchers have already demonstrated prompt injection attacks against early agent systems, where malicious instructions embedded in websites or documents can hijack an agent’s behavior. Whether Lux’s safety mechanisms can withstand adversarial attacks remains to be tested by independent researchers.
The MIT researcher who built two of GitHub’s most downloaded AI models
Qin brings an unusual combination of academic credentials and entrepreneurial experience to OpenAGI.
He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work appeared in top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.
Before founding OpenAGI, Qin built several widely adopted AI systems. JetMoE, a large language model he led development on, demonstrated that a high-performing model could be trained from scratch for less than $100,000 — a fraction of the tens of millions typically required. The model outperformed Meta’s LLaMA2-7B on standard benchmarks, according to a technical report that attracted attention from MIT’s Computer Science and Artificial Intelligence Laboratory.
His previous open-source projects achieved remarkable adoption. OpenVoice, a voice cloning model, accumulated approximately 35,000 stars on GitHub and ranked in the top 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded more than 19 million times, making it one of the most widely used audio AI models since its 2024 release.
Qin also co-founded MyShell, an AI agent platform that has attracted six million users who have collectively built more than 200,000 AI agents. Users have had more than one billion interactions with agents on the platform, according to the company.
Inside the billion-dollar race to build AI that controls your computer
The computer-use agent market has attracted intense interest from investors and technology giants over the past year.
OpenAI released Operator in January, allowing users to instruct an AI to complete tasks across the web. Anthropic has continued developing Claude Computer Use, positioning it as a core capability of its Claude model family. Google has incorporated agent features into its Gemini products. Microsoft has integrated agent capabilities across its Copilot offerings and Windows.
Yet the market remains nascent. Enterprise adoption has been limited by concerns about reliability, security, and the ability to handle edge cases that occur frequently in real-world workflows. The performance gaps revealed by benchmarks like Online-Mind2Web suggest that current systems may not be ready for mission-critical applications.
OpenAGI enters this competitive landscape as an independent alternative, positioning superior benchmark performance and lower costs against the massive resources of its well-funded rivals. The company’s Lux model and developer SDK are available beginning today.
Whether OpenAGI can translate benchmark dominance into real-world reliability remains the central question. The AI industry has a long history of impressive demos that falter in production, of laboratory results that crumble against the chaos of actual use. Benchmarks measure what they measure, and the distance between a controlled test and an 8-hour workday full of edge cases, exceptions, and surprises can be vast.
But if Lux performs in the wild the way it performs in the lab, the implications extend far beyond one startup’s success. It would suggest that the path to capable AI agents runs not through the largest checkbooks but through the cleverest architectures—that a small team with the right ideas can outmaneuver the giants.
The technology industry has seen that story before. It rarely stays true for long.

