Can AI Truly Replace Humans? A Critical Look at “False AI” and the Limits of ChatGPT-like Models

Since the launch of ChatGPT in late 2022 and early 2023, fears have spread across industries worldwide about artificial intelligence (AI) replacing human jobs. Despite these concerns, tech companies have pressed ahead, rapidly advancing AI models to enhance their capabilities.
In an effort to measure and validate these capabilities, AI developers routinely subject their models to internal testing—tests that often yield impressive results. But these consistent successes raised skepticism for Ryan Krishnan, who responded by founding “False AI” (Vals AI), a company dedicated to independently evaluating AI models.
Independent AI Evaluation: The Mission Behind False AI
Unlike corporate testing, Vals AI conducts rigorous and unbiased evaluations of AI models to assess their real-world accuracy and identify critical weaknesses. According to Krishnan, the lack of external review processes has resulted in overconfidence in AI performance.
In tests conducted by Vals AI on 22 popular large language models (LLMs) developed by OpenAI, Google, Anthropic, xAI, and others, none of the models exceeded a 50% accuracy rate on simple financial reasoning tasks—not complex scenarios, just standard financial queries. So what went wrong?
A Disconnect Between Training Data and Real-World Use
Krishnan believes that current AI models are overly optimized for scientific literature and research papers, which do not necessarily reflect the practical, real-world challenges professionals face. These limitations highlight a fundamental issue: most models are evaluated using academic-style benchmarks that are widely available and not necessarily aligned with the day-to-day needs of industry experts.
To address this gap, Vals AI—working with a leading financial institution—designed a 500-question financial proficiency test specifically targeting real-world tasks. Unlike academic benchmarks, these questions focused on tools and skills used by financial analysts and reporters, such as navigating EDGAR, the U.S. Securities and Exchange Commission’s (SEC) corporate filing database.
Poor Performance Across the Board
According to The Washington Post, the results were underwhelming:
- OpenAI’s latest model (o3) achieved just 48.3% accuracy.
- Anthropic’s Claude Sonnet 3.7 scored 44.1%.
- Meta’s LLaMA models performed especially poorly, failing to break the 10% threshold.
These results highlight a serious shortcoming: the AI models failed basic real-world financial reasoning tasks despite being marketed as cutting-edge tools.
Unsurprisingly, the major companies involved—including OpenAI—refused to comment or respond to the findings.
Why Did the Models Fail?
There are several likely causes:
- Misaligned training data – AI models were trained on theoretical, research-heavy datasets rather than practical, applied knowledge.
- Lack of contextual prompts – During the tests, the models were asked questions directly, without feeding them pre-selected documents or supplemental context.
- Overfitting to known benchmarks – Since many standard AI benchmarks are publicly available, companies optimize their models for those, leading to inflated performance results that don’t translate to real-world capability.
In contrast, Vals AI built its own question library based on feedback from experienced financial professionals, focusing on tasks relevant to working analysts and business journalists.
A New Trend in AI Accountability
Vals AI is one of a growing number of startups aiming to bring transparency and accountability to the AI space. With new models being released rapidly, the need for independent testing has become urgent.
Krishnan argues that third-party evaluations are essential to pushing the industry forward and ensuring that AI tools are truly useful to humans. Only by rigorously stress-testing these models can we build AI assistants that enhance productivity without risking critical errors.
“AI should not replace humans but rather assist them—if and only if its output is trustworthy and grounded in real-world accuracy,” said Krishnan.
Will AI Replace Human Jobs?
In February, Bill Gates suggested that AI could eventually replace human professionals such as doctors and teachers. His view was echoed by tech investor Victor Lazarte, who believes AI will go beyond enhancing human work and begin to fully automate it.
However, the findings from Vals AI’s recent tests cast doubt on such projections—at least for now. The models’ inability to handle basic financial queries suggests that AI still has a long way to go before it can function independently in critical decision-making roles.
Conclusion
The buzz around artificial intelligence continues to grow—but so do the questions about its real-world effectiveness. As companies race to build ever more powerful models, efforts like Vals AI remind us that accuracy, transparency, and accountability must come first. In the end, AI may not be here to replace humans, but to empower them—if it can prove itself worthy of that role.