Is AGI Imminent or Has Progress Plateaued?

Steve Shwartz

--

February 10, 2025

The release of OpenAI’s ChatGPT over two years ago ignited a surge of AI advancements, transforming industries and daily life:

  • Enhanced Internet Search: Tools like Perplexity retrieve and summarize search results, streamlining research.
  • Boosted Coding Productivity: Coding assistants like Cursor and GitHub Copilot help developers achieve greater efficiency.
  • AI-Powered Meeting Assistance: Meeting assistants generate summaries and action items for Zoom meetings.
  • Content Summarization: NotebookLM synthesizes key insights from scientific papers and YouTube videos.
  • Automated Writing Assistance: Tools like ChatGPT generate blog posts, emails, and speeches with ease.
  • System Control & Diagnostics: AI models now manage PCs and assist with troubleshooting.
  • Improved Predictions: AI enhances weather forecasts, flood predictions, and medical diagnoses.
  • And much more…

The influence of AI continues to expand, with rapid technological advancements emerging almost daily.

The Debate on AGI’s Proximity

Despite AI’s rapid progress, experts remain divided on the timeline for achieving Artificial General Intelligence (AGI). There have been many proposed definitions of AGI over the years. We’ll use one that is arguably the easiest to measure: In 2006, Nils Nilsson proposed that a machine has achieved human-level intelligence (i.e. AGI) when it is capable of doing all human jobs. And, because humanoid robots are still in their infancy, let’s limit the discussion to desk jobs.

As of late 2024, AI systems had not yet reached AGI. OpenAI CEO Sam Altman has suggested that AGI is imminent, comparing the expected leap from GPT-4 to GPT-5 to the improvement from GPT-3 to GPT-4. However, reports indicate that GPT-5’s 2024 training runs were expensive and underwhelming, casting doubt on AGI’s rapid arrival. Altman has since moderated his predictions, acknowledging the persistent challenges.

The Role of Reasoning in AGI

Altman has stated that the primary weakness of current LLMs lies in their reasoning abilities and frequently make errors that even a six-year-old would easily avoid.

Reasoning is the cognitive process we use to plan, solve problems, and make decisions. It involves collecting information from our surroundings, our memories, and external sources like the internet. We then synthesize this data — sometimes following a structured, logical progression, but often through a mix of trial and error, intuition, reflection, and re-evaluation. This process necessitates evaluating alternatives, considering different perspectives, and leveraging past experiences to generate insights and solutions.

Reasoning is not solely the province of scientists or intellectuals. It is fundamental trait of all conscious beings. Even animals exhibit rudimentary reasoning skills. A monkey, for instance, can be trained using classical conditioning to recognize and execute a deductive logic pattern for a task. A human, however, would not only grasp the reasoning steps but also detect the underlying intent behind the training. They might question the motives of the trainer, formulate hypotheses about the manipulation, and even devise strategies to counteract it. This capacity for self-awareness and adaptive thinking remains a distinguishing feature of human cognition.

Are LLMs merely learning structured reasoning patterns like a trained animal, or are they developing human-like thought processes?

Reasoning Benchmarks

LLMs excel in knowledge-based standardized tests, including the GRE and professional exams and their improving performance on reasoning benchmarks suggests they are acquiring some structured reasoning skills.

For example, OpenAI’s o3 model, released in December 2024, achieved 87.5% on the ARC-AGI benchmark — a dramatic improvement from the previous 32% state-of-the-art. Developed by former Google AI researcher François Chollet, ARC-AGI had been difficult for LLMs, despite smart humans scoring 95% with no training.

Chollet will soon release a second version of the benchmark with a new set of questions that are again easy for humans but that pose significant difficult for LLMs including o3. Then OpenAI may figure out how to build a new LLM that does well on the new benchmark.

The New York Times reported on a new benchmark named “Humanity’s Last Exam” on which the best AI systems achieve only 8% correct. The creator of that exam predicted that, by the end of 2025, many AI systems would do very well on them.

Does this mean that LLMs are reasoning more and more like humans?

Some people would argue that each new test just unfairly raises the bar. Perhaps. But it’s also likely that vendors of AI systems will design training data intended specifically to improve performance on this exam. So perhaps not.

Chollet asserts that neither ARC-AGI nor any of its successors is a test for AGI. He argues that we’ll know AGI has been achieved if it ever becomes impossible to create a benchmark that is easy for humans but difficult for the latest LLMs without training on the training set.

Chollet’s Theory

Chollet hypothesizes that LLMs function by mapping inputs into a high-dimensional vector space containing both knowledge (e.g. facts) and step-by-step reasoning programs. When asked to retrieve a fact, the model maps the natural language request into this space and either takes the closest fact or uses interpolation to produce a fact. Sometimes this interpolation gives the correct answer and sometimes it produces a hallucination.

A similar process occurs for reasoning programs. For example, one can instruct a language model to use step-by-step, aka Chain of Thought (CoT), reasoning to solve a word problem. Chollet hypothesizes that the steps in the chain of thought are the steps in the memorized program. During training, the language models see large numbers of these step-by-step programs and memorize them.

Andrej Karpathy, who was a founding member of OpenAI and later the head of AI at Tesla, provides a great explanation of how these models learn these step-by-step programs. He explains in easy-to-understand terms how reinforcement learning is used to explore alternative step-by-step path (program) for each problem in the training set, select the best path, and learn to solve that problem and similar problems with that step-by-step path.

The high-dimensional vector space stores similar programs near one another. When the LLM encounters a problem that is didn’t encounter in training, it selects the closest program in the vector space. Sometimes this program will produce the right answer especially if it has seen many similar problems during training. And sometimes it will produce the wrong answer.

The notion of CoT programs stored in a high-dimensional space also explains why LLMs improve performance on benchmarks when there is a training set for that benchmark. They don’t need to be trained on the exact questions in the test set. Instead, they can be trained on similar questions that map to nearby locations in the high-dimensional space. The model can then apply the closest program.

So, one explanation for o3’s success on reasoning benchmarks is that o3 has been trained on many more examples of the math, coding, and visual reasoning problems than earlier LLMs with lower performance. Some of these are specific training examples from the benchmark. Others are similar examples from the internet. From these examples, o3 has memorized the reasoning rules needed for these specific domains and is better at retrieving and applying them.

It’s not just o3. If one looks at progress on the most popular reasoning benchmarks, numerous models have scores that are approaching 100% accuracy. Does this mean they are approaching AGI? Or does it mean that they are training on more and better examples of questions that are similar to the ones on these reasoning benchmarks?

As Santa Fe Institute Professor Melanie Mitchell points out:

…if LLMs rely primarily on memorization and pattern-matching rather than true reasoning, then they will not be generalizable — we can’t trust them to perform well on “out of distribution” tasks, those that are not sufficiently similar to tasks they’ve seen in the training data.

If LLMs are just memorizing and pattern-matching, then they are just getting better at specific tasks and not making significant progress towards the generalized reasoning characteristic of AGI.

Apple Study

An October 2024 study by Apple researchers provides evidence that LLMs are just memorizing reasoning patterns. They tested LLMs on the GSM8K benchmark which contains 8500 grade school math word problems.

When GSM8K was first released, the best model scored 35% correct. OpenAI’s o1 preview model (released September 2024) scored 78.2% correct and a preview of its o3 model (not yet released) scored 94.2% correct.

However, the Apple researchers found that minor variations in the word problems caused significant degradation in performance on 24 LLMs including the o1-preview model. One of their tests showed up to a 65% performance drop when they added an irrelevant clause to the word problem. For example,

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

The correct answer here is 44 + 58 + 88 = 190. The clause about the five smaller kiwis is irrelevant because our common sense tells us that they are smaller but they are still kiwis. However, the LLM answers 44 + 58 + 88–5 = 185. Presumably it has blindly applied a reasoning pattern without using the required commonsense reasoning.

Even just changing the numerical values in the word problems led to performance degradation.

The Apple researchers concluded that “…this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.”

The Future of AGI: Breakthrough or Stagnation?

If LLMs rely primarily on memorization and pattern-matching, then AGI remains distant, and current architectures may not be capable of achieving it. Some AI researchers like Meta’s Yann Lecun argue that a fundamental breakthrough is required to move beyond scaling current models.

Will AGI emerge in our lifetimes? The coming years will be telling as researchers push the limits of hyper-scaling. One thing is certain: the debate over AI’s future is far from settled, and the journey toward AGI remains as intriguing as ever.

--

--

No responses yet