Don’t Be Fooled By ChatGPT: It’s Not As Smart As You Think

Steve Shwartz
11 min readJul 3, 2024

--

ChatGPT can write a sonnet that would make Shakespeare jealous, whip up code like a Silicon Valley prodigy, and hold a conversation that feels eerily close to talking to a friend. We marvel at the abilities of digital wordsmiths like ChatGPT and other large language models. But beneath the smooth surface of language, a crucial question simmers: are language models truly capable of reasoning, or are they simply masters of mimicry, their impressive outputs a clever illusion woven from vast stores of memorized data?

People use reasoning to plan, to solve problems, and to make decisions. Reasoning is a symphony of cognition. We gather information — the notes of our thoughts — from the world around us, the depths of our memories, and the vast orchestra of the internet. Then, we weave these notes together, sometimes in a logical, linear progression, but often in a whirlwind of leaps, backtracks, and intuitive flashes. We explore countless paths, weigh possibilities, consider multiple perspectives, and draw upon a lifetime of experiences, all in pursuit of solutions and insights.

Can language models, for all their linguistic prowess, truly replicate (or equal) this intricate dance of human thought? Can they grasp the nuances, the leaps of intuition, the messy yet beautiful complexity of true reasoning? And if not today, can we expect them to achieve this cognitive milestone in the near future? The answers to these questions hold profound implications for the future of AI and our relationship with these increasingly sophisticated machines.

Language models memorize information

Language models are trained on massive bodies of text to predict the next word. But what do they have to learn in order to be able to perform that task?

Research by Nicholas Carlini and his colleagues at Google Brain and Berkeley demonstrated that language models have a strong tendency to memorize sentence fragments from the training data, especially those containing nouns and numbers. Remarkably, they were able to extract 10,000 strings of memorized training data from language models and estimated that it would be possible to extract ten times that amount. This tendency to memorize grows as the models scale up in parameters, computation, and training data. Other studies have found that that the more frequently a piece of information appears in the training set, the more likely it is to be memorized.

Beyond verbatim memorization, language models appear to acquire a great deal of knowledge of the world. For example, ChatGPT can answer questions like this one:

User: Explain the moral and social issues with using AI to do your homework.

ChatGPT: There are a few potential moral and social issues with using AI to do homework. First, if students are using AI to do their homework for them, they may not be learning the material as well as they could be. This could lead to problems down the road when they are expected to know the material for exams or real-world applications. Additionally, using AI to do homework could lead to cheating. If students are able to get away with using AI to do their homework for them, it could create an unfair advantage over other students who are doing their homework on their own. Finally, using AI to do homework could lead to a reliance on technology that could be detrimental if the technology were to fail or become unavailable.

ChatGPT’s ability to craft well-formed responses raises a question: how does it access and utilize information?

While encountering similar discussions in its training data might play a role, ChatGPT appears to do more than simply parroting back memorized phrases. Instead, it seems to organize and store information in a way that allows it to generate novel responses, much like a chef combining familiar ingredients into a new dish.

For example, ChatGPT might have learned about homework aids in general, from CliffsNotes to online tutors. When asked about AI as a homework aid, it applies this existing knowledge to this new context, demonstrating a degree of flexibility beyond simple memorization.

The depth of language model knowledge becomes apparent when we look at performance on various benchmarks. Language models have demonstrated a surprising aptitude across a range of subjects, often rivaling or exceeding human capabilities:

Factual Recall: Language models like GPT-3, LLaMA 2, and PaLM 2 achieve impressive accuracy on datasets like Natural Questions and Web Questions, which test knowledge of simple facts. Imagine going head-to-head with a language model on a trivia night — you might be surprised at who comes out on top!

Academic Prowess: GPT-4 scores in the 80th percentile or higher on standardized tests like the GRE and SAT subject tests, covering a breadth of subjects from history and literature to biology and physics. It even holds its own on college entrance exams in China.

Professional Expertise: The capabilities of language models extend beyond the classroom. GPT-4 has demonstrated competency on professional exams like the Uniform Bar Exam (law) and medical licensing exams, even surpassing the performance of some medical students. It can even answer questions on the nuances of wine well enough to pass the Advanced Sommelier class!

While the inner workings of language models remain a mystery in many ways, their performance on these benchmarks provides compelling evidence that they are capable of acquiring, storing, and utilizing vast amounts of information across a wide range of disciplines.

The question remains: how far can this knowledge acquisition go, and will it ever truly mirror the depth and flexibility of human understanding?

Language models memorize reasoning patterns

Even the most advanced language models, despite their awe-inspiring capabilities, sometimes stumble in ways that reveal the limitations of their artificial intelligence.

Yejin Choi, a leading NLP researcher, aptly captured this duality in her 2023 TED Talk, “AI is incredibly smart and shockingly stupid.” She illustrated this point with a simple example: ask a language model to calculate how long it would take to dry thirty pieces of clothing if five pieces take five hours. Most people would apply their commonsense reasoning abilities to recognize that the pieces will be dried concurrently and that only five hours would be required for the thirty pieces. However, the language model apparently didn’t use commonsense reasoning and simply latched onto an incorrect reasoning pattern that says that if five pieces take five hours, that is one hour per piece, so thirty pieces should take thirty hours.

A similar issue arises with what researchers call the “Alice in Wonderland” problem. Ask a language model how many sisters Alice’s brother has if Alice has one brother and three sisters. Many language models, even sophisticated ones like GPT-4, stumble here. They fail to grasp that Alice is included in her brother’s count of sisters, often incorrectly answering “three.”

(Note: If you try these examples now, you might get correct answers. Major language model vendors are quick to release updates that fix reported errors like these. While they don’t disclose their methods, it’s likely that they include additional training examples to address such issues.)

These errors, while amusing, underscore a crucial point: Language models excel at retrieving and applying memorized reasoning patterns, but they often struggle with the kind of flexible, common-sense reasoning that humans take for granted. They’re like students who’ve memorized formulas without understanding the underlying concepts — they might pass a test, but they’ll likely falter when faced with a real-world problem that requires genuine understanding.

The sensitivity of language models to the wording of prompts further highlights this brittleness. A slight change in phrasing can lead them to retrieve an incorrect reasoning pattern that leads to a nonsensical result. This suggests that their reasoning capabilities have more to do with surface-level associations than deep, conceptual understanding.

Language model reasoning is simplistic compared to human reasoning

While large language models can appear remarkably intelligent, their reasoning capabilities are far inferior to those of humans. Language models are essentially masters of mimicry, relying on vast stores of memorized reasoning patterns gleaned from their training data. They try to match the pattern that seems most appropriate to the prompt, much like searching for the closest recipe in a cookbook.

This strategy works surprisingly well in many cases, especially for tasks that can be broken down into a linear series of steps, like following a cooking recipe. Imagine planning a dinner party:

1. Decide on the courses: Wine, salad, main course, dessert.
2. Choose specific dishes: Red wine, garden salad, tilapia, cheesecake.
3. Create a shopping list: List ingredients for each dish.
4. Go shopping: Buy all the necessary ingredients.
5. Cook: Prepare each dish according to the recipe.

This linear, step-by-step process, with no need for backtracking or revising the plan, plays to the strengths of language models. They excel when the solution path is clear and the necessary “sub-tasks” have been encountered in their training data.

However, human reasoning is far more flexible and adaptable. We can easily handle situations where the initial plan needs tweaking. Perhaps we decide that red wine might clash with fish, or remember that a guest has a seafood allergy. We seamlessly backtrack, adjust our choices, and continue planning.

This is where language models falter. Research has shown that they struggle with:

  • Complex Decomposition: Breaking down intricate problems into manageable sub-tasks that they haven’t seen before.
  • Novel Situations: Applying existing knowledge to new problems or domains.
  • Backtracking and Revision: Adjusting their approach when the initial plan proves inadequate.

Instead of human-like reasoning, language models rely on superficial pattern matching. They might solve a math problem by recognizing a familiar formula but fail when presented with the same problem phrased differently or requiring a slight variation in the solution method.

This reliance on memorized patterns is further evident in their sensitivity to prompt wording and their struggles with tasks that are rarely encountered in their training data. Even a simple task like continuing a sequence of letters in a non-standard alphabet, which is easy for humans, can stump a language model.

The limitations of language models become even more apparent when we consider their capacity for learning new skills. While humans can quickly grasp new concepts from just a few examples, language models often require massive amounts of data and struggle to generalize their knowledge to new situations.

François Chollet, a Google AI researcher and the creator of the Keras deep learning library, argues that true intelligence lies not just in mastering existing knowledge, but in the ability to efficiently learn new skills. To put this idea to the test, he developed ARC-AGI, a challenging benchmark designed to measure an AI’s capacity for acquiring new skills from limited data.

Imagine presenting an AI system with the following puzzle:

Source: Arc Prize

The left side provides two training examples, showcasing a simple pattern transformation. Humans, even those with average intelligence, can quickly grasp the underlying pattern from the examples on the left side of the image and apply them to the test example on the right, achieving an accuracy rate above 84%.

For AI systems, however, this seemingly simple task has proven to be a significant hurdle. As of June 2024, even the most advanced AI systems have struggled to achieve accuracy above 34% on the hidden version of the ARC-AGI benchmark. While some systems have shown higher scores on the public version, this is likely due to overfitting — essentially, they’ve become too specialized in solving those specific examples without truly grasping the underlying principles.

The ARC-AGI benchmark highlights a fundamental difference between human and artificial intelligence. Humans possess a remarkable ability to learn new concepts from just a handful of examples, generalizing their knowledge to solve novel problems. AI systems, on the other hand, often require massive amounts of data and struggle to adapt to situations outside their training data.

Chollet’s work underscores the need for AI systems that can learn and adapt more like humans, moving beyond pattern recognition towards genuine understanding and flexible problem-solving. In 2024, Chollet and others announced a $1M+ prize for anyone who can develop an AI system that can achieve 85% accuracy on ARC-AGI.

Will AI take over the world?

The quest for artificial general intelligence (AGI) — AI that rivals the full spectrum of human intellect — is akin to climbing a mountain shrouded in mist. We’ve made significant progress, with large language models like ChatGPT showcasing remarkable, albeit narrow, capabilities. But the summit, where AI truly comprehends, learns, and reasons like humans, remains tantalizingly out of reach.

The question on everyone’s mind: will future AI models finally break through the fog?

Sam Altman, CEO of OpenAI, has hinted at dramatic improvements in upcoming models like GPT-5, potentially surpassing the leap from GPT-3 to GPT-4.

We can expect advancements in various areas:

  • Reduced Hallucinations: Fewer instances of AI confidently generating false information.
  • Enhanced Safety: Better safeguards against biased, harmful, or unethical outputs.
  • Expanded Capabilities: Better memory, multi-step reasoning, tool use, and even potential for robotic integration.

These advancements will be impressive and the resulting language models will likely perform even better on academic and professional tests.

But will they be enough to bridge the fundamental gap between artificial and human intelligence? Can we expect AI to evolve beyond linear, pattern-based thinking to the kind of flexible, adaptable reasoning that allows humans to solve novel problems and learn new skills efficiently?

There is some evidence that language model capabilities may have plateaued. Many competitors to GPT-4 have surfaced in the year plus since GPT-4 was released. However, none of these models performs significantly better on benchmarks than GPT-4. Also, OpenAI’s CTO, Mira Murati said last month that none of the models in the OpenAI labs are significantly more advanced that what is publicly available.

If AI does evolve to the point of human-level reasoning and becomes capable of learning new skills efficiently, what does that mean for human intelligence?

One possible conclusion is that human reasoning is nothing more than pattern-matching to retrieve the optimal reasoning pattern from memory.

The answer also has profound implications for the future of human society. Some AI researchers warn of potential existential risks if AI surpasses human control. They argue that we’re rapidly approaching a point where AI could pose a serious threat to humanity. These researchers include Geoffrey Hinton, Yoshua Bengio, Dan Hendrycks, Dr. Demis Hassabis, Max Tegmark, Stuart Russell, and Shane Legg)

Other researchers argue that we are still far from creating AGI, and even if we succeed, AI will likely remain under human control. These researchers include Yann LeCun, Andrew Ng, Christopher Manning, François Chollet, Thomas G. Dietterich, Melanie Mitchell, Arvind Narayanan, Gary Marcus, Jeremy Howard, Kyunghyun Cho, Julian Togelius, Oren Etzioni, Grady Booch, Steven Pinker, and Blake Richards.

This debate is not merely academic. It has sparked a global conversation about the ethical implications of AI development. With massive investments pouring into AI research, the next few years will be crucial in determining whether AI capabilities will continue to advance at an exponential rate or if we’ve reached a plateau.

The stakes are high, and the world is watching. Will we witness the dawn of a new era of human-level AI, or will the mist continue to obscure the summit of true artificial intelligence? Only time will tell.

--

--

Steve Shwartz
Steve Shwartz

Written by Steve Shwartz

Author of “Evil Robots, Killer Computers, and Other Myths: The Truth About AI and the Future of Humanity” published Feb 9, 2021 by Fast Company Press.

No responses yet