Jul 21, 2024 - Georg Zoeller (AILTI)

Analysis: Recall or Reason - the $100B question

Seven years into the transformer, the question to which LLMs can reason, or whether they just recall known patters continues to be hazy. Here's an update on what we know and why it very much matters

Given the rate of progress, hype and capital pressure to put AI into production, it is no surprise that understanding of Generative AI technology significantly lags the investor narratives and vast valuations of AI companies.

It is certainly fair to say that we currently do not understand current, transformer based Generative AI technology well enough to say with any level of certainty whether lofty claims of “AGI” or “Reasoning” are supported, relegating them into “belief” rather than “science” or “engineering”.

Case to the point, even the often cited, seminal paper on “Emergent Properties in LLMs” is now under significant scrutiny, not just for using unclear definitions but also due to emerging counter-evidence questioning it’s conclusions

A related, open question we have about Large Language Models like GPT4 is whether they primarily reproduce information and solutions they have memorized (“recall”) in response to problems or if they are able to come to conclusions because they apply concepts they have learned (“reasoning”).

Recall of Memorized Ideas

The kind of “fuzzy” / “semantic” matching and information retrieval the technology is able to do given the compression factor of data involved does represent a powerful new capability in itself, but is often inferior (due to hallucinations) to specialized data retrieval systems like databases. Recall also is only useful for new situations, it fails when the LLM confronts a new problem.

An entire branch of Generative AI research and development is currently dedicated to improving recall, primarily by Retrieval Augmented Generation (“RAG”)

Abstract Reasoning

Generalized, transferrable reasoning on the other hand implies the ability to take learned concepts and apply it to new problems across domains. If/When LLMS are able to perform reasoning, their potential impact is vastly higher and opens the path to novel application in many currently human domain jobs.

Current online discourse around LLM and the marketing pushed by large AI companies heavily focuses on “reasoning” and away from reproduction of learned knowledge as it provides powerful investor narrative.

It’s really hard to answer…

In practice, it’s extremely hard to tell in the current ecosystem if a model is recalling or reasoning for an external observer, for several reasons

The training data all top end models is secret. Therefore, an external observer is unable to establish whether the LLM has seen a problem before and is reurgitating the solution, or if it’s reasoning it’s way to a solution. Even creating a completely novel problem and exhaustively checking the internet on whether it has been proposed is ineffective at ruling out this possibility.
All major vendors are using the data generated when users interact with the model for future training. As such, even after creating a novel problem and running tests on the LLM, there’s a good chance that future variations of the model will now be aware of the problem.
All major AI vendors have a extreme financial interest to maximize the hype and belief of broad capabilities.
The way we are benchmarking AI performance is extremely immature
Accidental Benchmark contamination and intentional manipulation are common, as evident with the strong variation of results whenver benchmarks are updated to novel problems.

Recent research therefore has focused on ways to solve the benchmarking question. For example in an extensive study, researchers from MIT and Boston University investigated the reasoning capabilities:

arXiv:2307.02477 - Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

In short, if an LLM is performing abstract, transferrable reasoning, variations to the problem should not materially affect LLM performance. Current research finds the variations are substantial.

Current Resarch: More Recalling than Reasoning

These studies, so far, indicating that the majority of LLM performance can be explained with effective recall of memorized data than reasoning.

Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.

One way to interprete these findings is to look at the question of Recall and Reasoning not as a binary, black and white distinction but gradient, or, more succinctly:

Sufficently fuzzy pattern matching and recall is indistinguishable from reasoning

Why does it matter ?

For businesses, what matters most is the ability to perform specific tasks. Most employees operate on well defined, repeating business problems that have ample data foundation available for training models. From that perspective, recalling such information is very well enough to drive significant automation potential for those jobs.

Or, corollary: The more variation in task details a job entails, the less likely memorisation dependent technology with limited ability to generalize or transfer is going to displace it.

Given the extreme hype-cycle AI is currently in, a growing consensus that it’s reasoning ability is limited would almost certainly have an impact on valuations … and the lofty narrative of replacing professions that

More independet research is needed to understand how the technology works and to establish durable measurements of progress to identify possible phase-shifts between recall and reasoning in major models and how they relate to data, training and resulting models.

TL;DR

Current science attributes the larger share of model performance on problem solving to mechanisms more aking to recalling memorized information than to abstract, generalizable reasoning.
Business decision makers, investors and career minded individuals should pay attention to new, verifiable research and credible benchmarking on the topic of “is it reasoning or memorizing” as it will provide insight into trajectory, scaling and real world business impact of the technology.
If the current trends hold, “high variation in daily workload and problem parameters” may be seen as possibly correlated with resilience against AI displacement for jobs and businesses.
It would also mean that the technology would require a much more constant influx of new data to stay relevant and operate in job domains, increasing costs to deploy and operate. The value and dependency of “access to current data” in this scenario would remain elevated, shifting balance of power in the industry away from model makers to data owners.