top of page
  • Writer's pictureGreg Robison

The Great Debate: Do Language Models Reason?

“Reason is the slow and torturous method by which those who do not know the truth discover it.” --Blaise Pascal

We’ve been talking recently about all the cool things Large Language Models (LLMs) like ChatGPT can do, like impersonate people, create synthetic data, code, etc. They learn the statistical patterns and relationships between words, phrases, and sentences, allowing them to generate coherent and human-like text. But it’s time to set the record straight on one controversial topic – can LLMs actually reason? They seem to be able to make cogent arguments:

Chat window AI Claude math proof
Claude Opus provides a valid proof - is that reasoning?

Why does it even matter? Because reasoning abilities are crucial for AI systems to truly understand and interact with us and our world in a meaningful way. Reasoning involves the ability to think logically, draw inferences, solve problems, and make sound decisions based on available information. If LLMs truly can reason, it would mean they can go beyond pattern matching to have a deeper understanding of the world and the ability to think critically. An AI that accurately reasons can assist humans in complex decision-making processes, solve challenging problems that require logical thinking and inference, or provide intelligent recommendations. We could accelerate scientific discovery, enhance decision making in policy, improve efficiency and productivity, and get truly personalized services like education, healthcare, and entertainment. We would have truly intelligent tools.


Defining Reason

Reasoning is a fundamental cognitive process that uses logical thinking, problem-solving, decision-making, and inference to make conclusions or decisions based on the information we have available. Reasoning has been at the heart of philosophy since before Aristotle, who studied reasoning systematically, distinguishing between deductive and inductive reasoning. Deductive reasoning involves generalizing based on observations, while inductive reasoning involves drawing specific conclusions from general principles or premises. Some argue that reasoning is purely logical, while others highlight the role of intuition, emotions, and context in reasoning. Either way, reasoning is not specific to humans as demonstrated by studies showing that many animals, including primates, crows, and even insects, can engage in basic forms of reasoning to solve problems and make decisions.


However, complex problem solving requires the ability to identify and define problems, generate, and evaluate potential solutions and successfully implement the best solution to achieve the goal. It often involves breaking down complex problems into smaller, more manageable parts and applying logical thinking and creativity to find solutions. Logical thinking involves the use of formal rules and principles to draw conclusions. It requires the ability to identify and analyze the structure of arguments, recognize logical fallacies, and use logical operations like "and" / "or" / "not". When we use these skills together, we can cure diseases, travel to the moon, solve global challenges like climate change and poverty, unlock the secrets of the universe, and even put our friend’s face everywhere possible.


Decision-making is like problem-solving and usually involves evaluating alternatives and choosing the most appropriate course of action based on the goals and information available. Like deciding where your family is going to have dinner, available options need to be selected and evaluated before choosing. Effective decision-making requires the ability to weigh both the pros and cons of various options, consider the long-term consequences and make trade-offs or concessions when necessary to get the job done. Induction and deduction both play critical roles in accurately drawing conclusions, as induction allows us to find new information in existing knowledge, while deduction allows us to draw proper conclusions based on our premises. These processes aren’t just for students or scientists, we use them constantly in our everyday lives.


Arguments for LLMs Having Reasoning Abilities

Using these criteria for reasoning, do today’s LLMs genuinely reason? Some say “yes”. In some benchmarks that measuring reasoning abilities, such as the General Language Understanding Evaluation (GLUE) benchmark, LLMs have surpassed humans’ performance. In a similar benchmark called SuperGLUE, which is a more challenging test of reasoning capabilities, LLMs are performing well, with some outperforming human baseline scores. GPT-4 achieves human-level scores of 95% on the Hellaswag benchmark designed to test common-sense reasoning abilities. High scores on these reasoning tests suggest that LLMs can effectively reason and draw logical conclusions based on the information they are given.


Current LLMs like Anthropic’s Claude Opus can generate coherent and logical responses to questions. If you ask an LLM to reason through a problem, it can generate well-reasoned answers that demonstrate a clear understanding of context and the ability to think through the implications. Smart LLMs can break down a problem into parts, provide relevant examples and arrive at a proper conclusion. Here are examples of Claude Opus and GPT-4 reasoning through a word problem and demonstrating logical arguments about their conclusion, although they reach different conclusions. Which logic is correct?

Chat window AI Claude Opus reasoning
Claude Opus reasons the word is “dim”, is that right?

Chat window ChatGPT reasoning
GPT-4 reasons that the word is “dog”, which logic is right?

LLMs also have the interesting ability to be good “few-shot” learners – that is, give them a couple of examples to learn from and they can then apply that knowledge to unique situations. For example, you can get the AI to speak in Yoda’s distinctive grammatical structures by giving it a few examples and ask it to apply the learning to a new situation.

Chat window ChatGPT explain innovation
GPT-4, It Was

The model can grasp the underlying patterns and principles from a small set of examples and then use these patterns in a new context. This ability to learn from limited data and generalize to new situations suggests that LLMs are not simply memorizing patterns but are inferring the appropriate response.


Another approach is to train reasoning into models. For example, “chain of thought” appears to enhance reasoning capabilities by training LLMs to generate step-by-step explanations for their answers. By training the model on examples of problems with detailed reasoning steps, the model learns to emulate the reasoning process. Through this training, the model appears to break down complex problems into steps and make the thought process clearer and more interpretable. The benchmark scores suggest that this kind of training does improve reasoning abilities by providing clear explanations of how they arrived at conclusions. For example:


Let's go through this step-by-step:

1. You start with 15 muffins.

2. You eat 2 muffins, leaving you with 13 muffins.

3. You give 5 muffins to your neighbor, leaving you with 8 muffins.

4. Your partner buys 6 more muffins, bringing the total number of muffins to 14.

5. Your partner eats 2 muffins, leaving you with 12 muffins.

Therefore, you now have 12 muffins.

Knowing you have an even dozen muffins - that’s reasoning, right?


Arguments Against LLMs Having True Reasoning Abilities

A kind of reasoning, sure, but a very limited one (compared to what we use at least). One limitation is that LLMs lack any genuine understanding of the information they are processing and generating. At their core, the models are based on pattern matching and statistical inference, learning to predict the likely next word or sequences of words based on the patterns they observed in the training data. They can generate coherent and seemingly meaningful responses; it doesn’t mean they have a deep understanding of the concepts and ideas they are manipulating. They work by manipulating numerical representations of abstractions without any grounded knowledge or real-world understanding, which limits their ability to reason about the real world in a meaningful way. A good example is LLM’s notoriously poor ability to do math – all it understands about numbers is where they are likely to occur in a sentence. If the model has seen enough examples that “17+5” is “22”, it might be able to respond correctly, but it did not add 17 and 5 together. If you give it a more unique, but still simple question like 735205 x 934584, it won’t fare as well because it doesn’t understand math at any level. If it doesn’t understand addition or multiplication, it certainly doesn’t understand linear algebra or calculus, or any maths really.

Chat window Claude Opus Math
Claude Opus – you’re close, but not right, the correct answer is 687,110,829,720.

Similarly, “reasoning from first principles” involves breaking complex problems down into their most fundamental, underlying truths, and then reasoning up from these basics. Aristotle was also a fan of this approach because it is a powerful way to understand and solve problems by avoiding assumptions and conventional wisdom and build knowledge from the ground up. However, LLMs struggle with this type of reasoning because they predict next words, not by dissecting problems into foundational elements and building insights scratch. Their “knowledge” is based on correlations in data, not understanding or analyzing first principles. So LLMs may output logical reasoning, but they fail to generate logical or relevant outputs when deduction is necessary.


We’ve seen that LLMs perform well in areas that they have encountered in training, but struggle with unusual or unseen scenarios that require a deeper understanding of context and ability to reason about the implications of the given information. The LLM might generate nonsensical or irrelevant outputs, as they lack the adaptability and flexibility that come with actual reason abilities - model that is trained to write python code will have trouble writing creative prose. And a model’s output is limited to the context provided in the interaction, but not the broader context. Without enough information, the output might be technically correct, but not necessarily relevant to the problem.


Finally, LLMs often struggle with long-term reasoning, causal reasoning, and counterfactual thinking which are fundamental to human reasoning. Because LLMs only predict the next word, they cannot consider the entirety of an argument or potential long-term effects.  Causal reasoning involves understanding the cause-and-effect relationships between events and actions because LLMs have no concept of a causal world, just a linguistic one. Counterfactual thinking involves considering different scenarios or outcomes that could occur under different conditions because they don’t have a deep understanding of cause-effect relationships, just statistical probabilities. For example, LLMs would have difficultly predicting all the potential effects, especially long-term effects, of introducing a new species into an ecosystem. For example, what would happen if you introduced Maryland Blue Crabs to the San Francisco Bay?

Chat window ChatGPT4 reasoning crabs
GPT-4 is short sighted, not considering subtle or detailed changes.

The model generates a generic response based on correlations in the training data but cannot accurately predict nuanced ecological consequences. The absence of causal reasoning and counterfactual thinking limits the ability of LLMs to engage in hypothetical reasoning, make predictions based on causal relationships, and consider alternative possibilities, which are crucial aspects of genuine reasoning abilities.

Chat window ChatGPT reasoning
GPT-4 says it only simulates aspects of reasoning. I buy that.

Conclusion

To get back to the original question, can LLMs reason? They have demonstrated impressive performance on various reasoning benchmarks and can often generate coherent and logical responses. But there are significant limitations that need to be addressed such as a lack of genuine understanding, an inability to handle novel situations, absence of reasoning from first principles, and difficulty with causal and counterfactual thinking. They do not possess the full range of reasoning abilities that humans have. But the rapid advancements in AI will undoubtedly continue to add reasoning abilities into AI systems, that will eventually transform various aspects of society, from education, climate change, and healthcare to scientific discovery and decision-making. This entire discussion may soon be moot…


Button says let's stay connected

bottom of page