Strawberry, Strawberry

Greg Robison
Sep 25, 2024
8 min read

Updated: Dec 4, 2024

WATCH THE REASONING GROW

i love summer in the garden. --Sam Altman, CEO of OpenAI

If you’ve been following along with us, we’ve been skeptical of LLM’s reasoning abilities, covering the debate here and discussed the lack of deliberate reasoning, like System 2 Thinking here. OpenAI just changed the debate with their latest o1 model, the start of a new series focused on advanced reasoning capabilities. Internally known as "Strawberry," the o1 model aims to revolutionize AI's ability to tackle complex problems, such as intricate math, coding challenges, and scientific reasoning. OpenAI suggests o1 can "think" through problems, refine its logic, and deliver more accurate responses. How does Strawberry work and is it closer to true reasoning?

PODCAST

NOTE: For the first time, we are experimenting with an AI-generated podcast that summarizes this post by Google’s NotebookLM. Listen here and let us know what you think:

INTRODUCTION TO THE o1 MODEL FAMILY

OpenAI has been a consistent leader in AI the past few years, from GPT-2’s ability to generate coherent text, to GPT-3’s coding abilities, to GPT-4’s multimodal capabilities that allow the model to understand and generate both text and images, allowing it to excel on many tasks. Now, with the release of the o1 model family, OpenAI is shifting its focus towards enhancing reasoning abilities, opening up many new use cases for solving more complex problems. The model that was codenamed “Strawberry” during development introduces a more iterative, multi-step reasoning process akin to our deliberate System 2 thinking. We developed reasoning through action, something which LLMs lack – could Strawberry grow more like us?

strawberry with 01 — OpenAI releases o1, codename Strawberry

Although we don’t know much about the specifics of the o1 series, the key innovation seems to lie in the combination of Chain of Thought (CoT) prompting which has been shown to improve reasoning capabilities and Reinforcement Learning (RL). This approach changes reasoning from a static, pattern-matching exercise into a dynamic and learnable skill. By expanding the reasoning process into many distinct steps, o1 can also explore various problem-solving pathways akin to Tree of Thought (ToT), receiving rewards based on better performance outcomes. This methodology is similar to the strategies employed in developing advanced game-playing AI systems like AlphaGo. However, o1 applies these principles to a much broader range of cognitive tasks, from mathematical problem-solving to scientific reasoning, where using multiple logical steps is essential for accuracy (see our example of CoT reasoning through the game Connections below). In a way, o1 attempts to solve the problem by throwing more compute resources at the problem by creating 10-100 times as much work compared to a single response.

flow chart showing chain of thought — Example of Chain of Thought (CoT) reasoning

RL enhances its reasoning capabilities during the training process, where the model learns by interacting with its environment and receiving feedback in the form of rewards or penalties based on its performance. This helps the model improve its decision-making over time. The model is rewarded for outputs that demonstrate correct and thoughtful reasoning and penalized for flawed or incorrect logic. By using RL, o1 refines its ability to "think" through problems, learning how to break down complex tasks and navigate multi-step workflows more effectively. It can develop its own “style of thought” based on whatever leads to the best results being reinforced. The model's capacity to refine and adapt its thought processes over time makes it more robust and better suited to tasks that require deeper, more nuanced problem-solving. True reasoning is more than pattern matching, and CoT and RL are more action-oriented and iterative in nature. We are getting closer.

KEY FEATURES OF THE o1 MODEL

The o1 model series is designed for “reasoning first”, working through problems instead of simply predicting the next word based on probabilities. This shift allows o1 to engage in multi-step reasoning, similar to how we consider different angles of a problem before reaching a conclusion. Unlike other LLM systems, o1 can revisit its initial thought process, allowing it to refine and correct its responses dynamically. In theory, this type of system can combat incorrect initial assumptions and revise with additional information. This type of “thinking” where precision and accuracy is necessary, like coding or scientific problem-solving where jumping to conclusions based on pattern can lead to errors. The model should be able to navigate and process complex tasks more thoughtfully, improving both accuracy and reliability.

benchmark scores chart — Benchmark scores show o1-preview and o1 are huge improvements over GPT-4o

Benchmark data backs up these assumptions, with very strong scores in such complex domains as mathematics, coding, and the sciences. For example, in tests that included problems from the International Mathematics Olympiad, o1 outperformed GPT-4o significantly. While GPT-4o solved only 13% of the Olympiad problems, o1 achieved a crazy 83% success rate, demonstrating its superior reasoning ability in complex math challenges. This reasoning-first approach allows the model to break down and solve problems in a way previous models couldn’t, giving it a substantial edge in competitive programming tasks as well.

What is the performance limit when scaling LLM inference? Sky's the limit. We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient. -- Denny Zhou, Google DeepMind

WHAT THIS MEANS FOR AI CAPABILITIES

Previous state of the art models like GPT-4 excel at pattern recognition, but their limitations become evident when multi-step processing is required (pretty much anything truly complex). The o1 model addresses these shortcomings by thinking through a problem step-by-step, meaning it can tackle intricate problems. This type of thinking means o1 can manage complex coding tasks that require iteration across multiple stages or mathematical proofs that require many steps. The transition from simple next-word predictors that lack any capacity to reconsider their initial outputs or refine their solutions to mimicking human-like reasoning processes can lead to a more dynamic and flexible AI system.

The types of tasks that o1 is like to outperform traditional LLMs include:

Complex Mathematical Problem Solving: In competitive scenarios like the International Mathematics Olympiad, o1 outperforms regular LLMs by reasoning through multi-step math problems. Its advanced problem-solving ability allows it to tackle difficult equations that require more than pattern recognition.
Scientific Research and Analysis: o1 can assist researchers by reasoning through scientific data, generating and analyzing complex formulas in fields like quantum physics or bioinformatics. Unlike regular LLMs, o1’s ability to handle abstract, multi-step reasoning makes it ideal for tasks requiring deep analytical thinking.
Debugging and Code Optimization: While regular LLMs can generate code, o1 excels at debugging and refining complex code. Its reasoning allows it to not just generate code, but critically evaluate and debug multi-layered issues in programming.
Medical Diagnostics: o1’s ability to reason through complex medical data sets enables it to assist in diagnosing conditions that require careful interpretation of multiple symptoms and variables. Regular LLMs often struggle with these nuanced, multi-step medical evaluations.
Legal Document Analysis: In legal settings, o1 can reason through complex contracts and legal documents, identifying potential issues or inconsistencies that might be overlooked by standard LLMs. It can also cross-reference laws and case studies to provide more accurate legal reasoning.
Quantum Computing Algorithms: Quantum computing involves complex algorithms and theoretical frameworks that require precise logical steps. o1’s reasoning capabilities make it better suited for understanding and constructing these algorithms compared to regular LLMs, which might miss subtle but critical logical nuances.
Personalized Education and Tutoring: o1 can provide more tailored educational guidance by reasoning through a student’s unique challenges and crafting adaptive learning paths. Unlike regular LLMs, it can assess where students need the most help and adjust its tutoring dynamically.
Supply Chain Optimization: In logistics, o1 can reason through multiple variables like inventory levels, transportation times, and demand forecasts to recommend optimized solutions. Regular LLMs might not handle the complex, multi-step analysis required to make these decisions effectively.
Financial Portfolio Management: For portfolio management, o1 can analyze market trends, asset performance, and risk variables to provide more strategic investment advice. Its ability to reason through multi-step processes gives it an edge over LLMs that might rely more on surface-level pattern matching.
Creative Writing with Logical Structure: While both o1 and regular LLMs are capable of creative writing, o1 is better suited for tasks that require logical progression in plot or argumentation, such as writing complex narratives or philosophical essays that require coherent, multi-step reasoning.

Imagine having thousands of Ph.D. students working on your problem at once – each reasoning, testing, theorizing, iterating, and finding optimal solutions.

o1’s reasoning-driven design allows it to contribute to solving some of the most challenging problems across all industries and making it a valuable tool for experts looking to streamline and enhance their research. With these types of advancements, the future of AI in science and technology looks to grow exponentially.

System 2 reasoning ability is just beginning to grow

AI SAFETY & GOVERNANCE

Because of the potential power of such a reasoning AI, OpenAI introduced new and robust safety measures aimed at ensuring responsible AI use. Prior to its public debut, extensive safety protocols were developed, including red-teaming exercises where external experts test the model for vulnerabilities. These assessments focus on important issues like content moderation, hallucinations, and bias with the goal of mitigating risks before the public gets their hands on it. The CoT processing approach enables it to follow safety rules in real-time, making it more reliable and resistant to jailbreaking. However, this transparency to the reasoning process has downsides and OpenAI decided to hide the explicit details of CoT from users (also perhaps to prevent others from training on the underlying reasoning process, something that will get you banned).

During the internal testing, it outperformed its predecessors with a stronger adherence to safety guidelines. OpenAI has partnered with the US and UK AI Safety Institutes to further improve the security and ethical deployment of its models. These partnerships allow for early access to research versions of the o1 model, providing governmental organizations an opportunity to stress-test and evaluate the model’s alignment before its public release.

SO HOW SMART IS o1-PREVIEW?

We’ve put o1-preview and o1-mini through some initial tests and we’ve come away very impressed! On our grAIg LLM benchmark, the Reasoning score is almost at ceiling with a score of 0.96 out of 1.0 (o1-mini outperforms GPT-4o at 0.90) so we now need to come up with a new reasoning benchmark!

benchmark chart — o1-preview practically breaks the Reasoning benchmark, time for harder questions…

We also test language abilities by using the NY Times game Connections – how well can AI correctly group 16 words into 4 distinct groups? In our informal testing GPT-4 and Claude Sonnet 3.5 can consistently get two of the four groups correct but fail to get the more complex groupings. That’s pretty good! But o1-preview correctly put all 16 words into the correct grouping – it only faltered in naming the hardest category (this was a tough one too).

final categorization — The categories are correct, but the rationale for the 4th group is incorrect (ending with Greek letters).

By examining its reasoning process more closely, you can see it trying various options and evaluating how well they do at meeting the goal. This deliberate, iterative process certainly tries to mimic our System 2 thinking. Is it “real” reasoning – no, it’s still a mimic that’s getting closer.

thought for 87 seconds — Some insight into the CoT reasoning process behind the categorization during its “Thinking” time.

WHAT'S NEXT: THE FUTURE OF o1

So far, we’re getting a preview – it is literally called “o1-preview”. According to benchmark scores, the full o1 model is even smarter, but we will have to wait until release to really test it out. OpenAI has released o1 as a barebones model, without web-browsing capabilities, file uploads or image processing, which will come later. These additional capabilities will allow o1 to not only reason through textual problems, but also process external information sources, enabling a richer knowledge base (not one of the strong points of the o1 models so far). Data integration will allow iterative data analysis that could put complex analyses in everyone’s hands.

The reason OpenAI abandoned the typical GPT name for this generation of models is because the o-series represents an evolutionary change. This new paradigm will continue towards even more human-like reasoning and problem-solving. With each update, whether based on new RL techniques or smarter models running in the background, the o1 series will continue to narrow the gap between human cognition and artificial intelligence, making it a more reliable partner in professional and academic environments. These new reasoning capabilities will transform industries like medicine, quantum computing, and education and have the potential to change how we approach complex tasks, leading to more efficient, accurate and innovative solutions.

CONCLUSION

The release of the o1 model series is the most significant attempt to mimic our System 2 thinking to date, allowing the potential for advanced reasoning. They can tackle complex, multi-step tasks across industries from healthcare to physics. By iterating through problems, o1 opens new possibilities for innovation and problem solving, moving AI closer to autonomous systems that can work alongside humans to drive progress. The long-term impact of reasoning-based AI could revolutionize fields like science, education and technology, making human-AI collaboration more effective and productive than ever. We’re inching ever closer to human-like cognition that will surpass us in many aspects soon.