OpenAI, the creator of ChatGPT, has unveiled its latest significant product, a generative AI model referred to as Strawberry, officially named OpenAI o1. More accurately, o1 represents a suite of models, with two now available: o1-preview and o1-mini. The latter is a smaller, more efficient model tailored for code generation, accessible on Thursday through ChatGPT and OpenAI’s API.
To use o1 within the ChatGPT client, users must subscribe to ChatGPT Plus or Team. Enterprise and educational users will receive early access next week. It's important to note that the current o1 chatbot experience is quite basic. Unlike its predecessor, GPT-4o, o1 cannot browse the internet or analyze files yet. Although the model includes image analysis capabilities, they are currently disabled pending further testing. Additionally, o1 has specific rate limits set at 30 messages per week for o1-preview and 50 for o1-mini.
Another consideration is the pricing; o1 is significantly more expensive. In the API, o1-preview costs $15 per million input tokens and $60 per million output tokens—three times higher than GPT-4o for input and four times greater for output. (It’s important to clarify that "tokens" refer to raw data units; one million tokens equate to around 750,000 words.)
OpenAI has announced plans to eventually offer o1-mini to all free ChatGPT users, although a specific release date has not been established.
Enhanced Reasoning Capabilities
OpenAI o1 overcomes some common reasoning challenges seen in generative AI models, allowing it to effectively fact-check itself by dedicating more time to analyze questions thoroughly. According to OpenAI, what sets o1 apart from other generative AI models is its ability to “think” before responding, which enhances the quality of its answers.
When given extra time to process, o1 can approach tasks holistically—strategizing and executing a series of actions over time to arrive at a solution. This makes o1 particularly adept at complex tasks requiring the synthesis of multiple subtasks, such as detecting privileged emails in legal settings or developing a marketing strategy.
On Thursday, Noam Brown, a research scientist at OpenAI, stated in posts on X that o1 is trained using reinforcement learning, which encourages the system to "think" before responding through a private thought chain. Correct answers yield rewards, while incorrect ones result in penalties. He emphasized that OpenAI utilized a new optimization algorithm and a training dataset enriched with reasoning data and scientific literature designed for reasoning tasks, adding, “The longer [o1] thinks, the better it performs.”
Performance Insights
Insights from Pablo Arredondo, VP at Thomson Reuters, suggest that o1 surpasses earlier OpenAI models like GPT-4o in areas such as analyzing legal briefs and effectively resolving complex LSAT logic games.
According to Arredondo, “We noticed o1 tackling more substantive, multi-faceted analyses.” Automated testing indicated improvements across a variety of simpler tasks as well.
In a trial at the International Mathematical Olympiad (IMO), o1 achieved an impressive 83% success rate in solving problems, compared to GPT-4o’s 13%. While this is noteworthy, Google DeepMind recently demonstrated its AI achieving a silver medal in a similar contest. OpenAI also reported that o1 performed in the 89th percentile in programming challenges on Codeforces, outpacing DeepMind's flagship model, AlphaCode 2.
Overall, OpenAI asserts that o1 excels in data analysis, scientific queries, and coding challenges. GitHub, which evaluated o1 with its AI assistant GitHub Copilot, noted that the model effectively optimizes algorithms and coding tasks. OpenAI’s benchmarks also indicate enhancements in o1's multilingual performance, especially in languages like Arabic and Korean.
Ethan Mollick, a management professor at Wharton, detailed his experiences with o1 on his personal blog. On a tricky crossword puzzle, o1 performed admirably, answering all questions accurately, despite fabricating a clue.
Limitations of OpenAI o1
Despite its strengths, OpenAI o1 has its drawbacks. The model may respond slower than other iterations, with Arredondo noting that o1 can take over ten seconds to answer some inquiries. It indicates its thought process with a label showing the current subtask in progress.
Due to the unpredictable nature of generative AI, o1 likely has other limitations. Brown conceded that o1 occasionally stumbles in games like tic-tac-toe. Furthermore, OpenAI acknowledged anecdotal feedback from testers indicating that o1 tends to hallucinate—confidently generating incorrect information—more frequently than GPT-4o and is less likely to acknowledge when it lacks an answer.
“Errors and hallucinations still happen [with o1],” Mollick commented in his blog. “It still isn’t flawless.”
The Competitive Landscape
It’s essential to recognize that OpenAI is not alone in exploring reasoning techniques to enhance model accuracy. Researchers at Google DeepMind recently published findings indicating that increasing compute time and providing direction during requests can significantly improve model performance without needing additional adjustments.
Reflecting the competitive atmosphere, OpenAI decided against disclosing o1’s raw “chains of thought” in ChatGPT, citing “competitive advantage.” Instead, the company opted to present “model-generated summaries” of these thought chains.
While OpenAI has taken the lead with o1, the success of the model depends on making it broadly accessible at a more affordable price point. Future developments will reveal how rapidly OpenAI can introduce upgraded versions of o1, as the company aims to experiment with models capable of reasoning for extended periods—days or even weeks—to advance their capabilities.