OpenAI has recently been wrapped in controversy throughout November. Between the rapid firing and rehiring of CEO Sam Altman and the temporary halt of ChatGPT Plus subscriptions, OpenAI has dominated headlines in the artificial intelligence industry. Now, AI enthusiasts are questioning whether GPT-4 is becoming “lazier” as it undergoes further training. Many users, particularly those leveraging the model for complex tasks, have taken to X (formerly Twitter) to express their frustrations regarding perceived performance declines.
On X, Rohit Krishnan shared his experiences with GPT-4, the model powering ChatGPT Plus. He noted that the chatbot frequently refused or provided incomplete responses to his queries that it had previously handled well. Additionally, he observed that the model sometimes utilized tools outside of its intended scope, such as DALL-E when a request required a code interpreter. In a humorous remark, he quipped that the model’s “error analyzing” feature seemed akin to saying it was “AFK [away from keyboard], be back in a couple of hours.”
Matt Wensing also reported his experience on X, where he tasked ChatGPT Plus with generating a list of dates from now until May 5, 2024. To his surprise, the chatbot asked for further details, such as the number of weeks between those dates, before completing the initial request.
Wharton professor Ethan Mollick compared interactions with GPT-4 from this July to recent exchanges, noting it remains knowledgeable but often explained how to fix code rather than performing the corrections itself. Essentially, he found himself doing the heavy lifting he initially expected from the model. While Mollick did not intend to critique GPT-4, his observations resonate with others who have described similar “back talk” from the AI.
ChatGPT is known for its propensity to generate "hallucinated" responses, but recent errors seem to surpass typical missteps. Although GPT-4 debuted in March, reports of it becoming “less capable” emerged as early as July. A study in collaboration with Stanford University and UC Berkeley indicated that GPT-4’s accuracy plummeted from 97.6% in March to just 2.4% by June. The research found that while ChatGPT Plus struggled with basic mathematical problems, the free version using the older GPT-3.5 model excelled in providing correct answers and comprehensive explanations.
During this period, OpenAI's Product VP, Peter Welinder, implied that heavy users might be experiencing a psychological impact, causing them to perceive a decrease in response quality even as the model became more efficient.
Mollick suggests that the current issues could be temporary, possibly stemming from system overloads or uncommunicated changes in prompt styles. OpenAI previously cited system overload as the reason for temporarily shutting down ChatGPT Plus sign-ups following a surge in interest after its inaugural DevDay developers’ conference, which unveiled numerous new functionalities for the paid version. A waitlist for ChatGPT Plus is still active. Notably, Mollick mentioned that ChatGPT on mobile adopts a different prompt style, yielding “shorter and more concise answers.”
User Yacine on X reported that the inconsistency in recent GPT-4 responses led them to revert to traditional coding methods. They also plan to develop a local code LLM to regain control over the model's parameters. Many users have indicated a shift towards open-source alternatives amid concerns about the model's performance. Reddit user Mindless-Ad8595 added that the latest updates made GPT-4 “too intelligent for its own good,” lacking a predefined “path” for behavior, which enhances versatility but can lead to confusion.
To improve the efficiency of outputs, users are advised to create custom GPTs tailored for specific tasks or applications, though practical solutions for those remaining within OpenAI's ecosystem are limited.
App developer Nick Dobos recounted a frustrating encounter where he asked GPT-4 to generate code for Pong in SwiftUI. Instead, he discovered numerous placeholders and "to-dos" in the output. He pointed out that the chatbot ignored commands to eliminate these placeholders, a sentiment echoed by several users on X who reported similar experiences. Dobos' post caught the attention of an OpenAI employee, who promised to forward the feedback to the development team for potential fixes and updates.
The reasons behind GPT-4's current challenges remain unclear, with users speculating on various theories. Some believe that OpenAI is merging models or experiencing lingering server overloads from running both GPT-4 and the newer GPT-4 Turbo, while others theorize the company might be trying to cut costs by limiting output quality.
OpenAI's operations are notoriously costly. In April 2023, researchers estimated that it cost $700,000 daily—or 36 cents per query—to keep ChatGPT up and running. Analysts noted that OpenAI would need to expand its GPU fleet by 30,000 units to sustain its commercial operations for the year, covering both ChatGPT processes and support for its partners.
In the meantime, users have taken to X to lighten the mood regarding GPT-4's performance issues, with some quipping, “Next thing you know, it'll be calling in sick,” added Southrye. MrGarnett humorously noted the frequent responses of “and you do the rest,” countering with “No, YOU do the rest.” The volume of discussions surrounding these problems is hard to ignore, leaving everyone eager to see if OpenAI can address the concerns in a future update.