Oh, Google. Will you ever release an AI product successfully on the first try?
Less than a month after launching Gemini, its highly anticipated ChatGPT competitor, Google faced substantial criticism for what were confirmed to be staged interactions in its promotional demo. Recent research indicates that the most advanced version available to consumers, Gemini Pro, lags behind OpenAI’s GPT-3.5 Turbo large language model (LLM) in most tasks.
The findings, presented by a team from Carnegie Mellon University and BerriAI in their paper “An In-depth Look at Gemini’s Language Abilities,” reveal that Gemini Pro performs slightly worse than GPT-3.5 Turbo across various tasks. The paper, published on arXiv.org, highlights that as of December 19, 2023, Gemini Pro's accuracy is notably less impressive than that of OpenAI's older model.
Google’s spokesperson responded, asserting that internal research shows Gemini Pro surpasses GPT-3.5 and that a more powerful version, Gemini Ultra, is coming in early 2024, reportedly outperforming GPT-4 in internal tests. They stated, “Gemini Pro outperforms inference-optimized models like GPT-3.5 and performs comparably with other leading models.”
The researchers tested four LLMs: Google Gemini Pro, OpenAI GPT-3.5 Turbo, GPT-4 Turbo, and Mistral’s Mixtral 8x7B. They used an AI aggregator site, LiteLLM, to assess the models over four days, utilizing various prompts, including 57 multiple-choice questions across STEM, humanities, and social sciences.
In their knowledge-based QA test, Gemini Pro scored 64.12/60.63, while GPT-3.5 Turbo achieved 67.75/70.07 and GPT-4 Turbo scored 80.48/78.95. Notably, Gemini consistently favored answer choice “D,” indicating a bias potentially due to insufficient instruction-tuning for multiple-choice formats. Furthermore, it struggled with specific categories such as human sexuality and formal logic due to safety response restrictions.
Gemini Pro did outperform GPT-3.5 Turbo in high school microeconomics and security questions; however, these gains were minimal. When testing longer or more complex queries, Gemini Pro showed decreased accuracy compared to both GPT models, although it excelled in word sorting and symbol manipulation tasks.
In programming capabilities, Gemini was again found lacking, performing worse than GPT-3.5 Turbo in completing Python code tasks. While Gemini Pro showed promise in language translation—outperforming GPT-3.5 Turbo and GPT-4 Turbo in several languages—it also exhibited a tendency to block responses across many language pairs due to content moderation.
The implications of these findings are significant for Google’s AI ambitions. As the release of Gemini Ultra approaches, Google may continue to trail OpenAI in generative AI performance. Interestingly, the research also indicated that Mistral's Mixtral 8x7B performed worse than GPT-3.5 Turbo across most tasks, suggesting that while Gemini Pro is not the best, it still outperforms some emerging competitors.
Overall, the study reinforces the notion that OpenAI currently maintains its lead in the generative AI landscape. As noted by experts like University of Pennsylvania professor Ethan Mollick, for most individual applications, GPT-4 remains the superior choice — at least until Gemini Ultra is released next year.