The Battle of Big Models: The Controversial Launch of Llama3-V
The competitive landscape of large model developers is evolving, with recent events sparking considerable debate. A Stanford University team recently introduced Llama3-V, a multimodal large model claiming to rival established models like GPT-4-V, Gemini Ultra, and Claude Opus, all for just $500 in training costs. The authors, Siddharth Sharma and Aksh Garg, undergraduate students in Stanford's computer science department, have previously published multiple machine learning papers and worked with major companies such as Tesla and SpaceX. Llama3-V quickly gained traction, even reaching trending charts on Hugging Face, a pivotal platform in the machine learning community.
However, excitement around Llama3-V was short-lived. Users soon highlighted striking similarities between Llama3-V and MiniCPM-Llama3-V 2.5, a model released in May by the Tsinghua University-affiliated company Weibi Intelligence. Observers noted that both models exhibit similar structures, code, and configuration files—differing mainly in variable names. Llama3-V’s code appears to be a reformatted version of MiniCPM-Llama3-V 2.5, showing similar behaviors across various noise versions. Notably, Llama3-V employs the tokenizer from MiniCPM-Llama3-V 2.5, with several special tokens also appearing in Llama3-V. Reports suggest that simply renaming variables in Llama3-V’s code permitted it to operate successfully with MiniCPM-V code, raising concerns about originality in its development.
On June 3, Weibi Intelligence’s CEO, Li Dahai, shared concerns on social media, asserting that Llama3-V demonstrated capabilities akin to the Tsinghua Bamboo manuscripts, producing identical errors in a non-public dataset. Li emphasized that their model’s recognition capabilities were achieved through meticulous months of scanning and annotating numerous manuscripts. He reported that high Gaussian perturbations revealed both models exhibited strikingly similar performances in correct and incorrect outputs.
When asked how to prevent such issues, Li noted the difficulty, attributing it to academic ethics. In light of the allegations, the Llama3-V team removed submitted criticisms of theft and subsequently withdrew the project from open-source platforms, issuing an apology. Sharma and Garg explained that they did not directly manage the coding; their work was overseen by Mustafa Aljadery, a USC graduate who had not yet released the training code.
The issue of "repackaging" large models has been pervasive in the industry. Some advocate for extensive use of open-source resources, while others claim that true innovation necessitates proprietary development. Modern large models trace their origins to the Transformer neural network architecture introduced by Google Brain in 2017. Building on this framework, companies pre-train large models on vast datasets to improve generalization capabilities and accelerate learning tasks.
Essentially, the "core" of a large model encompasses the complexities of neural network architecture and pre-training, while "shells" denote fine-tuning—adjusting pre-trained models for specific tasks. Fine-tuning is typically a supervised process, utilizing labeled data to direct the model's learning. AI analyst Zhang Yi noted that "repackaging" often involves modifying variable names during fine-tuning stages based on open-source models to develop adaptations for specific scenarios.
Suki, a former designer at Yuque and co-founder of AI assistant Monica, outlined four phases of "repackaging":
1. Directly referencing OpenAI APIs to replicate responses.
2. Constructing prompts, which serve as a foundation for model implementation.
3. Vectorizing specific datasets to build proprietary databases that can address questions beyond ChatGPT's capabilities.
4. Fine-tuning the model using quality Q&A datasets to enhance task-specific understanding, consuming fewer tokens than other methods.
In conclusion, this controversy highlights a contentious yet common trend in AI model development—ongoing adaptations to fulfill niche demands across diverse fields.