Enhancing LLM Performance on Non-Coding Tasks Through Pre-Training Data Code

Home AI News Enhancing LLM Performance on Non-Coding Tasks Through Pre-Training Data Code

Updated on October 25 2024

Large language models (LLMs) are typically pre-trained on extensive datasets that include both text and code. While code is crucial for models focused on programming tasks, its inclusion has become increasingly common even in LLMs designed for non-coding applications.

In a recent study, researchers at Cohere explored how the presence of code data in LLM pre-training impacts overall performance across various tasks beyond coding.

“While practitioners have anecdotally agreed that code data is vital for LLMs' performance, limited research has analyzed its precise impact on non-code tasks,” the researchers noted.

Their investigation confirms that code significantly enhances LLM performance across a variety of non-coding tasks, with implications for real-world training applications.

Investigating the Impact of Code

The researchers conducted a series of experiments to assess how code influences general LLM performance. Key factors included the amount of code in training data, the timing of code introduction during training, the quality of the code, and the model sizes.

Using a two-phase training approach, they performed "continued pre-training" on pre-trained models, incorporating different ratios of text and code over a fixed number of tokens. This was followed by a "cooldown" phase, emphasizing higher-quality datasets during final training stages.

The baseline model was trained solely on text. Other models were pre-trained on balanced datasets of text and code, or on code-only data before transitioning to text.

They evaluated models ranging from 470 million to 2.8 billion parameters across various benchmarks focused on world knowledge, natural language reasoning, and code performance.

The Benefits of Code for Non-Coding Tasks

The experiments demonstrated that code substantially improved LLM performance on non-coding tasks.

In natural language reasoning, models trained with code consistently outperformed text-only counterparts. Remarkably, pre-training exclusively on code yielded the highest performance in these benchmarks.

“This indicates that initializing from a pre-trained model with a mix of code positively influences natural language reasoning tasks,” the researchers explained.

For world knowledge tasks, a balanced dataset of code and text during pre-training produced the best results. The researchers suggested that “optimal performance on world knowledge tasks relies on a balanced data mix for initialization and a more significant proportion of text during continual pre-training.”

In generative tasks, both code-only and balanced models surpassed text-only models, indicating that incorporating code not only enhances reasoning but also improves generative quality.

Furthermore, the researchers noted that the benefits of adding code increased with model size, with the most substantial gains observed in world knowledge and code performance, followed by modest improvements in natural language reasoning.

“These results suggest that the trade-off between natural language tasks and code generation intensifies as model size grows,” they stated.

Although LLMs often show emergent behavior at larger scales, the researchers were unable to test very large models due to cost limitations. However, they remain optimistic that their findings will extend to larger scales.

“Given our results hold from 470M to 2.8B parameters, we believe they will apply to even larger models and token budgets,” they noted.

The study also revealed that incorporating high-quality synthetic code into pre-training data significantly enhances performance, addressing the limitations of available human-generated code.

“Our synthetic code was created from problem statements to produce verified Python solutions,” said Viraat Aryabumi, the lead author and Research Scholar at Cohere. “This opens up future potential, as leveraging a high-performing teacher model is essential for generating effective synthetic code.”

Additionally, they found that integrating code-adjacent data, such as GitHub pull requests and commits, boosted reasoning capabilities.

Incorporating code into the cooldown phase led to further performance enhancements in non-coding tasks, offering valuable insights for enterprises looking to fine-tune models with their specific data instead of training from scratch.

“The cooldown phase aligns closely with fine-tuning regarding cost, data quality, and resource requirements, delivering substantial gains. We recommend including code throughout the training process,” Aryabumi emphasized. “Utilizing high-quality code—such as internal codebases and code-adjacent data—can also improve results during cooldown.”

As Cohere focuses on developing LLMs for enterprise applications, these findings may influence future model and product deployments, potentially offering a variety of pre-trained models with different text and code mixtures tailored for specific tasks. Enterprises can then fine-tune these models on proprietary data for optimal performance.

“Our findings are highly relevant for developers and will likely lead to the release of more efficient models,” Aryabumi stated. “What’s surprising is how code enhances performance beyond coding-related tasks, and this informs our approach to developing state-of-the-art models.”

How Rec Room Successfully Reduced Toxicity in Player Voice Chat by 70%

Bland AI Secures $16M to Revolutionize Enterprise Phone Call Automation with Human Agents

Most people like

Journalist AI

Effortlessly create top-tier articles for your business in an instant.

AI AI Blog Writer

funfun.ai

Imagine bringing your dream companion to life with the power of artificial intelligence. A personalized AI girlfriend can not only engage you in meaningful conversations but also adapt to your interests and preferences, making each interaction unique. In this guide, we’ll explore how to design your perfect AI girlfriend, tailored to fulfill your desires and enhance your daily life. Get ready to embark on a journey towards creating a relationship that’s entirely your own.

AI AI Girlfriend

Lindy.ai

Create Your Own AI Agents Effortlessly—No Coding Required! Unlock the power of artificial intelligence by building personalized AI agents without any coding knowledge. Whether you're a business owner, a developer, or just someone curious about AI, our platform empowers you to design and deploy intelligent agents tailored to your specific needs. Dive into the world of no-code solutions and bring your AI ideas to life with ease!

AI AI Customer Service Assistant

Humanize AI Text

In today's digital landscape, effective communication is paramount. Our AI to human text conversion tool transforms complex, technical text into clear, relatable language, making it accessible to everyone. Whether you're a student, professional, or content creator, this tool enhances your writing by ensuring your message resonates with your audience. Discover the power of simplifying your words while preserving your intended meaning!

AI text converter AI Rewriter

Find AI tools in YBX