Large language models (LLMs) are typically pre-trained on extensive datasets that include both text and code. While code is crucial for models focused on programming tasks, its inclusion has become increasingly common even in LLMs designed for non-coding applications.
In a recent study, researchers at Cohere explored how the presence of code data in LLM pre-training impacts overall performance across various tasks beyond coding.
“While practitioners have anecdotally agreed that code data is vital for LLMs' performance, limited research has analyzed its precise impact on non-code tasks,” the researchers noted.
Their investigation confirms that code significantly enhances LLM performance across a variety of non-coding tasks, with implications for real-world training applications.
Investigating the Impact of Code
The researchers conducted a series of experiments to assess how code influences general LLM performance. Key factors included the amount of code in training data, the timing of code introduction during training, the quality of the code, and the model sizes.
Using a two-phase training approach, they performed "continued pre-training" on pre-trained models, incorporating different ratios of text and code over a fixed number of tokens. This was followed by a "cooldown" phase, emphasizing higher-quality datasets during final training stages.
The baseline model was trained solely on text. Other models were pre-trained on balanced datasets of text and code, or on code-only data before transitioning to text.
They evaluated models ranging from 470 million to 2.8 billion parameters across various benchmarks focused on world knowledge, natural language reasoning, and code performance.
The Benefits of Code for Non-Coding Tasks
The experiments demonstrated that code substantially improved LLM performance on non-coding tasks.
In natural language reasoning, models trained with code consistently outperformed text-only counterparts. Remarkably, pre-training exclusively on code yielded the highest performance in these benchmarks.
“This indicates that initializing from a pre-trained model with a mix of code positively influences natural language reasoning tasks,” the researchers explained.
For world knowledge tasks, a balanced dataset of code and text during pre-training produced the best results. The researchers suggested that “optimal performance on world knowledge tasks relies on a balanced data mix for initialization and a more significant proportion of text during continual pre-training.”
In generative tasks, both code-only and balanced models surpassed text-only models, indicating that incorporating code not only enhances reasoning but also improves generative quality.
Furthermore, the researchers noted that the benefits of adding code increased with model size, with the most substantial gains observed in world knowledge and code performance, followed by modest improvements in natural language reasoning.
“These results suggest that the trade-off between natural language tasks and code generation intensifies as model size grows,” they stated.
Although LLMs often show emergent behavior at larger scales, the researchers were unable to test very large models due to cost limitations. However, they remain optimistic that their findings will extend to larger scales.
“Given our results hold from 470M to 2.8B parameters, we believe they will apply to even larger models and token budgets,” they noted.
The study also revealed that incorporating high-quality synthetic code into pre-training data significantly enhances performance, addressing the limitations of available human-generated code.
“Our synthetic code was created from problem statements to produce verified Python solutions,” said Viraat Aryabumi, the lead author and Research Scholar at Cohere. “This opens up future potential, as leveraging a high-performing teacher model is essential for generating effective synthetic code.”
Additionally, they found that integrating code-adjacent data, such as GitHub pull requests and commits, boosted reasoning capabilities.
Incorporating code into the cooldown phase led to further performance enhancements in non-coding tasks, offering valuable insights for enterprises looking to fine-tune models with their specific data instead of training from scratch.
“The cooldown phase aligns closely with fine-tuning regarding cost, data quality, and resource requirements, delivering substantial gains. We recommend including code throughout the training process,” Aryabumi emphasized. “Utilizing high-quality code—such as internal codebases and code-adjacent data—can also improve results during cooldown.”
As Cohere focuses on developing LLMs for enterprise applications, these findings may influence future model and product deployments, potentially offering a variety of pre-trained models with different text and code mixtures tailored for specific tasks. Enterprises can then fine-tune these models on proprietary data for optimal performance.
“Our findings are highly relevant for developers and will likely lead to the release of more efficient models,” Aryabumi stated. “What’s surprising is how code enhances performance beyond coding-related tasks, and this informs our approach to developing state-of-the-art models.”