This Week in AI: Major Tech Companies Adopt Synthetic Data Solutions

This week in the AI landscape, synthetic data has taken center stage.

Last Thursday, OpenAI unveiled Canvas, a fresh interface for engaging with ChatGPT, its AI-driven chatbot platform. Canvas provides a workspace for your writing and coding projects, allowing users to generate text and code within the interface. Users can then highlight specific sections for editing with ChatGPT's assistance.

From a user experience standpoint, Canvas significantly enhances usability. However, the most intriguing aspect of this feature is the advanced model fueling it. OpenAI has fine-tuned its GPT-4o model using synthetic data to enable innovative user interactions in Canvas.

“We employed cutting-edge synthetic data generation techniques, like distilling outputs from OpenAI’s o1-preview, to refine the GPT-4o model for Canvas, allowing targeted edits and high-quality inline comments,” shared Nick Turley, ChatGPT's head of product, in a post on X. “This innovative approach has enabled rapid model improvement and the introduction of novel user interactions, all without relying on human-generated data.”

OpenAI isn’t alone; major tech companies are increasingly turning to synthetic data for model training. For instance, Meta’s development of Movie Gen, a suite of AI tools for video creation and editing, partially utilized synthetic captions generated by an extension of its Llama 3 models. Although a team of human annotators polished these captions, much of the foundational work was automated.

OpenAI’s CEO Sam Altman believes that AI will eventually create synthetic data capable of training itself effectively, a significant advantage for companies like OpenAI that incur substantial costs for human annotators and data licensing.

Meta has also enhanced its Llama 3 models with synthetic data. Additionally, OpenAI is reportedly sourcing synthetic training data from o1 for its next-generation model, code-named Orion.

However, the reliance on a synthetic-data-first approach comes with challenges. As highlighted by researchers, models generating synthetic data can introduce hallucinations (i.e., inaccuracies) and inherent biases, which can negatively impact the quality of the generated data.

To utilize synthetic data safely, it is essential to curate and filter it meticulously, just as is done with human-generated data. Neglecting this process could lead to model degradation, resulting in reduced creativity and heightened biases, ultimately jeopardizing the model's effectiveness.

Navigating this task at scale isn’t easy. Nevertheless, with real-world training data becoming increasingly expensive and harder to obtain, AI vendors may view synthetic data as their only viable option. We can only hope they proceed with caution.

Most people like

Find AI tools in YBX