Mostly AI is tackling a significant AI training bottleneck for enterprises. The Austrian company, renowned for its synthetic data generation platform, has today launched its synthetic text functionality. This new feature enables businesses to extract value from their proprietary datasets while minimizing privacy risks.
Starting now, this offering produces a synthetic version of an organization’s proprietary information, ensuring that personally identifiable information (PII) and diversity gaps are excluded. This empowers teams to train and optimize large language models (LLMs) more effectively, fostering faster innovation and improved decision-making.
Addressing AI Training Challenges
The launch comes at a critical moment when AI training is stagnating, prompting enterprises to seek valuable alternatives to public data sources. With the rise of generative AI, synthetic data is becoming a vital resource. According to Gartner, by 2026, 75% of companies are expected to leverage generative AI for creating synthetic data, a significant increase from under 5% in 2023.
Understanding Synthetic Text
Synthetic data is often the preferred solution when real data is costly or unavailable. While enterprises have utilized synthetic images, the generative AI boom is set to broaden its use across various data types. However, synthetic data can sometimes lack crucial organization-specific context, hindering the performance of AI models.
To combat this challenge, Mostly AI provides a platform where enterprises can train their own AI generators to produce on-demand synthetic data. Initially focused on structured tabular datasets, which capture transaction nuances and customer journeys, the platform now extends its capabilities to text data.
Proprietary text datasets—such as emails, chatbot conversations, and support transcripts—pose challenges due to PII, diversity gaps, and varying levels of structure. With the new synthetic text feature, users can train a text generator using their proprietary data, resulting in a purified synthetic version that retains the nuances and insights of the original text, while being free of PII and diversity gaps.
Users can also select from various language model options (including Mistral-7B and Viking-7B) to optimize their text generator. As CEO Tobias Hann explained, “The selected LLM is fine-tuned with the original text data in conjunction with structured data, enhancing the quality of the generated synthetic text.” Once fine-tuned, the platform creates synthetic text that can be downloaded or stored for further analysis.
Benefits for Enterprises
With the synthetic text generated from this platform, enterprises can enhance their analytics and generative AI applications. While no live applications are currently available, the initial focus will be on generating prompt-response pairs (such as question-answer pairs) commonly used in fine-tuning LLMs for customer service.
This new capability allows enterprises to extract value from proprietary text without privacy concerns, making it an attractive option for enhancing AI training efforts. Mostly AI claims that training a text classifier using its synthetic text led to a 35% performance boost compared to data generated through prompts to GPT-4o-mini.
However, it’s important to note that this represents an early comparison, with no established benchmarks yet to measure Mostly AI’s synthetic text generator against other generators, such as Gretel.
Hann emphasized, “The Mostly AI platform has previously been benchmarked against competing solutions and has consistently shown superior performance in the quality and privacy of the generated synthetic data.”