Salesforce has unveiled a groundbreaking language model benchmark specifically designed for businesses to evaluate their AI models on customer relationship management (CRM) tasks. Benchmarks serve as vital tools that assess a language model's performance, offering model owners a comprehensive evaluation of their outputs on targeted tasks, similar to the established MMLU benchmark for general knowledge assessment.
This innovative benchmark focuses on essential CRM applications, enabling organizations to measure the performance of AI systems in sales and service scenarios across four critical metrics: accuracy, cost, speed, and trust and safety. Developed by Salesforce’s AI research team, this benchmark addresses a noticeable gap in existing evaluations, which often overlook vital business-relevant metrics, such as operational costs and trust factors.
According to Clara Shih, CEO of Salesforce AI, "Business organizations are embracing AI to fuel growth, reduce expenses, and deliver personalized customer experiences—not for unrelated tasks like planning a birthday party or summarizing Shakespeare." She emphasizes that this benchmark transcends mere measurement; it serves as a dynamic and comprehensive framework that empowers companies to make informed decisions by effectively balancing accuracy, cost, speed, and trust.
Model owners can leverage Salesforce’s benchmark by comparing their results on a public leaderboard. Initially, this benchmark ranks OpenAI’s GPT-4 Turbo as the most accurate model for CRM tasks, while Anthropic's Claude 3 Haiku stands out as one of the most cost-effective options.
In the realm of speed, the Mixtral 8x7B, developed by the French AI startup Mistral, takes the lead as the fastest model. Notably, all of the top-performing models in speed are small language models, with GPT-3.5 Turbo, a larger model, ranking lower on speed metrics.
When it comes to trust and safety, Google’s Gemini Pro 1.5 garnered the highest score at 91%. Following closely are Meta's Llama 3 models, both the 8B and 70B configurations, achieving scores of 90%. In contrast, OpenAI's GPT-4 Turbo and GPT-4o received trust scores of 89% and 85%, respectively. The least trustworthy model, OpenAI’s GPT-3.5 Turbo, managed only a 60% safety score, indicating significant shortcomings in privacy and truthfulness assessments.
Looking ahead, Salesforce plans to expand this CRM benchmark by incorporating additional CRM use case scenarios and accommodating models that have undergone fine-tuning.
As Silvio Savarese, Executive Vice President and Chief Scientist at Salesforce AI Research, remarks, “As AI continues to advance, enterprise leaders recognize the necessity of finding the right blend of performance, accuracy, responsibility, and cost to fully harness the potential of generative AI for driving business growth.”
The introduction of Salesforce's LLM Benchmark for CRM is a pivotal development in how businesses evaluate their AI strategies, offering enhanced visibility into next-generation AI deployment and accelerating the time to value for CRM-specific applications.