AI Translation Model for Hindi Speakers Surpasses GPT 3.5T in Performance

Home AI News AI Translation Model for Hindi Speakers Surpasses GPT 3.5T in Performance

Updated on October 24 2024

Large language models are renowned for their ability to support multiple languages, but many are primarily trained on English. Recently, a new model emerged specifically designed for Hindi-speaking users: OpenHathi-Hi-v0.1. Developed by Sarvam AI, an innovative Indian startup specializing in generative AI solutions, this model not only outperforms OpenAI's GPT-3.5 Turbo in numerous Hindi tasks but also retains high performance in English.

OpenHathi-Hi-v0.1 is built on a seven-billion parameter version of Llama 2, a widely used open-source model from Meta. The Sarvam team expanded its tokenizer to accommodate 48,000 tokens, enabling the model to understand a wider array of languages and specialized vocabularies. The model has been trained in Hindi, English, and Hinglish—a popular fusion of Hindi and English—addressing the growing need for robust AI support in Indic languages, which are spoken by over 800 million people in regions like India, Pakistan, Sri Lanka, and Bangladesh.

In their research, the Sarvam team evaluated several existing models by testing their ability to translate simple English sentences into Hindi. Although the models produced text in Devanāgarī, the script for Hindi, they often generated incorrect translations. For instance, when given the sentence "the price of petrol has been on a constant rise for a few years," one model incorrectly translated it as "There is a problem of too many workplace values."

Training a language model for Indic languages, particularly Hindi, presents unique challenges compared to English. Sarvam developed a specialized tokenizer for Hindi to enhance the model's comprehension and efficiency. This involved training a sentence-piece tokenizer on a comprehensive Hindi text corpus and integrating it with the base model's tokenizer. The team also made provisions for Romanized Hindi, which is often used on English keyboards, ensuring the model could effectively handle both Hindi and English inputs.

To supplement the limited available training data, Sarvam translated English content into Hindi, enriching their dataset. Collaborating with I4Bharat, a research lab at the Indian Institute of Technology Madras, provided vital language resources and benchmarks for evaluating the model's performance. Additionally, the model underwent fine-tuning for specific tasks such as translation, content moderation, and text simplification.

The performance results are promising. OpenHathi-Hi-v0.1 excelled on the FLoRes-200 benchmark for translating Devanāgarī Hindi to English, outperforming both GPT-3.5 and GPT-4 models, although it did lag behind IndicTrans2 and Google Translate. Notably, OpenHathi displayed even greater accuracy in translating Romanized Hindi to English than in translating Devanāgarī Hindi, suggesting that the underlying English token embeddings within the Llama model effectively enhance performance across both languages.

Despite its impressive capabilities, Sarvam acknowledges that OpenHathi does encounter certain limitations, particularly the issue of catastrophic forgetting. This phenomenon occurs when a model that is trained in one language—initially proficient in English—loses its accuracy in that language while learning a new one, such as Hindi.

OpenHathi-Hi-v0.1 can be accessed on Hugging Face. It is designed as a base model; thus, users are encouraged to fine-tune it for specific applications relevant to their needs. The model operates under the Llama 2 license, allowing for use unless you are a hyperscaler. Sarvam has indicated that enterprise-grade versions of this model will be launching soon, promising even broader applications.

For those interested in a comprehensive understanding of the model's evaluation processes, Sarvam has also shared a video explanation on their YouTube channel.

Intel's 5th Gen Xeon Scalable Processors: Revolutionizing AI Power in Data Centers

Google Gemini Pro: Launching Soon for Businesses and Developers

Most people like

Instabase

71.1K

Streamline your operations and harness the power of AI to automate processes and reveal valuable insights from your data.

AI Large Language Models (LLMs)

Clipto

553.9K

Discover our advanced AI transcription service designed to seamlessly convert audio, video, and YouTube files into accurate text. Experience the efficiency and precision of automated transcription that enhances accessibility and improves content engagement.

AI transcription Transcription

Kasisto

7.4K

In an era where technology drives innovation, artificial intelligence (AI) is revolutionizing the banking and finance sector. By harnessing AI solutions, financial institutions can enhance customer experiences, streamline operations, and mitigate risks. From predictive analytics and fraud detection to personalized banking experiences, AI is reshaping how banks and financial services operate. Explore the transformative potential of AI in banking and finance and discover how these advanced technologies can create operational efficiencies and empower informed decision-making in today’s competitive landscape.

Conversational AI AI Chatbot

Flux AI Pro

54.9K

AI Image Generator: Create Stunning High-Quality Images from Text Prompts.

AI image generator AI Art Generator

Find AI tools in YBX