Large language models are renowned for their ability to support multiple languages, but many are primarily trained on English. Recently, a new model emerged specifically designed for Hindi-speaking users: OpenHathi-Hi-v0.1. Developed by Sarvam AI, an innovative Indian startup specializing in generative AI solutions, this model not only outperforms OpenAI's GPT-3.5 Turbo in numerous Hindi tasks but also retains high performance in English.
OpenHathi-Hi-v0.1 is built on a seven-billion parameter version of Llama 2, a widely used open-source model from Meta. The Sarvam team expanded its tokenizer to accommodate 48,000 tokens, enabling the model to understand a wider array of languages and specialized vocabularies. The model has been trained in Hindi, English, and Hinglish—a popular fusion of Hindi and English—addressing the growing need for robust AI support in Indic languages, which are spoken by over 800 million people in regions like India, Pakistan, Sri Lanka, and Bangladesh.
In their research, the Sarvam team evaluated several existing models by testing their ability to translate simple English sentences into Hindi. Although the models produced text in Devanāgarī, the script for Hindi, they often generated incorrect translations. For instance, when given the sentence "the price of petrol has been on a constant rise for a few years," one model incorrectly translated it as "There is a problem of too many workplace values."
Training a language model for Indic languages, particularly Hindi, presents unique challenges compared to English. Sarvam developed a specialized tokenizer for Hindi to enhance the model's comprehension and efficiency. This involved training a sentence-piece tokenizer on a comprehensive Hindi text corpus and integrating it with the base model's tokenizer. The team also made provisions for Romanized Hindi, which is often used on English keyboards, ensuring the model could effectively handle both Hindi and English inputs.
To supplement the limited available training data, Sarvam translated English content into Hindi, enriching their dataset. Collaborating with I4Bharat, a research lab at the Indian Institute of Technology Madras, provided vital language resources and benchmarks for evaluating the model's performance. Additionally, the model underwent fine-tuning for specific tasks such as translation, content moderation, and text simplification.
The performance results are promising. OpenHathi-Hi-v0.1 excelled on the FLoRes-200 benchmark for translating Devanāgarī Hindi to English, outperforming both GPT-3.5 and GPT-4 models, although it did lag behind IndicTrans2 and Google Translate. Notably, OpenHathi displayed even greater accuracy in translating Romanized Hindi to English than in translating Devanāgarī Hindi, suggesting that the underlying English token embeddings within the Llama model effectively enhance performance across both languages.
Despite its impressive capabilities, Sarvam acknowledges that OpenHathi does encounter certain limitations, particularly the issue of catastrophic forgetting. This phenomenon occurs when a model that is trained in one language—initially proficient in English—loses its accuracy in that language while learning a new one, such as Hindi.
OpenHathi-Hi-v0.1 can be accessed on Hugging Face. It is designed as a base model; thus, users are encouraged to fine-tune it for specific applications relevant to their needs. The model operates under the Llama 2 license, allowing for use unless you are a hyperscaler. Sarvam has indicated that enterprise-grade versions of this model will be launching soon, promising even broader applications.
For those interested in a comprehensive understanding of the model's evaluation processes, Sarvam has also shared a video explanation on their YouTube channel.