Tensoic AI Releases Kan-Llama: A 7B Llama-2 LoRA PreTrained and FineTuned on ‘Kannada’ Tokens

January 28, 2024January 28, 2024 Ritvik Vipra

0 Shares

Tensoic AI, a Mumbai-based software company, has recently unveiled its latest breakthrough in natural language processing (NLP) and machine learning. They have released Kan-Llama, a 7B Llama-2 model that is pre-trained and fine-tuned on ‘Kannada’ tokens. This new language model aims to enhance the linguistic capabilities for low-resource Indic languages, with a specific focus on Kannada. In this article, we will explore the significance of this release and delve into the technical aspects that make Kan-Llama a powerful tool for language processing.

The Need for Language Models

Language models play a crucial role in various NLP applications, including machine translation, text generation, sentiment analysis, and many more. These models are trained on large datasets and learn the statistical patterns and structures of language. However, building language models for non-English languages poses unique challenges, such as limited resources and a lack of comprehensive linguistic data.

🔥Explore 3500+ AI Tools and 2000+ GPTs at AI Toolhouse

Existing language models, like the Llama-2 model developed by Meta, have achieved impressive results in English and other widely spoken languages. However, the same level of support and accuracy is often lacking for languages with smaller user bases, such as Kannada. This is where Tensoic’s Kan-Llama comes into the picture.

Introducing Kan-Llama

Kan-Llama is a 7 billion Llama-2 model that has been pre-trained and fine-tuned on ‘Kannada’ tokens. It is designed to address the limitations of existing language models when it comes to non-English languages, particularly those with low linguistic resources. By focusing on Kannada, a Dravidian language spoken predominantly in the Indian state of Karnataka, Tensoic aims to extend the power of Llama-2 to less-resourced languages.

Technical Advancements in Kan-Llama

To optimize the performance and accuracy of Kan-Llama, the Tensoic team has implemented several technical advancements. Let’s take a closer look at these innovations:

Vocabulary Modification

Kan-Llama incorporates modification of the model’s vocabulary through a phrase fragment tokenizer. This tokenizer is specifically trained on a Kannada text corpus and integrated with the existing Llama-2 tokenizer [1]. By adapting the vocabulary to the unique characteristics of the Kannada language, Kan-Llama can better understand and process Kannada texts.

Low-Level Optimization (LoRA)

The Tensoic team has leveraged low-level optimization (LoRA) techniques during the pre-training phase of Kan-Llama. LoRA helps conserve the knowledge and weight of previously trained models, reducing the overall number of trainable parameters. This efficient training method enables the computational training of language models with billions of parameters, such as Kan-Llama [1].

Scalability and Conversational Capabilities

To ensure scalability and enhance its conversational capabilities, Kan-Llama has been optimized to work with specific data structures. This optimization allows the model to handle large volumes of text data efficiently and generate more contextually relevant responses during conversational tasks [4].

Training Process and Resources

The training of Kan-Llama involved pretraining on approximately 600 million Kannada tokens from the CulturaX Dataset. To handle the computational requirements, Tensoic utilized Nvidia A100 80GB instances, which offer high-performance computing capabilities. The training process took approximately 50 hours and incurred an estimated cost of $170, highlighting the significant resources invested in developing Kan-Llama [4].

Advancing Research and Collaboration

Tensoic’s release of Kan-Llama reflects its commitment to advancing research in the field of NLP and machine learning. By making this language model open-source, Tensoic encourages collaboration and contributions from the broader research community. This openness fosters innovation and allows researchers to explore new applications and possibilities using the Kan-Llama model.

Furthermore, Tensoic has actively collaborated with organizations such as Microsoft to make language models more accessible for research and public use. Such partnerships contribute to the development of state-of-the-art models and promote the democratization of language processing technologies.

Conclusion

Tensoic’s introduction of Kan-Llama, a 7B Llama-2 model pre-trained and fine-tuned on ‘Kannada’ tokens, marks a significant milestone in the advancement of language models for non-English languages. By focusing on Kannada, a less-resourced Indian language, Tensoic aims to address the limitations of existing models and expand the linguistic capabilities of Llama-2.

Through technical advancements in vocabulary modification, low-level optimization, and scalability, Kan-Llama demonstrates its power in processing Kannada language texts. The open-source nature of Kan-Llama promotes collaboration and innovation, fostering further advancements in the field of NLP.

As language models continue to evolve, Tensoic’s release of Kan-Llama paves the way for improved support for non-English languages. It contributes to the development of more inclusive and comprehensive language processing technologies.

If you like our work, you will love our Newsletter 📰