Tech Mahindra's Project Indus: A cultural odyssey to preserve & empower India's diverse languages through AI

Tech Mahindra's Project Indus is a major step towards linguistic inclusivity and technological empowerment in India. It's building the largest Indic Language Model (LLM) to bridge linguistic gaps and create a more inclusive AI landscape in the country.

Tech Mahindra Unveils Project Indus
Tech Mahindra Unveils Project Indus

Highlights

  • Tech Mahindra's initiative to build India's largest Indic Language Model (LLM)
  • Addressing linguistic disparities to make AI more accessible
  • Paving the way for a more inclusive and culturally aware AI landscape in India

In a remarkable endeavour, Tech Mahindra recently unveiled Project Indus their open-source large language model (LLM), a groundbreaking initiative aimed at creating a colossal Indic-based foundational model for Indian languages. Notably, the company has also previously trained 8000 employees in AI skills to stay ahead in the ever-evolving tech landscape.

While LLMs like OpenAI's GPT have been transformative, their primary focus has largely been on English, leaving a substantial gap in comprehending and generating content in Indian languages. In its first testing phase, Project Indus by Tech Mahindra is on a mission to collect dialects from all corners of India.

This ambitious initiative seeks to build the largest Indic-based language model for Indian languages. From the lyrical rhythm of Bengali to the melodic tones of Tamil, it aims to embrace India's linguistic diversity.

The project invites contributions from people across the nation through the "Bhasha Daan" portal, recognising that every dialect is a cultural treasure. While navigating this linguistic journey, Project Indus is committed to ensuring fairness and cultural sensitivity.

Make your contribution by providing dialects, (Photo: Tech Mahindra)

It represents not just a technological endeavour but a celebration of India's rich linguistic shade and a promise to empower and preserve its diverse languages.

The quest for India's biggest Indic LLM

Project Indus aspires to build a formidable 7-billion parameter LLM, initially supporting 40 different Hindi dialects with plans to incorporate more languages and dialects in the future. The overarching goal is to create a model that excels in textual continuity and eventually engages in dialogues.

First phase of Indus Project, (Photo: Tech Mahindra)

Tech Mahindra envisions democratising AI for non-English speakers in India, ensuring cultural sensitivity and preserving underrepresented languages.

Benefits of an Indic LLM

An Indic LLM offers numerous advantages, including versatility for diverse tasks like Q&A, content generation, and more, making it invaluable for industries like healthcare, retail, and tourism. Moreover, it addresses the issue of higher token costs for Indic languages in existing models, providing a cost-effective solution.

Additionally, it aids in the preservation of endangered languages and stands to benefit Tech Mahindra's customers by leveraging techniques from the model.

Building datasets: A crucial challenge

A critical aspect of this ambitious project is sourcing high-quality datasets for Indic languages and dialects. Tech Mahindra is actively collaborating with leading universities and stakeholders while crowdsourcing contributions from native speakers.

Through their "Bhasha Daan" portal, they encourage individuals to share expressions, vocabulary, and conversations in their dialects, fostering a collective effort to create robust datasets.

Mitigating biases: Ensuring fairness

To ensure fairness in the model's output, Tech Mahindra acknowledges the importance of mitigating biases within datasets. They plan to employ both human annotation and automatic techniques to scrub the data for racial, ethnic, and gender biases, among others.

The success of Project Indus hinges on robust data collection, efficient model training, and the meticulous handling of linguistic nuances.

In short, Tech Mahindra's Project Indus represents a significant stride toward linguistic inclusivity and technological empowerment for India. By creating the largest Indic LLM and addressing linguistic disparities, this initiative not only bridges gaps but also paves the way for a more inclusive and culturally sensitive AI landscape in India.