AI In India’s Native Tongues? Not So Quick

To enable AI in regional languages, India needs build datasets, incentives, and environments. Right now, that looks like a huge challenge

AI enables digital access in Indian languages for education, healthcare, governance and services.

India has run into a unique hurdle on the artificial intelligence (AI) front: its rich tapestry of tongues.

Most AI models in the world have been developed from English content floating around on the internet. Chinese companies have huge amounts of data in Mandarin. But how does India, with 22 languages and hundreds of dialects, train its AI models in native languages? 

Large Language Models (LLMs) are trained on vast amounts of text and speech data, and the performance of an AI system depends on the quality and availability of that data.

While AI systems are becoming more proficient in working with several languages, the harder part is building the data, incentives, and environments to enable regional-language innovation.

“AI enables you to work in various languages quite effectively,” said Amit Khanna, Partner and Automation Ecosystem Leader at Grant Thornton Bharat. “But are enough incentives available to businesses to create products for Tamil, Telugu, Bengali, Marathi, Gujarati, and other Indian language speakers?”

If AI continues to be English-based, a significant portion of the population may not reap the benefits of the new digital services. The stakes are getting higher with the emergence of AI use in healthcare, education, agriculture and public administration, which involves millions of citizens.

Why Language Matters in AI

The Census 2011 data reveals that Hindi is the mother tongue of 43.6% of Indians, followed by Bengali (8%), Marathi (6.9%), Telugu (6.7%), Tamil (5.7%) and Gujarati (4.6%). Millions more communicate in Kannada, Malayalam, Punjabi, Odia, and Urdu.

In India, language is not as uniform as Mandarin in China. When used for administration, commerce, and technology, it differs from state to state. While Hindi is the main language in most parts of northern and central India, there is Tamil in Tamil Nadu, Bengali in West Bengal, Gujarati in Gujarat, and Marathi in Maharashtra.

English continues to play an important role in business, higher education, and technology. English as a second language is used by over one-fifth of Indians, and as a third language by almost 7%, according to language data from the last census. But in their daily life, the majority of citizens still communicate in Indian languages.

With the rise of AI-powered chatbots and its integration into public service, language accessibility has become crucial. Whether it's a farmer looking for agriculture-related data, a student accessing educational materials, or a citizen applying for government benefits, they may all need help from AI in their native tongues. 

Dhrubabrata Ghosh Dastidar, Managing Director of Protiviti Member Firm for India, told The Secretariat, "The real opportunity lies in democratising access to AI so that startups, enterprises, government institutions, and citizens can all benefit from its capabilities rather than restricting its advantages to a handful of large technology players."

AI should be considered a capability that can serve society, not just a few users, he said.

Bridging India's Language Divide

Language accessibility will be a key factor in driving the adoption of AI across India, says Khanna.

"Today, everybody has a mobile phone. But everything in the central space comes only in English and Hindi, and maybe one state language. But if I am living in a rural area, I don't have any access. I should actually get the same thing in my language, in a small town where I have no other access," he said.

AI can help overcome some of the historical communication challenges that have long plagued governance and public services, Khanna said. Citizens might benefit from multilingual AI systems to provide access to government schemes, healthcare, and education materials in their native language for residents at both national and regional levels.

This is a function that could transform the capability of a country striving to be digitally inclusive. Instead of making citizens adapt to technology, AI could adapt to the language realities of citizens.

The Data Challenge

Technology, however, may not be the greatest hurdle. 

There are many Indian languages that are not well represented online. There are regional languages that have smaller digital footprints than English, and fewer high-quality datasets with which to train AI. There are also dialect and local variations. Having limited amounts of data in the target language makes it much more challenging to create a comprehensive AI model that can accurately process and interpret the language and produce a response.

Take, for instance, Google's recent front-page campaigns promoting AI in Indian languages signal growing momentum for vernacular AI. However, in many cases, these capabilities remain closer to advanced translation than truly native-language AI, as India still lacks sufficient datasets and mature LLMs trained directly on several regional languages.  

Steps Taken So Far

In response, the Indian government has taken steps to address the problem, including the launch of the IndiaAI Mission, which focuses on advancing the use of AI in the country, and the Bhashini initiative, a platform designed to encourage language technology and translation skills. 

Indian startups are also working on creating local AI models with an Indian touch.

With the growing presence of AI in public life, language will shape access to information, the participation in digital services and the benefit from technological advancements. The success of AI in India won’t just be about what advanced models it can create but also whether the models can understand the voice of the people they are meant to serve.

This is a free story, Feel free to share.

facebooktwitterlinkedInwhatsApp