Charting India's AI Future: Overcoming Data Challenges for Responsible Innovation

Data is not scarce here, it is the quality of it that is a concern, and especially for regional languages. One way out is MeitY should go beyond Bhashini AI for sector- and region-specific datasets, and keep them open source

Data serves as the bedrock of Artificial Intelligence (AI) systems, driving machine learning algorithms and facilitating intelligent applications. But India faces a myriad of data-related challenges that hinder AI development.

This piece delves into the key data-related barriers to it, the government's approach to these obstacles, and offers key recommendations. Importantly, while the government has some initiatives in the pipeline, it needs to implement them on priority. 

This article is the second of a two-part series that aims to provide strategic and targeted insights to the new government on leveraging India’s strengths and charting its own path in AI and its development. The first piece examined the compute pillar of AI. 

Quality Of Data Crucial

Data quality and accessibility are some of the key challenges that hinder reliable AI system development. The quality of datasets directly impacts the performance and accuracy of an AI model. 

Unstructured data, inconsistent data formats, inadequate labelling, and lack of standardisation can adversely impact the performance of the model. Further, gaps in data completeness, accuracy, and representativeness can negatively impact inclusion, equality, and unbiased outcomes.

India doesn't have a data scarcity problem. With more than 820 million internet users and a rapidly growing mobile user base, it generates a colossal amount of data daily. However, the quality of data available for training the AI model is the problem. There is a dearth of well-annotated, feature-rich local datasets, which can create impediments in AI development. 

The scarcity manifests starkly with the quality of datasets for Indian languages. Most languages here fall within the medium-resource to extremely low-resource spectrum (here, and here). This classification is based on the availability of digitised text data that can be used to train automated systems.

There is a need to prioritise access to high-quality and relevant indigenous data. It is needed to either train a domain-specific model or fine-tune a model for the Indian context. The IT Ministry (MeitY) has made efforts toward this goal by developing Bhashini AI, a multilingual translation model.

MeitY also plans to create IndiaAI Innovation Centre (IAIC) under the IndiaAI mission, which will promote the development and deployment of foundational models, with a specific emphasis on indigenous LMMs and domain-specific models. 

MeitY can solve this in two ways. First, it should extend its efforts beyond Bhashini AI and focus on developing quality datasets with indigenous data in various contexts. Such as datasets encompassing diverse areas like agriculture (region-specific weather patterns and farming practices) and smart mobility (data about unique Indian driving conditions and traffic behaviour). Second, it should take an open-source approach to datasets for the foundational models it creates under the IAIC or otherwise. 

For instance, Bhashini AI is open source, but the data it was trained on is not. This open-source approach to datasets will allow developers, researchers, and entrepreneurs to leverage them, creating a much wider array of innovative downstream applications compared to state-led efforts. This can foster a vibrant ecosystem for AI research and development in India.

From Isolation to Integration: Bridging Data Silos

Another key challenge is that the available data is fragmented across various organisations, with disparate schemas and metadata standards. This can be a bottleneck for developing AI systems at scale.

Recognising this, the government proposed certain initiatives focused on enabling access to data and reducing silos but the implementation is lacking. It released the National Data Governance Framework Policy (NDGFP) in 2022 for consultation but it is yet to be implemented.

NDGFP aims to promote data-driven governance, by standardising data management of non-personal data and anonymised data across government entities.

A plan to come up with a unified national data-sharing exchange for anonymised personal data, the India Datasets Platform (IDP) has also been a long time in the making. The IDP shall hopefully diminish barriers to access for startups, promoting domestic AI innovation and development. 

Call for Action: Fostering India's Data Ecosystem

Timing is of the essence here. MeitY should, within the 100-day plan, release the NDGFP and operationalise the IDP as soon as possible. It also needs to develop uniform data interoperability standards across diverse government datasets to assist with data management, sharing, governance, and algorithmic training. 

The NDGFP was criticised for its failure to mandate adequate technical and organisational measures to prevent the risk of re-identification of individuals when processing non-personal and anonymised data. These safeguards need to be incorporated.

It is also important to follow data protection principles such as data minimisation and purpose limitation and to promote ethical best practices. 

Additionally, to fully unlock the potential of India's data ecosystem, the government must incentivise and facilitate the involvement of private companies.

This can be achieved through a multi-pronged approach, encompassing a supportive regulatory environment, financial grants for data-driven research and development, and the deployment of innovation solutions like regulatory sandboxes.

Commit To A Focused Execution

In navigating India's path towards AI leadership, addressing the foundational challenges of data quality and integration is paramount.

The abundance of data in India presents an opportunity, yet its uneven quality and fragmented storage hinder AI development. By prioritising high-quality, open-source, indigenous datasets, we can foster a vibrant AI and data ecosystem in India. 

Initiatives like the NDGFP and IDP hold promise but require swift implementation with strong data protection safeguards. Without immediate and effective action, the opportunities these initiatives present will be squandered. 

The Indian government needs to break from its pattern of bureaucratic delay and inconsistent policy execution, committing instead to a focused and strategic execution to harness the potential of these initiatives.

(Rudraksh Lakra is an associate and Rutuja Pol leads government affairs practice at Ikigai Law, a technology-focused law and policy firm. Views expressed are personal)

This is a free story, Feel free to share.

facebooktwitterlinkedInwhatsApp