Daily News
LDC-IL launches 16 datasets to drive AI research in Indian languages
Published
2 years agoon

The Linguistic Data Consortium for Indian Languages (LDC-IL), operating under the Ministry of Education’s scheme, focuses on creating digital corpora in various Indian languages. During the 8th Project Advisory Committee meeting at the Central Institute of Indian Languages (CIIL) in Mysuru, chaired by Shailendra Mohan, director of CIIL, LDC-IL introduced 16 novel datasets in Indian languages. This ground-breaking initiative aims to advance research in Artificial Intelligence (AI) and Machine Learning (ML) by providing valuable resources.
These datasets, a first of their kind, are designed to support the development of technologies in Indian languages, including Automatic Speech Recognition and Live Voice Translation. They are instrumental in enhancing the precision and efficacy of tools in Indian languages. The datasets encompass 12 scheduled languages like Hindi, Bengali, Tamil, Marathi, Kannada, Malayalam, Odia, Assamese, Konkani, Maithili, Urdu, and Nepali. Additionally, there are two variants of Indian English, specifically the Bengali variant of Indian English and the Kannada variant of English.
In a notable move, the institute also released datasets for Chhattisgarhi, traditionally grouped with Hindi. This reflects the government’s commitment to advancing education and technology for all mother tongues in India, aligning with the recommendations of the National Education Policy-2020.
The availability of these datasets on the Data Distribution Portal of LDC-IL, accessible at https://data.ldcil.org, marks a significant contribution to linguistic research and the AI and ML development landscape. The Linguistic Data Consortium for Indian Languages, as the largest repository of curated text and speech resources in Indian languages, now boasts a total of 57 datasets covering 21 Indian languages.
These datasets, distinct from real-world data collected from verified sources rather than crowd-sourced, serve as crucial resources for training and benchmarking AI and Generative AI-based technologies. The applications derived from these datasets are expected to promote and strengthen linguistic diversity in India.