IIT professor highlights BharatGens challenges in building LLMs for 16 Indian languages
x
Professor Ganesh Ramakrishnan, principal investigator of BharatGen, India's first government-funded generative AI initiative

IIT professor highlights BharatGen's challenges in building LLMs for 16 Indian languages

IIT Bombay professor Ganesh Ramakrishna, BharatGen talks of challenges of data limitations and linguistic complexities in building LLMs for Indian languages


Professor Ganesh Ramakrishnan of IIT Bombay is the principal investigator at BharatGen, described as “India's first government-funded generative AI initiative aimed at developing multilingual and multimodal foundational models”.

The project is a collaborative effort involving premier institutions like IIT Bombay, IIT Madras, IIT Hyderabad, IIT Kanpur, IIIT Hyderabad, IIM Indore, and IIT Mandi, which is funded by the department of science and technology through TIH IIT Bombay, a section 8 company. At IIT Bombay, Ramakrishnan serves as the Bank of Baroda Chair professor in digital entrepreneurship at the department of computer science and engineering.

In this exclusive interview with The Federal, he talks of the work being done by BharatGen on building LLMs (Large Language Models) in 16 languages, viz., Assamese, Bengali, English, Gujarati, Hindi, Kannada, Maithili, Malayalam, Marathi, Nepali, Odia, Punjabi, Sanskrit, Sindhi, Tamil and Telugu. He also speaks about the challenges of working with Indian languages in the context of AI (artificial intelligence), and the future of Indian language LLMs.

Here are excerpts from the interview:

What is BharatGen doing with Indian languages? I believe you're building an LLM for Indian languages.

Yes, we've been building an LLM for Indian languages. We’re tokenising a large amount of data, but the tokenisation is such that it leverages a couple of things. One is what we call lexical similarity across Indian languages.

If you take a language like Tamil, there are several words shared in common with, say, Malayalam. It is all a gradation. It is not that all languages have to share the same vocabulary, but sometimes pairs of languages share. There may be a difference of opinion, whether it's 40 per cent shared, 20 per cent shared, but there is sharing. We have some kind of gradual transition from one language to another, one dialect to another.

Also read: Why fitting Tamil into AI and Large Language Models is a big challenge

Lexical similarities also depend on what literature you are reading. Sometimes, some of the ancient literature could have more or less overlap; recent literature would have less overlap, but there is overlap.

The second thing that is similar across Indian languages, which is even more similar than the lexical similarity, is the syntactic structure. If you take English, for example, “I eat mangoes” is subject, verb, object. If you take Tamil, it is naan mambazham sapadren, in Hindi, it’ll be main aam khata hoon, and in Marathi it is mi aam khato.

In all Indian languages, the pattern is: subject, object, verb. So, it is undeniable that in all Indian languages, there are similarities. But the degree of similarity can be debated.

Samanvaya is a book that several of our colleagues have written, including colleagues in BharatGen and our students. It is an interlingua for Indian languages. And there we show that by using certain structures, structures such as groups of words, all Indian languages start looking very similar.

We looked at Malayalam, Kannada, Marathi, Hindi, Bengali, and Sanskrit, and we saw that when we do the grouping and look at the dependency relation, inspired by the Karaka framework of Panini, Indian languages have a lot of coherence.

The third thing is phonemes, or speech. At the phoneme level, you find that, again, there's a significant amount of similarity. If you look at the common label set (CLS), it brings multiple Indian languages very close together.

When building an LLM, compared to, say, English, for Indian languages, the dataset is going to be far more limited. So, how do you overcome that? What work is being done on that?

So, the dataset is limited. Obviously, we are creating datasets. We have a platform called UDAAN, which we have been using for quite some time to generate high-fidelity translations of content, books, etc., in multiple Indian languages. Through optical character recognition (OCR), we’re getting a lot of content and books which are copyright-free, digitised. That work has been going on for the last 4-5 years.

There is also a lot of data available on the internet, but obviously, that also needs to be curated. It is significantly smaller than English, but we also leverage that. So, it’s a combination of these three: OCR, publicly available data sets, and a lot of translated, high-quality data.

Also read: India's AI market to triple to US$ 17 billion by 2027: Report

How do you take into account differences in spoken and written language?

The change is largely vocabulary. It is not that they are totally different. There is a gradual shift.

The advantage with large language models, and especially when you train these models together, is that we use text from multiple Indian languages together and we use speech.

Some of it could even be rendered from the spoken to written form, because there is a gradient. And there is a continuum. The continuum, when you exploit, then the written form may have some overlap with the spoken form. And the written form in one language has some overlap with the written form in another language. This continuum is very helpful for India. I am not saying we have exploited it completely, but the more we are able to leverage that, the more we will be able to mitigate the challenge of data constraints. But yes, there is a long way to go.

All I am trying to say is, there are very genuine efforts being put in. That is why we also reach out to government departments for the data they have. Because we are also a kind of government sovereign AI initiative, funded by the department of science and technology, we are also getting support to reach out to multiple government agencies for the data.

All that is, of course, a slow process. That data is also trickling in.

Our team members have also gone to several libraries. We have a consortium. IIT Kanpur, IIT Madras, IIT Hyderabad, IIT Mandi, and IIT Bombay, of course, is leading. And one of our very strong consortium partners is IIM Indore. They have been doing a fabulous job in contacting several local libraries in India. Many times, we do not pay attention to them, but many, many small libraries are more than willing to share data.

Are there any specific challenges or more of a challenge for certain Indian languages?

In general, Indian languages need more attention. That is all I can say. But it is too early to say whether there is any one language you have to single out in terms of challenges.

Also read: AI in banking should be the co-pilot, not the sole pilot: Deepak Sharma

Are you also working on the north-eastern languages?

For speech, certainly. And for text, also, gradually.

Text is taking more time. We are building a mixture of expert foundational LLM from scratch for 16 Indian languages, which includes some north-eastern languages, like Assamese. But for the speech part, we have already gone to Bodo, but we do not have as much textual data.

Since you said that there is still a long way to go, what else needs to be done? Also, due to the complexity of Indian languages, will it be able to match the English model LLMs?

We hope and we wish we do better, as I said, not just by individual languages, but the collective wisdom and the collective similarity of those languages. That is the continuous effort for us. And of course, data is super critical. We are not saying one compensates completely for the other. There is some complementarity.

If you leverage language characteristics, it can complement the fact that we have less data. But we also appeal to anyone who's in possession of Indian language data to share it with BharatGen at [email protected].

We can facilitate any kind of agreement. We want to preserve many of those texts.

Next Story