The limited volume and diversity of training data in Tamil hampers the performance of large language models when processing Tamil prompts. Image: iStock

Why fitting Tamil into AI and Large Language Models is a big challenge

Tamil doesn't lend itself easily to AI due to low levels of digital data, linguistic complexities like diglossia, and rich morphology, but this can be corrected

Deiva Sundaram Nainar
21 Jun 2025 6:30 AM IST (Updated:2025-06-21 01:02:34)

Why doesn’t ChatGPT – or other artificial intelligence system – handle Tamil as effectively as it does English?

While AI technologies have made extraordinary progress in recent years, particularly through tools powered by Large Language Models (LLMs), Tamil continues to face specific challenges in this rapidly evolving domain.

To understand the reasons, we need to examine both the technological and linguistic realities that shape the AI landscape for Tamil.

Digital language data

LLMs are the foundational engine behind today’s AI systems such as ChatGPT, Google Gemini, Meta AI, and Claude. These models are trained to perform a wide range of language tasks: answering questions, translating texts, summarising books and articles, and even generating images from text prompts.

However, these capabilities are built on a crucial resource – the vast quantities of digital language data available in each language.

LLMs learn patterns, meanings, and usage by analysing enormous corpora of books, websites, articles, and online conversations. In this context, the volume and diversity of training data determine how well an AI model performs in any given language.

Shortage of Tamil digital content

In the world of AI, languages are generally classified into three tiers: well-supported (such as English or Chinese), moderately supported (Tamil belongs here), and low-resource languages (including many tribal or endangered languages).

Also read | The Federal COGNI: The Boardroom Series explores how AI reshapes CXO decisions

Tamil’s position as a moderately supported language does not reflect any linguistic deficiency; rather, it highlights the relatively limited presence of Tamil in the global digital data ecosystem.

Unlike English, which has trillions of words across countless digital platforms, Tamil has far fewer resources in digital form.

There is a noticeable shortage of Tamil textbooks in scientific, technological, and commercial domains. Formal or academic Tamil content is sparse on websites and social media, and even on platforms like YouTube or blogging sites, Tamil usage tends to be informal.

LLMs learn patterns, meanings, and usage by analysing enormous corpora of books, websites, articles, and online conversations. The volume and diversity of training data determine how well an AI model performs in any given language. Tamil has a limited presence here.

Professional, technical, and research-level Tamil content is minimal. This lack of comprehensive digital representation translates directly into weaker performance by AI models when processing Tamil.

Supervised learning

Most LLMs today are pre-trained on multilingual or high-resource language corpora. To improve their performance in Tamil, these models must be fine-tuned using Tamil-specific data.

Given the scarcity of such data, one effective strategy is to enrich the available content through linguistic annotation – that is, by tagging texts with grammatical and syntactic information that helps the model understand the language structure.

This approach, known as supervised learning, allows the AI system to move beyond surface-level word patterns and grasp the deeper grammar of Tamil. When such linguistic knowledge is combined with data-driven training, we arrive at what is known as a hybrid model – a model that benefits from both computational strength and linguistic insight.

Staking its place in AI world

Tamil is a language of great antiquity, with a continuous literary and cultural tradition. Long before the digital era, Tamil scholars had already laid the foundation for complex grammatical and literary systems capable of serving every function in society.

It is because of this distinguished legacy that Tamil has been formally recognised by the government of India as a classical language. In today’s AI-driven world, Tamil must again claim its rightful place – this time within the realm of intelligent software systems.

Also read: Em dash: A stylish punctuation mark you’re wrong to accuse of being AI-generated

To achieve this, Tamil must be integrated meaningfully into all language-related functions across Tamil Nadu and beyond. It should be used in areas such as production, commerce, education, governance, and advanced research.

Such widespread and purposeful use is essential for Tamil to keep pace with technological advancements and retain its vitality in the AI era.

Diglossia deterrent

In addition to data scarcity, Tamil presents unique structural challenges for AI.

One of the most significant is diglossia – the coexistence of two distinct varieties of the language. Senthamizh, the formal literary Tamil used in writing, is vastly different from the colloquial Tamil used in daily speech. These varieties differ not just in vocabulary but also in syntax and grammatical structure.

Tamil’s position as a moderately supported language does not reflect any linguistic deficiency; rather, it highlights the relatively limited presence of Tamil in the global digital data ecosystem.

If AI systems are trained on mixed data without distinguishing between these forms, the output is often stylistically inappropriate or semantically inaccurate – such as responding to a modern question with a poetic expression.

Challenge of classical Tamil

Another complication arises from Tamil’s vast corpus of classical literature. These texts employ grammar, idioms, and vocabulary that differ significantly from contemporary usage.

Unless AI models are trained explicitly on these older forms of Tamil, they will struggle to process or generate meaningful responses in classical contexts, limiting their utility in fields such as literary scholarship and heritage preservation.

Furthermore, Tamil is a morphologically rich language. A single verb root can yield thousands of conjugated forms.

Why is Tamil only ‘moderately’ supported in AI? The answer lies not in the nature of the language itself, but in the quantity, quality, and diversity of digital data available for model training.

Take the root sel (to go), for example. It can appear as sendraan ('he went'), selgiraan ('he is going'), selvaan ('he will go'), sellamaattaan ('he won't go'), sellavarugiraan ('he is going to go') and so on, depending on the tense, number, gender, and modality.

Tamil nouns are equally complex, undergoing multiple inflections for case, number, and relational suffixes. These grammatical intricacies pose a significant challenge to AI models unless they are trained on large, well-annotated corpora that capture these variations accurately.

Also read: Journalism in the age of AI, social media, deepfakes, and rush to publish

So, why is Tamil only ‘moderately’ supported in AI? The answer lies not in the nature of the language itself, but in the quantity, quality, and diversity of digital data available for model training.

What can be done?

Fortunately, there are concrete steps that can be taken to enhance AI performance in Tamil:

Fine-tune existing LLMs specifically for Tamil, incorporating both spoken and written varieties.
Develop more intelligent tokenizers that can segment and interpret Tamil’s agglutinative (using words that contain many elements that each express a meaning rather than separate words for each meaning) structure effectively
Promote the use of Tamil in technical, academic, and research writing across schools, colleges, universities, and digital platforms.
Encourage the digitisation of Tamil books, manuscripts, official documents, and historical records.
Create and share high-quality annotated datasets for core NLP tasks like grammar correction, spell checking, morphological analysis, and translation.

The objective is not merely to make AI function in Tamil but to ensure that Tamil itself becomes an active participant in the AI revolution – not only AI in Tamil, but Tamil in AI.

As a community, we must take the lead in producing, preserving, and promoting Tamil content in digital spaces. By doing so, we safeguard the relevance of our language for future generations while contributing meaningfully to global technological advancement.

Tamil lacks nothing in terms of history, depth, or expressive richness. What it needs now is a strong digital foundation – and a collective commitment – to ensure it thrives in the age of artificial intelligence.

Advancing Tamil in the realm of AI requires a united effort – where Tamil people, language scholars, and both Union and state governments join hands to ensure that Tamil secures its rightful place alongside global languages like English.

(The Federal seeks to present views and opinions from all sides of the spectrum. The information, ideas or opinions in the articles are of the author and do not necessarily reflect the views of The Federal.)

Tamil languageartificial intelligence (AI)English languageregional languages

About the AuthorDeiva Sundaram Nainar

Deiva Sundaram Nainar is a former professor and Head, Department of Tamil Language and Linguistic Studies Unit, University of Madras, Chennai.