Please suggest any other resources you may be aware of. Raise an issue to add more resources to the catalog. Put the proposed entry in the following format:
[Wikipedia Dumps](https://dumps.wikimedia.org/)
Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.
đ Featured Resources
- IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million semgents)
- CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
- CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi's Mann ki Baat speeches.
- AI4Bharat IndicNLP Project: Text corpora, word embeddings, text classification datasets for Indian languages.
- TyDi QA datasets: QA dataset for Bengali and Telugu.
Browse the entire catalog...
đNote: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.
- Major Indic Language NLP Repositories
- Libraries
- Text Corpora
- Unicode Standard
- Monolingual Corpus
- Language Identification
- Lexical Resources
- NER Corpora
- Parallel Translation Corpus
- Parallel Transliteration Corpus
- Textual Entailment
- Paraphrase
- Sentiment, Sarcasm, Emotion Analysis
- Question Answering
- Dialog
- Discourse
- POS Tagged corpus
- Chunk Corpus
- Dependency Parse Corpus
- Models
- Speech Corpora
- OCR Corpora
- Multimodal Corpora
- Language Specific Catalogs
- Technology Development for Indian Languages (TDIL)
- Center for Indian Language Technology (CFILT)
- Language Technologies Research Center (LTRC)
- Linguistic Data Consortium For Indian Languages (LDCIL)
- University of Hyderabad - Sanskrit NLP
- Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentece splitting, normalization, script conversion, transliteration, etc
- pyiwn: Python Interface to IndoWordNet
- Indic-OCR : OCR for Indic Scripts
- CLTK: Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
- Wikipedia Dumps
- WMT Common Crawl Dumps: Crawls between 2012 and 2016. Noisy text, needs to be filtered.
- WMT NEWS Crawl
- LDCIL Monolingual Corpus
- Charles University Hindi Monolingual Corpus
- Charles University Urdu Monolingual Corpus
- IIT Bombay Hindi Monolingual Corpus
- EMILLE Corpus (multiple Indian languages)
- Janmabhumi Malayalam Corpus
- Leipzig Corpus
- Sanskrit Monolingual and Sandhi-split Corpus
- Lot Of Indic Tweets Corpus: Large twitter datasets for telugu (7.9 million) and hindi (17.6 million) and fasttext skipgram and cbow word vectors for the same.
- CMU Romanized Hinglish Corpus: See THIS PAPER for details.
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 45k sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
- VarDial 2018 Language Identification Dataset: 5 languages - Hindi, Braj, Awadhi, Bhojpuri, Magahi.
- IndoWordNet
- IIIT-Hyderabad Word Similarity Database: 7 Indian languages
- Facebook Hindi Analogy Dataset
- MGAD Hindi Analogy dataset
- FIRE 2013 AUKBC NER Corpus
- FIRE 2014 AUKBC NER Corpus
- IIT Bombay Marathi NER Corpus
- WikiAnn NER Corpus (Noisy)
- a-mma NER data
- Indian Language Corpora Initiative: Available on TDIL portal on request
- IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million semgents)
- CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
- CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi's Mann ki Baat speeches.
- OPUS corpus
- WAT 2018 Parallel Corpus: There may significant overlap between WAT and OPUS.
- EILMT Corpus
- MTurk Indian Parallel Corpus
- TED Parallel Corpus
- Charles University English-Hindi Parallel Corpus
- Charles University English-Tamil Parallel Corpus
- Charles University English-Odia Parallel Corpus
- Charles University English-Urdu Religious Parallel Corpus
- PMIndia: Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India (paper).
- WikiMatrix Corpus: Mined from Wikipedia, looks noisy.
- CCMatrix: Parallel corpus mined from CommonCrawl, looks noisy.
- JW300 Corpus: Parallel corpus mined from jw.org. Religious text from Jehovah's Witness.
- IndoWordnet Parallel Corpus: Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages).
- FLORES dataset: English-Sinhala and English-Nepali corpora
- Uka Tarsadia University Corpus: 65k English-Gujarati sentence pairs. Corpus is described in this paper
- NLPC-UoM English-Tamil Corpus: 9k sentences, 24k glossary terms
- English-Tamil Wiki Titles: from statmt
- JNU-BHLTR Bhojpuri Corpus: English-Bhojpuri corpus of 65k sentences
- BrahmiNet Corpus: 110 language pairs mined from ILCI parallel corpus.
- Xlit-Crowd: Hindi-English Transliteration Corpus created via crowdsourcing.
- Xlit-IITB-Par: Hindi-English Transliteration Corpus mined from parallel translation corpora.
- FIRE 2013 Track on Transliterated Search: Transliteration dataset of native words in Hindi, Bengali and Gujarati.
- NEWS 2016 Shared Task dataset: Transliteration datasets for Kannada, Tamil, Bengali and Hindi created by Microsoft Research India.
- NotAI-tech English-Telugu: Around 38k word pairs
- BBC news articles classification dataset: 14 class classification
- iNLTK News Headlines classification: Datasets for multiple Indian languages.
- AI4Bharat IndicNLP News Articles: Word embeddings for 10 Indian languages.
- XNLI corpus: Hindi and Urdu test sets and machine translated training sets (from English MultiNLI).
- Amrita University-DPIL Corpus: Sentence level paraphrase identification for four Indian languages (Tamil, Malayalam, Hindi and Punjabi).
- IIT Bombay movie review datasets for Hindi and Marathi
- IIT Patna movie review datasets for Hindi
- IIIT-H LTRC Multi-domain dataset for Telugu
- ACTSA corpus for Telugu
- BHAAV (ā¤ā¤žā¤ĩ) Corpus: A Text Corpus for Emotion Analysis from Hindi Stories
- iNLTK Movie Reviews: Hindi sentiment analysis on movie reviews
- Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
- TyDi QA datasets: QA dataset for Bengali and Telugu.
- bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
- MMQA dataset: Hindi QA dataset described in this paper
- XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
- EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
- Indian Language Corpora Initiative
- Universal Dependencies
- Code Mixed Dataset for Hindi, Bengali and Telugu, ICON 2016 shared task
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 5000 sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
- IIIT Hyderabad Hindi Treebank
- Universal Dependencies
- Universal Dependencies Hindi Treebank
- Universal Dependencies Urdu Treebank
- FastText CommonCrawl+Wikipedia
- FastText Wikipedia
- Polyglot
- AI4Bharat IndicNLP Project: Word embeddings for 10 Indian languages.
- BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
- iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
- albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
- AI4Bharat IndicNLP Project: Unsupervised morphanalyzers for 10 Indian languages learnt using morfessor.
- Shata-Anuvaadak: 110 language pairs
- LTRC Vanee: Dependency based Statistical MT system from English to Hindi
- Microsoft Speech Corpus: Speech corpus for Telugu, Tamil and Gujarati.
- Microsoft-IITB Marathi Speech Corpus: 109 hours of speech data collected via crowdsourcing.
- AccentDB: Database of Indian English accents from native speakers in Bangla, Malayalam, Telugu and Oriya.
- IIT Madras TTS database
- BABEL Speech Corpus: includes some Indian languages
- Pratham ASER dataset: Dataset for research on reading level assessment.
- English-Hindi Visual Genome: Images captioned in both English and Hindi.
Pointers to language-specific NLP resource catalogs