Bangla

Project Vaiyākaraṇa

Bangla is the fifth most-spoken language globally, yet Grammatical Error Correction (GEC) in Bangla remains underdeveloped. In this work, we investigate how LLMs can be leveraged to improve Bangla GEC. We first do an extensive categorization of 12 error classes in Bangla, and take a survey of native Bangla speakers to collect real-world errors. We next devise a rule-based noise injection method to create grammatically incorrect sentences corresponding to correct ones. The Vaiyākaraṇa dataset, thus created, consists of 567,422 sentences, of which 227,119 are erroneous. This dataset is then used to instruction tune LLMs for the task of GEC in Bangla. Evaluations show that instruction-tuning with Vaiyākaraṇa improves the GEC performance of LLMs by 3-7 percentage points.

Read the Vaiyākaraṇa Benchmark Paper on arXiv.

Team

Prof. Arnab Bhattacharya

Professor IIT Kanpur

Pramit Bhattacharyya

Ph.D IIT Kanpur

Jeswaanth Gogula

M.Tech IIT Kanpur