Bangla

Project Vacaspati

Our corpus VĀCASPATI is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also built a word em- bedding model, VĀC-FT , using FastText from VĀCASPATI as well as trained a BERT model, VĀC-BERT , using the corpus. VĀC-BERT has far fewer parameters and requires only a fraction of resources compared to other state- of-the-art BERT models and yet performs either better or similar on various downstream tasks.

The Vacaspati dataset has a size of 2.4GB.
It contains over 115 million words.
There are more than 11 million sentences in the dataset.
The content has been contributed by over 200 authors.

Download Vac-FT Download Vac-BERT

Team

Prof. Arnab Bhattacharya

Professor IIT Kanpur

Pramit Bhattacharyya

Ph.D IIT Kanpur

Joydeep Mondal

Ph.D IIT Kanpur

Subhadip Maji

Ph.D IIT Kanpur