Project Vacaspati
Our corpus VĀCASPATI is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also built a word em- bedding model, VĀC-FT , using FastText from VĀCASPATI as well as trained a BERT model, VĀC-BERT , using the corpus. VĀC-BERT has far fewer parameters and requires only a fraction of resources compared to other state- of-the-art BERT models and yet performs either better or similar on various downstream tasks.
- The Vacaspati dataset has a size of 2.4GB.
- It contains over 115 million words.
- There are more than 11 million sentences in the dataset.
- The content has been contributed by over 200 authors.
Team

Prof. Arnab Bhattacharya
Professor IIT Kanpur
Pramit Bhattacharyya
Ph.D IIT Kanpur
Joydeep Mondal
Ph.D IIT Kanpur