Project Vacaspati

Our corpus VĀCASPATI is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also built a word em- bedding model, VĀC-FT , using FastText from VĀCASPATI as well as trained a BERT model, VĀC-BERT , using the corpus. VĀC-BERT has far fewer parameters and requires only a fraction of resources compared to other state- of-the-art BERT models and yet performs either better or similar on various downstream tasks.
Metrics Value
Vacaspati Size 2.4GB
No. of Words 115M+
No. of Sentences 11M+
No. of Authors 200+
Downloadable links for Vac-FT and Vac-BERT.
Vac-FT (Click here to download)
Vac-BERT (Click here to download)