Project Vacaspati
Our corpus VĀCASPATI is varied from multiple aspects, including type of
composition, topic, author, time, space, etc. It contains more than 11
million sentences and 115 million words. We also built a word em-
bedding model, VĀC-FT , using FastText from VĀCASPATI as well as trained
a BERT model, VĀC-BERT , using the corpus. VĀC-BERT has far fewer
parameters and requires only a fraction of resources compared to other
state- of-the-art BERT models and yet performs either better or similar
on various downstream tasks.
Metrics |
Value |
Vacaspati Size |
2.4GB |
No. of Words |
115M+ |
No. of Sentences |
11M+ |
No. of Authors |
200+ |
Downloadable links for Vac-FT and Vac-BERT.