Project Vacaspati

Our corpus VĀCASPATI is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also built a word em- bedding model, VĀC-FT , using FastText from VĀCASPATI as well as trained a BERT model, VĀC-BERT , using the corpus. VĀC-BERT has far fewer parameters and requires only a fraction of resources compared to other state- of-the-art BERT models and yet performs either better or similar on various downstream tasks.

  • The Vacaspati dataset has a size of 2.4GB.
  • It contains over 115 million words.
  • There are more than 11 million sentences in the dataset.
  • The content has been contributed by over 200 authors.

Team

Team-member-1

Prof. Arnab Bhattacharya

Professor IIT Kanpur
Team-member-2

Pramit Bhattacharyya

Ph.D IIT Kanpur
Team-member-3

Joydeep Mondal

Ph.D IIT Kanpur
Team-member-4

Subhadip Maji

Ph.D IIT Kanpur