Our corpus VĀCASPATI
is varied from multiple aspects, including type
of composition, topic, author, time, space, etc.
It contains more than 11 million sentences and
115 million words. We also built a word em-
bedding model, VĀC-FT , using FastText from
VĀCASPATI as well as trained a BERT model,
VĀC-BERT , using the corpus. VĀC-BERT
has far fewer parameters and requires only a
fraction of resources compared to other state-
of-the-art BERT models and yet performs either better or similar on
various downstream tasks.
If you are interested you can contact us for annotation on Lemmatization,
POS-tagger, NER for Bangla. This will be a paid position. Contact details
is available on the contact page