HuggingFace Pipelines

This assignment is based on this.

Fill-Mask Task

Use HuggingFace pipelines to replicate tables 3 and 4 from this paper. Do this three times with three different models. At least one of the three models should be BERT.

Translation and Feature Extraction

  1. Write a program that inputs a name such as Noah Smith. Find three author ids in Semantic Scholar that match that name, using methods like we used for the first assignment. But this time, we want to sort author ids by citations, and use the top three.
  2. Find the four top papers for each author id. If there are more than four papers, use the ones with the most citations.
  3. Get their titles and abstracts.
  4. Translate the titles and abstracts to Chinese and back to English.
  5. Repeat, but replace Chinese with French.
  6. Use the allenai/specter2_base model on HuggingFace to obtain vectors for these papers, with and without the translations.
  7. Plot cosine similarities of these vectors, both with imshow as well as boxplot.
  8. Does translation do anything to the vectors? Does translation make the vectors better or worse (or does it do nothing at all)? Are the French translations better or worse than the Chinese translations for comparing papers? How would we make that question precise?