Better Together

This assignment is based on this.

Please try this and this. Many names such as David Madigan and Noah Smith are difficult for search engines such as Semantic Scholar. This exercise provides a visualization to help resolve some of these ambiguities.

Tasks

Write a program that inputs a name such as Noah Smith and outputs a visualization like this:

Note: Your picture will look different because you will be using different embeddings.

Please post code and output pictures on GitHub (or Colab), and share links to your code on Canvas.

Here is a colab link that includes some of the hints below: my colab.

Suggested steps:

  1. input name and output list of candidates and their papers
  2. input paper id and output embeddings
  3. compute pairwise similarities
  4. plot similarities

Suggested improvements:

  1. Sort candidate authors by citations.
  2. Sort candidate papers by citations.
  3. Limit candidate authors and candidate papers to n-best, for some reasonable value of n.

Some Useful Python Packages

Hints: the following python packages may be useful:

import json,requests
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib
import matplotlib.pyplot as plt

Documentation of Background Material

You won't need most of these tutorials for this homework, but they are good things to know, and and will be useful later for this class. Some of them are very open ended.

  1. numpy: python tools for arrays
  2. requests: python tools to call URLs
  3. json: python tools to process requests for API calls
  4. Semantic Scholar API: tools to request fields from Semantic Scholar; fields include papers, authors, embeddings and more
  5. sklearn: python tools for machine learning
  6. SciPy: more python tools for machine learning
  7. NetworkX: Network (Graph) analysis in Python
  8. HuggingFace: a popular Hub for models, datasets and tutorials on deep nets for natural language (and more)
  9. matplotlib: python tools for plotting
  10. imshow: part of matplotlib
  11. GitHub: Tutorial on GitHub

Step 1: Input name and output list of candidates and their papers

j = requests.get('https://api.semanticscholar.org/graph/v1/author/search?query=David Madigan&fields=name,citationCount,papers,papers.citationCount').json()

Step 2: input paper id and output embeddings

p = requests.get('https://api.semanticscholar.org/graph/v1/paper/00707ba45ffe6efa08a59693c47801211ca634d6?fields=title,embedding,citationCount,title').json()

Step 3: compute pairwise similarities

See here

Step 4: plot similarities

There are many tutorials on imshow such as this.