Better Together
This assignment is based on this.
Please try this
and this.
Many names such as David Madigan and Noah Smith are difficult for search engines such as
Semantic Scholar.
This exercise provides a visualization to help resolve some of these ambiguities.
Tasks
Write a program that inputs a name such as Noah Smith
and outputs a visualization like this:
Note: Your picture will look different because you will be using different embeddings.
Please post code and output pictures on GitHub (or Colab), and share links to your code on Canvas.
Here is a colab link that includes some of the hints below: my colab.
Suggested steps:
- input name and output list of candidates and their papers
- input paper id and output embeddings
- compute pairwise similarities
- plot similarities
Suggested improvements:
- Sort candidate authors by citations.
- Sort candidate papers by citations.
- Limit candidate authors and candidate papers to n-best, for some reasonable value of n.
Some Useful Python Packages
Hints: the following python packages may be useful:
import json,requests
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib
import matplotlib.pyplot as plt
Documentation of Background Material
You won't need most of these tutorials for this homework, but they are good things to know, and and will be useful later for this class.
Some of them are very open ended.
- numpy: python tools for arrays
- requests: python tools to call URLs
- json: python tools to process requests for API calls
- Semantic Scholar API: tools to request fields from Semantic Scholar; fields include papers, authors, embeddings and more
- sklearn: python tools for machine learning
- SciPy: more python tools for machine learning
- NetworkX: Network (Graph) analysis in Python
- HuggingFace: a popular Hub for models, datasets and tutorials on deep nets for natural language (and more)
- matplotlib: python tools for plotting
- imshow: part of matplotlib
- GitHub: Tutorial on GitHub
Step 1: Input name and output list of candidates and their papers
j = requests.get('https://api.semanticscholar.org/graph/v1/author/search?query=David Madigan&fields=name,citationCount,papers,papers.citationCount').json()
Step 2: input paper id and output embeddings
p = requests.get('https://api.semanticscholar.org/graph/v1/paper/00707ba45ffe6efa08a59693c47801211ca634d6?fields=title,embedding,citationCount,title').json()
Step 3: compute pairwise similarities
See here
Step 4: plot similarities
There are many tutorials on imshow such as this.