This dataset is designed to demonstrate Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. Analysis of the query log from our academic search engine, SemanticScholar.org, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ESR represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledge graph embedding. Experiments demonstrate ESR’s ability in improving Semantic Scholar’s online production system, especially on hard queries where word-based ranking fails.
If you find this dataset helpful in your work, please cite:
@inproceedings{xiong2017ESR,
title={Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding},
author={Xiong, Chenyan and Power, Russell and Callan, Jamie},
booktitle={Proceedings of the 26th International Conference on World Wide Web (WWW 2017)},
note={To appear},
year={2017},
organization={ACM}
};
In the zip file you will find the following files and folders:
s2_query.json contains the queries used in this paper. Each line of it is a json format dictionary, with the following format:
{
"qid": "the query id",
"query": "the query string",
"ana": {
the annotated entity id and frequency
}
}
The entities are from Freebase. Please refer to the final dump of Freebase to get more information about these entities.
s2.trec is the TREC format ranking files. It contains the ranking lists from semanticscholar.org’s production search engine (as of 2016 summer).
s2_doc.json contains the candidate documents. Each line of it is a json format dictionary. Its fields include:
docno: the doc id
title
keyPhrase: the automatically extracted key phrases for this paper.
paperAbstract: paper abstract
venue
numCitedBy: number of citations
numKeyCitations: number of key citations. Key citation means the other paper considers this one as a very important related work. It is from semanticsholar's production system.
Ana: the annotation of each of the title, paperAbstract, and body field.
Due to copyright restrictions, we are not allowed to release the body text. Please check https://api.semanticscholar.org/ to get the full corpus and more information about each document.
s2.qrel is the relevance judgments for these queries. It was labeled by the first two authors. Judging the relevance of computer science papers is very hard. We have to read many papers’ abstract or even introductions ourselves before making any reasonable judgments. The current size of labels is limited. Keep updated with SemanticScholar.org for future possible benchmark release.
ranking_res folder includes the ranking results of all baselines, develop methods, and alternative methods in the experiments and analysis of this paper. Feel free to conduct future experiments based on them.
knowledge_graph_embedding folder contains the entity embeddings trained using our knowledge graph. It is in Google word2vec format.
queryID query
------- ------------------------------------------------------------
1 deep learning
2 artificial intelligence
3 information retrieval
4 machine learning
5 question answering
6 noun phrases
7 penn treebank
8 speech recognition
9 data mining
10 computer vision
11 reinforcement learning
12 natural language
13 autoencoder
14 ontology
15 sentiment analysis
16 sap
17 lstm
18 natural language processing
19 semantic web
20 mooc
21 human computer interaction
22 eye movement clustering
23 semantic relations
24 efficient estimation of word representations in vector space
25 big data
26 audio visual fusion
27 object detection
28 gfdm
29 neural network
30 generalized extreme value
31 information geometry
32 image panorama video
33 data science
34 semantic parsing
35 augmented reality
36 imbalanced data
37 recommender system
38 inverse reinforcement learning mixture
39 transfer learning
40 cnn
41 dynamic programming segmentation
42 natural language interface
43 genetic algorithm
44 prolog
45 contact prediction
46 wifi malware
47 nsdi machine learning
48 forensics and machine learning
49 words to speech
50 information theory
51 morphology morphological
52 category theory
53 graph theory
54 smart thermostat
55 exploit vulnerability
56 reinforcement learning and video game
57 system health management
58 spatial multi agent systems
59 service composition
60 mobile payment
61 3 axis gantry
62 softmax categorization
63 cost aggregation
64 chinese dialect
65 depth camera
66 mobile tcp traffic analysis
67 collective learning
68 robust production planning
69 memory hierarchy
70 hashing
71 comparable corpora
72 knowledge graph
73 social media
74 deep learning surveillance
75 cryptography
76 parametric max flow
77 deep reinforcement learning
78 varying weight grasp
79 dirichlet process
80 word embedding
81 graph drawing
82 robust principal component analysis
83 differential evolution
84 seq2seq
85 document logical structure
86 duality
87 variable neighborhood search
88 urban public transportation systems
89 edx coursera
90 fdir
91 cryptography key management
92 ontology construction
93 go game
94 personality trait
95 sparse learning
96 directed hypergraph
97 inventory management
98 clojure
99 ontology semantic web
100 convolutional neural network time series