edit

SciTail Dataset

File formats

We release the SciTail dataset in various formats for ease of use.

snli_format: JSONL format used by SNLI with a JSON object corresponding to each entailment example in each line.

tsv_format: Tab-separated format with three columns: premise hypothesis label

dgem_format: Tab-separated format used by the DGEM model: premise hypothesis label hypothesis graph structure

Annotation Column Headers

Individual folders contain additional information about these formats. We also provide the complete list of annotations in all_annotations.tsv with the following columns:

Question: Original question

Answer choice: Correct answer choice

KB Sentence: Retrieved sentence used as the premise

Q+A as Sentence: Question and Answer choice converted into a sentence. "???" used if no ___ or wh-word found in the question.

Question Source: Source of this question from {Pub4, Pub8, SciQ}{Train, Dev, Test}

Num. Support: Number of crowd-workers that annotated this sentence as supporting

Num. Partial: Number of crowd-workers that annotated this sentence as partially supporting

Num. None: Number of crowd-workers that annotated this sentence as unrelated

Total: Number of crowd-workers that annotated this sentence

IR Position: Position of this sentence in the retrieved sentences for this question

Label: Final entailment label for the premise=KB Sentence and hypothesis=Q+A as Sentence from {entails, neutral}