The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label.
A limitation of mainstream entailment datasets is that they have been constructed in isolation from any end task. Moreover, in several cases, either the hypothesis or the premise has been synthesized specifically for creating the entailment dataset. Both the premise and the hypothesis in SciTail were authored independently of each other and independent of the entailment task. As a result, linguistic variations in the dataset are not limited by the coverage of rules or the creativity of crowd-workers. Further, unfiltered web sentences, which are used to create the premises, tend to be highly diverse in various aspects (length, complexity, being well-formed for a parser, etc.), adding to the linguistic challenge. Refer to our paper on SciTail, A Textual Entailment Dataset from Science Question Answering for additional information.
Which of the following best explains how stems transport water to other parts of the plant?
Stems transport water to other parts of the plant through a system of tubes.
Water and other materials necessary for biological activity in trees are transported throughout the stem and branches in thin, hollow tubes in the xylem, or wood tissue.
Cut plant stems and insert stem into tubing while stem is submerged in a pan of water.
We release the SciTail dataset in various formats for ease of use.
snli_format: JSONL format used by SNLI with a JSON object corresponding to each entailment example in each line.
tsv_format: Tab-separated format with three columns: premise hypothesis label
dgem_format: Tab-separated format used by the DGEM model: premise hypothesis label hypothesis graph structure
Individual folders contain additional information about these formats. We also provide the complete list of annotations in all_annotations.tsv with the following columns:
Question: Original question
Answer choice: Correct answer choice
KB Sentence: Retrieved sentence used as the premise
Q+A as Sentence: Question and Answer choice converted into a sentence. “???” used if no ___ or wh-word found in the question.
Question Source: Source of this question from {Pub4, Pub8, SciQ}{Train, Dev, Test}
Num. Support: Number of crowd-workers that annotated this sentence as supporting
Num. Partial: Number of crowd-workers that annotated this sentence as partially supporting
Num. None: Number of crowd-workers that annotated this sentence as unrelated
Total: Number of crowd-workers that annotated this sentence
IR Position: Position of this sentence in the retrieved sentences for this question
Label: Final entailment label for the premise=KB Sentence and hypothesis=Q+A as Sentence from {entails, neutral}