ARC Questions

Data Format

This data set consists of 7,787 science exam questions drawn from a variety of sources, including science questions provided under license by a research partner affiliated with AI2. These are text-only, English language exam questions that span several grade levels as indicated in the files. Each question has a multiple choice structure (typically 4 answer options). The questions are sorted into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions. Each are pre-split into Train, Development, and Test sets as follows:

  • Challenge Train: 1,119
  • Challenge Dev: 299
  • Challenge Test: 1,172
  • Easy Train: 2,251
  • Easy Dev: 570
  • Easy Test: 2,376

Each set is provided in two formats, CSV and JSON. The CSV files contain the full text of the question and its answer options in one cell. The JSON files contain a split version of the question, where the question text has been separated from the answer options programatically.

Please note: This data should not be distributed except by the Allen Institute for Artificial Intelligence (AI2). All parties interested in acquiring this data must download it from AI2 directly at data.allenai.org/arc. This data is to be used for non-commercial, research purposes only.

JSONL Structure

The JSONL files contain the same questions split into the “stem” of the question (the question text) and then the various answer “choices” and their corresponding labels (A, B, C, D). The questionID is also included.

{"id":"MCAS_2000_4_6","question":{"stem":"Which technology was developed most recently?","choices":[{"text":"cellular telephone","label":"A"},{"text":"television","label":"B"},{"text":"refrigerator","label":"C"},{"text":"airplane","label":"D"}]},"answerKey":"A"}
  • id - a unique identifier for the question (our own numbering)
  • question
    • stem - the question text
    • choices - the answer choices
      • label - the answer label ("A", "B", "C", "D")
        • text - the text associated with the answer label
  • answerKey - the the correct answer option

CSV Structure

Comma-delimited (CSV) columns:

  • questionID - a unique identifier for the question (our own numbering)
  • originalQuestionID - the question number on the test
  • totalPossiblePoint - how many points the question is worth when scoring
  • AnswerKey - the correct answer option
  • isMultipleChoiceQuestion - 1 = multiple choice, 0 = other
  • includesDiagram - 1 = includes diagram, 0 = other
  • examName - the source of the exam
  • schoolGrade - grade level
  • year - publication year of the exam
  • question - the text of the question itself
  • subject - the general question topic
  • category - Test, Train, or Dev