Dataset Structure

Data format

This download contains elementary level and middle school level questions in multiple choice format, both with and without associated diagrams. The questions come pre-split into Train, Development, and Test sets.

All questions are provided in CSV format containing the full text of the question, any applicable image reference(s), and its answer options in one cell. The accompanying images are provided in a separate directory.

All questions are additionally provided in JSONL format, containing a split version of the question where the question text has been separated from the answer options programatically.

JSONL Structure

The JSONL files contain the same questions split into the “stem” of the question (the question text) and then the various answer “choices” and their corresponding labels (A, B, C, D). The questionID is also included. When an image is present, its file reference is inserted.

{"id":"89629","question":{"stem":"Which of the following groups of materials would most likely be used to build an electromagnet?","choices":[{"label":"A","text":"bare wire, plastic rod, battery"},{"label":"B","text":"bare wire, iron rod, light bulb"},{"label":"C","text":"insulated wire, iron rod, battery"},{"label":"D","text":"insulated wire, plastic rod, light bulb"}]},"answerKey":"C"}
  • id - a unique identifier for the question (our own numbering)
  • question
    • stem - the question text
    • choices - the answer choices
      • label - the answer label ("A", "B", "C", "D")
        • text - the text associated with the answer label
  • answerKey - the the correct answer option

CSV Structure

Comma-delimited (CSV) columns:

  • questionID - a unique identifier for the question (our own numbering)
  • originalQuestionID - the question number on the test
  • totalPossiblePoint - how many points the question is worth when scoring
  • AnswerKey - the correct answer option
  • isMultipleChoiceQuestion - 1 = multiple choice, 0 = other
  • includesDiagram - 1 = includes diagram, 0 = other
  • examName - the source of the exam
  • schoolGrade - grade level
  • year - publication year of the exam
  • question - the text of the question itself
  • subject - the general question topic
  • category - Test, Train, or Dev