High-quality, engaging, relatable datasets are essential for data science education, and early research shows that a student’s selection of dataset has a substantial impact on their engagement. There are many datasets freely available online or ready to be provided by industry partners. However, those datasets may not be appropriate for a classroom audience, nor are they guaranteed to include pedagogical outcomes required to teach introductory data science or statistical concepts in a relatable, engaging, and clear way. Additional work is required to select and clean datasets for use in data science classrooms or in data science curricula, which can be a barrier between teachers and curriculum writers creating engaging and accessible data science lessons.
As a collaboration between Code.org, Bootstrap and Data Science for Everyone: we've created a Datasets for Classroom Data Science Spec to give guidance on what types of datasets are most appropriate for classroom data science and are most compatible with our tools for delivering instruction to students. It offers a pathway for individuals to find, clean, document, and upload datasets that can be used in data science tools (like Code.org’s App Lab) or curricula (like Bootstrap’s Data Science course), modeled after Bootstrap’s pilot with Brown University students.
This spec also codifies a requirement that datasets include a datasheet, adapted from the requirements listed in the research paper Datasheets for Datasets. These datasheets provide necessary context when considering the source and use of data, information about any normalizing or cleaning that was done to make a dataset compatible with the spec, as well as pedagogical considerations for educators and curriculum developers to best inform how these datasets can be used with intentionality within a lesson or curriculum.
- Read the Datasets for Classroom Data Science Spec
- Find a dataset that interests you or would be interesting to students
- Pre-process the data to make sure it aligns with the spec. This ensures the dataset will be accessible to students, and work correctly with our tools
- Create an Educator-Facing Datasheet. We've provided cloneable templates in Google Docs or as a README.md file in a public GitHub repository.
- Upload the dataset and datasheet to a public location (such as a shared Google drive or a public GitHub repository).
- Email firstname.lastname@example.org to let us know about your dataset and we'll add it to App Lab! Our curriculum team may also incorporate it into lessons for students, or use it when developing new lessons or activities as our curricula grow.
If you have any questions on submitting a dataset, feel free to email us at email@example.com!