The Data Nutrition Project

“A “nutrition label” for datasets.

The Data Nutrition Project aims to create a standard label for interrogating datasets for measures that will ultimately drive the creation of better, more inclusive algorithms.

Our current prototype includes a highly-generalizable interactive data diagnostic label that allows for exploring any number of domain-specific aspects in datasets. Similar to a nutrition label on food, our Dataset Nutrition Label aims to highlight the key ingredients in a dataset such as meta-data and populations, as well as unique or anomalous features regarding distributions, missing data, and comparisons to other ‘ground truth’ datasets. We are currently testing our label on several datasets, with an eye towards open sourcing this effort and gathering community feedback.

The design utilizes a ‘modular’ framework that can be leveraged to add or remove areas of investigation based on the domain of the dataset. For example, Dataset Nutrition Labels for data about people may include modules about the representation of race and gender, while Nutrition Labels for data about trees may not require that module.

To learn more, check out our live prototype built on the Dollars for Docs dataset from ProPublica. A first draft of our paper can be found here….”