Data citations in context: We need disciplinary metadata to move forward

By Kathleen Gregory and Anton Ninkov

Data citations hold great promise for a variety of stakeholders. Unfortunately, due in part to a lack of metadata, i.e. about disciplinary domains, many of those promises remain out of reach. Metadata providers – repositories, publishers and researchers – play a key role in improving the current situation.  

The potentials of data citations are many. From the research perspective, citations to data can help researchers discover existing datasets and understand or verify claims made in the academic literature. Citations are also seen as a way to give credit for producing, managing and sharing data, as well as to provide legal attribution. Researchers, funders and repository managers also hope that data citations can provide a mechanism for tracking and understanding the use and ‘impact’ of research data [1]. Bibliometricians, who study patterns in scholarly communication by tracing publications, citations and related metadata, are also interested in using data citations to understand engagements and relationships between data and other forms of research output. 

Figure 1. Perspectives about the potentials of data citation [2]

Realizing the potential of data citations relies on having complete, detailed and standardized metadata describing the who, what, when, where and how of data and their associated work. As we are discovering in the Meaningful Data Counts project, which brings together bibliometricians and members of the research data community as part of the broader Make Data Count initiative, the metadata needed to provide context for both data and data citations are often not provided in standardized ways…if they are provided at all. 

As a first step in this project, we have been mapping the current state of metadata, shared data, and data citations available in the DataCite corpus. Our openly available jupyter notebook pulls realtime metadata about data in DataCite [3] and demonstrates both the evolving nature of the corpus and the lack of available metadata. In particular, our work highlights the current lack of information about a critical metadata element for providing context about data citations – the disciplinary domain where data were created. 

For example, we find that the amount of data available in DataCite has increased by more than 1.5 million individual datasets over a 7 month period from January to July 2021, when the corpus increased from 8,243,204 to 9,930,000 datasets. In January, as few as 5.7% of the available datasets had metadata describing their disciplinary domain according to the most commonly used subject classification system (see the treemap in Figure 2). In July, despite the increased number of datasets overall, the percentage with a disciplinary domain dropped slightly to 5.63%.

Figure 2. Data with metadata describing disciplinary domain, according to the OECD Fields of Science classification, retrieved on July 9th, 2021. For an interactive version of this tree map, with the most current data, please see our Jupyter Notebook [3]

These low percentages reflect the fact that providing information about the subject or disciplinary domain of data is not a required field in the DataCite metadata schema. For the nearly 6% of data that do have subject information, the corpus contains multiple classification schemes of differing granularity levels, ranging from the more general to the more specific. DataCite currently works to automatically map these classifications to each other in order to improve disciplinary metadata. Organizations which submit their data to DataCite also have a role to play in improving these disciplinary descriptions, as this information underlies many of these mapping efforts.  

Subject or disciplinary classifications for data are typically created using three methods:

  • Intellectually, where researchers, data creators or data curators use their expertise to assign a relevant subject classification. 
  • Automatically, where automated techniques are used to extract subject information from other data descriptions, e.g. the title or abstract (if available)
  • By proxy, where data are assigned the same subject classification as a related entity, e.g. when data are given the same subject classification as the repository where they are stored. This can be done either automatically or manually. 

Of these three methods, the intellectual method tends to be the most common, and also the most accurate and time-consuming approach. This method is often carried out by those closest to the data, i.e. researchers/data creators or data curators, who have expert knowledge about the data’s subject or disciplinary context which may be difficult to determine either automatically or by proxy.

While our work also exposes other examples of missing or incomplete metadata [4], we highlight here the current lack of information about disciplinary domains, as disciplinary information is important across all the perspectives shown in Figure 1. For example, disciplinary norms influence how data are shared, how they are made available, how they are understood and how they are reused. Information about disciplines is important for discovering data and is typically used by funders and research evaluators to place academic work in context. Disciplinary analyses are also a critical step in contextualizing citation practices in bibliometric studies, as citation behaviours have repeatedly been shown to follow discipline-specific patterns. Without disciplinary metadata, placing data citations into context will remain elusive and meaningful data metrics cannot be developed. 

In order to move forward with understanding data citations in context, we need better metadata – metadata about disciplinary domains, but also metadata describing other aspects of data creation and use. Metadata providers, from publishers to researchers to data repositories, can help to improve the current situation by working to create complete metadata records describing their data. Only with such metadata can the potentials of data citation be achieved. 


[1] These perspectives are visible, e.g. in the Joint Declaration of Data Citation Principles:

Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014

[2] Gregory, K. (2021, July). Bringing data in sight: Data citations in research. Presentation. Presented at Forum Bibliometrie 2021, Technical University of Munich, online. 

[3] Ninkov, A. (2021). antonninkov/ISSI2021: Datasets on DataCite – an Initial Bibliometric Investigation (1.0) [Computer software]. Zenodo.

[4] Ninkov, A., Gregory, K.; Peters, I., Haustein, S. (2021). Datasets on DataCite – An initial bibliometric investigation. Proceedings of the 18th International Conference of the International Society for Scientometrics and Informetrics, Leuven, Belgium (virtual). Preprint: