A large dataset of scientific text reuse in Open-Access publications | Scientific Data

Abstract:  We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains 91 million cases of reused text passages found in 4.2 million unique open-access publications. Cases range from overlap of as few as eight words to near-duplicate publications and include a variety of reuse types, ranging from boilerplate text to verbatim copying to quotations and paraphrases. Featuring a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case, our dataset addresses the most salient shortcomings of previous ones on scientific writing. The Webis-STEREO-21 does not indicate if a reuse case is legitimate or not, as its focus is on the general study of text reuse in science, which is legitimate in the vast majority of cases. It allows for tackling a wide range of research questions from different scientific backgrounds, facilitating both qualitative and quantitative analysis of the phenomenon as well as a first-time grounding on the base rate of text reuse in scientific publications.

 

EU list of specific high-value datasets and the arrangements for their publication and re-use

“(12) It is the objective of Directive (EU) 2019/1024 to promote the use of standard public licences available online for re-using public sector information. The Commission’s Guidelines on recommended standard licences, datasets and charging for the re-use of documents (5) identify Creative Commons (‘CC’) licences as an example of recommended standard public licences. CC licences are developed by a non-profit organisation and have become a leading licensing solution for public sector information, research results and cultural domain material across the world. It is therefore necessary to refer in this Implementing Regulation to the most recent version of the CC licence suite, namely CC 4.0. A licence equivalent to the CC licence suite may include additional arrangements, such as the obligation on the re-user to include updates provided by the data holder and to specify when the data were last updated, as long as they do not restrict the possibilities for re-using the data….”

Commission defines high-value datasets to be made available for re-use

Today, the Commission has published a list of high-value datasets that public sector bodies will have to make available for re-use, free of charge, within 16 months.

 

Certain public sector data, such as meteorological or air quality data are particularly interesting for creators of value-added services and applications and have important benefits for society, the environment and the economy – which is why they should be made available to the public.

 

TIER2

“Enhancing Trust, Integrity and Efficiency in Research through next-level Reproducibility…

TIER2 aims to boost knowledge on reproducibility, create tools, engage communities, implement interventions and policy across different contexts to increase re-use and overall quality of research results….”

Rescuing NASA’s Historical Airborne and Field Data for Open Access and Reuse – NASA Technical Reports Server (NTRS)

“Data rescue: the act of finding and transferring data from a non-public place to a publicly accessible and supported repository. May require data transformation from analog to digital format(s).

Data recovery: the process of restoring data that hasbeen lost, accidentally deleted, corrupted or madeinaccessible….”

Kenyon | The Journal Article as a Means to Share Data: a Content Analysis of Supplementary Materials from Two Disciplines | Journal of Librarianship and Scholarly Communication

Abstract:  INTRODUCTION The practice of publishing supplementary materials with journal articles is becoming increasingly prevalent across the sciences. We sought to understand better the content of these materials by investigating the differences between the supplementary materials published by authors in the geosciences and plant sciences. METHODS We conducted a random stratified sampling of four articles from each of 30 journals published in 2013. In total, we examined 297 supplementary data files for a range of different factors. RESULTS We identified many similarities between the practices of authors in the two fields, including the formats used (Word documents, Excel spreadsheets, PDFs) and the small size of the files. There were differences identified in the content of the supplementary materials: the geology materials contained more maps and machine-readable data; the plant science materials included much more tabular data and multimedia content. DISCUSSION Our results suggest that the data shared through supplementary files in these fields may not lend itself to reuse. Code and related scripts are not often shared, nor is much ‘raw’ data. Instead, the files often contain summary data, modified for human reading and use. CONCLUSION Given these and other differences, our results suggest implications for publishers, librarians, and authors, and may require shifts in behavior if effective data sharing is to be realized.

 

Over €4.4 million granted to four new projects to enhance the common European data space for cultural heritage | Europeana Pro

“The Europeana Initiative is at the heart of the common European data space for cultural heritage, a flagship initiative of the European Union to support the digital transformation of the cultural heritage sector. Discover the projects funded under the initiative….

We are delighted to announce that the European Commission has funded four projects under their new flagship initiative for deployment of the common European data space for cultural heritage. The call for these projects, launched in spring 2022, aimed at seizing the opportunities of advanced technologies for the digital transformation of the cultural heritage sector. This included a focus on 3D, artificial intelligence or machine learning for increasing the quality, sustainability, use and reuse of data, which we are excited to see the projects explore in the coming months….”

Open Science in Developmental Science | Annual Review of Developmental Psychology

Abstract:  Open science policies have proliferated in the social and behavioral sciences in recent years, including practices around sharing study designs, protocols, and data and preregistering hypotheses. Developmental research has moved more slowly than some other disciplines in adopting open science practices, in part because developmental science is often descriptive and does not always strictly adhere to a confirmatory approach. We assess the state of open science practices in developmental science and offer a broader definition of open science that includes replication, reproducibility, data reuse, and global reach.

 

Measuring the impact of health research data in terms of data citations by scientific publications | SpringerLink

From the abstract: “In order to figure out disease-specific data sharing and reuse level, we took the number of data records and their citations in the scientific literature in the Data Citation Index platform as approximate indicators. The results indicated that only a small percentage (7.5%) of health data records had received documented citations by scientific publications. We find the level of data sharing and reuse varies across diseases. Our study suggested that the more socioeconomic burden and the more research funding, the more likely scientific data for diseases will be produced and made available. But such a correlation could not be observed for the activity of data reuse. Secondary reuse of scientific data is a complex behavior.”

The connection of open science practices and the methodological approach of researchers | SpringerLink

Steinhardt, I., Bauer, M., Wünsche, H. et al. The connection of open science practices and the methodological approach of researchers. Qual Quant (2022). https://doi.org/10.1007/s11135-022-01524-4

Abstract: The Open Science movement is gaining tremendous popularity and tries to initiate changes in science, for example the sharing and reuse of data. The new requirements that come with Open Science poses researchers with several challenges. While most of these challenges have already been addressed in several studies, little attention has been paid so far to the underlying Open Science practices (OSP). An exploratory study was conducted focusing on the OSP relating to sharing and using data. 13 researchers from the Weizenbaum Institute were interviewed. The Weizenbaum Institute is an interdisciplinary research institute in Germany that was founded in 2017. To reconstruct OSP a grounded theory methodology (Strauss in Qualitative Analysis for Social Scientists, Cambridge University Press, Cambridge, 1987) was used and classified OSP into open production, open distribution and open consumption (Smith in Openness as social praxis. First Monday, 2017). The research shows that apart from the disciplinary background and research environment, the methodological approach and the type of research data play a major role in the context of OSP. The interviewees’ self-attributions related to the types of data they work with: qualitative, quantitative, social media and source code. With regard to the methodological approach and type of data, it was uncovered that uncertainties and missing knowledge, data protection, competitive disadvantages, vulnerability and costs are the main reasons for the lack of openness. The analyses further revealed that knowledge and established data infrastructures as well as competitive advantages act as drivers for openness. Because of the link between research data and OSP, the authors of this paper argue that in order to promote OSP, the methodological approach and the type of research data must also be considered.

 

Evaluating the (in)accessibility of data behind papers in astronomy

Abstract:  This paper presents results of a survey of authors of journal articles published over several decades in astronomy. The study focuses on determining the characteristics and accessibility of data behind papers, referring to the spectrum of raw and derived data that would be needed to validate the results of a particular published article as a capsule of scientific knowledge. Curating the data behind papers can arguably lead to new discoveries through reuse. However, as shown through related research and confirmed by the results of the present study, a fully accessible portrait of the data behind papers is often unavailable. These findings have implications for reusability efforts and are presented alongside a discussion of open science.

Hello World, From Wikimedia Enterprise | 21 Jun 2022

“We launched Wikimedia Enterprise last year with a goal of making it easy to programmatically access data from across the Wikimedia Foundation projects. Since then, we have been busy building a product that can serve the needs of commercial users of any size. Today, we are thrilled to share some of the first customers using this product, in addition to new features that make it easy for anyone to start using Wikimedia Enterprise.  Today, we are excited to announce that: Google has become the very first customer of Wikimedia Enterprise. The Internet Archive will receive full access to Enterprise’s feature set, at no cost, for use in furthering their mission of archiving the Web. Self-service trial accounts are available to anyone to try out Wikimedia Enterprise for their own use. Trial accounts include unlimited free access to a monthly snapshot of the entire Wikimedia Enterprise project archive and 10,000 free requests from our On-Demand API. New product and pricing details are now available, including a pricing calculator to estimate usage cost after a trial, as well as comprehensive product documentation, and a customer service portal with detailed FAQs. We have also added a news page (you are reading it!) to better communicate updates and announcements to current and potential customers….”

A survey of researchers’ code sharing and code reuse practices, and assessment of interactive notebook prototypes [PeerJ]

Abstract:  This research aimed to understand the needs and habits of researchers in relation to code sharing and reuse; gather feedback on prototype code notebooks created by NeuroLibre; and help determine strategies that publishers could use to increase code sharing. We surveyed 188 researchers in computational biology. Respondents were asked about how often and why they look at code, which methods of accessing code they find useful and why, what aspects of code sharing are important to them, and how satisfied they are with their ability to complete these tasks. Respondents were asked to look at a prototype code notebook and give feedback on its features. Respondents were also asked how much time they spent preparing code and if they would be willing to increase this to use a code sharing tool, such as a notebook. As a reader of research articles the most common reason (70%) for looking at code was to gain a better understanding of the article. The most commonly encountered method for code sharing–linking articles to a code repository–was also the most useful method of accessing code from the reader’s perspective. As authors, the respondents were largely satisfied with their ability to carry out tasks related to code sharing. The most important of these tasks were ensuring that the code was running in the correct environment, and sharing code with good documentation. The average researcher, according to our results, is unwilling to incur additional costs (in time, effort or expenditure) that are currently needed to use code sharing tools alongside a publication. We infer this means we need different models for funding and producing interactive or executable research outputs if they are to reach a large number of researchers. For the purpose of increasing the amount of code shared by authors, PLOS Computational Biology is, as a result, focusing on policy rather than tools.

 

Study on EU copyright and related rights and access to and reuse of data – Publications Office of the EU

European Commission, Directorate-General for Research and Innovation, Senftleben, M., Study on EU copyright and related rights and access to and reuse of data, Publications Office of the European Union, 2022, https://data.europa.eu/doi/10.2777/78973

EU legislation in the field of copyright, related rights and sui generis database rights can have a deep impact on access to data resources for scientific research and the availability of data resulting from publicly funded research. To establish a copyright and related rights framework that offers appropriate data access and reuse opportunities for scientific research, it is necessary to identify potential barriers and challenges that may arise from EU copyright and related rights legislation and corresponding rights management. This study analyses the interaction between copyright and related rights law and data access and reuse for scientific research purposes. It proposes legislative and non-legislative measures to improve the current EU regulatory framework.