Data Management Plans: Implications for Automated Analyses

Abstract:  Data management plans (DMPs) are an essential part of planning data-driven research projects and ensuring long-term access and use of research data and digital objects; however, as text-based documents, DMPs must be analyzed manually for conformance to funder requirements. This study presents a comparison of DMPs evaluations for 21 funded projects using 1) an automated means of analysis to identify elements that align with best practices in support of open research initiatives and 2) a manually-applied scorecard measuring these same elements. The automated analysis revealed that terms related to availability (90% of DMPs), metadata (86% of DMPs), and sharing (81% of DMPs) were reliably supplied. Manual analysis revealed 86% (n = 18) of funded DMPs were adequate, with strong discussions of data management personnel (average score: 2 out of 2), data sharing (average score 1.83 out of 2), and limitations to data sharing (average score: 1.65 out of 2). This study reveals that the automated approach to DMP assessment yields less granular yet similar results to manual assessments of the DMPs that are more efficiently produced. Additional observations and recommendations are also presented to make data management planning exercises and automated analysis even more useful going forward.


AI, Instructional Design, and OER – improving learning

“In other words, as far as the US Copyright Office is concerned, output from programs like ChatGPT or Stable Diffusion are not eligible for copyright protection. Now, that could change if Congress gets involved (are they actually capable of doing anything?), or if the Supreme Court turned its collective back on decades of precedent (which, admittedly, has been happening recently). But unless something rather dramatic happens along these lines, the outputs of generative AI programs will continue to pass immediately into the public domain. Consequently, they will be open educational resources under the common definition….

Generative AI tools could have an incredible impact on the breadth and diversity of OER that exist, since they will dramatically decrease the cost and time necessary to create the informational resources that underlie these learning materials. Current funding for the creation of OER (when it’s available at all) typically focuses on the courses enrolling the largest number of students. This makes sense when your theory of philanthropy is that the money you spend should benefit the most people possible. But that theory of philanthropy also means that the “long tail” of courses that each enroll relatively few students are unlikely to ever receive funding. LLMs will radically alter the economics of creating OER, and should make it possible for OER to come to these courses as well. (And while LLMs will have a significant impact on the economics of creating OER, they may not have as dramatic an impact on the sustainability, maintenance, and upkeep of OER over time.)…”

Surprise machines | John Benjamins

“Although “the humanities so far has focused on literary texts, historical text records, and spatial data,” as stated by Lev Manovich in Cultural Analytics (Manovich, 2020, p.?10), the recent advancements in artificial intelligence are driving more attention to other media. For example, disciplines such as digital humanities now embrace more diverse types of corpora (Champion, 2016). Yet this shift of attention is also visible in museums, which recently took a step forward by establishing the field of experimental museology (Kenderdine et al., 2021).

This article illustrates the visualization of an extensive image collection through digital means. Following a growing interest in the digital mapping of images – proved by the various scientific articles published on the subject (Bludau et al., 2021; Crockett, 2019; Seguin, 2018), Ph.D. theses (Kräutli, 2016; Vane, 2019), software (American Museum of Natural History, 2020/2022; Diagne et al., 2018; Pietsch, 2018/2022), and presentations (Benedetti, 2022; Klinke, 2021) – this text describes an interdisciplinary experiment at the intersection of information design, experimental museology, and cultural analytics.

Surprise Machines is a data visualization that maps more than 200,000 digital images of the Harvard Art Museums (HAM) and a digital installation for museum visitors to understand the collection’s vastness. Part of a temporary exhibition organized by metaLAB (at) Harvard and entitled Curatorial A(i)gents, Surprise Machines is enriched by a choreographic interface that allows visitors to interact with the visualization through a camera capturing body gestures. The project is unique for its interdisciplinarity, looking at the prestigious collection of Harvard University through cutting-edge techniques of AI….”

Opinion: Why we’re becoming a Digital Public Good — and why we aren’t | Devex

“A few months ago, Medtronic LABS made the decision to open source our digital health platform SPICE, and pursue certification as a Digital Public Good. DPGs are defined by the Digital Public Good Alliance as: “Open-source software, open data, open AI models, open standards, and open content that adhere to privacy and other applicable laws and best practices, do no harm by design, and help attain the Sustainable Development Goals.” The growing momentum around DPGs in global health is relatively new, coinciding with the launch of the U.N. Secretary General’s Roadmap for Digital Cooperation in 2020. The movement aims to put governments in the driver’s seat, promote better collaboration among development partners, and reduce barriers to the digitization of health systems.”

Can artificial intelligence assess the quality of academic journal articles in the next REF? | Impact of Social Sciences

“For journal article prediction, there is no knowledge base related to quality that could be leveraged to predict REF scores across disciplines, so only the machine learning AI approach is possible. All previous attempts to produce related predictions have used machine learning (or statistical regression, which is also a form of pattern matching). Thus, we decided to build machine learning systems to predict journal article scores. As inputs, based on an extensive literature review of related prior work, we chose: field and year normalised citation rate; authorship team size, diversity, productivity, and field and year normalised average citation impact; journal names and citation rates (similar to the Journal Impact Factor); article length and abstract readability; and words and phrases in the title, keywords and abstract. We used provisional REF2021 scores for journal articles with these inputs and asked the AI to spot patterns that would allow it to accurately predict REF scores….”

Frontiers | Open Science, Open Data, and Open Scholarship: European Policies to Make Science Fit for the Twenty-First Century

“Open science will make science more efficient, reliable, and responsive to societal challenges. The European Commission has sought to advance open science policy from its inception in a holistic and integrated way, covering all aspects of the research cycle from scientific discovery and review to sharing knowledge, publishing, and outreach. We present the steps taken with a forward-looking perspective on the challenges laying ahead, in particular the necessary change of the rewards and incentives system for researchers (for which various actors are co-responsible and which goes beyond the mandate of the European Commission). Finally, we discuss the role of artificial intelligence (AI) within an open science perspective.”

Thoughts on AI’s Impact on Scholarly Communications? An Interview with ChatGPT – The Scholarly Kitchen

“As 2022 drew to a close, a great deal of popular attention was drawn to the latest artificial intelligence chatbot, ChatGPT, which was released in November 2022 by OpenAPI. (As an aside, the company has a had a very interesting background and funding, which has produced a number of important AI advances). Machine learning, natural language processing and textual creation have made significant advances over the past decade. When the first auto-generation of content from structured data became commercially viable about a decade ago, it was reasonably easy to discern machine generated content. This is increasingly no longer the case. Content distributors and assessors of content, be that for scholarly peer-review, for academic credentialling, or simply those who consume content should be aware of the capabilities of these tools and should not be dismissive of them.

Given the range of questions in scholarly communications around the application of AI, I thought it might be interesting to see what the ChatGPT’s response would be to some of these, along with other literary/tech questions, and to share them with you. You can review for yourself whether you think the responses are good ones or not and, if you didn’t know the source of the responses, whether you could tell that they were derived from a machine. Copied below are the questions and responses. I have not edited the responses in any way from what was output by the ChatGPT. It seems we’ve moved well beyond the “Turing Test” for assessing the intelligence of a machine. You may notice there is something formulaic to some of the responses, but it’s only discernible after looking over several of them. Though it is important to reflect that the machine doesn’t “know” whether the answers are correct or not, only that they are statically valid responses to the questions posed….”

GitHub is Sued, and We May Learn Something About Creative Commons Licensing – The Scholarly Kitchen

“I have had people tell me with doctrinal certainty that Creative Commons licenses allow text and data mining, and insofar as license terms are observed, I agree. The making of copies to perform text and data mining, machine learning, and AI training (collectively “TDM”) without additional licensing is authorized for commercial and non-commercial purposes under CC BY, and for non-commercial purposes under CC BY-NC. (Full disclosure: CCC offers RightFind XML, a service that supports licensed commercial access to full-text articles for TDM with value-added capabilities.)

I have long wondered, however, about the interplay between the attribution requirement (i.e., the “BY” in CC BY) and TDM. After all, the bargain with those licenses is that the author allows reuse, typically at no cost, but requires attribution. Attribution under the CC licenses may be the author’s primary benefit and motivation, as few authors would agree to offer the licenses without credit.

In the TDM context, this raises interesting questions:

Does the attribution requirement mean that the author’s information may not be removed as a data element from the content, even if inclusion might frustrate the TDM exercise or introduce noise into the system?
Does the attribution need to be included in the data set at every stage?
Does the result of the mining need to include attribution, even if hundreds of thousands of CC BY works were mined and the output does not include content from individual works?

While these questions may have once seemed theoretical, that is no longer the case. An analogous situation involving open software licenses (GNU and the like) is now being litigated….”

Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers | bioRxiv

Abstract:  Background Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing.

Methods We gathered ten research abstracts from five high impact factor medical journals (n=50) and asked ChatGPT to generate research abstracts based on their titles and journals. We evaluated the abstracts using an artificial intelligence (AI) output detector, plagiarism detector, and had blinded human reviewers try to distinguish whether abstracts were original or generated.

Results All ChatGPT-generated abstracts were written clearly but only 8% correctly followed the specific journal’s formatting requirements. Most generated abstracts were detected using the AI output detector, with scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% [12.73, 99.98] compared with very low probability of AI-generated output in the original abstracts of 0.02% [0.02, 0.09]. The AUROC of the AI output detector was 0.94. Generated abstracts scored very high on originality using the plagiarism detector (100% [100, 100] originality). Generated abstracts had a similar patient cohort size as original abstracts, though the exact numbers were fabricated. When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, but that the generated abstracts were vaguer and had a formulaic feel to the writing.

Conclusion ChatGPT writes believable scientific abstracts, though with completely generated data. These are original without any plagiarism detected but are often identifiable using an AI output detector and skeptical human reviewers. Abstract evaluation for journals and medical conferences must adapt policy and practice to maintain rigorous scientific standards; we suggest inclusion of AI output detectors in the editorial process and clear disclosure if these technologies are used. The boundaries of ethical and acceptable use of large language models to help scientific writing remain to be determined.

Open access database for artificial intelligence research – Gastrointestinal Endoscopy

Abstract:  Artificial intelligence (AI) for gastrointestinal endoscopy is an important and rapidly developing area of research. In particular, AI research in colonoscopy has attracted significant attention, with several AI medical devices already on the market in the United States and worldwide. With the help of AI models based on machine learning, endoscopists can appreciate improved and operator-independent colonoscopy quality, such as AI-driven detection and characterization of colorectal polyps. 

Over €4.4 million granted to four new projects to enhance the common European data space for cultural heritage | Europeana Pro

“The Europeana Initiative is at the heart of the common European data space for cultural heritage, a flagship initiative of the European Union to support the digital transformation of the cultural heritage sector. Discover the projects funded under the initiative….

We are delighted to announce that the European Commission has funded four projects under their new flagship initiative for deployment of the common European data space for cultural heritage. The call for these projects, launched in spring 2022, aimed at seizing the opportunities of advanced technologies for the digital transformation of the cultural heritage sector. This included a focus on 3D, artificial intelligence or machine learning for increasing the quality, sustainability, use and reuse of data, which we are excited to see the projects explore in the coming months….”

Confidence at Scale: Using Technology to Assess Research Credibility

“After multiple years of data collection, the Research team at the Center for Open Science (COS) is preparing for the end of its participation in DARPA’s Systematizing Confidence in Open Research and Evidence (SCORE) program and the transition to the work that follows. SCORE has been a significant undertaking spanning multiple research teams and thousands of researchers and participants throughout the world – all collaborating to answer a single question: Can we create rapid, scalable, and valid methods for assessing confidence in research claims?…

How do you believe the objectives of SCORE contribute to Open Science more generally?

KU: The scale and rigor behind SCORE are unlike anything that was achieved so far in the open science research space. Each step of the project has been carefully planned to ensure that we can reach decisive conclusions regarding the current state of social and behavioral science research which makes me confident in SCORE’s ability to transform our understanding of the state of the field….”