“Over the last several years, Duke, like many other institutions, has made a significant investment in computational research, recognizing that such research techniques can have wide-ranging benefits from translational research in the biomedical sciences to the digital humanities, this work can and has been transformative. Much of this work is reliant on researchers being able to engage in text and data-mining (TDM) to produce the data-sets necessary for large-scale computational analysis. For the sciences, this can range from compiling research data across a whole series of research projects, to collecting large numbers of research articles for computer-aided systematic reviews. For the humanities, it may mean assembling a corpus of digitized books, DVDs, music, or images for analysis into how language, literary themes, or depictions have changed over time….
The techniques and tools for text and data-mining have advanced rapidly, but one constant for TDM researchers has been a fear of legal risk. For data-sets composed of copyrighted works, the risk of liability can seem staggering. With copyright’s statutory damages set as high as $150,000 per work infringed, a corpus of several hundred works can cause real concern.
However, the risks of just avoiding copyrighted works are also high….”