“I have had people tell me with doctrinal certainty that Creative Commons licenses allow text and data mining, and insofar as license terms are observed, I agree. The making of copies to perform text and data mining, machine learning, and AI training (collectively “TDM”) without additional licensing is authorized for commercial and non-commercial purposes under CC BY, and for non-commercial purposes under CC BY-NC. (Full disclosure: CCC offers RightFind XML, a service that supports licensed commercial access to full-text articles for TDM with value-added capabilities.)
I have long wondered, however, about the interplay between the attribution requirement (i.e., the “BY” in CC BY) and TDM. After all, the bargain with those licenses is that the author allows reuse, typically at no cost, but requires attribution. Attribution under the CC licenses may be the author’s primary benefit and motivation, as few authors would agree to offer the licenses without credit.
In the TDM context, this raises interesting questions:
Does the attribution requirement mean that the author’s information may not be removed as a data element from the content, even if inclusion might frustrate the TDM exercise or introduce noise into the system?
Does the attribution need to be included in the data set at every stage?
Does the result of the mining need to include attribution, even if hundreds of thousands of CC BY works were mined and the output does not include content from individual works?
While these questions may have once seemed theoretical, that is no longer the case. An analogous situation involving open software licenses (GNU and the like) is now being litigated….”