“An alternative has been to use proprietary data collected for market research (e.g., Comscore, Nielsen), with methods that are sometimes opaque and a pricing structure that is prohibitive to most researchers.
We believe that this approach is no longer acceptable as the mainstay of CSS, as pragmatic as it might seem in light of the apparent abundance of such data and limited resources available to a research community in its infancy. We have two broad concerns about data availability and access.
First, many companies have been steadily cutting back data that can be pulled from their platforms (5). This is sometimes for good reasons—regulatory mandates (e.g., the European Union General Data Protection Regulation), corporate scandal (Cambridge Analytica and Facebook)—however, a side effect is often to shut down avenues of potentially valuable research. The susceptibility of data availability to arbitrary and unpredictable changes by private actors, whose cooperation with scientists is strictly voluntary, renders this system intrinsically unreliable and potentially biased in the science it produces.
Second, data generated by consumer products and platforms are imperfectly suited for research purposes (6). Users of online platforms and services may be unrepresentative of the general population, and their behavior may be biased in unknown ways. Because the platforms were never designed to answer research questions, the data of greatest relevance may not have been collected (e.g., researchers interested in information diffusion count retweets because that is what is recorded), or may be collected in a way that is confounded by other elements of the system (e.g., inferences about user preferences are confounded by the influence of the company’s ranking and recommendation algorithms). The design, features, data recording, and data access strategy of platforms may change at any time because platform owners are not incentivized to maintain instrumentation consistency for the benefit of research.
For these reasons, research derived from such “found” data is inevitably subject to concerns about its internal and external validity, and platform-based data, in particular, may suffer from rapid depreciation as those platforms change (7). Moreover, the raw data are often unavailable to the research community owing to privacy and intellectual property concerns, or may become unavailable in the future, thereby impeding the reproducibility and replication of results….
Despite the limitations noted above, data collected by private companies are too important, too expensive to collect by any other means, and too pervasive to remain inaccessible to the public and unavailable for publicly funded research (8). Rather than eschewing collaboration with industry, the research community should develop enforceable guidelines around research ethics, transparency, researcher autonomy, and replicability. We anticipate that many approaches will emerge in coming years that will be incentive compatible for involved stakeholders….
Privacy-preserving, shared data infrastructures, designed to support scientific research on societally important challenges, could collect scientifically motivated digital traces from diverse populations in their natural environments, as well as enroll massive panels of individuals to participate in designed experiments in large-scale virtual labs. These infrastructures could be driven by citizen contributions of their data and/or their time to support the public good, or in exchange for explicit compensation. These infrastructures should use state-of-the-art security, with an escalation checklist of security measures depending on the sensitivity of the data. These efforts need to occur at both the university and cross-university levels. Finally, these infrastructures should capture and document the metadata that describe the data collection process and incorporate sound ethical principles for data collection and use….”