Symposium: Epistemic Strategies for the Integration of Big Data (September 2017)

7 September 2017

Forum auditorium A, 14:30–16:30
Streatham Campus, University of Exeter

Organiser

Sabina Leonelli

Background

Big data have become central to the practices and discourse of many strands of contemporary science, due to a variety of reasons ranging from techno-scientific factors (such as the emergence of high-throughput tools for data generation) to the political and economic expectations raised by this terminology (such as the idea that automated analysis of big data may facilitate processes of discovery as well as the translation of research insights into social goods). The sheer quantity of data that are becoming available for dissemination and scrutiny is certainly making a difference to research methods and results, in ways that are hard to capture by recourse to traditional philosophical literature on induction and theory-ladenness (Edwards 2009, Woodward 2010, Rheinberger 2011, Callebaut 2012, Ratti 2015, Pietsch 2015, Leonelli 2014, 2015). Even more provocative to philosophers is the increasing set of strategies, technologies and skills devoted to the formatting, assemblage, visualisation and interpretation of data and related evidential reasoning produced by widely different research groups, whose work focuses on different phenomena, commits to different assumptions and utilizes different methods (O’Malley and Soyer 2012, Leonelli 2016). Scientific data can vary greatly in their format and availability; in the ways in which they have been produced and the materials from which they have been extracted; in the geographical sites, temporal scales and epistemic goals of the scientists generating them; and, most trivially perhaps, in the objects and processes that they can be taken to document. Integrative research efforts need to bridge across these multiple dimensions, by bringing together data obtained in a variety of different settings so that they can be analyzed together and brought to bear on common questions. Data integration thus requires extensive scientific labor, including the development of apposite infrastructures, analytic tools, standards, methods and models.

How can data produced from diverse sources and techniques be integrated and visualized? What role does technology (in the form of experimental instruments, modeling software and digital databases) play in such efforts? How do the challenges and opportunities offered by data integration and amalgamation strategies affect the development and content of scientific knowledge claims? And how can the integrity and validity of evidential claims produced in this manner be evaluated? This symposium approaches these questions by bringing together philosophical studies grounded on the empirical examination of large- scale data integration practices within and across biomedicine, environmental science and biochemistry. We discuss the epistemological challenges involved in bringing together diverse datasets pertaining to different phenomena, target systems and research environments, and in some cases collected by different types of experts working on widely disparate materials across several locations around the globe. We place particular emphasis on documenting concerns relating to convoluted and non-linear methods of inference, sampling, modelling and data processing which are often employed in complex data integration exercises, with implications for the extent to which data can be triangulated, reproduced, reused, validated and replicated. We also consider the epistemic advantages involved in integration efforts, particularly the potential to cluster data in the absence of formal, unifying theories and related opportunities to bridge across diverging research perspectives and conceptions of science and its uses. Data cannot be stored and circulated without organising principles; this basic requirement is ever more pressing when posting big data online, given the complexity of technological arrangements that are needed to exploit the potential for large storage capability, immediate dissemination and wide reach provided by the web. Data stored in digital databases need to be standardised, ordered and visualised, so that scientists can retrieve them from databases in ways that help them in their own research. These processes constitute an important form of data integration, which involves significant amounts of labour and expertise, including the ability to conceptually order data, format them to fit specific programmes, and develop adequate software and models. Indeed, making big data widely available through databases often requires a sophisticated understanding of what data might be used for, as well as extensive work on the classification and modelling of datasets so that they become compatible with each other, retrievable and re-usable by the wider scientific community. Paying attention to the differences and interplay between modes and strategies of integration illuminates the mechanics and challenges of making data not only accessible but also usable to the scientific community; the large amount of conceptual and material scaffolding needed to transform big data into new scientific knowledge; and the different forms of knowledge that may result from processes of data integration, depending on which communities, infrastructures and institutions are involved in scientific research.

The papers articulate concerns of interest to different parts of the philosophy of science, which had hitherto not been put in dialogue with each other, despite their common interest in questions of visualization, inductive and causal reasoning, empirical methods and experimental practices. In particular, the symposium was aimed at facilitating a comparative discussion of the analysis of data practices and related reasoning between the philosophy of chemistry (Woody and Tibbetts), the philosophy of biology and environmental sciences (Leonelli and Tempini) and the philosophy of medical reasoning and evidential practices (including both evidence used in clinical studies, as in the paper by Clarke, Illari and Russo, and the formal study of inference and amalgamation techniques in the evidence-based approach, as exemplified by Osimani). Furthermore, all the papers presented in this symposium have two characteristics in common: they are grounded on the empirical study of contemporary scientific practice, and they emerged out of collaborations with practicing scientists. This session was thus aimed both at exploring the epistemic concerns and implications of current scientific efforts to integrate and analyse big data, and at reflecting on the various roles that philosophers can play alongside such efforts, ranging from friendly support to outright critique.

Programme

Chair: Rachel Ankeny (University of Adelaide)

14:30 Sabina Leonelli & Niccolò Tempini: Where Health and Environment Meet: Geolocation as Invariance Strategy for Integrating Diverse Data Sources.

14:55 Andrea Woody & Katharine Moore Tibbetts: Smart Search through Complex Landscapes.

15:20 Brendan Clarke, Phyllis Illari & Federica Russo: Datified Evidence in Theory and in Clinical Practice.

15:45 Barbara Osimani: Exact Replication or Varied Evidence? Reliability, Robustness and the Reproducibility Problem.

16:10 Commentary by Rachel Ankeny

16:15 General discussion

Events 2014–2021