A new publication from Britain’s Royal Society – the oldest scientific society in the world that has included as its fellows, Isaac Newton and Stephen Hawking – discusses how artificial intelligence might affect how science is done.
This is quite a hot topic at the moment. A recent paper by Microsoft (Microsoft Research AI4Science, 2023) using their GPT-4 LLM also looked at AI’s potential to analyse scientific literature, help researchers visualize large datasets, uncover trends in complex data, create code from text, and even develop novel hypotheses. The Royal Society report goes further, looking at the nature and methods of scientific inquiry, and also exploring how notions of research integrity, skills and ethics are inevitably changing with AI – and what the implications are for the future of science and scientists.
Although geology and geoscience aren’t mentioned specifically in the Royal Society’s (2024) report, at least one of the detailed case studies – on using AI in the investigation and treatment of rare disease - shows many similarities to the challenges of geoscience and its so-called ‘long tail data’ (e.g. Sinha et al 2013, Stephenson et al. 2020).
According to Sinha et al. (2013), geoscience broadly has two kinds of data: those generated by sensor technologies - and more variably generated data. Sensor technology data (earthquake and volcanology) usually reside in well-designed and curated data centres. By contrast, the many small datasets of the deep-time long tail (stratigraphy, palaeontology, palaeogeography, tectonics) are more unstructured and more heterogeneous. It turns out that rare disease data is similar to data in the geoscience long tail, in that it’s limited in its availability, and so hard to apply to disease research. So it’s worth comparing the two problems.
A rare disease is defined as a condition that affects fewer than 1 in 2,000 people, and of the more than 7,000 rare diseases worldwide, only 5% have a treatment. This is partly because of fragmented patient data, making it hard to see big trends and correlations. So rare diseases are ripe for the use of AI.
According to the Royal Society report, rare disease data is siloed, scattered, behind paywalls or commercially owned, which means the scarcity of data can make it difficult to train accurate and robust AI models. There is also a lack of channels to coordinate across labs and institutions to integrate and cross reference datasets. Sometimes data is heterogenous, noisy, incomplete, or incorrectly labelled. It may come from different sources: clinical records, genetic testing, and patient surveys, with different formats, quality standards, and levels of detail. As the Royal Society report says: ‘…integrating and harmonising such heterogeneous data can be a significant challenge…’.
This is precisely the same challenge for much of geoscience. In biostratigraphic studies of big sedimentary basins for example, biotic trends could be identified and linked with environmental change if only biostratigraphic data was standardised, of consistent quality and the same levels of detail. Even within oil companies that routinely use large amounts of biostratigraphic data, data are often not held in a consistent format. The inconsistencies may be even larger between companies and between individual academics and academic research groups. This ‘data fragmentation’ doesn’t just apply to biostratigraphy, but also to stratigraphy, palaeogeography, tectonics and many other areas of geoscience, particularly in the deep-time realm.
Work is being done to try to bring together and link geoscience data, even in the long tail. The resolution of long tail geoscience data has been, and still is, one of the Deep-time Digital Earth (DDE) project’s aims, to: ‘…transform Earth science by connecting and harmonising long tail deep-time data islands to support broad-based scientific studies relevant to the entire Earth system..’. (Stephenson et al. 2020). Over the last 5 or so years, DDE has made massive strides in establishing the standards and building the tools to allow data to be more accessible and linkable, and in creating a comprehensive platform for this kind of work to be done. If you’re interested in DDE, have a look at the DDE website (https://www.ddeworld.org/) and the DDE platform website (https://deep-time.org/). Get involved!
Mike Stephenson
References
Microsoft Research AI4Science (2023) The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4. arXiv (non-peer-reviewed pre-print); https://arxiv.org/abs/2311.07361
Royal Society 2024. Science in the age of AI: How artificial intelligence is changing the nature and method of scientific research Issued: May 2024 DES8836_1 ISBN: 978-1-78252-712-1
Sinha, A.K., Thessen, A.E., and Barnes, C.G., 2013, Geoinformatics: Toward an integrative view of Earth as a system. In: Bickford, M.E., (Ed.), The Web of Geological Sciences: Advances, Impacts, and Interactions. Geologi cal Society of America Special Paper 500, pp. 591−604. doi:10.1130/ 2013.2500(19)
Stephenson, M H, Cheng, Q, Wang, D, Fan, J, Oberhänsli, R. 2020. Progress towards the establishment of the IUGS Deep-time Digital Earth (DDE) programme. Episodes 43(4): 1057-1062