Dealing with the ‘long tail’: bringing data out of the box and into cyberspace

Mike Stephenson, Stephenson Geoscience Consulting Ltd

Alessandro Carniti, State Key Laboratory of Critical Earth Material Recycling and Mineral Deposits, School of Earth Sciences and Engineering, Nanjing University, Nanjing 210023, China

Jiaxi Yang, Zhejiang Lab, Hangzhou 311100, China.

A vast amount of data generated by geoscientists and institutions is still not accessible to other geoscientists or to artificial intelligence tools. This is a problem for geoscience and other observational sciences which often rely on mixed qualitative and quantitative data that is part of the so-called ‘long tail’ (e.g. Sinha et al. 2013) – the unstructured and heterogeneous datasets that sit in geological surveys, university research groups, and on individual scientists’ computers. DDE and GeoGPT were created to free -up data and make it available to scientists and to artificial intelligence such as large language models (LLMs). This is an article about one of these datasets and about the challenge of making it suitable for LLM development. It’s a lesson in dealing with data from the long tail.

The database in question is the Jansonius and Hills catalogue (JHC; 1976 and subsequent updates) of fossil spore and pollen genera. The JHC was originally created as a series of cards - originally one card per genus – and contains around 4000 spore (fungi, plants and algae) and pollen genera from the Phanerozoic. This monumental database by Calgary palynologists J Jansonius and L V Hills provides descriptions and diagnoses for these genera, the source publications and authors, descriptions of genus type species, and often subsequent genus descriptions, including formal emendations. Taken as a whole, the catalogue is an enormously useful resource which has no other equivalent in palynology.

This is why a team from GeoGPT and DDE decided to ask permission to build a large language model that could assimilate information across the around 4000 genera of the JHC and deliver a taxonomic determination method (e.g. a taxonomic key) using the information. The JHC is a large document in the form of a PDF. The first 78 pages consist of an introduction, genus lists, corrigenda and addenda, and p. 79- 5676 are the genera files. As part of the preparation for LLM development, the team successfully extracted 4322 pages in total which contained diagnoses or descriptions, and after data cleaning (removing material that couldn’t be used) around 3800 genera were left.

An LLM augmented taxonomic key (LATK) has already been developed for a much smaller dataset (around 70 spore species from the Carboniferous-Permian of the Arabian Plate) where ‘chain of thought’ logical steps are embedded in the system (Stephenson et al. 2024; Stephenson et al. in preparation) through careful structuring of the learning material (for example in providing hierarchical stages in the process). This functions relatively well. However the decision was made to develop a taxonomic key for the JHC mainly because its comprehensive and authoritative coverage would be of interest to a wider group of palynologists. Because of the difficulties of pre-structuring such a large dataset (as was done with the Carboniferous-Permian LATK), it was agreed to proceed on the basis of a Retrieval-Augmented Generation (RAG) technique to search for the best matches based on users’ descriptions.

The first stage was to gain permission to use the database for LLM training. The JHC is open-access as an online, searchable PDF in the digital collection of the Library of the University of Calgary. Permission was sought from the library which instead referred the team to the descendants of J Jansonius (sadly deceased). Following correspondence, the Jansonius family readily agreed to allow the team to develop the JHC as learning material for AI applications.

This was perhaps the easiest part. Careful study of the database revealed many problems affecting the usability of the dataset.

The first is the quality of the PDF. The text was originally manually typed onto cards in the 1970s and 80s and some of the typed letters are not clear; for example the computing system often mistakes Os for 0s, with the result that the names of genera and their characteristics are sometimes obscured. Similar mistakes occur with non-standard symbols and letters, or letters with umlauts, for example.

The quality of descriptions and diagnoses of genera is very variable. There are some excellent comprehensive descriptions and diagnoses of genera erected in the later years of the rapid expansion of palynology in the 1960s and 1970s, particularly when the science was being taken up by the oil industry. However many early genus descriptions and diagnoses are very short and sketchy, partly because at the time of their composition, palynology was a young science and few other genera had been recognised and low levels of detail in descriptions was the norm. Jansonius and Hills correctly recognised also that some genera have been invalidly published (see for example Gravendyck et al. 2021), so some names appearing in the JHC are superfluous. Jansonius and Hills often dealt with descriptions and diagnoses in foreign languages by translating them (many are translated from Russian).

Inconsistent use of terminology is another problem. This inconsistency occurs in several forms, for example synonyms for the same ornament element type (verrucae/warts), or more commonly multiple slightly different versions of descriptive terms (spines or spinae; bacula or baculae). Palynological terminology has also developed near parallel systems (e.g. exoexine/intexine, nexine/sexine; cavate/camerate) often due to different usage in the different international ‘schools’ of palynology, and different terminology developed for different time periods (e.g. the Palaeozoic and Cenozoic).

The most important part of the training that deals with many of these problems is the creation of ‘question-and-answer pairs’ because they help the LLM to learn how to generate relevant, structured, and context-aware responses. For the JHC, two kinds of question-and-answer pairs were developed. The first are structured QA pairs directly anchored in the JHC text; in other words the answers to the questions can be found directly within the text of JHC. A set of Q&A pairs for the spore genus Bellispores is shown in Fig. 1. Most questions are simple: ‘Question; What kind of germination mark does Bellispores have?’, ‘Answer; Bellispores has a trilete mark’; and the answers can be found directly in the corresponding card (Fig. 1). The QA pairs also had to include similarly anchored compound questions (requiring integration of multiple information), and counterintuitive questions (testing the model’s error-correction ability, e.g. ‘Is Bellispores monolete?’). The simple, compound and counterintuitive QA pairs are used to ‘supervised fine tune’ the embedding model to expand the word lists for the genera. When users type in descriptions similar to those of the QA pairs, the RAG system recognizes these descriptions and gives certain genera higher scores when recommending potential candidates.

A close-up of a letter

AI-generated content may be incorrect.

Fig. 1. A series of QA pairs for the spore genus Bellispores. Q&A pairs on the left; JHC card on the right.

The second type of QA pairs are known as ‘incorporate’; these cover areas of palynology outside the immediate learning materials, i.e. outside the JHC. They include questions like ‘What are the characteristics of zonate spores?’ and ‘Describe all the features of monolete spores’. The incorporate QA pairs are similarly used to supervised-fine tune the LLM in the broader knowledge of palynology.

Hundreds of QA pairs were generated for the JHC, partly by specialists (authors MHS and AC). Synthetic QA pairs were also be generated by the LLM based on the QA pairs generated by the specialists. Many of these preparations are complete and the LATK is now in advanced development and will be ready for testing soon.

In many ways this is a good example of data and knowledge that although very important to a developing science, is currently accessible in only an unsophisticated form. Many palynologists will know of the JHC and will have used it. But it isn’t easy to use and there are many trends in the data which are hidden from view because in its unprocessed form it cannot easily by assimilated into any artificial intelligence platform.

The conversion of the JHC into a LATK is typical of the challenges of making long tail data suitable for AI development. Geoscience has enormous amounts of similar inaccessible data and knowledge like this - in the boxes of reports in geological surveys across the world, in the paywall papers of geoscience journals, and in the unstructured data on individual geoscientists’ computers.

DDE and GeoGPT are to trying to improve the processes of long tail access and it may be that this pilot project to build a LATK from the Jansonius and Hills catalogue could be a blueprint for other similar projects to bring data ‘out of the box’ and into cyberspace.

References

Gravendyck J, Fensome RA, Head MJ, Herendeen PS, Riding JB, Bachelier JB, Turland NJ. 2021. Taxonomy and nomenclature in palaeopalynology: basic principles, current challenges and future perspectives. Palynology. 45(4): 717–743.

Jansonius J, Hills V. 1976. Genera file of fossil spores and pollen. Canada: Special Publication, Department of Geology, University of Calgary.

Sinha, A.K., Thessen, A.E., and Barnes, C.G., 2013, Geoinformatics: Toward an integrative view of Earth as a system. In: Bickford, M.E., (Ed.), The Web of Geological Sciences: Advances, Impacts, and Interactions. Geological Society of America Special Paper 500, pp. 591−604. doi:10.1130/ 2013.2500(19)

Stephenson, M. H., Shen, C., Xiao, Z., Mao, T. 2024. Large Language Models in palynological taxonomy 27, Permophiles Issue #77 August 2024 27-30; https://permian.stratigraphy.org/files/permophiles/Permophiles%2077.pdf