Data Science and Machine Learning for Planet Earth

Earth-Centric AI: What is it?

AI and machine learning is by far the largest of our research remit, within what we call the Earth-Centric AI team. Chances are that you never heard of the term Earth-Centric AI. That’s normal: we invented it in 2022. Earth-Centric AI encapsulates best what we do in terms of scientific approach: we apply a Data-Centric AI approaches to solve problems in Earth and planetary sciences.

Empire House. The new Digital Environment Research Institute, the home of our group, is located on the Whitechapel campus of Queen Mary University. We are sitting in Empire House, an Art Deco building completely renovated in 2019 for DERI.

A well-known but little acknowledged fact is that most scientific projects do not generate Big Data, i.e. for many scientific applications we have a few hundreds to a few thousand datapoints. This is much less than the millions of records typically used in business applications of machine learning. This situation leads to a Small Dataset challenge in AI: most of the modern neural networks tend to overfit small datasets. Yet many interesting scientific and engineering questions require to use small dataset, and generating more data is not possible either due to the high cost involved, impossibility to resample an experiment, or both.

Being Data-Centric means we care a lot about the quality, robustness, and relevance of our data. As a matter of fact, we still acquire some of this data ourselves, in the field, at sea, or in the lab. We place equal weight on the choice of model and model parameters, and on the way our data is selected, prepared and balanced. When needed, we resort to synthetic data generation to overcome the small data issue. This ensures that we can use all data and train models that generalize well, thus tackling novel scientific problems.

Computer Vision for Earth and Planetary Data

Our lab has extensive experience in computer vision, notably using Convolutional Neural Networks (CNNs) and other deep-neural architecture to extract information from unstructured visual data. For this, we usually write Python code and use deep-learning libraries such as TensorFlow or PyTorch.

One of our topic of research is automated core analysis. In a paper published in 2022, we showed that the Dunham texture of carbonates could be interpreted by CNNs combined with transfer learning.

Earth Science data often consists of images. This is notably the case for core photographs, or thin sections. We have been working since 2017 in automatic lithology and grain classification in both carbonate and clastic rocks. For this, we can leverage the power of the convolutional neural network architecture, and learn rock textures from images. We demonstrate that this approach results in a faster, more reliable description of rock textures. We also used YoLo to automatically recognize individual grain elements in carbonate rocks. Work in ongoing in this theme, wiht multiple PhD projects focused on computer vision for geological material.

Another big area of growth for us is in the realm of satellite image analysis. For instance, we use Earth satellite imagery to automatically detect coral bleaching. The hope is to be able to develop tools capable of monitoring and predicting coral bleaching ahead of an event. We also used satellite images of Earth to predict wind direction, and satelite imagery of Mars to predict the type of terrain present. The work on Mars is continuing, with an effort to use hyperspectral imaging to detect potential resources of interest for a human settlement.

Size matters. The diagram on the right shows the results of our investigation on the impact of dataset size. Spoiler alert: the best network architecture to use depends on the size of the dataset available to you. There is no one size fits all.

Finally, computer vision is also used to interpret seismic volumes. Seismic data has been the backbone of subsurface analsyis since the 1960’s. Modern seismic is of high-quality, and contains a wealth of information about the geological history of a region, and the presence of important resources in the subsurface. But interpreting large volume of seismic data requires a lot of time. Our research usulises state of the art machine learning techniques to try to automatically interpret seismic volumes. Examples of research project in this domain include automatic detection of faults, and a framework for automated sequence stratigraphic interpretation.

We also use computer vision to handle seismic data. In this example project, we attempt to improve the quality of older seismic data by predicting and subtracting noise.

Generative Models in Earth-Centric AI

Generative models are all over the internet these days: who has not heard of diffusion models such as OpenAI’s Dall-E for images, or ChatGPT (an automated chatbox). But did you know that these generative solution also have their place in Earth Science?

For instance, we have been working for a few years on generating realistic and accurate images of carbonate rocks based on subsurface measurements of resistivity (an FMS image). This allows geologists to interpet the subsurface in a more natural way, looking at rocks they are familiar with, rather than hard to interpret resistivity images. For this, we use a technique known as a Generative Adversarial Network (GAN) to transform the resistivity image into a pseudo-core image.

We use generative adversarial networks (GANs) to generate pseudo-core images from downhole FMS images.

Generative models also play a role in increasing the resolution of our data. We have attempted to improve the resolution of vintage seismic data by applying a GANs approach. On Mars, we intend to improve the spatial and the spectral resolution of satellite imagery through the use of generative models.

And of course, generative models are a potential solution to the small dataset problem we highlighted above. By capturing the statistical variability of our data through sound data-centric approaches, we can then hope to generate more example of the same data. This then allows to use more advanced inference techniques, limiting the risks of overfitting. Hence, generative model have a key role to play in Earth-Centric AI and are a large focus of our research efforts.

Natural Language Processing and Explainable AI

The approaches mentioned above are where most of our research efforts are focus at the moment. However, we also have a keen interest in Natural Language Processing. Our interest stems from the fact that each scientific domain posesses its own corpus, and that to extract information from the literature, one would need an encoding that is mirroring as closely as possible the relationship between the technical words of this corpus. Thus, our research in natural language processes tends to focus on domain-specific languages, and extracting meaning and closeness in scientific documents. We use models trained from scratch in Python, or leverage transfer learning from pre-trained models such as GPT3, BERT and HuggingFace.

In a book chapter recently published, we show that conventional machine learning (or “statistical machine learning”) can be used to predict production in U.S. Gas Shale. This is important to move away from more polluting fossil fuels (such as petroleum), and to have a secure supply of energy.

We also use statistical machine learning extensively for tabular data. For instance, we have conducted a study using random forest and other algorithm to predict the production of gas shales in the USA. Gradient boosted trees are also of great interest, because of their proven track record in predicting from tabular data. This statistical machine learning research is at the intersection of our research in carbonate clumped isotopes (where we generate a lot of tabular data) and our interest in the topic of Explainable AI (XAI). We believe that the future of AI will have to be explainable, as more and more decisions are being taken automatically by algorithm. This raises a number of ethical questions, also in environmental science and Earth science. Social and environmental justice needs explainable and transparent AI.