Data Science and Machine Learning for Planet Earth

Teaching the Next Generation of Geoscientists

Data Science and Machine Learning for Planet Earth

Synopsis

This intensive, two-week block teaching course introduces graduate students and professionals to data science and machine learning principles, with a focus on applications relevant to Earth and environmental sciences. The course balances theory and practical exercises, with a strong emphasis on live coding sessions to ensure students gain hands-on experience. The module covers the entire machine learning workflow, from data wrangling and preparation to model evaluation and optimization. Students learn to use Python with Pandas, NumPy, Scikit-learn, XGboost, and other libraries.

The course is structured around key topics that will be covered through a combination of lectures and live coding sessions:

Model Evaluation: Understanding performance metrics and monitoring algorithm performance.
Overview of the Machine Learning Workflow: Understanding the steps in an end-to-end machine learning project.
Data Wrangling: Cleaning and preparing raw datasets for machine learning applications.
Data Preparation Pipelines: Using Scikit-learn to handle categorical data, perform feature engineering, and normalize/scale data.
Regression Algorithms: Implementing linear regression, decision trees, support vector machines, and more.
Classification Algorithms: Exploring algorithms such as SVC, random forests, and k-nearest neighbors.
Clustering and Dimensionality Reduction: Using algorithms like k-means, DBSCAN, PCA, and t-SNE.

By the end of the course, students will have a solid understanding of key machine learning algorithms and their applications to real-world problems, as well as the ability to code and deploy end-to-end machine learning pipelines.

Learning Outcomes

Gain an understanding of the end-to-end workflow of machine learning projects.
Learn to use open-source libraries (e.g., Scikit-learn) in a notebook environment for data wrangling and preparation.
Build data preparation pipelines, including handling categorical data, feature engineering, and scaling.
Explore and implement classical machine learning algorithms for regression, classification, and clustering.
Learn to evaluate model performance using various metrics such as RMSE, precision, recall, confusion matrix, and AUC.
Develop proficiency in professional coding practices for machine learning.

Course Plan

Week 1 – Focus on Data Science

Day 1: Data Preprocessing
- Topic: How to clean, prepare, and preprocess data for machine learning. Understand data cleaning techniques, handle missing data, and prepare datasets for analysis.
Day 2: Performance Metrics
- Topic: How to evaluate and monitor the performance of machine learning algorithms. Gain an understanding of key metrics such as RMSE, precision, recall, confusion matrix, and AUC.
Day 3: Optimization
- Topic: Understanding how machine learning algorithms optimize their performance. Learning Outcome: Learn about optimization techniques such as gradient descent and hyperparameter tuning.
Day 4: Deep Dive into Data Modelling (🦈)
- Topic: Fundamental principles of machine learning and data modeling. Learning Outcome: Build a solid foundation in understanding how machine learning models are structured and trained.
Day 5: ML Workflow
- Topic: Efficient and professional coding practices for machine learning workflows. Learning Outcome: Learn to build reproducible and efficient machine learning pipelines.

Week 2 – Focus on Machine Learning Algorithms

Day 6: Model Tuning
- Topic: Finding the best hyperparameters for your machine learning models. Learning Outcome: Understand techniques for hyperparameter optimization and their impact on model performance.
Day 7: Ensemble Learning
- Topic: Exploring powerful ensemble learning techniques such as random forests and gradient boosting. Learning Outcome: Learn how ensemble methods improve model performance by combining predictions from multiple models.
Day 8: Natural Language Processing
- Topic: Preparing and using text data for machine learning applications. Learning Outcome: Understand the basics of text preprocessing and NLP techniques.
Day 9: Unsupervised Learning
- Topic: Machine learning techniques for unlabeled data. Learning Outcome: Explore clustering techniques such as k-means, DBSCAN, and hierarchical clustering.

Algorithms Covered

Week 1 – Data Science Focus

Linear Regression
Logistic Regression
K-Nearest Neighbors
SGD Regressor and SGD Classifier

Week 2 – Machine Learning Algorithms

Supervised Learning Algorithms

Support Vector Machine (SVM)
Decision Trees
Random Forest
Gradient Boosting Trees
AdaBoost
Naive Bayes Classifier
Multi-Layer Perceptron (Neural Network)

Unsupervised Learning Algorithms

Principal Component Analysis (PCA)
K-Means Clustering
DBSCAN and HDBSCAN
Isolation Forest (iFor)
Single-Class SVM
UMAP and t-SNE (optional)

Teaching Methods

The course combines lectures, live coding sessions, and practical exercises. Each day starts with a theoretical session to introduce key concepts, followed by hands-on practice using real-world datasets. The live coding approach ensures that students can follow along and apply the techniques as they learn.