Teaching the Next Generation of Geoscientists

Data Science and Machine Learning for Planet Earth

Synopsis

This intensive, two-week block teaching course introduces graduate students and professionals to data science and machine learning principles, with a focus on applications relevant to Earth and environmental sciences. The course balances theory and practical exercises, with a strong emphasis on live coding sessions to ensure students gain hands-on experience. The module covers the entire machine learning workflow, from data wrangling and preparation to model evaluation and optimization. Students learn to use Python with Pandas, NumPy, Scikit-learn, XGboost, and other libraries.

The course is structured around key topics that will be covered through a combination of lectures and live coding sessions:

  • Model Evaluation: Understanding performance metrics and monitoring algorithm performance.
  • Overview of the Machine Learning Workflow: Understanding the steps in an end-to-end machine learning project.
  • Data Wrangling: Cleaning and preparing raw datasets for machine learning applications.
  • Data Preparation Pipelines: Using Scikit-learn to handle categorical data, perform feature engineering, and normalize/scale data.
  • Regression Algorithms: Implementing linear regression, decision trees, support vector machines, and more.
  • Classification Algorithms: Exploring algorithms such as SVC, random forests, and k-nearest neighbors.
  • Clustering and Dimensionality Reduction: Using algorithms like k-means, DBSCAN, PCA, and t-SNE.

By the end of the course, students will have a solid understanding of key machine learning algorithms and their applications to real-world problems, as well as the ability to code and deploy end-to-end machine learning pipelines.

Learning Outcomes

  • Gain an understanding of the end-to-end workflow of machine learning projects.
  • Learn to use open-source libraries (e.g., Scikit-learn) in a notebook environment for data wrangling and preparation.
  • Build data preparation pipelines, including handling categorical data, feature engineering, and scaling.
  • Explore and implement classical machine learning algorithms for regression, classification, and clustering.
  • Learn to evaluate model performance using various metrics such as RMSE, precision, recall, confusion matrix, and AUC.
  • Develop proficiency in professional coding practices for machine learning.

Course Plan

Week 1 – Focus on Data Science

  • Day 1: Data Preprocessing
    • Topic: How to clean, prepare, and preprocess data for machine learning. Understand data cleaning techniques, handle missing data, and prepare datasets for analysis.
  • Day 2: Performance Metrics
    • Topic: How to evaluate and monitor the performance of machine learning algorithms. Gain an understanding of key metrics such as RMSE, precision, recall, confusion matrix, and AUC.
  • Day 3: Optimization
    • Topic: Understanding how machine learning algorithms optimize their performance. Learning Outcome: Learn about optimization techniques such as gradient descent and hyperparameter tuning.
  • Day 4: Deep Dive into Data Modelling (🦈)
    • Topic: Fundamental principles of machine learning and data modeling. Learning Outcome: Build a solid foundation in understanding how machine learning models are structured and trained.
  • Day 5: ML Workflow
    • Topic: Efficient and professional coding practices for machine learning workflows. Learning Outcome: Learn to build reproducible and efficient machine learning pipelines.

Week 2 – Focus on Machine Learning Algorithms

  • Day 6: Model Tuning
    • Topic: Finding the best hyperparameters for your machine learning models. Learning Outcome: Understand techniques for hyperparameter optimization and their impact on model performance.
  • Day 7: Ensemble Learning
    • Topic: Exploring powerful ensemble learning techniques such as random forests and gradient boosting. Learning Outcome: Learn how ensemble methods improve model performance by combining predictions from multiple models.
  • Day 8: Natural Language Processing
    • Topic: Preparing and using text data for machine learning applications. Learning Outcome: Understand the basics of text preprocessing and NLP techniques.
  • Day 9: Unsupervised Learning
    • Topic: Machine learning techniques for unlabeled data. Learning Outcome: Explore clustering techniques such as k-means, DBSCAN, and hierarchical clustering.

Algorithms Covered

Week 1 – Data Science Focus

  • Linear Regression
  • Logistic Regression
  • K-Nearest Neighbors
  • SGD Regressor and SGD Classifier

Week 2 – Machine Learning Algorithms

Supervised Learning Algorithms
  • Support Vector Machine (SVM)
  • Decision Trees
  • Random Forest
  • Gradient Boosting Trees
  • AdaBoost
  • Naive Bayes Classifier
  • Multi-Layer Perceptron (Neural Network)
Unsupervised Learning Algorithms
  • Principal Component Analysis (PCA)
  • K-Means Clustering
  • DBSCAN and HDBSCAN
  • Isolation Forest (iFor)
  • Single-Class SVM
  • UMAP and t-SNE (optional)

Teaching Methods

The course combines lectures, live coding sessions, and practical exercises. Each day starts with a theoretical session to introduce key concepts, followed by hands-on practice using real-world datasets. The live coding approach ensures that students can follow along and apply the techniques as they learn.