Data Science and Machine Learning for Planet Earth
Synopsis
This intensive, two-week block teaching course introduces graduate students and professionals to data science and machine learning principles, with a focus on applications relevant to Earth and environmental sciences. The course balances theory and practical exercises, with a strong emphasis on live coding sessions to ensure students gain hands-on experience. The module covers the entire machine learning workflow, from data wrangling and preparation to model evaluation and optimization. Students learn to use Python with Pandas, NumPy, Scikit-learn, XGboost, and other libraries.
The course is structured around key topics that will be covered through a combination of lectures and live coding sessions:
- Model Evaluation: Understanding performance metrics and monitoring algorithm performance.
- Overview of the Machine Learning Workflow: Understanding the steps in an end-to-end machine learning project.
- Data Wrangling: Cleaning and preparing raw datasets for machine learning applications.
- Data Preparation Pipelines: Using Scikit-learn to handle categorical data, perform feature engineering, and normalize/scale data.
- Regression Algorithms: Implementing linear regression, decision trees, support vector machines, and more.
- Classification Algorithms: Exploring algorithms such as SVC, random forests, and k-nearest neighbors.
- Clustering and Dimensionality Reduction: Using algorithms like k-means, DBSCAN, PCA, and t-SNE.
By the end of the course, students will have a solid understanding of key machine learning algorithms and their applications to real-world problems, as well as the ability to code and deploy end-to-end machine learning pipelines.
Learning Outcomes
- Gain an understanding of the end-to-end workflow of machine learning projects.
- Learn to use open-source libraries (e.g., Scikit-learn) in a notebook environment for data wrangling and preparation.
- Build data preparation pipelines, including handling categorical data, feature engineering, and scaling.
- Explore and implement classical machine learning algorithms for regression, classification, and clustering.
- Learn to evaluate model performance using various metrics such as RMSE, precision, recall, confusion matrix, and AUC.
- Develop proficiency in professional coding practices for machine learning.
Course Plan
Week 1 – Focus on Data Science
- Day 1: Data Preprocessing
- Topic: How to clean, prepare, and preprocess data for machine learning. Understand data cleaning techniques, handle missing data, and prepare datasets for analysis.
- Day 2: Performance Metrics
- Topic: How to evaluate and monitor the performance of machine learning algorithms. Gain an understanding of key metrics such as RMSE, precision, recall, confusion matrix, and AUC.
- Day 3: Optimization
- Topic: Understanding how machine learning algorithms optimize their performance. Learning Outcome: Learn about optimization techniques such as gradient descent and hyperparameter tuning.
- Day 4: Deep Dive into Data Modelling (🦈)
- Topic: Fundamental principles of machine learning and data modeling. Learning Outcome: Build a solid foundation in understanding how machine learning models are structured and trained.
- Day 5: ML Workflow
- Topic: Efficient and professional coding practices for machine learning workflows. Learning Outcome: Learn to build reproducible and efficient machine learning pipelines.
Week 2 – Focus on Machine Learning Algorithms
- Day 6: Model Tuning
- Topic: Finding the best hyperparameters for your machine learning models. Learning Outcome: Understand techniques for hyperparameter optimization and their impact on model performance.
- Day 7: Ensemble Learning
- Topic: Exploring powerful ensemble learning techniques such as random forests and gradient boosting. Learning Outcome: Learn how ensemble methods improve model performance by combining predictions from multiple models.
- Day 8: Natural Language Processing
- Topic: Preparing and using text data for machine learning applications. Learning Outcome: Understand the basics of text preprocessing and NLP techniques.
- Day 9: Unsupervised Learning
- Topic: Machine learning techniques for unlabeled data. Learning Outcome: Explore clustering techniques such as k-means, DBSCAN, and hierarchical clustering.
Algorithms Covered
Week 1 – Data Science Focus
- Linear Regression
- Logistic Regression
- K-Nearest Neighbors
- SGD Regressor and SGD Classifier
Week 2 – Machine Learning Algorithms
Supervised Learning Algorithms
- Support Vector Machine (SVM)
- Decision Trees
- Random Forest
- Gradient Boosting Trees
- AdaBoost
- Naive Bayes Classifier
- Multi-Layer Perceptron (Neural Network)
Unsupervised Learning Algorithms
- Principal Component Analysis (PCA)
- K-Means Clustering
- DBSCAN and HDBSCAN
- Isolation Forest (iFor)
- Single-Class SVM
- UMAP and t-SNE (optional)
Teaching Methods
The course combines lectures, live coding sessions, and practical exercises. Each day starts with a theoretical session to introduce key concepts, followed by hands-on practice using real-world datasets. The live coding approach ensures that students can follow along and apply the techniques as they learn.