Scalable Machine Learning with Apache Spark
Contact us to book this course
Learning Track
Machine Learning and AI
Delivery methods
On-Site, Virtual
Duration
2 days
This course will teach you how to scale ML pipelines with Apache Spark™, including distributed training, hyperparameter tuning and inference. You’ll build and tune ML models with SparkML while leveraging MLflow to track, version and manage these models. We’ll cover the latest ML features in Apache Spark, such as pandas UDFs, pandas functions and the pandas API on Spark, as well as the latest ML product offerings such as Feature Store and AutoML.
This course will prepare you to take the Databricks Certified Machine Learning Associate exam.
Objectives
- Perform scalable EDA with Spark
- Build and tune machine learning models with SparkML
- Track, version and deploy models with MLflow
- Perform distributed hyperparameter tuning with HyperOpt
- Use the Databricks Machine Learning workspace to create a Feature Store and AutoML experiments
- Leverage the pandas API on Spark to scale your pandas code
Prerequisites
- Intermediate experience with Python
- Experience building machine learning models
- Familiarity with PySpark DataFrame API
Course outline
- Spark/ML Overview
- Exploratory Data Analysis (EDA) and Feature Engineering with Spark
- Linear Regression with SparkML: Transformers, Estimators, Pipelines and Evaluators
- MLflow Tracking and Model Registry
- Tree-Based Models: Hyperparameter Tuning and Parallelism
- HyperOpt for Distributed Hyperparameter Tuning
- Databricks AutoML and Feature Store
- Integrating Third-Party Packages (Distributed XGBoost)
- Distributed Inference of scikit-learn Models with pandas UDFs
- Distributed Training with pandas Function APIs