Scalable Machine Learning with Apache Spark

Learning Track

Machine Learning and AI

Delivery methods

On-Site, Virtual

Duration

2 days

This course will teach you how to scale ML pipelines with Apache Spark™, including distributed training, hyperparameter tuning and inference. You’ll build and tune ML models with SparkML while leveraging MLflow to track, version and manage these models. We’ll cover the latest ML features in Apache Spark, such as pandas UDFs, pandas functions and the pandas API on Spark, as well as the latest ML product offerings such as Feature Store and AutoML.

This course will prepare you to take the Databricks Certified Machine Learning Associate exam.

Objectives

Perform scalable EDA with Spark
Build and tune machine learning models with SparkML
Track, version and deploy models with MLflow
Perform distributed hyperparameter tuning with HyperOpt
Use the Databricks Machine Learning workspace to create a Feature Store and AutoML experiments
Leverage the pandas API on Spark to scale your pandas code

Prerequisites

Intermediate experience with Python
Experience building machine learning models
Familiarity with PySpark DataFrame API

Course outline

1Day 1

Spark/ML Overview
Exploratory Data Analysis (EDA) and Feature Engineering with Spark
Linear Regression with SparkML: Transformers, Estimators, Pipelines and Evaluators
MLflow Tracking and Model Registry

2Day 2

Tree-Based Models: Hyperparameter Tuning and Parallelism
HyperOpt for Distributed Hyperparameter Tuning
Databricks AutoML and Feature Store
Integrating Third-Party Packages (Distributed XGBoost)
Distributed Inference of scikit-learn Models with pandas UDFs
Distributed Training with pandas Function APIs

Ready to accelerate your team's innovation?

Schedule a meeting

Unlock your team’s potential and get the most from your tech stack