Apache Spark Programming with Databricks

Learning Track

Programming and Apache Spark

Delivery methods

On-Site, Virtual

Duration

2 days

In this course, you will explore the fundamentals of Apache Spark and Delta Lake on Databricks. You will learn the architectural components of Spark, the DataFrame and Structured Streaming APIs, and how Delta Lake can improve your data pipelines. Lastly, you will execute streaming queries to process streaming data and understand the advantages of using Delta Lake.

Objectives

Upon completion of the course, you will be able to:

Define Spark’s architectural components
Describe how DataFrames are transformed, executed, and optimized in Spark
Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark
Apply the Structured Streaming API to perform analysis on streaming data
Use Delta Lake to improve the quality and performance of data pipelines

Prerequisites

Familiarity with Python and basic programming concepts, including data types, lists, dictionaries, variables, functions, loops, conditional statements, exception handling, accessing classes, and using third-party libraries
Basic knowledge of SQL, including writing queries using SELECT, WHERE, GROUP BY, ORDER BY, LIMIT, and JOIN

Course outline

1Day 1

Spark overview
Databricks platform overview
SparkSQL
DataFrame reader, writer, transformation, and aggregation
Datetimes
Complex types

2Day 2

User-defined functions (UDFs) and vectorized UDFs
Spark internals
Query optimization
Partitioning
Streaming API
Delta Lake

Ready to accelerate your team's innovation?

Schedule a meeting

Unlock your team’s potential and get the most from your tech stack