Apache Spark Programming with Databricks
Contact us to book this course
Learning Track
Programming and Apache Spark
Delivery methods
On-Site, Virtual
Duration
2 days
In this course, you will explore the fundamentals of Apache Spark and Delta Lake on Databricks. You will learn the architectural components of Spark, the DataFrame and Structured Streaming APIs, and how Delta Lake can improve your data pipelines. Lastly, you will execute streaming queries to process streaming data and understand the advantages of using Delta Lake.
Objectives
Upon completion of the course, you will be able to:
- Define Spark’s architectural components
- Describe how DataFrames are transformed, executed, and optimized in Spark
- Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark
- Apply the Structured Streaming API to perform analysis on streaming data
- Use Delta Lake to improve the quality and performance of data pipelines
Prerequisites
- Familiarity with Python and basic programming concepts, including data types, lists, dictionaries, variables, functions, loops, conditional statements, exception handling, accessing classes, and using third-party libraries
- Basic knowledge of SQL, including writing queries using SELECT, WHERE, GROUP BY, ORDER BY, LIMIT, and JOIN
Course outline
- Spark overview
- Databricks platform overview
- SparkSQL
- DataFrame reader, writer, transformation, and aggregation
- Datetimes
- Complex types
- User-defined functions (UDFs) and vectorized UDFs
- Spark internals
- Query optimization
- Partitioning
- Streaming API
- Delta Lake