PySpark Essentials
Contact us to book this courseBig Data and Machine Learning
On-Site, Virtual
3 days
Apache Spark builds on the success of Apache Hadoop by offering faster execution of MapReduce applications—up to two orders of magnitude quicker. Spark not only supports batch-oriented MapReduce but also provides direct support for real-time data streaming, graph processing, and machine learning. With its versatility, Spark applications can be developed incrementally using several command-line interfaces, including Python, R, SQL, Java, and Scala. This course focuses on using PySpark, Spark’s Python API, to harness the power of big data processing.
Learning Objectives
After successfully completing this course, students will be able to:
- Describe the Spark architecture, including cluster management and file systems
- Explain the components of a Spark application
- Implement a Spark application based on Resilient Distributed Datasets (RDDs) using PySpark
- Interact with Spark using Jupyter Notebooks
- Utilize SQL as an API for MapReduce applications
- Create and manipulate relational tables using Spark SQL and DataFrames
- Perform real-time data processing with Spark Streaming
- Implement distributed machine learning applications with Spark MLlib
Who Should Attend
Participants should have completed a foundational course on big data, such as ROI’s “Big Data: Understanding Hadoop and Its Ecosystem,” or possess equivalent experience. This course focuses on PySpark; however, familiarity with Python, SQL, and basic programming concepts is required. Participants will be expected to adapt code examples to solve problems presented in the exercises.
Prerequisites
This course is intended for data engineers, data scientists, software developers, and IT professionals who want to understand how to implement big data applications. This is a hands-on course designed to provide participants with practical experience in developing Spark applications using Python.
Course outline
- Spark Architecture and RDD Basics
- Interactive Development with PySpark Shell
- Exercise: Getting Started with PySpark and Word Count
- RDD Lineage and Partitioning
- Programming with RDDs in Python
- Hands-on Exercise: Airline Data Analysis Using RDDs
- Introduction to Jupyter Notebooks
- The Importance of Spark SQL
- Introduction to DataFrames in PySpark
- Exercise: Querying JSON Data with Spark SQL
- DataFrame Operations in Python
- Getting Started with PySpark
- Exercise: Building a Log Processing Pipeline
- Developing Applications in Python
- Comparing RDDs and DataFrames in Python
- Unified DataFrames and UDFs (User-Defined Functions)
- Exercise: Analyzing Stock Market Data with DataFrames
- Persisting and Checkpointing in Spark
- Introduction to Spark Streaming and its Architecture
- Lambda Architecture
- Hands-on Exercise: Twitter Sentiment Analysis with Spark Streaming
- Introduction to Spark MLlib
- Hands-on Exercise: Building a Predictive Model for Customer Churn