• ROI Training

PySpark Essentials

Contact us to book this course
Curriculum icon
Curriculum

Big Data and Machine Learning

Delivery methods icon
Delivery methods

On-Site, Virtual

Duration icon
Duration

3 days

Apache Spark builds on the success of Apache Hadoop by offering faster execution of MapReduce applications—up to two orders of magnitude quicker. Spark not only supports batch-oriented MapReduce but also provides direct support for real-time data streaming, graph processing, and machine learning. With its versatility, Spark applications can be developed incrementally using several command-line interfaces, including Python, R, SQL, Java, and Scala. This course focuses on using PySpark, Spark’s Python API, to harness the power of big data processing.

Learning Objectives

After successfully completing this course, students will be able to:

  • Describe the Spark architecture, including cluster management and file systems
  • Explain the components of a Spark application
  • Implement a Spark application based on Resilient Distributed Datasets (RDDs) using PySpark
  • Interact with Spark using Jupyter Notebooks
  • Utilize SQL as an API for MapReduce applications
  • Create and manipulate relational tables using Spark SQL and DataFrames
  • Perform real-time data processing with Spark Streaming
  • Implement distributed machine learning applications with Spark MLlib

Who Should Attend

Participants should have completed a foundational course on big data, such as ROI’s “Big Data: Understanding Hadoop and Its Ecosystem,” or possess equivalent experience. This course focuses on PySpark; however, familiarity with Python, SQL, and basic programming concepts is required. Participants will be expected to adapt code examples to solve problems presented in the exercises.

Prerequisites

This course is intended for data engineers, data scientists, software developers, and IT professionals who want to understand how to implement big data applications. This is a hands-on course designed to provide participants with practical experience in developing Spark applications using Python.

Course outline

  • Spark Architecture and RDD Basics
  • Interactive Development with PySpark Shell
  • Exercise: Getting Started with PySpark and Word Count
  • RDD Lineage and Partitioning
  • Programming with RDDs in Python
  • Hands-on Exercise: Airline Data Analysis Using RDDs
  • Introduction to Jupyter Notebooks
  • The Importance of Spark SQL
  • Introduction to DataFrames in PySpark
  • Exercise: Querying JSON Data with Spark SQL
  • DataFrame Operations in Python
  • Getting Started with PySpark
  • Exercise: Building a Log Processing Pipeline
  • Developing Applications in Python
  • Comparing RDDs and DataFrames in Python
  • Unified DataFrames and UDFs (User-Defined Functions)
  • Exercise: Analyzing Stock Market Data with DataFrames
  • Persisting and Checkpointing in Spark
  • Introduction to Spark Streaming and its Architecture
  • Lambda Architecture
  • Hands-on Exercise: Twitter Sentiment Analysis with Spark Streaming
  • Introduction to Spark MLlib
  • Hands-on Exercise: Building a Predictive Model for Customer Churn

Ready to accelerate your team's innovation?