Apache Spark

Apache Spark
=====

—–
* Course Id : BIGD-ASPK
* Duration : 40 Hours

Overview
—–
* Using Apache Spark, participants can build complete, Unified big data applications combining batch Streaming and Interactive analytics on all their data
* Using Spark, Developers can write sophisticated parallel applications to execute faster decisions, with Real-time actions
* Spark based solutions can be  applied to a wide variety of industries, architectures and use cases

Pre-Requisites
—–
All attendees should be familiar with :
* Linux environment
* Python language

Objectives
—–
All attendees will :
* Understand the features of Spark’s Resilient Distributed Datasets
* Work with the Spark shell for interactive data analysis
* Understand How Spark runs on a cluster
* Write Spark applications

Course Structure
—–
* We provide more focus on hands-on in our technical courses (typically 80% hands-on/20% theory)
* Students get the capability to apply the material they learn to real-world problems

Materials Provided
—–
* PDF of slides and hands-on exercises
* Access to instance with lab environment

Software Requirements
—–
Any of the following
* Any current internet browser
* vnc client
* rdp client

Hardware Requirements
—–
* Processor: 1.2 GHz
* RAM: 512 MB
* Disk space: 1 GB
* Network Connection with low latency (<250ms) to Internet

## Daywise Course Outline
—–
## Day 1
—–
* Unit 1 : Introduction to Spark
* Unit 2 : Spark Architecture
* Unit 3 : Introduction to RDDs
* Unit 4 : Deep dive into Spark RDDs

## Day 2
—–
* Unit 5 : Aggregating Data with Pair RDDs
* Unit 6 : Developing and Deploying Spark Applications
* Unit 7 : Spark List
* Unit 8 : Parallel Processing

## Day 3
—–
* Unit 9 : Spark RDD Persistence
* Unit 10 : Spark API
* Unit 11 : Spark SQL
* Unit 12 : Spark internals

## Day 4
—–
* Unit 13 : Spark Streaming
* Unit 14 : Machine Learning
* Unit 15 : MLLib Overview
* Unit 16 : Scheduling/ Partitioning

## Day 5
—–
* Unit 17 : Performance Characteristics and Tuning
* Unit 18 : Spark GraphX Overview
* Unit 19 : Graph Processing with GraphFrames
* Unit 20 : AWS EMR

## Detailed Outline
—–

Unit 1 : Introduction to Spark
—–

* Overview, Spark Systems
* Spark Ecosystem
* Spark vs. Hadoop
* Spark installation guide
* Spark configuration
* The Spark Shell

Unit 2 : Spark Architecture
—–

* Memory management
* Executor memory vs. driver memory
* Working with Spark Shell
* The concept of Resilient Distributed Datasets (RDD)
* Functional programming in Spark
* The architecture of Spark

Unit 3 : RDDs
—–

* Spark RDD, Lifecycle, Lazy Evaluation
* Creating RDDs
* RDD partitioning
* Operations and transformation in RDD
* Caching – Storage Type, Guidelines

Unit 4 : Deep dive into Spark RDDs
—–

* Key-Value Pairs – Definition
* The RDD general operations
* RDD action for
* Collect
* Count
* Collectsmap
* Saveastextfiles
* Pair RDD functions

Unit 5 : Aggregating Data with Pair RDDs
—–

* Key-Value pair in RDDs
* Creation
* Operation
* How Spark makes MapReduce operations faster
* MapReduce interactive operations
* Spark stack

Unit 6 : Developing and Deploying Spark Applications
—–

* Spark Applications vs. Spark Shell
* Creating a Spark application
* Configuring Spark Properties
* Deploying a Spark application
* Building and Running a Spark Application
* Logging and Debugging

Unit 7 : Spark List
—–

* Creation of mutable list
* Set and set operations
* List
* Tuple
* Concatenating list

Unit 8 : Parallel Processing
—–

* Learning about Spark parallel processing
* Deploying on a cluster
* Introduction to Spark partitions
* File-based partitioning of RDDs
* Understanding of HDFS and data locality
* Mastering the technique of parallel operations

Unit 9 : Spark RDD Persistence
—–

* The execution flow in Spark
* Understanding the RDD persistence overview
* Spark execution flow and Spark terminology
* RDD limitations
* Spark shell arguments
* Distributed persistence
* RDD lineage
* Key/Value pair for CountByKey
* ReduceByKey
* SortByKey
* AggregataeByKey

Unit 10 : Spark API
—–

* Overview, Basic Driver Code, SparkConf
* Creating and Using a SparkContext
* RDD API
* Application Lifecycle
* Cluster Managers

Unit 11 : Spark SQL
—–

* Introduction and Usage
* DataFrames and SQLContext
* Working with JSON
* Querying – The DataFrame DSL and SQL
* Data Formats

Unit 12 : Spark internals
—–

* Cluster Architecture
* The Catalyst query optimizer
* The Tungsten in-memory data format
* How the Spark scheduler works to execute jobs and tasks
* Shuffling, shuffle files
* How Spark handles data reads and writes

Unit 13 : Spark Streaming
—–

* Spark Streaming Architecture
* Writing streaming programcoding
* Processing of spark stream
* Processing Spark Discretized Stream (DStream)
* The context of Spark Streaming
* Streaming transformation
* Flume Spark streaming
* Request count and Dstream

Unit 14 : Machine Learning
—–

* Spark MLlib Pipeline API
* Built-in Featurizing and Algorithms
* Cross-Validation and Grid Search for Hyperparameter Tuning
* Evaluation Metrics
* Data Partitioning Strategies

Unit 15 : MLLib Overview
—–

* Introduction
* Feature Vectors
* Introduction to K-Means
* Various variables in Spark like shared variables
* Broadcast variables
* Accumulators

Unit 16 : Scheduling/ Partitioning
—–

* Scheduling and partitioning in Spark
* Hash partition
* Range partition
* Scheduling within and around applications
* Static partitioning
* Dynamic sharing
* Fair scheduling
* Map partition with index
* GroupByKey

Unit 17 : Performance Characteristics and Tuning
—–

* The Spark UI
* Narrow vs. Wide Dependencies
* Minimizing Data Processing
* Using Caching
* Using Broadcast Variables and Accumulators

Unit 18 : Spark GraphX Overview
—–

* Introduction
* Constructing Simple Graphs
* GraphX API
* Shortest Path Example

Unit 19 : Graph Processing with GraphFrames
—–

* Basic Graph Analysis
* GraphFrames API
* Transforming DataFrames into a graph
* Perform graph analysis
* PageRank

Unit 20 : Spark with AWS EMR
—–

* Understanding Hadoop on AWS EMR
* Relationship of Spark to Hadoop on EMR
* Importing data from S3
* Setting up a new EMR
* Build a new Cluster and initialize the data with EMR steps
* Visualizing and prototyping Spark with Zeppelin

Scroll to top