Process Data in real-time with Spark And Python

Big Data Processing with Spark And Python Training

Big Data Processing with Spark And Python
=====

—–
* Course Id : BIGD-SKPY
* Duration : 32 Hours

Overview
—–
* Participants will learn Spark with Python
* This course teaches the basics in Python, followed by learning how to use Spark DataFrames with the latest Spark 2.0 syntax
* We also cover how to use the MLlib Machine Library with the DataFrame syntax and Spark
* Technologies such as Spark SQL, Spark Streaming, and advanced models like Gradient Boosted Trees are also covered in our course
* This course allows attendees to get the ability to analyze huge data sets

Training Objectives
—–
All attendees will :
* Learn Spark with Python
* Get the ability to use new skills to solve a real-world big data problem
* Feel comfortable putting Spark and PySpark on the resume
* Analyze Big Data by using Python and Spark together
* Learn Spark 2.0 DataFrame Syntax
* Learn How to use Logisitic Regression
* Use Spark with Random Forests for Classification
* Work with Spark’s Gradient Boosted Trees
* Create Powerful Machine Learning Models using Spark’s MLlib
* Learn about the DataBricks Platform!

Pre-Requisites
—–
Basic knowledge on:
* Python programming
* Awareness of Big Data

Course Structure
—–
* We provide more focus on hands-on in our technical courses (typically 80% hands-on/20% theory)
* Students get the capability to apply the material they learn to real-world problems

Materials Provided
—–
* All participants receive
* PDF of slides
* PDF of handson
* Access to instance with lab environment

Software Requirements
—–
Any of the following
* Any current internet browser
* vnc client
* rdp client

Hardware Requirements
—–
* Processor: 1.2 GHz+
* RAM: 512 MB+
* Disk space: 1 GB+
* Network Connection with low latency (<250ms) to Internet

## Daywise Course Outline
—–
## Day 1
—–
* Unit 1 : Introduction to Big Data Hadoop and Spark
* Unit 2 : Spark Architecture
* Unit 3 : Introduction to Python Crash Course
* Unit 4 : Python for Apache Spark (PySpark)

## Day 2
—–
* Unit 4 : Spark RDDs
* Unit 5 : Aggregating Data with Pair RDDs
* Unit 6 : Developing and Deploying Spark+Python Applications
* Unit 7 : Spark SQL

## Day 3
—–
* Unit 8 : Machine Learning
* Unit 9 : MLLib Overview
* Unit 10 : Deep Dive into Spark MLlib

## Day 4
—–
* Unit 11 : Basics of Apache Kafka and Apache Flume – Generating Streams
* Unit 12 : Apache Spark Streaming – Processing Multiple Batches
* Unit 13 : Performance Characteristics and Tuning

## Detailed Big Data Processing with Spark And Python Outline
—–

Unit 1 : Introduction to Big Data Hadoop and Spark
——
* What is Big Data?
* Big Data Customer Scenarios
* Limitations of Existing Data Analytics Architecture
* What is Hadoop?
* Key Characteristics of Hadoop
* Hadoop Core Components
* Why Spark is needed?
* What is Spark?
* How Spark differs from other frameworks?
* Spark vs. Hadoop
* Spark Ecosystem
* Spark installation guide
* Spark configuration
* The Spark Shell
* Writing your first Spark Job Using SBT
* Submitting Spark Job

Unit 2 : Spark Architecture
——
* Spark Components & its Architecture
* Spark Deployment Modes
* Memory management
* Executor memory vs. driver memory
* Working with Spark Shell
* Resilient Distributed Datasets (RDD)
* Functional programming in Spark
* Spark Web UI
* Data Ingestion using Sqoop
* Overview, Basic Driver Code, SparkConf
* Creating and Using a SparkContext
* RDD API
* Application Lifecycle
* Cluster Managers

Unit 3 : Quick Introduction to Python
——
Introduction to Python
Jupyter Notebook Overview
Python Basics
Python Flow Control
Working with Dictionaries
Error Handling
Python Exercises

Unit 4 : Python for Apache Spark (PySpark)
——
* Functions
* Function Parameters
* Global Variables
* Variable Scope and Returning Values
* Lambda Functions
* Object-Oriented Concepts
* Standard Libraries
* Modules Used in Python
* The Import Statements
* Module Search Path
* Package Installation Ways
* Spark Components & its Architecture
* Spark Deployment Modes
* Introduction to PySpark Shell
* Submitting PySpark Job
* Spark Web UI
* Use Jupyter Notebook to write your first PySpark Job
* Data Ingestion using Sqoop

Unit 5 : Spark RDDs
——
* Spark RDD, Lifecycle, Lazy Evaluation
* Creating RDDs
* RDD partitioning
* Operations and transformation in RDD
* Caching – Storage Type, Guidelines
* Key-Value Pairs – Definition
* The RDD general operations
* RDD action for
* Collect
* Count
* Collectsmap
* Saveastextfiles
* Pair RDD functions

Unit 6 : Pair RDDs and Persistence
——
* Key-Value pair in RDDs
* Creation
* Operation
* How Spark makes MapReduce operations faster
* MapReduce interactive operations
* Spark stack
* The execution flow in Spark
* Understanding the RDD persistence overview
* Spark execution flow and Spark terminology
* RDD limitations
* Spark shell arguments
* Distributed persistence
* RDD lineage
* Key/Value pair for CountByKey
* ReduceByKey
* SortByKey
* AggregateByKey

Unit 7 : Developing and Deploying Spark+Python Applications
——
* Spark Applications vs. Spark Shell
* Creating a Spark application
* Configuring Spark Properties
* Deploying a Spark application
* Building and Running a Spark Application
* Logging and Debugging
* Learning about Spark parallel processing
* Deploying on a cluster
* Introduction to Spark partitions
* File-based partitioning of RDDs
* Understanding of HDFS and data locality
* Mastering the technique of parallel operations

Unit 8 : Spark SQL
——
* Introduction and Usage
* Spark SQL Architecture
* SQLContext, DataFrames and DataSets
* Working with JSON
* Querying – The DataFrame DSL and SQL
* Spark DataFrame Basic Operations
* Groupby and Aggregate Operations
* Missing Data
* Data Formats
* User Defined Functions
* Interoperating with RDDs
* JSON and Parquet File Formats
* Loading Data through Different Sources
* Spark – Hive Integration
* Spark Internals
* Cluster Architecture
* The Catalyst query optimizer
* The Tungsten in-memory data format
* How the Spark scheduler works to execute jobs and tasks
* Shuffling, shuffle files
* How Spark handles data reads and writes

Unit 9 : Machine Learning
——
* Spark MLlib Pipeline API
* Built-in Featurizing and Algorithms
* Cross-Validation and Grid Search for Hyperparameter Tuning
* Evaluation Metrics
* Data Partitioning Strategies

Unit 10 : MLLib Overview
——
* Introduction
* Feature Vectors
* Introduction to K-Means
* Various variables in Spark like shared variables
* Broadcast variables
* Accumulators

Unit 11 : Deep Dive into Spark MLlib
——
* MLLib Algorithms
* Linear Regression
* Logistic Regression
* Decision Tree
* Random Forest
* Working with K-Means Clustering in MLLib
* Analysis on Airline Flight Data using MLlib (K-Means)

Unit 12 : Apache Spark Streaming – Processing Multiple Batches
——
* Spark Streaming Architecture
* Writing streaming programcoding
* Processing of spark stream
* Processing Spark Discretized Stream (DStream)
* The context of Spark Streaming
* Streaming transformation
* Flume Spark streaming
* Request count and Dstream
* Drawbacks in Existing Computing Methods
* Why Streaming is Necessary?
* What is Spark Streaming?
* Spark Streaming Features
* Spark Streaming Workflow
* Streaming Context & DStreams
* Transformations on DStreams
* Why Windowed Operators are Useful
* Important Windowed Operators
* Slice, Window and ReduceByWindow Operators
* Stateful Operators

Unit 13 : Performance Characteristics and Tuning
——
* Scheduling and partitioning in Spark
* Hash partition
* Range partition
* Scheduling within and around applications
* Static partitioning
* Dynamic sharing
* Fair scheduling
* Map partition with index
* GroupByKey
* The Spark UI
* Narrow vs. Wide Dependencies
* Minimizing Data Processing
* Using Caching
* Using Broadcast Variables and Accumulators