Apache Spark And Scala Training

Working With Spark And Scala
=====

—–
* Course Id : BIGD-SSLA
* Duration : 32 Hours

Overview
—–
* Using Apache Spark, participants can build complete, unified big data applications combining batch Streaming and interactive analytics
* Spark helps developers write sophisticated parallel applications to execute faster decisions, with real-time actions
* Spark based solutions can be applied to a wide variety of industries, architectures and use cases

Pre-Requisites
—–
All attendees should be familiar with :
* Linux environment
* Big Data awareness

Objectives
—–
All attendees will :
* Understand Big Data and its components
* Learn about the difference between batch processing and real-time processing
* Understand the features of Spark’s Resilient Distributed Datasets
* Work with the Spark shell for interactive data analysis
* Learn how to use Scala with Spark for faster data processing
* Learn about Scala constructs
* Understand How Spark runs on a cluster
* Write Spark applications

Course Structure
—–
* We provide more focus on hands-on in our technical courses (typically 80% hands-on/20% theory)
* Students get the capability to apply the material they learn to real-world problems

Materials Provided
—–
* PDF of slides and hands-on exercises
* Access to instance with pre-configured lab environment

Software Requirements
—–
Any of the following
* Any current internet browser
* vnc client
* rdp client

Hardware Requirements
—–
* Processor: 1.2 GHz
* RAM: 512 MB
* Disk space: 1 GB
* Network Connection with low latency (<250ms) to Internet

## Daywise Apache Spark And Scala Course Outline
—–
## Day 1
—–
* Unit 1 : Introduction to Big Data Hadoop and Spark
* Unit 2 : Spark Architecture
* Unit 3 : Introduction to Scala for Apache Spark

## Day 2
—–
* Unit 4 : Functional Programming in Scala
* Unit 5 : Spark RDDs
* Unit 6 : Aggregating Data with Pair RDDs

## Day 3
—–
* Unit 7 : Developing and Deploying Spark+Scala Applications
* Unit 8 : Spark SQL
* Unit 9 : Machine Learning

## Day 4
—–
* Unit 10 : MLLib Overview
* Unit 11 : Deep Dive into Spark MLlib
* Unit 12 : Basics of Apache Kafka and Apache Flume – Generating Streams

## Day 5
—–
* Unit 13 : Apache Spark Streaming – Processing Multiple Batches
* Unit 14 : Performance Characteristics and Tuning

## Detailed Apache Spark And Scala Outline
—–

Unit 1 : Introduction to Big Data Hadoop and Spark
——
* What is Big Data?
* Big Data Customer Scenarios
* Limitations of Existing Data Analytics Architecture
* What is Hadoop?
* Key Characteristics of Hadoop
* Hadoop Core Components
* Why Spark is needed?
* What is Spark?
* How Spark differs from other frameworks?
* Spark vs. Hadoop
* Spark Ecosystem
* Spark installation guide
* Spark configuration
* The Spark Shell
* Writing your first Spark Job Using SBT
* Submitting Spark Job

Unit 2 : Spark Architecture
——
* Spark Components & its Architecture
* Spark Deployment Modes
* Memory management
* Executor memory vs. driver memory
* Working with Spark Shell
* Resilient Distributed Datasets (RDD)
* Functional programming in Spark
* Spark Web UI
* Data Ingestion using Sqoop
* Overview, Basic Driver Code, SparkConf
* Creating and Using a SparkContext
* RDD API
* Application Lifecycle
* Cluster Managers

Unit 3 : Introduction to Scala for Apache Spark
——
* What is Scala?
* Why Scala for Spark?
* Scala in other Frameworks
* Introduction to Scala REPL
* Basic Scala Operations
* Variable Types in Scala
* Control Structures in Scala
* Foreach loop, Functions and Procedures
* Collections in Scala- Array
* ArrayBuffer, Map, Tuples, Lists, and more

Unit 4 : Functional Programming in Scala
——
* Functional Programming
* Higher Order Functions
* Anonymous Functions
* Class in Scala
* Getters and Setters
* Custom Getters and Setters
* Properties with only Getters
* Auxiliary Constructor and Primary Constructor
* Singletons
* Extending a Class
* Overriding Methods
* Traits as Interfaces and Layered Traits

Unit 5 : Spark RDDs
——
* Spark RDD, Lifecycle, Lazy Evaluation
* Creating RDDs
* RDD partitioning
* Operations and transformation in RDD
* Caching – Storage Type, Guidelines
* Key-Value Pairs – Definition
* The RDD general operations
* RDD action for
* Collect
* Count
* Collectsmap
* Saveastextfiles
* Pair RDD functions

Unit 6 : Pair RDDs and Persistence
——
* Key-Value pair in RDDs
* Creation
* Operation
* How Spark makes MapReduce operations faster
* MapReduce interactive operations
* Spark stack
* The execution flow in Spark
* Understanding the RDD persistence overview
* Spark execution flow and Spark terminology
* RDD limitations
* Spark shell arguments
* Distributed persistence
* RDD lineage
* Key/Value pair for CountByKey
* ReduceByKey
* SortByKey
* AggregateByKey

Unit 7 : Developing and Deploying Spark+Scala Applications
——
* Spark Applications vs. Spark Shell
* Creating a Spark application
* Configuring Spark Properties
* Deploying a Spark application
* Building and Running a Spark Application
* Logging and Debugging
* Learning about Spark parallel processing
* Deploying on a cluster
* Introduction to Spark partitions
* File-based partitioning of RDDs
* Understanding of HDFS and data locality
* Mastering the technique of parallel operations

Unit 8 : Spark SQL
——
* Introduction and Usage
* Spark SQL Architecture
* SQLContext, DataFrames and DataSets
* Working with JSON
* Querying – The DataFrame DSL and SQL
* Data Formats
* User Defined Functions
* Interoperating with RDDs
* JSON and Parquet File Formats
* Loading Data through Different Sources
* Spark – Hive Integration
* Spark Internals
* Cluster Architecture
* The Catalyst query optimizer
* The Tungsten in-memory data format
* How the Spark scheduler works to execute jobs and tasks
* Shuffling, shuffle files
* How Spark handles data reads and writes

Unit 9 : Machine Learning
——
* Spark MLlib Pipeline API
* Built-in Featurizing and Algorithms
* Cross-Validation and Grid Search for Hyperparameter Tuning
* Evaluation Metrics
* Data Partitioning Strategies

Unit 10 : MLLib Overview
——
* Introduction
* Feature Vectors
* Introduction to K-Means
* Various variables in Spark like shared variables
* Broadcast variables
* Accumulators

Unit 11 : Deep Dive into Spark MLlib
——
* MLLib Algorithms
* Linear Regression
* Logistic Regression,
* Decision Tree
* Random Forest
* Working with K-Means Clustering in MLLib
* Analysis on Airline Flight Data using MLlib (K-Means)

Unit 12 : Data Ingestion with Apache Kafka and Apache Flume
——
* Need for Kafka
* What is Kafka?
* Core Concepts of Kafka
* Kafka Architecture
* Where is Kafka Used?
* Understanding the Components of Kafka Cluster
* Configuring Kafka Cluster
* Kafka Producer and Consumer Java API
* Need of Apache Flume
* What is Apache Flume?
* Basic Flume Architecture
* Flume Sources
* Flume Sinks
* Flume Channels
* Flume Configuration
* Integrating Apache Flume and Apache Kafka

Unit 13 : Apache Spark Streaming – Processing Multiple Batches
——
* Spark Streaming Architecture
* Writing streaming programcoding
* Processing of spark stream
* Processing Spark Discretized Stream (DStream)
* The context of Spark Streaming
* Streaming transformation
* Flume Spark streaming
* Request count and Dstream
* Drawbacks in Existing Computing Methods
* Why Streaming is Necessary?
* What is Spark Streaming?
* Spark Streaming Features
* Spark Streaming Workflow
* Streaming Context & DStreams
* Transformations on DStreams
* How Windowed Operators are useful
* Important Windowed Operators
* Slice, Window and ReduceByWindow Operators
* Stateful Operators

Unit 14 : Performance Characteristics and Tuning
——
* Scheduling and partitioning in Spark
* Hash partition
* Range partition
* Scheduling within and around applications
* Static partitioning
* Dynamic sharing
* Fair scheduling
* Map partition with index
* GroupByKey
* The Spark UI
* Narrow vs. Wide Dependencies
* Minimizing Data Processing
* Using Caching
* Using Broadcast Variables and Accumulators