Apache Spark and Scala

About Course

This course is designed to prepare you for the Cloudera Hadoop and Spark Developer Certification Exam (CCA175). This course will cover Apache Spark and the Spark Ecosystem, which includes Spark RDD, Spark SQL, Spark MLlib and Spark Streaming. You will get knowledge on Scala Programming language, HDFS,Spark GraphX and Messaging System such as Kafka.

Get Expert Consultation

Course Description:

Learn the latest Big Data Technology – Spark! And learn to use it

Analyze huge data sets, and this course is specifically designed to speed on one of the best technologies for this task, Apache Spark! The top technology companies like Google, Facebook, Netflix, Amazon, NASA, and more are all using Spark to solve their big data problems!

Spark can perform up to 100x faster than Hadoop MapReduce.

This course will teach the basics of Scala, continuing on to learning how to use Spark DataFrames with the latest Spark 2.X syntax! Once we’ve done that we’ll go through how to use the MLlib Machine Library with the DataFrame syntax and Spark.

We also cover the latest Spark Technologies, like Spark SQL, Spark Streaming, Spark ML

Course Objective/What will Learn

HDFS Commands
Scala Fundamentals
Core Spark – Transformations and Actions (RDD)
Spark SQL and Data Frames
Spark Streaming analytics using Kafka, Flume
Spark ML

Pre-Requisites/Requirements:

Basic Programming Skills

Target Audience:

Developers and Architects
BI /ETL/DW Professionals
Senior IT Professionals
Testing Professionals and Mainframe Professionals
Freshers and Big Data Enthusiasts

1. Introduction to Big Data Hadoop and Spark

Learning Objectives: Understand Big Data and its components such as HDFS. You will learn about the Hadoop Cluster Architecture. You will also get an introduction to Spark and the difference between batch processing and real-time processing.

Topics:

What is Big Data?
Big Data Customer Scenarios
What is Hadoop?
Hadoop’s Key Characteristics
Hadoop Ecosystem and HDFS
Hadoop Core Components
Rack Awareness and Block Replication
YARN and its Advantage
Hadoop Cluster and its Architecture
Hadoop: Different Cluster Modes
Big Data Analytics with Batch & Real-time Processing
Why Spark is needed?
What is Spark?
How Spark differs from other frameworks?

Hands-on: Scala REPL Detailed Demo.

2. Introduction to Scala

Learning Objectives: Learn the basics of Scala that are required for programming Spark applications. Also learn about the basic constructs of Scala such as variable types, control structures, collections such as Array, ArrayBuffer, Map, Lists, and many more.

Topics:

What is Scala?
Why Scala for Spark?
Scala in other Frameworks
Introduction to Scala REPL
Basic Scala Operations
Variable Types in Scala
Control Structures in Scala
Foreach loop, Functions and Procedures
Collections in Scala- Array
ArrayBuffer, Map, Tuples, Lists, and more

Hands-on: Scala REPL Detailed Demo

3. Object Oriented Scala and Functional Programming Concepts

Learning Objectives: Learn about object-oriented programming and functional programming techniques in Scala.

Topics

Variables in Scala
Methods, classes, and objects in Scala
Packages and package objects
Traits and trait linearization
Java Interoperability

Introduction to functional programming
Functional Scala for the data scientists
Why functional programming and Scala are important for learning Spark?
Pure functions and higher-order functions
Using higher-order functions
Error handling in functional Scala
Functional programming and data mutability

Hands-on: OOPs Concepts- Functional Programming

4. Collection APIs

Learning Objectives: Learn about the Scala collection APIs, types and hierarchies. Also, learn about performance characteristics.

Topics

Scala collection APIs
Types and hierarchies
Performance characteristics
Java interoperability
Using Scala implicits

5. Introduction to Spark

Learning Objectives: Understand Apache Spark and learn how to develop Spark applications.

Topics:

Introduction to data analytics
Introduction to big data
Distributed computing using Apache Hadoop
Introducing Apache Spark
Apache Spark installation
Spark Applications
The back bone of Spark – RDD
Loading Data
What is Lambda
Using the Spark shell
Actions and Transformations
Associative Property
Implant on Data
Persistence
Caching
Loading and Saving data

Hands-on:

Building and Running Spark Applications
Spark Application Web UI
Configuring Spark Properties

6. Operations of RDD

Learning Objectives: Get an insight of Spark – RDDs and other RDD related manipulations for implementing business logic (Transformations, Actions, and Functions performed on RDD).

Topics

Challenges in Existing Computing Methods
Probable Solution & How RDD Solves the Problem
What is RDD, Its Operations, Transformations & Actions
Data Loading and Saving Through RDDs

Key-Value Pair RDDs
Other Pair RDDs, Two Pair RDDs
RDD Lineage
RDD Persistence
WordCount Program Using RDD Concepts
RDD Partitioning & How It Helps Achieve Parallelization
Passing Functions to Spark

Hands-on:

Loading data in RDD
Saving data through RDDs
RDD Transformations
RDD Actions and Functions
RDD Partitions
WordCount through RDDs

7. DataFrames and Spark SQL

Learning Objectives: Learn about SparkSQL which is used to process structured data with SQL queries, data-frames and datasets in Spark SQL along with different kinds of SQL operations performed on the data-frames. Also, learn about the Spark and Hive integration.

Topics

Need for Spark SQL
What is Spark SQL?
Spark SQL Architecture
SQL Context in Spark SQL
User Defined Functions
Data Frames & Datasets
Interoperating with RDDs
JSON and Parquet File Formats
Loading Data through Different Sources
Spark – Hive Integration

Hands-on:

Spark SQL – Creating Data Frames
Loading and Transforming Data through Different Sources
Spark-Hive Integration

8. Machine learning using MLlib

Learning Objectives: Learn why machine learning is needed, different Machine Learning techniques/algorithms, and SparK MLlib.

Topics

Why Machine Learning?
What is Machine Learning?
Where Machine Learning is Used?
Different Types of Machine Learning Techniques
Introduction to MLlib
Features of MLlib and MLlib Tools
Various ML algorithms supported by MLlib
Optimization Techniques

9. Using Spark MLlib

Learning Objectives: Implement various algorithms supported by MLlib such as Linear Regression, Decision Tree, Random Forest and so on

Topics

Supervised Learning – Linear Regression, Logistic Regression, Decision Tree, Random Forest
Unsupervised Learning – K-Means Clustering

Hands-on:

Machine Learning MLlib
K- Means Clustering
Linear Regression
Logistic Regression
Decision Tree
Random Forest

10. Streaming with Kafka and Flume

Learning Objectives: Understand Kafka and its Architecture. Also, learn about Kafka Cluster, how to configure different types of Kafka Clusters. Get introduced to Apache Flume, its architecture and how it is integrated with Apache Kafka for event processing. At the end, learn how to ingest streaming data using flume.

Topics

Need for Kafka
What is Kafka?
Core Concepts of Kafka
Kafka Architecture
Where is Kafka Used?
Understanding the Components of Kafka Cluster
Configuring Kafka Cluster
Kafka Producer and Consumer Java API
Need of Apache Flume
What is Apache Flume?
Basic Flume Architecture
Flume Sources
Flume Sinks
Flume Channels
Flume Configuration
Integrating Apache Flume and Apache Kafka

Hands-on:

Configuring Single Node Single Broker Cluster
Configuring Single Node Multi Broker Cluster
Producing and consuming messages
Flume Commands
Setting up Flume Agent

11. Apache Spark Streaming

Learning Objectives: Learn about the different streaming data sources such as Kafka and Flume. Also, learn to create a Spark streaming application.

Topics

Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka Data Sources

Hands-on:

Perform Twitter Sentimental Analysis Using Spark Streaming

12. Spark GraphX Programming

Learning Objectives: Learn the key concepts of Spark GraphX programming and operations along with different GraphX algorithms and their implementations.

Topics

A brief introduction to graph theory
GraphX
VertexRDD and EdgeRDD
Graph operators
Pregel API
PageRank

Performance Tuning In Spark

Apache Spark and Scala

About Course

Get Expert Consultation

1. Introduction to Big Data Hadoop and Spark

Topics:

2. Introduction to Scala

Topics:

3. Object Oriented Scala and Functional Programming Concepts

Topics

4. Collection APIs

Topics

5. Introduction to Spark

Topics:

Hands-on:

6. Operations of RDD

Topics

Hands-on:

7. DataFrames and Spark SQL

Topics

Hands-on:

8. Machine learning using MLlib

Topics

9. Using Spark MLlib

Topics

Hands-on:

10. Streaming with Kafka and Flume

Topics

Hands-on:

11. Apache Spark Streaming

Topics

Hands-on:

12. Spark GraphX Programming

Topics

Contact Us

Quick Links

Stay Connected

Stay Connected