Course Overview

This course is intermediate level in “Machine Learning and Artificial Intelligence” learning path. It has been designed and developed for providing exposure to participants in Scalable Machine learning. This course covers Spark Core, Spark SQL, Spark Streaming and Spark ML in detail along with providing exposure to Deep Learning in a gentle manner.

Who should attend

This program is designed for those who aspire for Data/ML/AI roles:

Data Engineers
Data Scientists
Machine Learning Engineers
Data Integration Engineers
Data Architects

Course Objectives

Understand the role of Spark in Machine Learning
Providing hands-on experience in Data Acquisition, Processing, Analysis and Modeling using Cloudera distribution of Hadoop and Spark
The participants will deal with various common types of data e.g. CSV, XML, JSON, Social Media data etc. for pre-processing and/or building Machine Learning Models using Spark
How Deep Cognition helps in performing Deep Learning
During the course, the participants will also get exposure to Deep Learning using Deep Cognition Studio
Build Deep Learning Models using Deep Cognition Studio even without knowledge of Statistics

Outline: Scalable Machine Learning and Deep Learning (SMLDL)

Understanding the Big Picture

Artificial Intelligence (AI) Overview
AI vs ML vs Data Science
The relationship between Deep Learning (DL) and Machine Learning
Practical Use cases
Concepts and Terms
Tools/Platforms for Scalable ML, DL, and AI
Big Data and Cloud fits into the Ecosystem

Introduction to Scalable Machine Learning

What is Scalable Machine Learning?
Why it is required?
Key platforms for performing Scalable Machine Learning
Scalable Machine Learning Project End to End Pipeline
Spark Introduction
Why Spark for Scalable Machine Learning?
Databricks Platform Demo
Approaches for scaling sci-kit learn code
Hands-on Exercise(s): Experiencing the first notebook

Why Spark for Scalable Machine Learning (SML)?

Quick Recap/Introduction to Hadoop
Logical View of Cloudera Distribution
Big Data Analytics Pipelines
Components in Cloudera Distribution for performing SML
Hands-on Exercise(s)

Scalable Machine Learning on Enterprise Platform

Data Acquisition at Scale

Acquiring Structured content from Relational Databases
Acquiring Semi-structured content from Log Files
Acquiring Unstructured content from other key sources like Web
Tools for Performing Data acquisition at Scale
Sqoop, Flume and Kafka Introduction, use cases and architectures
Hands-on Exercise(s)

Data Pre-Processing for Modeling

Using the Spark Shell
Resilient Distributed Datasets (RDDs)
Functional Programming with Spark
RDD Operations
Key-Value Pair RDDs
MapReduce and Pair RDD Operations
Building and Running a Spark Application
Performing Data Validation
Data De-Duplication
Detecting Outliers
Hands-on Exercise(s)

Working with Iterative Algorithms

Dealing with RDD Infinite Lineages
Caching Overview
Distributed Persistence
Checkpointing of an Iterative Machine Learning Algorithm
Hands-on Exercise(s)

Spark SQL

Introduction
Dataframe API
Performing ad-hoc query analysis using Spark SQL
Hands-on Exercise(s)

Spark Machine Learning using MLLib

Spark ML vs Spark MLLib
Data types and key terms
Feature Extraction
Linear Regression using Spark MLLib
Hands-on Exercise(s)

Spark Machine Learning using ML

Spark ML Overview
Transformers and Estimators
Pipelines
Implementing Decision Trees
K-Means Clustering using Spark ML
Hands-on Exercise(s)

Natural Language Processing

What is Natural Language Processing?
The NLTK package
Preparing text for analysis
Text summarisation
Sentiment analysis
Naïve Bayes technique
Text classification
Topic Modelling
Hands-on Exercise(s)

Model Evaluation, Optimization and Deployment

Model Evaluation
Optimizing a Model
Deploying Model
Best Practices

Decision Trees and Random Forest

Types – Classification and Regression trees
Gini Index, Entropy and Information Gain
Building Decision Trees
Pruning the trees
Prediction using Trees
Ensemble Models
Bagging and Boosting
Advantages of using Random Forest
Working with Random Forest
Ensemble Learning
How ensemble learning works
Building models using Bagging
Random Forest algorithm
Random Forest model building
Fine tuning hyper-parameters
Hands-on Exercise(s)

Real-time Analytics

Real-time data acquisition using Kafka
Salient Features of Kafka
Kafka Use cases
Comparing Kafka with other Key tools
End to End Data Pipeline using Kafka
Integrating Kafka with Spark Streaming
Hands-on Exercise(s)

Introduction to Deep Learning

What is Deep Learning?
Deep Learning Architecture
Deep Learning Frameworks
The relationship between Deep Learning and Machine Learning
Deep Learning Use cases
Concepts and Terms
How to implement Deep Learning?

Working with Deep Cognition Studio (DCS)

Deep Cognition Introduction
Why Deep Cognition Studio?
Walkthrough of Deep Learning Studio
Multilayer Perceptron in Deep Cognition
How does a single artificial neuron work?
Computation Graph
Activation Functions
Importance of non-linear activation
Data encoding for deep neural networks
Hands-on Exercise(s)

Building Convolution Neural Network in DCS

Convolutional Neural Networks
Components of CNN
Data augmentation
Transfer learning for using pre-trained networks
Hands-on Exercise(s)

Hands-on Exercises:

Spark
- Running an application on YARN
- Interactive Data Exploration using Spark
- Working with Pair RDDs
- Dealing with XML files in Spark
- Processing JSON data in Spark
- Processing Log file data in Spark
- Caching in Spark
- Data Deduplication
- Using Broadcast Variables
- Using Accumulators
- Working with Dataframe API
- Spark SQL – Multiple exercises
- Spark Streaming: Part 1
- Spark Streaming: Part 2
- Integrating Kafka and Spark Streaming
Spark ML
- Vector
- Stringindexer and onehotencoder
- SQL transformer
- Pipeline
- Imputer
- Sparkml_pca
- Decision tree classification
- Vector assembler
- Kmeans to analyze hacking attacks
- End to end spark ml pipeline using a decision tree
- Cross-validation
- Naive Bayes classification
- NLP and NLTK basics
- Decision tree classification example
- Lab sentiment_analysis
- Logistic regression
- RFormula
- Support vector machine
- Linear regression
- Predict customer churn
- Random forest classification
Deep Cognition
- Experiencing first Deep Neural Network
- Emotion Analysis

Scalable Machine Learning and Deep Learning (SMLDL)