Hadoop
The Big Data Hadoop course has been designed to impart an in-depth knowledge of Big Data processing using Hadoop. The course is packed with real-life projects and case studies.
  • Next batch: 28 Apr, 2018
  • 80 hrs
  • 8 weekends (Sat & Sun)
About the Course
Hadoop Introduction
Hadoop Architecture and Components
Hadoop Cluster Setup
HDFS
YARN
MapReduce
MapReduce Algorithms
MapReduce Best Practices
Managing Hadoop Cluster
Pig
Hive
HBase
Sqoop
Oozie
Flume
Zookeeper
Hadoop Ecosystem components
Analysing unstructured data (movie reviews, food reviews & twitter) for Trends, Sentiment analysis, Topic modelling
Motivation for Hadoop and NoSQL (Big Data, Scalability, Problems with Traditional Systems, Distributed Systems)
An Overview of Hadoop
Comparison with SQL Databases/ Data Warehouses/ Other Distributed Systems
Hadoop Distributed File System
MapReduce Programming model
Yarn – Resource Management
Hadoop Common Utilities
Hadoop 1.0 Vs Hadoop 2.0
Hadoop Ecosystem Components
Hadoop Architecture (1.0 and 2.0)
Name Node (NN) | Data Node(DN)
Job Tracker(JT)
Task Tracker(TT)
SecondaryNamenode(SNN)
Backup Node (BN)
Check Point Node (CN)
Resource Manager (RM)
Node Manager (NM)
Application Master (AM)
Job History Server (JHS) | Timeline Server
Apache Hadoop Distribution (Prerequisites Software Installation, Configuration details, Local mode, Pseudo distributed mode, Fully Distributed mode)
Cloudera Distribution - CDH (Prerequisites Software Installation, Cloudera Manager Installation, Creating CDH cluster (Parcels, Packages)
Hortonworks Distribution - HDP (Prerequisites Software Installation , Apache Ambari Installation, Creating HDP cluster)
Planning of Hadoop Clusters (Hardware Details, Dev clusters, Testing Clusters, Production Clusters)
Overview of HDFS
HDFS Architecture & Internals
HDFS Data Organization
Basic File System Operations
HDFS Commands [Admin + User]
HDFS Java Client API
Data Integrity, Compression, Data Archival
High Availability, Federation, Encryption
Data Backups, Short Circuit Reads, ACLs, Quotas
Upgrades, Storage Policies, Data Balancing, Snapshots
Web Interface, WebHDFS, HttpFS
HDFS Compatible File Systems
HDFS Metrics
Overview of YARN
YARN Vs MapReduce (Hadoop 1.0)
YARN Architecture and Internals
Resource Schedulers
YARN High Availability
YARN Commands
YARN Web Interface, YARN REST API
YARN Metrics
YARN Applications
Overview of MapReduce
Hadoop MapReduce Architecture and Internals
Difference between MR1 & MR2
Hadoop MapReduce API Concepts
Mapper, Reducer, Partitioner, Shuffle, Combiner, Sorting, Counters
Hadoop MapReduce Data Flow
Hadoop MapReduce Job Template
Hadoop Data Types
Hadoop Serialization
Distributed Cache, Speculative Execution, Data Localization
Hadoop File Formats - Sequence File, Map File, Avro, etc.
Hadoop Streaming – Non Java MapReduce Programming
Custom Data Types, Partitioners, Input/output Formats
MapReduce Application Master
MapReduce Job History Server
MapReduce Commands
MapReduce Joins
MapReduce Hands On
NoSQL CAP theorem description
Cassandra Internals - columnar data store, high throughput, heavy loads, concurrent writes
Use cases on Spark – Cassandra
Developing MapReduce Programs
Integration with Eclipse IDE
Monitoring MapReduce jobs
Configuration Tuning
Debugging MapReduce Jobs
Task Profiling
Performance tuning
Sending Job specific parameters
Unit Testing with MRUnit
Provisioning and Monitoring Cluster
Configuration Management
Cluster Health Management
Cluster Metrics
Security
Commissioning - Decommissioning
Overview of Pig
Installation
Architecture and components
Pig Engine
Grunt
Pig Latin (Operators – Functions (UDFs) – sds – Macros – Data types – Storage types – Language constructs – Parameter substitution – Pig Commands – Pig unit testing)
Pig administration
Best practices
Overview of Hive
Installation
Architecture and components
Hive Query language [HQL] (DDL – DML – DQL – DCL)
Functions (UDFs)
Views
Joins
Partitioning
Bucketing
Indexing
PLHQL
Parameter Substitution
Hive Commands
Storage handlers
File Formats and Ser-De
Hive over Tez and Spark
HCataLog | Hive security
Overview
Architecture and components
Installation
Data model (Conceptual view – Physical view)
DB operations
HBase clients (real-time & batch)
HBase Query Language (DDL – DML – DQL – DCL)
HBase tools (hbck – compaction – region splits and merge – WAL – Snapshots – Replication – Backups)
Hot spotting
HBase administration
HBase security
HBase integrations (Phoenix – Spark – MapReduce)
Overview
Installation
Commands (Import – Export – Job – Merge – Metastore – Eval – Codegen)
Overview
Architecture and components
Installation
Oozie internals (Bundle engine – Coordinator engine – Workflow engine)
Workflow (Controlling nodes – Action nodes)
Oozie language – hPDL with few example
Overview
Architecture and components
Data flow
Events
Agents (Sources – Channels – Syncs)
Interceptors
Zookeeper (1 hour)
Ambari
Avro
Accumulo
Spark
Flink
Mahout
DataFu
Kafka/Chukwa/Falcon
Drill/Impala
Lipstick
Sentry/Knox
Tajo/Presto
Giraph
Kudu
Thrift
MADlib
Hue
ORC
Gora
Cassandra
Tez
Lucene/Nutch
Parquet
Singa
Trainer details
Naga Mallikarjun Ineni
Open source contributor 6+Years in Hadoop eco-system projects Delivered trainings to 20+ batches / 600+ people since 2012