Real time hadoop + mapreduce intro

2013: year of real-time access to Big Data?

Geoffrey Hendrey
@geoffhendrey
@vertascale

Agenda

• Hadoop MapReduce basics
• Hadoop stack & data formats
• File access times and mechanics
• Key-based indexing systems (HBase)
• MapReduce, Hive/Pig
• MPP approaches & alternatives

A very bad* diagram

*this diagram makes it appear that data flows through the master node.

Job Configuration
http://wiki.apache.org/hadoop/WordCount

Reducer Group Iterators
• Reducer groups values together by key
• Your code will iterate over the values, emit reduced
result

Bear:[1,1] Bear:2

• Hadoop reducer value iterators return THE SAME
OBJECT each next(). Object is “reused” to reduce
garbage collection load
• Beware of “reused” objects (this is a VERY common
cause of long and confusing debugs)
• Cause for concern: you are emitting an object with
non-primitive values. STALE “reused object” state from
previous value.

Hadoop Writables
• Values in Hadoop are transmitted (shuffled, emitted) in a binary
format
• Hadoop includes primitive types: IntWritable, Text, LongWritable,
etc
• You must implement Writable interface for custom objects
public void write(DataOutput d) throws IOException {
d.writeUTF(this.string);
d.writeByte(this.column);
}
public void readFields(DataInput di) throws IOException {
this.string = di.readUTF();
this.column = di.readByte();
}

Hadoop Keys (WritableComparable)
• Be very careful to implement equals and hashcode
consistently with compareTo()
• compareTo() will control the sort order of keys
arriving in reducer
• Hadoop includes ability to write custom partitioner
public int getPartition(Document doc,
Text v, int numReducers) {
return doc.getDocId()%numReducers;
}

HDFS performance characteristics
• HDFS was designed for high throughput, not low
seek latency
• best-case configurations have shown HDFS to
perform 92K/s random reads
[http://hadoopblog.blogspot.com/]

• Personal experience: HDFS very robust. Fault
tolerance is “real”. I’ve unplugged machines
and never lost data.

Motivation for Real-time Hadoop
• Big Data is more opaque than small data
– Spreadsheets choke
– BI tools can’t scale
– Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:
– Faster “time to answer” on Big Data
– Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”
• MapReduce jobs are hard to debug

Survey or real-time capabilities

• Real-time, in-situ, self-service is the
“Holy Grail” for the business analyst
• spectrum of real-time capabilities exists
on Hadoop

Available in Hadoop Proprietary

HDFS HBase Drill
Easy Hard

Real-time spectrum on Hadoop
Use Case Support Real-time

Seek to a particular byte in a distributed HDFS YES
file

Seek to a particular value in a distributed HBase YES
file, by key (1-dimensional indexing)

Answer complex questions expressible in MapReduce NO
code (e.g. matching users to music (Hive, Pig)
albums). Data science.

Ad-hoc query for scattered records given MPP YES
simple constraints (“field*4+==“music” && Architectures
field*9+==“dvd”)

Hadoop Underpinned By HDFS
• Hadoop Distributed File System (HDFS)
• inspired by Google FileSystem (GFS)
• underpins every piece of data in “Hadoop”
• Hadoop FileSystem API is pluggable
• HDFS can be replaced with other suitable
distributed filesystem
– S3
– kosmos
– etc

MapFile for real-time access?
– Index file must be loaded by client (slow)
– Index file must fit in RAM of client by default
– scan an average of 50% of the sampling
interval
– Large records make scanning intolerable
– not a viable “real world” solution for random
access

Apache HBase

• Clone of Google’s Big Table.
• Key-based access mechanism
• Designed to hold billions of rows
• “Tables” stored in HDFS
• Supports MapReduce over tables, into
tables
• Requires you to think hard, and commit
to a key design.

HBase random read performance
http://hstack.org/hbase-performance-testing/
• 7 servers, each with
• 8 cores
• 32GB DDR3 and
• 24 x 146GB SAS 2.0 10K RPM disks.
• Hbase table
• 3 billion records,
• 6600 regions.
• data size is between 128-256 bytes per row,
spread in 1 to 5 columns.

Zoomed-in “Get” time histogram

http://hstack.org/hbase-performance-testing/

MapReduce
• “MapReduce is a framework for processing
parallelizable problems across huge datasets
using a large number of computers”-wikipedia
• MapReduce is strongly tied to HDFS in Hadoop.
• Systems built on HDFS (i.e. HBase) leverage this
common foundation for integration with the MR
paradigm

MapReduce and Data Science
• Many complex algorithms can be expressed in
the MapReduce paradigm
– NLP
– Graph processing
– Image codecs
• The more complex the algorithm, the more Map
and Reduce processes become complex
programs in their own right.
• Often cascade multiple MR jobs in succession

Is MapReduce real-time?
• MapReduce on Hadoop has certain latencies
that are hard to improve
– Copy
– Shuffle, sort
– Iterate
• time-dependent on the both the size of the
input data and the number of processors
available
• In a nutshell, it’s a “batch process” and isn’t
“real-time”

Hive and Pig
• Run on top of MapReduce
• Provide “Table” metaphor familiar to SQL users
• Provide SQL-like (or actually same) syntax
• Store a “schema” in a database, mapping tables
to HDFS files
• Translate “queries” to MapReduce jobs
• No more real-time than MapReduce

MPP Architectures
• Massively Parallel Processing
• Lots of machines, so also lots of memory
Examples:
• Spark – general purpose data science framework
sort of like real-time MapReduce for data
science
• Dremel – columnar approach, geared toward
answering SQL-like aggregations and BI-style
questions

Spark

• Originally designed for iterative machine
learning problems at Berkeley
• MapReduce does not do a great job on iterative
workloads
• Spark makes more explicit use of memory
caches than Hadoop
• Spark can load data from any Hadoop input
source

Effect of Memory Caching in Spark

Is Spark Real-time?
• If data fits in memory, execution time for most
algorithms still depends on
– amount of data to be processed
– number of processors
• So, it still “depends”
• …but definitely more focused on fast time-to-
answer
• Interactive scala and java shells

Dremel MPP architecture
• MPP architecture for ad-hoc query on nested
data
• Apache Drill is an OS clone of Dremel
• Dremel originally developed at Google
• Features “in situ” data analysis
• “Dremel is not intended as a replacement for
MR and is often used in conjunction with it to
analyze outputs of MR pipelines or rapidly
prototype larger computations.” -Dremel:
Interactive Analysis of WebScaleDatasets

In Situ Analysis

• Moving Big Data is a nightmare
• In situ: ability to access data in
place
– In HDFS
– In Big Table

Uses For Dremel At Google
• Analysis of crawled web documents.
• Tracking install data for applications on Android
Market.
• Crash reporting for Google products.
• OCR results from Google Books.
• Spam analysis.
• Debugging of map tiles on Google Maps.
• Tablet migrations in managed Bigtable instances.
• Results of tests run on Google’s distributed build
system.
• Etc, etc.

Why so many uses for Dremel?
• On any Big Data problem or application, dev
team faces these problems:
– “I don’t know what I don’t know” about data
– Debugging often requires finding and correlating
specific needles in the haystack
– Support and marketing often require segmentation
analysis (identify and characterize wide swaths of
data)
• Every developer/analyst wants
– Faster time to answer
– Fewer trips around the mulberry bush

Dremel MPP query execution tree

Alternative approaches?
• Both MapReduce and MPP query architectures
take “throw hardware at the problem”
approach.
• Alternatives?
– Use MapReduce to build distributed indexes on data
– Combine columnar storage and inverted indexes to
create columnar inverted indexes
– Aim for the sweet spot for data scientist and
engineer: Ad-hoc queries with results returned in
seconds on a single processing node.

Contact Info
Email:
geoff@vertascale.com

Twitter:
@geoffhendrey
@vertascale

www:
http://vertascale.com

references
• http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-
the-elephant/
• http://bradhedlund.com/2011/09/10/understanding-hadoop-
clusters-and-the-network/
• http://yoyoclouds.wordpress.com/tag/hdfs/
• http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
• http://hadoopblog.blogspot.com/
• http://lunarium.info/arc/images/Hbase.gif
• http://www.zdnet.com/i/story/60/01/073451/zdnet-amazon-
s3_growth_2012_q1_1.png
• http://hstack.org/hbase-performance-testing/
• http://en.wikipedia.org/wiki/File:Mapreduce_Overview.svg
• http://www.rabidgremlin.com/data20/MapReduceWordCountOver
view1.png
• Dremel: Interactive Analysis of WebScale Datasets

Real time hadoop + mapreduce intro

More Related Content

What's hot

Similar to Real time hadoop + mapreduce intro

Real time hadoop + mapreduce intro