2013: year of real-time access to Big Data?

              Geoffrey Hendrey
               @geoffhendrey
                @vertascale
Agenda

•   Hadoop MapReduce basics
•   Hadoop stack & data formats
•   File access times and mechanics
•   Key-based indexing systems (HBase)
•   MapReduce, Hive/Pig
•   MPP approaches & alternatives
A very bad* diagram




*this diagram makes it appear that data flows through the master node.
A better picture
Job Configuration
http://wiki.apache.org/hadoop/WordCount
Map and Reduce Java Code
Reduce
Reducer Group Iterators
• Reducer groups values together by key
• Your code will iterate over the values, emit reduced
  result

              Bear:[1,1]            Bear:2


• Hadoop reducer value iterators return THE SAME
  OBJECT each next(). Object is “reused” to reduce
  garbage collection load
• Beware of “reused” objects (this is a VERY common
  cause of long and confusing debugs)
• Cause for concern: you are emitting an object with
  non-primitive values. STALE “reused object” state from
  previous value.
Hadoop Writables
• Values in Hadoop are transmitted (shuffled, emitted) in a binary
  format
• Hadoop includes primitive types: IntWritable, Text, LongWritable,
  etc
• You must implement Writable interface for custom objects
   public void write(DataOutput d) throws IOException {
         d.writeUTF(this.string);
         d.writeByte(this.column);
   }
    public void readFields(DataInput di) throws IOException {
         this.string = di.readUTF();
         this.column = di.readByte();
   }
Hadoop Keys (WritableComparable)
• Be very careful to implement equals and hashcode
  consistently with compareTo()
• compareTo() will control the sort order of keys
  arriving in reducer
• Hadoop includes ability to write custom partitioner
 public int getPartition(Document doc,
                        Text v, int numReducers) {
     return doc.getDocId()%numReducers;
}
Typical Hadoop File Formats
Hadoop Stack Review
Distributed File System
HDFS performance characteristics
• HDFS was designed for high throughput, not low
  seek latency
• best-case configurations have shown HDFS to
  perform 92K/s random reads
  [http://hadoopblog.blogspot.com/]

• Personal experience: HDFS very robust. Fault
  tolerance is “real”. I’ve unplugged machines
  and never lost data.
Motivation for Real-time Hadoop
• Big Data is more opaque than small data
  – Spreadsheets choke
  – BI tools can’t scale
  – Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:
  – Faster “time to answer” on Big Data
  – Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”
• MapReduce jobs are hard to debug
Survey or real-time capabilities

• Real-time, in-situ, self-service is the
  “Holy Grail” for the business analyst
• spectrum of real-time capabilities exists
  on Hadoop

 Available in Hadoop                   Proprietary

           HDFS        HBase   Drill
 Easy                                        Hard
Real-time spectrum on Hadoop
Use Case                                      Support       Real-time

Seek to a particular byte in a distributed    HDFS          YES
file

Seek to a particular value in a distributed   HBase         YES
file, by key (1-dimensional indexing)

Answer complex questions expressible in       MapReduce     NO
code (e.g. matching users to music            (Hive, Pig)
albums). Data science.

Ad-hoc query for scattered records given MPP                YES
simple constraints (“field*4+==“music” && Architectures
field*9+==“dvd”)
Hadoop Underpinned By HDFS
•   Hadoop Distributed File System (HDFS)
•   inspired by Google FileSystem (GFS)
•   underpins every piece of data in “Hadoop”
•   Hadoop FileSystem API is pluggable
•   HDFS can be replaced with other suitable
    distributed filesystem
    – S3
    – kosmos
    – etc
Amazon S3
MapFile for real-time access?
  – Index file must be loaded by client (slow)
  – Index file must fit in RAM of client by default
  – scan an average of 50% of the sampling
    interval
  – Large records make scanning intolerable
  – not a viable “real world” solution for random
    access
Apache HBase

• Clone of Google’s Big Table.
• Key-based access mechanism
• Designed to hold billions of rows
• “Tables” stored in HDFS
• Supports MapReduce over tables, into
  tables
• Requires you to think hard, and commit
  to a key design.
HBase Architecture
HBase random read performance
http://hstack.org/hbase-performance-testing/
• 7 servers, each with
   • 8 cores
   • 32GB DDR3 and
   • 24 x 146GB SAS 2.0 10K RPM disks.
• Hbase table
   • 3 billion records,
   • 6600 regions.
   • data size is between 128-256 bytes per row,
     spread in 1 to 5 columns.
Zoomed-in “Get” time histogram




      http://hstack.org/hbase-performance-testing/
MapReduce
• “MapReduce is a framework for processing
  parallelizable problems across huge datasets
  using a large number of computers”-wikipedia
• MapReduce is strongly tied to HDFS in Hadoop.
• Systems built on HDFS (i.e. HBase) leverage this
  common foundation for integration with the MR
  paradigm
MapReduce and Data Science
• Many complex algorithms can be expressed in
  the MapReduce paradigm
  – NLP
  – Graph processing
  – Image codecs
• The more complex the algorithm, the more Map
  and Reduce processes become complex
  programs in their own right.
• Often cascade multiple MR jobs in succession
Is MapReduce real-time?
• MapReduce on Hadoop has certain latencies
  that are hard to improve
  – Copy
  – Shuffle, sort
  – Iterate
• time-dependent on the both the size of the
  input data and the number of processors
  available
• In a nutshell, it’s a “batch process” and isn’t
  “real-time”
Hive and Pig
• Run on top of MapReduce
• Provide “Table” metaphor familiar to SQL users
• Provide SQL-like (or actually same) syntax
• Store a “schema” in a database, mapping tables
  to HDFS files
• Translate “queries” to MapReduce jobs
• No more real-time than MapReduce
MPP Architectures
• Massively Parallel Processing
• Lots of machines, so also lots of memory
Examples:
• Spark – general purpose data science framework
  sort of like real-time MapReduce for data
  science
• Dremel – columnar approach, geared toward
  answering SQL-like aggregations and BI-style
  questions
Spark

• Originally designed for iterative machine
  learning problems at Berkeley
• MapReduce does not do a great job on iterative
  workloads
• Spark makes more explicit use of memory
  caches than Hadoop
• Spark can load data from any Hadoop input
  source
Effect of Memory Caching in Spark
Is Spark Real-time?
• If data fits in memory, execution time for most
  algorithms still depends on
  – amount of data to be processed
  – number of processors
• So, it still “depends”
• …but definitely more focused on fast time-to-
  answer
• Interactive scala and java shells
Dremel MPP architecture
• MPP architecture for ad-hoc query on nested
  data
• Apache Drill is an OS clone of Dremel
• Dremel originally developed at Google
• Features “in situ” data analysis
• “Dremel is not intended as a replacement for
  MR and is often used in conjunction with it to
  analyze outputs of MR pipelines or rapidly
  prototype larger computations.” -Dremel:
  Interactive Analysis of WebScaleDatasets
In Situ Analysis

• Moving Big Data is a nightmare
• In situ: ability to access data in
  place
  – In HDFS
  – In Big Table
Uses For Dremel At Google
•    Analysis of crawled web documents.
•    Tracking install data for applications on Android
    Market.
•    Crash reporting for Google products.
•    OCR results from Google Books.
•    Spam analysis.
•    Debugging of map tiles on Google Maps.
•    Tablet migrations in managed Bigtable instances.
•    Results of tests run on Google’s distributed build
    system.
•   Etc, etc.
Why so many uses for Dremel?
• On any Big Data problem or application, dev
  team faces these problems:
  – “I don’t know what I don’t know” about data
  – Debugging often requires finding and correlating
    specific needles in the haystack
  – Support and marketing often require segmentation
    analysis (identify and characterize wide swaths of
    data)
• Every developer/analyst wants
  – Faster time to answer
  – Fewer trips around the mulberry bush
Column Oriented Approach
Dremel MPP query execution tree
Is Dremel real-time?
Alternative approaches?
• Both MapReduce and MPP query architectures
  take “throw hardware at the problem”
  approach.
• Alternatives?
  – Use MapReduce to build distributed indexes on data
  – Combine columnar storage and inverted indexes to
    create columnar inverted indexes
  – Aim for the sweet spot for data scientist and
    engineer: Ad-hoc queries with results returned in
    seconds on a single processing node.
Contact Info
               Email:
               geoff@vertascale.com

               Twitter:
               @geoffhendrey
               @vertascale

               www:
               http://vertascale.com
references
• http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-
  the-elephant/
• http://bradhedlund.com/2011/09/10/understanding-hadoop-
  clusters-and-the-network/
• http://yoyoclouds.wordpress.com/tag/hdfs/
• http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
• http://hadoopblog.blogspot.com/
• http://lunarium.info/arc/images/Hbase.gif
• http://www.zdnet.com/i/story/60/01/073451/zdnet-amazon-
  s3_growth_2012_q1_1.png
• http://hstack.org/hbase-performance-testing/
• http://en.wikipedia.org/wiki/File:Mapreduce_Overview.svg
• http://www.rabidgremlin.com/data20/MapReduceWordCountOver
  view1.png
• Dremel: Interactive Analysis of WebScale Datasets

Real time hadoop + mapreduce intro

  • 1.
    2013: year ofreal-time access to Big Data? Geoffrey Hendrey @geoffhendrey @vertascale
  • 2.
    Agenda • Hadoop MapReduce basics • Hadoop stack & data formats • File access times and mechanics • Key-based indexing systems (HBase) • MapReduce, Hive/Pig • MPP approaches & alternatives
  • 3.
    A very bad*diagram *this diagram makes it appear that data flows through the master node.
  • 4.
  • 5.
  • 6.
    Map and ReduceJava Code
  • 7.
  • 8.
    Reducer Group Iterators •Reducer groups values together by key • Your code will iterate over the values, emit reduced result Bear:[1,1] Bear:2 • Hadoop reducer value iterators return THE SAME OBJECT each next(). Object is “reused” to reduce garbage collection load • Beware of “reused” objects (this is a VERY common cause of long and confusing debugs) • Cause for concern: you are emitting an object with non-primitive values. STALE “reused object” state from previous value.
  • 9.
    Hadoop Writables • Valuesin Hadoop are transmitted (shuffled, emitted) in a binary format • Hadoop includes primitive types: IntWritable, Text, LongWritable, etc • You must implement Writable interface for custom objects public void write(DataOutput d) throws IOException { d.writeUTF(this.string); d.writeByte(this.column); } public void readFields(DataInput di) throws IOException { this.string = di.readUTF(); this.column = di.readByte(); }
  • 10.
    Hadoop Keys (WritableComparable) •Be very careful to implement equals and hashcode consistently with compareTo() • compareTo() will control the sort order of keys arriving in reducer • Hadoop includes ability to write custom partitioner public int getPartition(Document doc, Text v, int numReducers) { return doc.getDocId()%numReducers; }
  • 11.
  • 12.
  • 13.
  • 14.
    HDFS performance characteristics •HDFS was designed for high throughput, not low seek latency • best-case configurations have shown HDFS to perform 92K/s random reads [http://hadoopblog.blogspot.com/] • Personal experience: HDFS very robust. Fault tolerance is “real”. I’ve unplugged machines and never lost data.
  • 15.
    Motivation for Real-timeHadoop • Big Data is more opaque than small data – Spreadsheets choke – BI tools can’t scale – Small samples often fail to replicate issues • Engineers, data scientists, analysts need: – Faster “time to answer” on Big Data – Rapid “find, quantify, extract” • Solve “I don’t know what I don’t know” • MapReduce jobs are hard to debug
  • 16.
    Survey or real-timecapabilities • Real-time, in-situ, self-service is the “Holy Grail” for the business analyst • spectrum of real-time capabilities exists on Hadoop Available in Hadoop Proprietary HDFS HBase Drill Easy Hard
  • 17.
    Real-time spectrum onHadoop Use Case Support Real-time Seek to a particular byte in a distributed HDFS YES file Seek to a particular value in a distributed HBase YES file, by key (1-dimensional indexing) Answer complex questions expressible in MapReduce NO code (e.g. matching users to music (Hive, Pig) albums). Data science. Ad-hoc query for scattered records given MPP YES simple constraints (“field*4+==“music” && Architectures field*9+==“dvd”)
  • 18.
    Hadoop Underpinned ByHDFS • Hadoop Distributed File System (HDFS) • inspired by Google FileSystem (GFS) • underpins every piece of data in “Hadoop” • Hadoop FileSystem API is pluggable • HDFS can be replaced with other suitable distributed filesystem – S3 – kosmos – etc
  • 19.
  • 20.
    MapFile for real-timeaccess? – Index file must be loaded by client (slow) – Index file must fit in RAM of client by default – scan an average of 50% of the sampling interval – Large records make scanning intolerable – not a viable “real world” solution for random access
  • 21.
    Apache HBase • Cloneof Google’s Big Table. • Key-based access mechanism • Designed to hold billions of rows • “Tables” stored in HDFS • Supports MapReduce over tables, into tables • Requires you to think hard, and commit to a key design.
  • 22.
  • 23.
    HBase random readperformance http://hstack.org/hbase-performance-testing/ • 7 servers, each with • 8 cores • 32GB DDR3 and • 24 x 146GB SAS 2.0 10K RPM disks. • Hbase table • 3 billion records, • 6600 regions. • data size is between 128-256 bytes per row, spread in 1 to 5 columns.
  • 24.
    Zoomed-in “Get” timehistogram http://hstack.org/hbase-performance-testing/
  • 25.
    MapReduce • “MapReduce isa framework for processing parallelizable problems across huge datasets using a large number of computers”-wikipedia • MapReduce is strongly tied to HDFS in Hadoop. • Systems built on HDFS (i.e. HBase) leverage this common foundation for integration with the MR paradigm
  • 26.
    MapReduce and DataScience • Many complex algorithms can be expressed in the MapReduce paradigm – NLP – Graph processing – Image codecs • The more complex the algorithm, the more Map and Reduce processes become complex programs in their own right. • Often cascade multiple MR jobs in succession
  • 27.
    Is MapReduce real-time? •MapReduce on Hadoop has certain latencies that are hard to improve – Copy – Shuffle, sort – Iterate • time-dependent on the both the size of the input data and the number of processors available • In a nutshell, it’s a “batch process” and isn’t “real-time”
  • 28.
    Hive and Pig •Run on top of MapReduce • Provide “Table” metaphor familiar to SQL users • Provide SQL-like (or actually same) syntax • Store a “schema” in a database, mapping tables to HDFS files • Translate “queries” to MapReduce jobs • No more real-time than MapReduce
  • 29.
    MPP Architectures • MassivelyParallel Processing • Lots of machines, so also lots of memory Examples: • Spark – general purpose data science framework sort of like real-time MapReduce for data science • Dremel – columnar approach, geared toward answering SQL-like aggregations and BI-style questions
  • 30.
    Spark • Originally designedfor iterative machine learning problems at Berkeley • MapReduce does not do a great job on iterative workloads • Spark makes more explicit use of memory caches than Hadoop • Spark can load data from any Hadoop input source
  • 31.
    Effect of MemoryCaching in Spark
  • 32.
    Is Spark Real-time? •If data fits in memory, execution time for most algorithms still depends on – amount of data to be processed – number of processors • So, it still “depends” • …but definitely more focused on fast time-to- answer • Interactive scala and java shells
  • 33.
    Dremel MPP architecture •MPP architecture for ad-hoc query on nested data • Apache Drill is an OS clone of Dremel • Dremel originally developed at Google • Features “in situ” data analysis • “Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations.” -Dremel: Interactive Analysis of WebScaleDatasets
  • 34.
    In Situ Analysis •Moving Big Data is a nightmare • In situ: ability to access data in place – In HDFS – In Big Table
  • 35.
    Uses For DremelAt Google • Analysis of crawled web documents. • Tracking install data for applications on Android Market. • Crash reporting for Google products. • OCR results from Google Books. • Spam analysis. • Debugging of map tiles on Google Maps. • Tablet migrations in managed Bigtable instances. • Results of tests run on Google’s distributed build system. • Etc, etc.
  • 36.
    Why so manyuses for Dremel? • On any Big Data problem or application, dev team faces these problems: – “I don’t know what I don’t know” about data – Debugging often requires finding and correlating specific needles in the haystack – Support and marketing often require segmentation analysis (identify and characterize wide swaths of data) • Every developer/analyst wants – Faster time to answer – Fewer trips around the mulberry bush
  • 37.
  • 38.
    Dremel MPP queryexecution tree
  • 39.
  • 40.
    Alternative approaches? • BothMapReduce and MPP query architectures take “throw hardware at the problem” approach. • Alternatives? – Use MapReduce to build distributed indexes on data – Combine columnar storage and inverted indexes to create columnar inverted indexes – Aim for the sweet spot for data scientist and engineer: Ad-hoc queries with results returned in seconds on a single processing node.
  • 41.
    Contact Info Email: geoff@vertascale.com Twitter: @geoffhendrey @vertascale www: http://vertascale.com
  • 42.
    references • http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of- the-elephant/ • http://bradhedlund.com/2011/09/10/understanding-hadoop- clusters-and-the-network/ • http://yoyoclouds.wordpress.com/tag/hdfs/ • http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ • http://hadoopblog.blogspot.com/ • http://lunarium.info/arc/images/Hbase.gif • http://www.zdnet.com/i/story/60/01/073451/zdnet-amazon- s3_growth_2012_q1_1.png • http://hstack.org/hbase-performance-testing/ • http://en.wikipedia.org/wiki/File:Mapreduce_Overview.svg • http://www.rabidgremlin.com/data20/MapReduceWordCountOver view1.png • Dremel: Interactive Analysis of WebScale Datasets