Introduction to AWS Big Data
Omid Vahdaty, Big Data Ninja
When the data outgrows your ability to process
● Volume
● Velocity
● Variety
EMR
● Basic, the simplest distribution of all hadoop distributions.
● Not as good as cloudera.
Collect
● Firehose
● Snowball
● SQS
● Ec2
Store
● S3
● Kinesis
● RDS
● DynmoDB
● Clloud Search
● Iot
Process + Analyze
● Lambda
● EMR
● Redshift
● Machine learning
● Elastic search
● Data Pipeline
● Athena
Visulize
● Quick sight
● ElasticSearch service
History of tools
HDFS - from google
Cassandra - from facebook, Columnar store NoSQL , materialized view, secondary
index
Kafka - linkedin.
Hbase, no sql , hadoop eco system.
EMR Hadoop ecosystem
● Spark - recommend options. Can do all the below…
● Hive
● Oozie
● Mahout - Machine library (mllib is better, runs on spark)
● Presto, better than hive? More generic than hive.
● Pig - scripting for big data.
● Impala - Cloudera, and part of EMR.
ETL Tools
● Attunity
● Splunk
● Semarchy
● Informatica
● Tibco
● Clarit
Architecture
● Decouple :
○ Store
○ Process
○ Store
○ Process
○ insight...
● Rule of thumb: 3 tech in dc, 7 techs max in cloud
○ Do use more b/c: maintenance
Architecture considerations
● unStructured? Structured? Semi structured?
● Latency?
● Throughput?
● Concurrency?
● Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies
EcoSystem
● Redshift = analytics
● Aurora = OLTP
● DynamoDB = NoSQL like mongodb.
● Lambda
● SQS
● Cloud Search
● Elastic Search
● Data Pipeline
● Beanstalk - i have jar, install it for me…
● AWS machine learning
Data ingestions Architecture challenges
● Durability
● HA
● Ingestions types:
○ Batch
○ Stream
● Transaction: OLTP, noSQL.
Ingestion options
● Kinesis
● Flume
● Kafka
● S3DistCP copy from
○ S3 to HDFS
○ S3 to s3
○ Cross account
○ Supports compression.
Transfer
● VPN
● Direct Connect
● S3 multipart upload
● Snowball
● IoT
Steaming
● Streaming
● Batch
● Collect
○ Kinesys
○ DynamoDB streams
○ SQS (pull)
○ SNS (push)
○ KAFKA (Most recommend)
Comparison
Streaming
● Low latency
● Message delivery
● Lambda architecture implementation
● State management
● Time or count based windowing support
● Fault tolerant
Stream processor comparison
Stream collection options
Kinesis client library
AWS lambda
EMR
Third party
Spark streaming (latency min =1 sec) , near real time, with lot of libraries.
Storm - Most real time (sub millisec), java code based.
Flink (similar to spark)
Kinesys
● Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
● Analytics - in flight analytics.
● Firehose - Park you data @ destination.
Kinesis analytics example
CREATE STREAM "INTERMEDIATE_STREAM" (
hostname VARCHAR(1024),
logname VARCHAR(1024),
username VARCHAR(1024),
requesttime VARCHAR(1024),
request VARCHAR(1024),
status VARCHAR(32),
responsesize VARCHAR(32)
);
-- Data Pump: Take incoming data from SOURCE_SQL_STREAM_001 and insert
into INTERMEDIATE_STREAM
KCL
● Read from stream using get api
● Build application with the KCL
● Leverage kinesis spout for storm
● Leverage EMR connector
Firehose - for parking
● Not for fast lane - no in flight analytics
● Capture , transform and load.
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
● Producer - you input to delivery stream
● Buffer size MB
● Buffer in seconds.
Comparison of Kinesis product
● Stream
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Latency 60 sec minimum.
○ For larger “events”
DynamoDB
● Fully managed NoSql document store key value
○ Tables have not fixed schema
● High performance
○ Single digit latency
○ Runs on solid drives
● Durable
○ Multi az
○ Fault toulerant, replicated 3 AZ.
Durability
● Read
○ Eventually consistent
○ Strongly consistent
● Write
○ Quorum ack
○ 3 replication - always - we can’t change it.
○ Persistence to disk.
Indexing and partitioning
● Idnexing
○ LSI - local secondary index, kind of alternate range key
○ GSI - global secondary index - “pivot charts” for your table, kind of projection (vertica)
● Partitioning
○ Automatic
○ Hash key speared data across partitions
DynamoDB Items and Attributes
● Partitions key
● Sort key (optional)
● LSI
● attributes
AWS Titan(on DynamoDB) - Graph DB
● Vertex - nodes
● Edge- relationship b/w nodes.
● Good when you need to investigate more than 2 layer relationship (join of 4 tables)
● Based on TinkerPop (open source)
● Full text search with lucene SOLR or elasticsearch
● HA using multi master replication (cassandra backed)
● Scale using DynamoDB backed.
● Use cases: Cyber , Social networks, Risk management
Elasticache: MemCache / Redis
● Cache sub millisecond.
● In memory
● No disk I/O when querying
● High throughput
● High availability
● Fully managed
Redshift
● Peta scale database for analytics
○ Columnar
○ MPP
● Complex SQL
RDS
● Flavours
○ Mysql
○ Auraora
○ postgreSQL
○ Oracle
○ MSSQL
● Multi AZ, HA
● Managed services
Data Processing
● Batch
● Ad hoc queries
● Message
● Stream
● Machine learning
● (by the use parquet, not ORC, more commonly used in the ecosystem)
Athena
● Presto
● In memory
● Hive meta store for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions
EMR
● Emr , pig and hive can work on top of spark , but not yet in EMR
● TEZ is not working. Horton gave up on it.
● Can ran presto on hive
Best practice
● Bzip2 - splittable compression
○ You don't have to open all the file, you can bzip2 only one block in it and work in parallel
● Snappy - encoding /decoding time - VERY fast, not compress well.
● Partitions
● Ephemeral EMR
EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper
● zeppelin
● For research - Public data sets exist of AWS...
EMR ARchitecture
● Master node
● Core nodes - like data nodes (with storage)
● Task nodes - (not like regular hadoop, extends compute)
● Default replication factor
○ Nodes 1-3 ⇒ 1 factor
○ Nodes 4-9 ⇒ 2 replication factor
○ Nodes 10+ ⇒ 3 replication factor
○ Not relevant in external tables
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)
EMR
● Use SCALA
● If you use Spark SQL , even in python it is ok.
● For code - python is full of bugs, connects well to R
● Scala is better - but not connects easily to data science  R
● Use cloud formation when ready to deploy fast.
● Check instance I ,
● Dense instance is good architecture
● User spot instances - for the tasks.
● Use TAGs
● Constant upgrade of the AMI version
● Don't use TEZ
● Make sure your choose instance with network optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Steps to automate steps on running cluster
● Use RDS to share Hive MetaStore (the metastore is mysql based)
● Use r, kafka and impala via bootstrap actions and many others
CPU features for instances
Sending work to emr
● Steps -
○ can be added while cluster is running
○ Can be added from the UI / CLI
○ FIFO scheduler by default
● Emr api
● Ganglia: jvm.JvmMetrics.MemHeapUsedM
Landscape
Hive
● SQL over hadoop.
● Engine: spark, tez, ML
● JDBC / ODBC
● Not good when need to shuffle.
● Connect well with DynamoDB
● SerDe json, parquet,regex etc.
Hbase
● Nosql like cassandra
● Below the Yarn
● Agent of HBASE on each data node. Like impala.
● Write on HDFS
● Multi level Index.
● (AWS avoid talking about it b/c of Dynamo, used only when you want to save
money)
● Driver from hive (work from hive on top of HBase)
Presto
● Like Hive, from facebook.
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries
Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know SQL, for system people, for those
who want to avoid java/scala
● Fair fight compared to hive in term of performance only
● Good for unstructured files ETL : file to file , and use sqoop.
Spark
Mahut
● Machine library for hadoop - Not used
● Spark ML lib is used instead.
R
● Open source package for statistical computing.
● Works with EMR
● “Matlab” equivalent
● Works with spark
● Not for developer :) for statistician
● R is single threaded - use spark R to distribute. Not everything works perfect.
Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● work s on top of Hive
● Inside EMR.
● Give more feedback to let you know where u are
Hue● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor
Spark
● In memory
● X10 to X100 times faster
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR
Spark
● RDD
○ An array (data set)
○ Read only distributed objects cached in mem across cluster
○ Allow apps to keep working set in mem for reuse
○ Fault tolerant
○ Ops:
■ Transformation: map/ filter / groupby / join
■ Actions: count / reduce/ collet / save / persist
○ Object oreinted ops
○ High level expression liken lambda, funcations, map
Data Frames
● Dataset containing named columns
● Access ordinal positions
● Tungsten execution backed - framework for distributed memory (open
source?)
● Abstraction for selecting , filtering, agg, and plot structure data.
● Run sql on it.
Streaming
● Near real time (1 sec latency) , like batch of 1sec windows
● Same spark as before
● Streaming jobs with API
Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR
Spark GraphX
● Works with graphs for parallel computations
● Access to access of library of algorithm
○ Page rank
○ Connected components
○ Label propagation
○ SVD++
○ Strongly connected components
Hive on Spark
● Will replace tez.
Spark flavours
● Own cluster
● With yarn
● With mesos
Downside
● Compute intensive
● Performance gain over mapreduce is not guaranteed.
● Streaming processing is actually batch with very small window.
Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries which are long lasting.
● DO NOT USE FOR transformation.
EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases.
● Raw data? s3.
Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scal ? redshift
● Complex joins → Redshift
Presto VS redshift
● Not a true DW , but can be used.
● Require S3, or HDFS
● Netflix uses presoto for analytics.
Redshift
● Leader node
○ Meta data
○ Execution plan
○ Sql end point
● Data node
● Distribution key
● Sort key
● Normalize…
● Dont be afraid to duplicate data with different sort key
Redshift Spectrum
● External table to S3
● Additional compute resources to redshift cluster.
● Not good for all use cases
Cost consideration
● Region
● Data out
● Spot instances
○ 50% - 80% cost reduction.
○ Limit your bid
○ Work well with EMR. use spot instances for task core mostly. For dev - use spot.
○ May be killed in the middle :)
● reserved instances
Kinesis pricing
● Streams
○ Shard hour
○ Put payload unit , 25KB
● Firehose
○ Volume
○ 5KB
● Analytics
○ Pay per container , 1 cpu, 4 GB = 11 Cents
Dynamo
● Most expensive: Pay for throughput
○ 1k write
○ 4k read
○ Eventually consistent is cheaper
● Be carefull… from 400$ to 10000$ simply b/s you change block size.
● You pay for storage, until 25GB free, the rest is 25 cent per GB
Optimize cost
● Alerts for cost mistakes-
○ unused machines etc.
○ on 95% from expected cost. → something is wrong…
● No need to buy RI for 3 year, 1year is better.
Visualizing - Quick Sight
● Cheap
● Connect to everything
○ S3
○ Dynamo
○ Redhsift
● SPICE
● Use storyboard to create slides of graphs like PPT
Tableau
● Connect to Redshift
● Good look and feel
● ~1000$ per user
● Not the best in term of performance. Quick sight is faster.
Other visualizer
● Tibco - Spot file
○ Has many connector
○ Had recommendations feature
● Jaspersoft
○ Limited
● ZoomData
● Hunk -
○ Users EMR and S3
○ Scehmoe on the fly
○ Available in market place
○ Expensive.
○ Non SQL , has it own language.
Visualize - which DB to choose?
● Hive is not recommended
● Presto a bit faster.
● Athena is ok
● Impala is better (has agent per machine, caching)
● Redshift is best.
Orchestration
● Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Data pipeline
○ Move data from on prem to cloud
○ Distributed
○ Integrate well with s3, dynamodb, RDS, EMR, EC2, redshift
○ Like ETL: Input, data manipulation, output
○ Not trivial, but nicer than Oozie
● Other: AirFlow, Knime, Luigi, Azkaban
Security
● Shared security model
○ Customer:
■ OS, platform, identity, access management,role, permissions
■ Network
■ Firewall
○ AWS:
■ compute, storage, databases, networking, LB
■ Regions, AZ, edge location
■ compliance
EMR secured
● IAM
● VPC
● Private subnet
● VPC endpoint to S3
● MFA to login
● STS + SAML - token to login. Complex solution
● Kerberos authentication of Hadoop nodes
● Encryption
○ SSH to master
○ TLS between nodes
IAM
● User
● Best practice :
○ MFA
○ Don't use ROOT account
● Group
● Role
● Policy best practice
○ Allow X
○ Disallow all the rest
● Identity federation via LDAP
Kinesis Security best practice
● Create IAM admin for admin
● Create IAM entity for re sharing stream
● Create IAM entity for produces to write
● Create iam entity for consumer to read.
● Allow specific source IP
● Enforce AWS:SecureTransport condition key for every API call.
● Use temp credentials : IAM role
Dynamo Security
● Access
● Policies
● Roels
● Database level access level - row level (item)/ Col level (attribute) access control
● STS for web identity federation.
● Limit of amount row u can see
● Use SSL
● All request can be signed bia SHA256
● PCI, SOC3, HIPPA, 270001 etc.
● Cloud trail - for Audit log
Redshift Security
● SSL in transit
● Encryption
○ KMSAES256
○ HSM encryption (hardware)
● VPC
● Cloud Trail
● All the usual regulation certification.
● Security groups
Big Data patterns
● Interactive query → mostly EMR, Athena
● Batch Processing → reporting → redshift + EMR
● Stream processing → kinesis, kinesis client library, lambda, EMR
● Real time prediction → mobile, dynamoDB. → lambda+ ML.
● Batch prediction → ML and redshift
● Long running cluster → s3 → EMR
● Log aggregation → s3 → EMR → redshift
Stay in touch...
● Omid Vahdaty
● +972-54-2384178

Introduction to AWS Big Data

  • 1.
    Introduction to AWSBig Data Omid Vahdaty, Big Data Ninja
  • 2.
    When the dataoutgrows your ability to process ● Volume ● Velocity ● Variety
  • 3.
    EMR ● Basic, thesimplest distribution of all hadoop distributions. ● Not as good as cloudera.
  • 4.
  • 5.
    Store ● S3 ● Kinesis ●RDS ● DynmoDB ● Clloud Search ● Iot
  • 6.
    Process + Analyze ●Lambda ● EMR ● Redshift ● Machine learning ● Elastic search ● Data Pipeline ● Athena
  • 7.
    Visulize ● Quick sight ●ElasticSearch service
  • 8.
    History of tools HDFS- from google Cassandra - from facebook, Columnar store NoSQL , materialized view, secondary index Kafka - linkedin. Hbase, no sql , hadoop eco system.
  • 9.
    EMR Hadoop ecosystem ●Spark - recommend options. Can do all the below… ● Hive ● Oozie ● Mahout - Machine library (mllib is better, runs on spark) ● Presto, better than hive? More generic than hive. ● Pig - scripting for big data. ● Impala - Cloudera, and part of EMR.
  • 10.
    ETL Tools ● Attunity ●Splunk ● Semarchy ● Informatica ● Tibco ● Clarit
  • 11.
    Architecture ● Decouple : ○Store ○ Process ○ Store ○ Process ○ insight... ● Rule of thumb: 3 tech in dc, 7 techs max in cloud ○ Do use more b/c: maintenance
  • 12.
    Architecture considerations ● unStructured?Structured? Semi structured? ● Latency? ● Throughput? ● Concurrency? ● Access patterns? ● Pass? Max 7 technologies ● Iaas? Max 4 technologies
  • 13.
    EcoSystem ● Redshift =analytics ● Aurora = OLTP ● DynamoDB = NoSQL like mongodb. ● Lambda ● SQS ● Cloud Search ● Elastic Search ● Data Pipeline ● Beanstalk - i have jar, install it for me… ● AWS machine learning
  • 14.
    Data ingestions Architecturechallenges ● Durability ● HA ● Ingestions types: ○ Batch ○ Stream ● Transaction: OLTP, noSQL.
  • 15.
    Ingestion options ● Kinesis ●Flume ● Kafka ● S3DistCP copy from ○ S3 to HDFS ○ S3 to s3 ○ Cross account ○ Supports compression.
  • 16.
    Transfer ● VPN ● DirectConnect ● S3 multipart upload ● Snowball ● IoT
  • 17.
    Steaming ● Streaming ● Batch ●Collect ○ Kinesys ○ DynamoDB streams ○ SQS (pull) ○ SNS (push) ○ KAFKA (Most recommend)
  • 18.
  • 19.
    Streaming ● Low latency ●Message delivery ● Lambda architecture implementation ● State management ● Time or count based windowing support ● Fault tolerant
  • 20.
  • 21.
    Stream collection options Kinesisclient library AWS lambda EMR Third party Spark streaming (latency min =1 sec) , near real time, with lot of libraries. Storm - Most real time (sub millisec), java code based. Flink (similar to spark)
  • 22.
    Kinesys ● Stream -collect@source and near real time processing ○ Near real time ○ High throughput ○ Low cost ○ Easy administration - set desired level of capacity ○ Delivery to : s3,redshift, Dynamo ○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second. ● Analytics - in flight analytics. ● Firehose - Park you data @ destination.
  • 24.
    Kinesis analytics example CREATESTREAM "INTERMEDIATE_STREAM" ( hostname VARCHAR(1024), logname VARCHAR(1024), username VARCHAR(1024), requesttime VARCHAR(1024), request VARCHAR(1024), status VARCHAR(32), responsesize VARCHAR(32) ); -- Data Pump: Take incoming data from SOURCE_SQL_STREAM_001 and insert into INTERMEDIATE_STREAM
  • 25.
    KCL ● Read fromstream using get api ● Build application with the KCL ● Leverage kinesis spout for storm ● Leverage EMR connector
  • 26.
    Firehose - forparking ● Not for fast lane - no in flight analytics ● Capture , transform and load. ○ Kinesis ○ S3 ○ Redshift ○ elastic search ● Managed Service ● Producer - you input to delivery stream ● Buffer size MB ● Buffer in seconds.
  • 27.
    Comparison of Kinesisproduct ● Stream ○ Sub 1 sec processing latency ○ Choice of stream processor (generic) ○ For smaller events ● Firehose ○ Zero admin ○ 4 targets built in (redshift, s3, search, etc) ○ Latency 60 sec minimum. ○ For larger “events”
  • 28.
    DynamoDB ● Fully managedNoSql document store key value ○ Tables have not fixed schema ● High performance ○ Single digit latency ○ Runs on solid drives ● Durable ○ Multi az ○ Fault toulerant, replicated 3 AZ.
  • 29.
    Durability ● Read ○ Eventuallyconsistent ○ Strongly consistent ● Write ○ Quorum ack ○ 3 replication - always - we can’t change it. ○ Persistence to disk.
  • 30.
    Indexing and partitioning ●Idnexing ○ LSI - local secondary index, kind of alternate range key ○ GSI - global secondary index - “pivot charts” for your table, kind of projection (vertica) ● Partitioning ○ Automatic ○ Hash key speared data across partitions
  • 31.
    DynamoDB Items andAttributes ● Partitions key ● Sort key (optional) ● LSI ● attributes
  • 32.
    AWS Titan(on DynamoDB)- Graph DB ● Vertex - nodes ● Edge- relationship b/w nodes. ● Good when you need to investigate more than 2 layer relationship (join of 4 tables) ● Based on TinkerPop (open source) ● Full text search with lucene SOLR or elasticsearch ● HA using multi master replication (cassandra backed) ● Scale using DynamoDB backed. ● Use cases: Cyber , Social networks, Risk management
  • 33.
    Elasticache: MemCache /Redis ● Cache sub millisecond. ● In memory ● No disk I/O when querying ● High throughput ● High availability ● Fully managed
  • 34.
    Redshift ● Peta scaledatabase for analytics ○ Columnar ○ MPP ● Complex SQL
  • 35.
    RDS ● Flavours ○ Mysql ○Auraora ○ postgreSQL ○ Oracle ○ MSSQL ● Multi AZ, HA ● Managed services
  • 36.
    Data Processing ● Batch ●Ad hoc queries ● Message ● Stream ● Machine learning ● (by the use parquet, not ORC, more commonly used in the ecosystem)
  • 37.
    Athena ● Presto ● Inmemory ● Hive meta store for DDL functionality ○ Complex data types ○ Multiple formats ○ Partitions
  • 38.
    EMR ● Emr ,pig and hive can work on top of spark , but not yet in EMR ● TEZ is not working. Horton gave up on it. ● Can ran presto on hive
  • 39.
    Best practice ● Bzip2- splittable compression ○ You don't have to open all the file, you can bzip2 only one block in it and work in parallel ● Snappy - encoding /decoding time - VERY fast, not compress well. ● Partitions ● Ephemeral EMR
  • 40.
    EMR ecosystem ● Hive ●Pig ● Hue ● Spark ● Oozie ● Presto ● Ganglia ● Zookeeper ● zeppelin ● For research - Public data sets exist of AWS...
  • 41.
    EMR ARchitecture ● Masternode ● Core nodes - like data nodes (with storage) ● Task nodes - (not like regular hadoop, extends compute) ● Default replication factor ○ Nodes 1-3 ⇒ 1 factor ○ Nodes 4-9 ⇒ 2 replication factor ○ Nodes 10+ ⇒ 3 replication factor ○ Not relevant in external tables ● Does Not have Standby Master node ● Best for transient cluster (goes up and down every night)
  • 42.
    EMR ● Use SCALA ●If you use Spark SQL , even in python it is ok. ● For code - python is full of bugs, connects well to R ● Scala is better - but not connects easily to data science R ● Use cloud formation when ready to deploy fast. ● Check instance I , ● Dense instance is good architecture ● User spot instances - for the tasks. ● Use TAGs ● Constant upgrade of the AMI version ● Don't use TEZ ● Make sure your choose instance with network optimized ● Resize cluster is not recommended ● Bootstrap to automate cluster upon provisioning ● Steps to automate steps on running cluster ● Use RDS to share Hive MetaStore (the metastore is mysql based) ● Use r, kafka and impala via bootstrap actions and many others
  • 43.
  • 44.
    Sending work toemr ● Steps - ○ can be added while cluster is running ○ Can be added from the UI / CLI ○ FIFO scheduler by default ● Emr api ● Ganglia: jvm.JvmMetrics.MemHeapUsedM
  • 45.
  • 46.
    Hive ● SQL overhadoop. ● Engine: spark, tez, ML ● JDBC / ODBC ● Not good when need to shuffle. ● Connect well with DynamoDB ● SerDe json, parquet,regex etc.
  • 47.
    Hbase ● Nosql likecassandra ● Below the Yarn ● Agent of HBASE on each data node. Like impala. ● Write on HDFS ● Multi level Index. ● (AWS avoid talking about it b/c of Dynamo, used only when you want to save money) ● Driver from hive (work from hive on top of HBase)
  • 48.
    Presto ● Like Hive,from facebook. ● Not good always for join on 2 large tables. ● Limited by memory ● Not fault tolerant like hive. ● Optimized for ad hoc queries
  • 49.
    Pig ● Distributed Shellscripting ● Generating SQL like operations. ● Engine: MR, Tez ● S3, DynamoDB access ● Use Case: for data science who don't know SQL, for system people, for those who want to avoid java/scala ● Fair fight compared to hive in term of performance only ● Good for unstructured files ETL : file to file , and use sqoop.
  • 50.
  • 51.
    Mahut ● Machine libraryfor hadoop - Not used ● Spark ML lib is used instead.
  • 52.
    R ● Open sourcepackage for statistical computing. ● Works with EMR ● “Matlab” equivalent ● Works with spark ● Not for developer :) for statistician ● R is single threaded - use spark R to distribute. Not everything works perfect.
  • 53.
    Apache Zeppelin ● Notebook- visualizer ● Built in spark integration ● Interactive data analytics ● Easy collaboration. ● Uses SQL ● work s on top of Hive ● Inside EMR. ● Give more feedback to let you know where u are
  • 54.
    Hue● Hadoop userexperience ● Logs in real time and failures. ● Multiple users ● Native access to S3. ● File browser to HDFS. ● Manipulate metascore ● Job Browser ● Query editor ● Hbase browser ● Sqoop editor, oozier editor, Pig Editor
  • 55.
    Spark ● In memory ●X10 to X100 times faster ● Good optimizer for distribution ● Rich API ● Spark SQL ● Spark Streaming ● Spark ML (ML lib) ● Spark GraphX (DB graphs) ● SparkR
  • 56.
    Spark ● RDD ○ Anarray (data set) ○ Read only distributed objects cached in mem across cluster ○ Allow apps to keep working set in mem for reuse ○ Fault tolerant ○ Ops: ■ Transformation: map/ filter / groupby / join ■ Actions: count / reduce/ collet / save / persist ○ Object oreinted ops ○ High level expression liken lambda, funcations, map
  • 57.
    Data Frames ● Datasetcontaining named columns ● Access ordinal positions ● Tungsten execution backed - framework for distributed memory (open source?) ● Abstraction for selecting , filtering, agg, and plot structure data. ● Run sql on it.
  • 58.
    Streaming ● Near realtime (1 sec latency) , like batch of 1sec windows ● Same spark as before ● Streaming jobs with API
  • 59.
    Spark ML ● Classification ●Regression ● Collaborative filtering ● Clustering ● Decomposition ● Code: java, scala, python, sparkR
  • 60.
    Spark GraphX ● Workswith graphs for parallel computations ● Access to access of library of algorithm ○ Page rank ○ Connected components ○ Label propagation ○ SVD++ ○ Strongly connected components
  • 61.
    Hive on Spark ●Will replace tez.
  • 62.
    Spark flavours ● Owncluster ● With yarn ● With mesos
  • 63.
    Downside ● Compute intensive ●Performance gain over mapreduce is not guaranteed. ● Streaming processing is actually batch with very small window.
  • 64.
    Redshift ● OLAP, notOLTP→ analytics , not transaction ● Fully SQL ● Fully ACID ● No indexing ● Fully managed ● Petabyte Scale ● MPP ● Can create slow queue for queries which are long lasting. ● DO NOT USE FOR transformation.
  • 65.
    EMR vs Redshift ●How much data loaded and unloaded? ● Which operations need to performed? ● Recycling data? → EMR ● History to be analyzed again and again ? → emr ● What the data needs to end up? BI? ● Use spectrum in some use cases. ● Raw data? s3.
  • 66.
    Hive VS. Redshift ●Amount of concurrency ? low → hive, high → redshift ● Access to customers? Redshift? ● Transformation, Unstructured , batch, ETL → hive. ● Peta scal ? redshift ● Complex joins → Redshift
  • 67.
    Presto VS redshift ●Not a true DW , but can be used. ● Require S3, or HDFS ● Netflix uses presoto for analytics.
  • 68.
    Redshift ● Leader node ○Meta data ○ Execution plan ○ Sql end point ● Data node ● Distribution key ● Sort key ● Normalize… ● Dont be afraid to duplicate data with different sort key
  • 69.
    Redshift Spectrum ● Externaltable to S3 ● Additional compute resources to redshift cluster. ● Not good for all use cases
  • 70.
    Cost consideration ● Region ●Data out ● Spot instances ○ 50% - 80% cost reduction. ○ Limit your bid ○ Work well with EMR. use spot instances for task core mostly. For dev - use spot. ○ May be killed in the middle :) ● reserved instances
  • 71.
    Kinesis pricing ● Streams ○Shard hour ○ Put payload unit , 25KB ● Firehose ○ Volume ○ 5KB ● Analytics ○ Pay per container , 1 cpu, 4 GB = 11 Cents
  • 72.
    Dynamo ● Most expensive:Pay for throughput ○ 1k write ○ 4k read ○ Eventually consistent is cheaper ● Be carefull… from 400$ to 10000$ simply b/s you change block size. ● You pay for storage, until 25GB free, the rest is 25 cent per GB
  • 73.
    Optimize cost ● Alertsfor cost mistakes- ○ unused machines etc. ○ on 95% from expected cost. → something is wrong… ● No need to buy RI for 3 year, 1year is better.
  • 74.
    Visualizing - QuickSight ● Cheap ● Connect to everything ○ S3 ○ Dynamo ○ Redhsift ● SPICE ● Use storyboard to create slides of graphs like PPT
  • 75.
    Tableau ● Connect toRedshift ● Good look and feel ● ~1000$ per user ● Not the best in term of performance. Quick sight is faster.
  • 76.
    Other visualizer ● Tibco- Spot file ○ Has many connector ○ Had recommendations feature ● Jaspersoft ○ Limited ● ZoomData ● Hunk - ○ Users EMR and S3 ○ Scehmoe on the fly ○ Available in market place ○ Expensive. ○ Non SQL , has it own language.
  • 77.
    Visualize - whichDB to choose? ● Hive is not recommended ● Presto a bit faster. ● Athena is ok ● Impala is better (has agent per machine, caching) ● Redshift is best.
  • 78.
    Orchestration ● Oozie ○ Openssource workflow ■ Workflow: graph of action ■ Coordinator: scheduler jobs ○ Support: hive, sqoop , spark etc. ● Data pipeline ○ Move data from on prem to cloud ○ Distributed ○ Integrate well with s3, dynamodb, RDS, EMR, EC2, redshift ○ Like ETL: Input, data manipulation, output ○ Not trivial, but nicer than Oozie ● Other: AirFlow, Knime, Luigi, Azkaban
  • 79.
    Security ● Shared securitymodel ○ Customer: ■ OS, platform, identity, access management,role, permissions ■ Network ■ Firewall ○ AWS: ■ compute, storage, databases, networking, LB ■ Regions, AZ, edge location ■ compliance
  • 80.
    EMR secured ● IAM ●VPC ● Private subnet ● VPC endpoint to S3 ● MFA to login ● STS + SAML - token to login. Complex solution ● Kerberos authentication of Hadoop nodes ● Encryption ○ SSH to master ○ TLS between nodes
  • 81.
    IAM ● User ● Bestpractice : ○ MFA ○ Don't use ROOT account ● Group ● Role ● Policy best practice ○ Allow X ○ Disallow all the rest ● Identity federation via LDAP
  • 82.
    Kinesis Security bestpractice ● Create IAM admin for admin ● Create IAM entity for re sharing stream ● Create IAM entity for produces to write ● Create iam entity for consumer to read. ● Allow specific source IP ● Enforce AWS:SecureTransport condition key for every API call. ● Use temp credentials : IAM role
  • 83.
    Dynamo Security ● Access ●Policies ● Roels ● Database level access level - row level (item)/ Col level (attribute) access control ● STS for web identity federation. ● Limit of amount row u can see ● Use SSL ● All request can be signed bia SHA256 ● PCI, SOC3, HIPPA, 270001 etc. ● Cloud trail - for Audit log
  • 84.
    Redshift Security ● SSLin transit ● Encryption ○ KMSAES256 ○ HSM encryption (hardware) ● VPC ● Cloud Trail ● All the usual regulation certification. ● Security groups
  • 85.
    Big Data patterns ●Interactive query → mostly EMR, Athena ● Batch Processing → reporting → redshift + EMR ● Stream processing → kinesis, kinesis client library, lambda, EMR ● Real time prediction → mobile, dynamoDB. → lambda+ ML. ● Batch prediction → ML and redshift ● Long running cluster → s3 → EMR ● Log aggregation → s3 → EMR → redshift
  • 86.
    Stay in touch... ●Omid Vahdaty ● +972-54-2384178