Introduction to AWS Big Data

Introduction to AWS Big Data
Omid Vahdaty, Big Data Ninja

When the data outgrows your ability to process
● Volume
● Velocity
● Variety

EMR
● Basic, the simplest distribution of all hadoop distributions.
● Not as good as cloudera.

Collect
● Firehose
● Snowball
● SQS
● Ec2

Store
● S3
● Kinesis
● RDS
● DynmoDB
● Clloud Search
● Iot

Process + Analyze
● Lambda
● EMR
● Redshift
● Machine learning
● Elastic search
● Data Pipeline
● Athena

Visulize
● Quick sight
● ElasticSearch service

History of tools
HDFS - from google
Cassandra - from facebook, Columnar store NoSQL , materialized view, secondary
index
Kafka - linkedin.
Hbase, no sql , hadoop eco system.

EMR Hadoop ecosystem
● Spark - recommend options. Can do all the below…
● Hive
● Oozie
● Mahout - Machine library (mllib is better, runs on spark)
● Presto, better than hive? More generic than hive.
● Pig - scripting for big data.
● Impala - Cloudera, and part of EMR.

ETL Tools
● Attunity
● Splunk
● Semarchy
● Informatica
● Tibco
● Clarit

Architecture
● Decouple :
○ Store
○ Process
○ Store
○ Process
○ insight...
● Rule of thumb: 3 tech in dc, 7 techs max in cloud
○ Do use more b/c: maintenance

Architecture considerations
● unStructured? Structured? Semi structured?
● Latency?
● Throughput?
● Concurrency?
● Access patterns?
● Pass? Max 7 technologies
● Iaas? Max 4 technologies

EcoSystem
● Redshift = analytics
● Aurora = OLTP
● DynamoDB = NoSQL like mongodb.
● Lambda
● SQS
● Cloud Search
● Elastic Search
● Data Pipeline
● Beanstalk - i have jar, install it for me…
● AWS machine learning

Data ingestions Architecture challenges
● Durability
● HA
● Ingestions types:
○ Batch
○ Stream
● Transaction: OLTP, noSQL.

Ingestion options
● Kinesis
● Flume
● Kafka
● S3DistCP copy from
○ S3 to HDFS
○ S3 to s3
○ Cross account
○ Supports compression.

Transfer
● VPN
● Direct Connect
● S3 multipart upload
● Snowball
● IoT

Steaming
● Streaming
● Batch
● Collect
○ Kinesys
○ DynamoDB streams
○ SQS (pull)
○ SNS (push)
○ KAFKA (Most recommend)

Streaming
● Low latency
● Message delivery
● Lambda architecture implementation
● State management
● Time or count based windowing support
● Fault tolerant

Stream collection options
Kinesis client library
AWS lambda
EMR
Third party
Spark streaming (latency min =1 sec) , near real time, with lot of libraries.
Storm - Most real time (sub millisec), java code based.
Flink (similar to spark)

Kinesys
● Stream - collect@source and near real time processing
○ Near real time
○ High throughput
○ Low cost
○ Easy administration - set desired level of capacity
○ Delivery to : s3,redshift, Dynamo
○ Ingress 1mb, egress 2mbs. Upto 1000 Transaction per second.
● Analytics - in flight analytics.
● Firehose - Park you data @ destination.

Kinesis analytics example
CREATE STREAM "INTERMEDIATE_STREAM" (
hostname VARCHAR(1024),
logname VARCHAR(1024),
username VARCHAR(1024),
requesttime VARCHAR(1024),
request VARCHAR(1024),
status VARCHAR(32),
responsesize VARCHAR(32)
);
-- Data Pump: Take incoming data from SOURCE_SQL_STREAM_001 and insert
into INTERMEDIATE_STREAM

KCL
● Read from stream using get api
● Build application with the KCL
● Leverage kinesis spout for storm
● Leverage EMR connector

Firehose - for parking
● Not for fast lane - no in flight analytics
● Capture , transform and load.
○ Kinesis
○ S3
○ Redshift
○ elastic search
● Managed Service
● Producer - you input to delivery stream
● Buffer size MB
● Buffer in seconds.

Comparison of Kinesis product
● Stream
○ Sub 1 sec processing latency
○ Choice of stream processor (generic)
○ For smaller events
● Firehose
○ Zero admin
○ 4 targets built in (redshift, s3, search, etc)
○ Latency 60 sec minimum.
○ For larger “events”

DynamoDB
● Fully managed NoSql document store key value
○ Tables have not fixed schema
● High performance
○ Single digit latency
○ Runs on solid drives
● Durable
○ Multi az
○ Fault toulerant, replicated 3 AZ.

Durability
● Read
○ Eventually consistent
○ Strongly consistent
● Write
○ Quorum ack
○ 3 replication - always - we can’t change it.
○ Persistence to disk.

Indexing and partitioning
● Idnexing
○ LSI - local secondary index, kind of alternate range key
○ GSI - global secondary index - “pivot charts” for your table, kind of projection (vertica)
● Partitioning
○ Automatic
○ Hash key speared data across partitions

DynamoDB Items and Attributes
● Partitions key
● Sort key (optional)
● LSI
● attributes

AWS Titan(on DynamoDB) - Graph DB
● Vertex - nodes
● Edge- relationship b/w nodes.
● Good when you need to investigate more than 2 layer relationship (join of 4 tables)
● Based on TinkerPop (open source)
● Full text search with lucene SOLR or elasticsearch
● HA using multi master replication (cassandra backed)
● Scale using DynamoDB backed.
● Use cases: Cyber , Social networks, Risk management

Elasticache: MemCache / Redis
● Cache sub millisecond.
● In memory
● No disk I/O when querying
● High throughput
● High availability
● Fully managed

Redshift
● Peta scale database for analytics
○ Columnar
○ MPP
● Complex SQL

RDS
● Flavours
○ Mysql
○ Auraora
○ postgreSQL
○ Oracle
○ MSSQL
● Multi AZ, HA
● Managed services

Data Processing
● Batch
● Ad hoc queries
● Message
● Stream
● Machine learning
● (by the use parquet, not ORC, more commonly used in the ecosystem)

Athena
● Presto
● In memory
● Hive meta store for DDL functionality
○ Complex data types
○ Multiple formats
○ Partitions

EMR
● Emr , pig and hive can work on top of spark , but not yet in EMR
● TEZ is not working. Horton gave up on it.
● Can ran presto on hive

Best practice
● Bzip2 - splittable compression
○ You don't have to open all the file, you can bzip2 only one block in it and work in parallel
● Snappy - encoding /decoding time - VERY fast, not compress well.
● Partitions
● Ephemeral EMR

EMR ecosystem
● Hive
● Pig
● Hue
● Spark
● Oozie
● Presto
● Ganglia
● Zookeeper
● zeppelin
● For research - Public data sets exist of AWS...

EMR ARchitecture
● Master node
● Core nodes - like data nodes (with storage)
● Task nodes - (not like regular hadoop, extends compute)
● Default replication factor
○ Nodes 1-3 ⇒ 1 factor
○ Nodes 4-9 ⇒ 2 replication factor
○ Nodes 10+ ⇒ 3 replication factor
○ Not relevant in external tables
● Does Not have Standby Master node
● Best for transient cluster (goes up and down every night)

EMR
● Use SCALA
● If you use Spark SQL , even in python it is ok.
● For code - python is full of bugs, connects well to R
● Scala is better - but not connects easily to data science R
● Use cloud formation when ready to deploy fast.
● Check instance I ,
● Dense instance is good architecture
● User spot instances - for the tasks.
● Use TAGs
● Constant upgrade of the AMI version
● Don't use TEZ
● Make sure your choose instance with network optimized
● Resize cluster is not recommended
● Bootstrap to automate cluster upon provisioning
● Steps to automate steps on running cluster
● Use RDS to share Hive MetaStore (the metastore is mysql based)
● Use r, kafka and impala via bootstrap actions and many others

Sending work to emr
● Steps -
○ can be added while cluster is running
○ Can be added from the UI / CLI
○ FIFO scheduler by default
● Emr api
● Ganglia: jvm.JvmMetrics.MemHeapUsedM

Hive
● SQL over hadoop.
● Engine: spark, tez, ML
● JDBC / ODBC
● Not good when need to shuffle.
● Connect well with DynamoDB
● SerDe json, parquet,regex etc.

Hbase
● Nosql like cassandra
● Below the Yarn
● Agent of HBASE on each data node. Like impala.
● Write on HDFS
● Multi level Index.
● (AWS avoid talking about it b/c of Dynamo, used only when you want to save
money)
● Driver from hive (work from hive on top of HBase)

Presto
● Like Hive, from facebook.
● Not good always for join on 2 large tables.
● Limited by memory
● Not fault tolerant like hive.
● Optimized for ad hoc queries

Pig
● Distributed Shell scripting
● Generating SQL like operations.
● Engine: MR, Tez
● S3, DynamoDB access
● Use Case: for data science who don't know SQL, for system people, for those
who want to avoid java/scala
● Fair fight compared to hive in term of performance only
● Good for unstructured files ETL : file to file , and use sqoop.

Mahut
● Machine library for hadoop - Not used
● Spark ML lib is used instead.

R
● Open source package for statistical computing.
● Works with EMR
● “Matlab” equivalent
● Works with spark
● Not for developer :) for statistician
● R is single threaded - use spark R to distribute. Not everything works perfect.

Apache Zeppelin
● Notebook - visualizer
● Built in spark integration
● Interactive data analytics
● Easy collaboration.
● Uses SQL
● work s on top of Hive
● Inside EMR.
● Give more feedback to let you know where u are

Hue● Hadoop user experience
● Logs in real time and failures.
● Multiple users
● Native access to S3.
● File browser to HDFS.
● Manipulate metascore
● Job Browser
● Query editor
● Hbase browser
● Sqoop editor, oozier editor, Pig Editor

Spark
● In memory
● X10 to X100 times faster
● Good optimizer for distribution
● Rich API
● Spark SQL
● Spark Streaming
● Spark ML (ML lib)
● Spark GraphX (DB graphs)
● SparkR

Spark
● RDD
○ An array (data set)
○ Read only distributed objects cached in mem across cluster
○ Allow apps to keep working set in mem for reuse
○ Fault tolerant
○ Ops:
■ Transformation: map/ filter / groupby / join
■ Actions: count / reduce/ collet / save / persist
○ Object oreinted ops
○ High level expression liken lambda, funcations, map

Data Frames
● Dataset containing named columns
● Access ordinal positions
● Tungsten execution backed - framework for distributed memory (open
source?)
● Abstraction for selecting , filtering, agg, and plot structure data.
● Run sql on it.

Streaming
● Near real time (1 sec latency) , like batch of 1sec windows
● Same spark as before
● Streaming jobs with API

Spark ML
● Classification
● Regression
● Collaborative filtering
● Clustering
● Decomposition
● Code: java, scala, python, sparkR

Spark GraphX
● Works with graphs for parallel computations
● Access to access of library of algorithm
○ Page rank
○ Connected components
○ Label propagation
○ SVD++
○ Strongly connected components

Hive on Spark
● Will replace tez.

Spark flavours
● Own cluster
● With yarn
● With mesos

Downside
● Compute intensive
● Performance gain over mapreduce is not guaranteed.
● Streaming processing is actually batch with very small window.

Redshift
● OLAP, not OLTP→ analytics , not transaction
● Fully SQL
● Fully ACID
● No indexing
● Fully managed
● Petabyte Scale
● MPP
● Can create slow queue for queries which are long lasting.
● DO NOT USE FOR transformation.

EMR vs Redshift
● How much data loaded and unloaded?
● Which operations need to performed?
● Recycling data? → EMR
● History to be analyzed again and again ? → emr
● What the data needs to end up? BI?
● Use spectrum in some use cases.
● Raw data? s3.

Hive VS. Redshift
● Amount of concurrency ? low → hive, high → redshift
● Access to customers? Redshift?
● Transformation, Unstructured , batch, ETL → hive.
● Peta scal ? redshift
● Complex joins → Redshift

Presto VS redshift
● Not a true DW , but can be used.
● Require S3, or HDFS
● Netflix uses presoto for analytics.

Redshift
● Leader node
○ Meta data
○ Execution plan
○ Sql end point
● Data node
● Distribution key
● Sort key
● Normalize…
● Dont be afraid to duplicate data with different sort key

Redshift Spectrum
● External table to S3
● Additional compute resources to redshift cluster.
● Not good for all use cases

Cost consideration
● Region
● Data out
● Spot instances
○ 50% - 80% cost reduction.
○ Limit your bid
○ Work well with EMR. use spot instances for task core mostly. For dev - use spot.
○ May be killed in the middle :)
● reserved instances

Kinesis pricing
● Streams
○ Shard hour
○ Put payload unit , 25KB
● Firehose
○ Volume
○ 5KB
● Analytics
○ Pay per container , 1 cpu, 4 GB = 11 Cents

Dynamo
● Most expensive: Pay for throughput
○ 1k write
○ 4k read
○ Eventually consistent is cheaper
● Be carefull… from 400$ to 10000$ simply b/s you change block size.
● You pay for storage, until 25GB free, the rest is 25 cent per GB

Optimize cost
● Alerts for cost mistakes-
○ unused machines etc.
○ on 95% from expected cost. → something is wrong…
● No need to buy RI for 3 year, 1year is better.

Visualizing - Quick Sight
● Cheap
● Connect to everything
○ S3
○ Dynamo
○ Redhsift
● SPICE
● Use storyboard to create slides of graphs like PPT

Tableau
● Connect to Redshift
● Good look and feel
● ~1000$ per user
● Not the best in term of performance. Quick sight is faster.

Other visualizer
● Tibco - Spot file
○ Has many connector
○ Had recommendations feature
● Jaspersoft
○ Limited
● ZoomData
● Hunk -
○ Users EMR and S3
○ Scehmoe on the fly
○ Available in market place
○ Expensive.
○ Non SQL , has it own language.

Visualize - which DB to choose?
● Hive is not recommended
● Presto a bit faster.
● Athena is ok
● Impala is better (has agent per machine, caching)
● Redshift is best.

Orchestration
● Oozie
○ Opens source workflow
■ Workflow: graph of action
■ Coordinator: scheduler jobs
○ Support: hive, sqoop , spark etc.
● Data pipeline
○ Move data from on prem to cloud
○ Distributed
○ Integrate well with s3, dynamodb, RDS, EMR, EC2, redshift
○ Like ETL: Input, data manipulation, output
○ Not trivial, but nicer than Oozie
● Other: AirFlow, Knime, Luigi, Azkaban

Security
● Shared security model
○ Customer:
■ OS, platform, identity, access management,role, permissions
■ Network
■ Firewall
○ AWS:
■ compute, storage, databases, networking, LB
■ Regions, AZ, edge location
■ compliance

EMR secured
● IAM
● VPC
● Private subnet
● VPC endpoint to S3
● MFA to login
● STS + SAML - token to login. Complex solution
● Kerberos authentication of Hadoop nodes
● Encryption
○ SSH to master
○ TLS between nodes

IAM
● User
● Best practice :
○ MFA
○ Don't use ROOT account
● Group
● Role
● Policy best practice
○ Allow X
○ Disallow all the rest
● Identity federation via LDAP

Kinesis Security best practice
● Create IAM admin for admin
● Create IAM entity for re sharing stream
● Create IAM entity for produces to write
● Create iam entity for consumer to read.
● Allow specific source IP
● Enforce AWS:SecureTransport condition key for every API call.
● Use temp credentials : IAM role

Dynamo Security
● Access
● Policies
● Roels
● Database level access level - row level (item)/ Col level (attribute) access control
● STS for web identity federation.
● Limit of amount row u can see
● Use SSL
● All request can be signed bia SHA256
● PCI, SOC3, HIPPA, 270001 etc.
● Cloud trail - for Audit log

Redshift Security
● SSL in transit
● Encryption
○ KMSAES256
○ HSM encryption (hardware)
● VPC
● Cloud Trail
● All the usual regulation certification.
● Security groups

Big Data patterns
● Interactive query → mostly EMR, Athena
● Batch Processing → reporting → redshift + EMR
● Stream processing → kinesis, kinesis client library, lambda, EMR
● Real time prediction → mobile, dynamoDB. → lambda+ ML.
● Batch prediction → ML and redshift
● Long running cluster → s3 → EMR
● Log aggregation → s3 → EMR → redshift

Stay in touch...
● Omid Vahdaty
● +972-54-2384178

Introduction to AWS Big Data

More Related Content

What's hot

Similar to Introduction to AWS Big Data

More from Omid Vahdaty

Recently uploaded

Introduction to AWS Big Data