Scaling micro-services architecture on AWS
Boyan Dimitrov,
Senior Systems Engineer at Hailo
@nathariel
Outline
• Intro to the Hailo world
• Our cloud journey and architecture evolution
• Platform design patterns and challenges
• Tooling
AWS User Group UK 2014
AWS User Group UK 2014
AWS User Group UK 2014
The world’s highest-rated taxi app – almost 20,000
five-star reviews
To date, Hailo has carried more than 11 million
passengers
Hailo has over 50,000 registered taxi drivers
worldwide
AWS User Group UK 2014
November 2011: Hailo 1.0 Launch
Users: 1
Regions: eu-west-1
AWS User Group UK 2014
eu-west-1
Java
MYSQL
PHP
Architecture specifics
• Monolithic PHP and Java applications
• Built and supported by 3-4 backend
engineers
• City-specific environments
• MySQL master-master replication for
resilience
• Multi-AZ since day 1
AWS specifics
Route 53 ELB S3
AWS User Group UK 2014
Challenges
• Hard to develop new features
• Painful to push code changes and to support many independent city specific
environments
• Adding new instances and more capacity is a very slow and expensive
process
• Unreliable and slow failover procedures
• SPOF
AWS User Group UK 2014
December 2013: Hailo 2.0
AWS User Group UK 2014
Users: 1 000 000+
Regions: eu-west-1, us-east-1, ap-northeast-1
Architecture specifics
• Micro-services architecture based on Go and Java
• Seamless service discovery, service to service communication,
monitoring and instrumentation
• Everything is automated
• Ability to scale services up and down based on demand
AWS specifics
Route 53 ELB S3
AWS User Group UK 2014
Autoscaling Cloudfront Redshift
eu-west-1
Message
Bus+
Go
Services
Proxy Layer
Java
Services
C*
us-east-1
Proxy Layer
C*
ap-northeast -1
Proxy Layer
C*
AWS User Group UK 2014
Distributed
Queue+
Message
Bus+
Distributed
Queue+
Message
Bus+
Distributed
Queue+
Go
Services
Java
Services
Go
Services
Java
Services
Challenges
• Hard to develop new features
Completing new features in days, not months
• Painful to push code changes
Seamless service deployment and ability to run multiple versions of a service
• Adding new instances and adding more capacity is slow
Our servers scale up and down based on demand
• Unreliable and slow failover procedures
Automated reaping of misbehaving services and AZ failover
• SPOF
Fault-tolerant distributed services architecture
AWS User Group UK 2014
Infrastructure operating cost – a very important KPI
AWS User Group UK 2014
Platform design patterns and challenges
AWS User Group UK 2014
AWS User Group UK 2014
Orchestration Layer Overview
• External orchestration
services responsible for all
environments
• Internal orchestration
services responsible for
the local environment only
AWS User Group UK 2014
External Orchestration
Layer under the hood
• The external orchestration
layer is built on the same
platform and shares the
same distributed,
scalability and resiliency
specifics
• Each external
orchestration service
instance has a “global”
view of our infrastructure
• Relies heavily on STS to
operate across different
accounts and regions
AWS User Group UK 2014
Inside an environment: Auto Scaling and service provisioning
• Increased operational and deployment complexity - requires constant service
resource utilization monitoring and manual shuffling.
• Risk of performance impact due to “noisy neighbours”
• Suboptimal resource management
AWS User Group UK 2014
Challenges
AWS User Group UK 2014
Micro-services + Containers + Scheduling
• Increased operational and deployment complexity – requires constant service
resource utilization monitoring and manual shuffling
On-demand infrastructure resources and services provisioning based on SLA
• Risk of performance impact due to “noisy neighbours”
Each service is isolated from the rest
• Suboptimal resource management
Services are grouped together in the most optimal way. We expect up to 30%
cost reduction of our worker services operational cost once we roll out this
solution
AWS User Group UK 2014
Micro-services + Containers + Scheduling on AWS will be a dominant
architecture pattern in the next few years
Challenges
Tooling
AWS User Group UK 2014
Because all resources are ephemeral and will fail…
AWS User Group UK 2014
A holistic view of the platform
AWS User Group UK 2014
Service level health checks
AWS User Group UK 2014
Reliable and repeatable service provisioning
Everything is an event stream
AWS User Group UK 2014
Platform events count as well!
AWS User Group UK 2014
AWS User Group UK 2014
Still “things” will fail in mysterious ways
AWS User Group UK 2014
Circuit breakers and graceful degradation when things go
wrong
Thank you, any questions?
@nathariel
boyan@hailocab.com

Scaling micro-services Architecture on AWS

  • 1.
    Scaling micro-services architectureon AWS Boyan Dimitrov, Senior Systems Engineer at Hailo @nathariel
  • 2.
    Outline • Intro tothe Hailo world • Our cloud journey and architecture evolution • Platform design patterns and challenges • Tooling AWS User Group UK 2014
  • 3.
  • 4.
    AWS User GroupUK 2014 The world’s highest-rated taxi app – almost 20,000 five-star reviews To date, Hailo has carried more than 11 million passengers Hailo has over 50,000 registered taxi drivers worldwide
  • 5.
  • 6.
    November 2011: Hailo1.0 Launch Users: 1 Regions: eu-west-1 AWS User Group UK 2014
  • 7.
    eu-west-1 Java MYSQL PHP Architecture specifics • MonolithicPHP and Java applications • Built and supported by 3-4 backend engineers • City-specific environments • MySQL master-master replication for resilience • Multi-AZ since day 1 AWS specifics Route 53 ELB S3 AWS User Group UK 2014
  • 8.
    Challenges • Hard todevelop new features • Painful to push code changes and to support many independent city specific environments • Adding new instances and more capacity is a very slow and expensive process • Unreliable and slow failover procedures • SPOF AWS User Group UK 2014
  • 9.
    December 2013: Hailo2.0 AWS User Group UK 2014 Users: 1 000 000+ Regions: eu-west-1, us-east-1, ap-northeast-1
  • 10.
    Architecture specifics • Micro-servicesarchitecture based on Go and Java • Seamless service discovery, service to service communication, monitoring and instrumentation • Everything is automated • Ability to scale services up and down based on demand AWS specifics Route 53 ELB S3 AWS User Group UK 2014 Autoscaling Cloudfront Redshift
  • 11.
    eu-west-1 Message Bus+ Go Services Proxy Layer Java Services C* us-east-1 Proxy Layer C* ap-northeast-1 Proxy Layer C* AWS User Group UK 2014 Distributed Queue+ Message Bus+ Distributed Queue+ Message Bus+ Distributed Queue+ Go Services Java Services Go Services Java Services
  • 12.
    Challenges • Hard todevelop new features Completing new features in days, not months • Painful to push code changes Seamless service deployment and ability to run multiple versions of a service • Adding new instances and adding more capacity is slow Our servers scale up and down based on demand • Unreliable and slow failover procedures Automated reaping of misbehaving services and AZ failover • SPOF Fault-tolerant distributed services architecture AWS User Group UK 2014
  • 13.
    Infrastructure operating cost– a very important KPI AWS User Group UK 2014
  • 14.
    Platform design patternsand challenges AWS User Group UK 2014
  • 15.
    AWS User GroupUK 2014 Orchestration Layer Overview • External orchestration services responsible for all environments • Internal orchestration services responsible for the local environment only
  • 16.
    AWS User GroupUK 2014 External Orchestration Layer under the hood • The external orchestration layer is built on the same platform and shares the same distributed, scalability and resiliency specifics • Each external orchestration service instance has a “global” view of our infrastructure • Relies heavily on STS to operate across different accounts and regions
  • 17.
    AWS User GroupUK 2014 Inside an environment: Auto Scaling and service provisioning
  • 18.
    • Increased operationaland deployment complexity - requires constant service resource utilization monitoring and manual shuffling. • Risk of performance impact due to “noisy neighbours” • Suboptimal resource management AWS User Group UK 2014 Challenges
  • 19.
    AWS User GroupUK 2014 Micro-services + Containers + Scheduling
  • 20.
    • Increased operationaland deployment complexity – requires constant service resource utilization monitoring and manual shuffling On-demand infrastructure resources and services provisioning based on SLA • Risk of performance impact due to “noisy neighbours” Each service is isolated from the rest • Suboptimal resource management Services are grouped together in the most optimal way. We expect up to 30% cost reduction of our worker services operational cost once we roll out this solution AWS User Group UK 2014 Micro-services + Containers + Scheduling on AWS will be a dominant architecture pattern in the next few years Challenges
  • 21.
    Tooling AWS User GroupUK 2014 Because all resources are ephemeral and will fail…
  • 22.
    AWS User GroupUK 2014 A holistic view of the platform
  • 23.
    AWS User GroupUK 2014 Service level health checks
  • 24.
    AWS User GroupUK 2014 Reliable and repeatable service provisioning
  • 25.
    Everything is anevent stream AWS User Group UK 2014
  • 26.
    Platform events countas well! AWS User Group UK 2014
  • 27.
    AWS User GroupUK 2014 Still “things” will fail in mysterious ways
  • 28.
    AWS User GroupUK 2014 Circuit breakers and graceful degradation when things go wrong
  • 29.
    Thank you, anyquestions? @nathariel boyan@hailocab.com

Editor's Notes

  • #4  Seamless user experience
  • #17 This solution can operate across business boundaries and is vendor agnostic