Bizo Dev Blog

Setting up Discourse on AWS

Mon, 30 Jun 2014 00:00:00 +0000

The Bizo dev team recently decided to experiment with using discourse as a discussion forum. Discourse has published an installation guide for a single-machine setup using DigitalOcean, but we decided to deploy on AWS instead.

Why AWS?

While the default installation is fast and easy to implement, it does leave the web server running in the same box as all the important data: postgres, redis, file uploads. This means that if the server location were, hypothetically, invaded by a troop of angry hammer-wielding monkeys, there would be much downtime and data loss involved.

As proper monkey-fearing developers, we prefer to use immutable servers whenever possible. By using immutable servers and keeping data in dedicated AWS services, we can treat them as disposable and essentially expect them to be torn down and rebuilt on a regular basis. This provides incentive to keep configuration simple, which makes automation easier, and in turn leads to less time fussing over things like server state and maintaining exact run-time configuration.

We prefer sticking to this approach even for experimental apps like discourse. In this case, we used EC2 for the webserver, RDS for the postgres database, ElastiCache for the redis store, and S3 for file uploads.

Setting up AWS Services

Setting up AWS services is generally pretty straighforward, but there are a few things to keep in mind.

Security Groups

Security groups are always a good idea, but they are required for RDS and ElastiCache to communicate with EC2 instances. Make sure to allow ssh (port 22) and http (ports 80 + 443), and give it a memorable name like discourse-prod.

RDS, ElastiCache

Again, the only catch here is to make sure that the security group attached to these instances is the same one that the EC2 instance is attached to. Otherwise, the web server won't be able to talk to either of them. Make sure to note the hostnames, auth credentials, etc here for use in configuring discourse.

EC2

The only officially supported install method for discourse is via Docker, which requires choosing specific versions of Ubuntu for the EC2 instance. See the docker documentation for more details.

Discourse Config

Since we're not doing a standalone deploy, we based our discourse docker config on the web-only example that discourse provides. The final config looked something like this:

templates:
  - "templates/sshd.template.yml"
  - "templates/web.template.yml"

expose:
  - "80:80"
  - "2222:22"

params:
  version: HEAD

env:
  # Creating an account with a developer email will automatically give it 
  # admin access to the site for setup
  DISCOURSE_DEVELOPER_EMAILS: 'email_address@company.com'
  DISCOURSE_HOSTNAME: 'DOMAIN_FOR_DISCOURSE_SITE.com'
  # Enter info from RDS here
  DISCOURSE_DB_SOCKET: ''
  DISCOURSE_DB_HOST: 'DB_INSTANCE_ID.REGION.rds.amazonaws.com'
  DISCOURSE_DB_PORT: '5432'
  DISCOURSE_DB_USERNAME: 'DB_USER'
  DISCOURSE_DB_PASSWORD: 'DB_PASSWORD'
  DISCOURSE_DB_NAME: 'DB_NAME'
  # Enter info from elasticache here
  DISCOURSE_REDIS_HOST: 'REDIS_INSTANCE.cache.amazonaws.com'
  DISCOURSE_REDIS_PORT: '6379'
  # Amazon SES can be used for SMTP, or even gmail for lower volumes
  DISCOURSE_SMTP_ADDRESS: SMTP_SERVER
  DISCOURSE_SMTP_PORT: SMTP_PORT
  DISCOURSE_SMTP_USER_NAME: SMTP_USER
  DISCOURSE_SMTP_PASSWORD: SMTP_PASSWORD

volumes:
  - volume:
        host: /var/docker/shared
        guest: /shared

# you may use the docker manager to upgrade and monitor your docker image
# UI will be visible at http://yoursite.com/admin/docker
hooks:
# you may import your key using launchpad if needed
#after_sshd:
#    - exec: ssh-import-id some-user
  after_code:
    - exec:
        cd: $home/plugins
        cmd:
          - mkdir -p plugins
          - git clone https://github.com/discourse/docker_manager.git

Bootstrapping EC2 instances

We have existing infrastructure that will spin up new instances with applicable settings such as security groups, auto scaling group, and load balancers. A simple shell script sets up the internals on the instance itself. First, installing docker and the docker container infrastructure for discourse:

# This is taken almost verbatim from the discourse installation guide
(wget -qO- https://get.docker.io/ | bash) > docker_install.log 2>&1
install -g docker -m 2775 -d /var/docker
git clone https://github.com/discourse/discourse_docker.git /var/docker

At the time of this writing, some of the provided install scripts do not play very nicely with Ubuntu's default dash so bash is used explicitly.

Next, we customize the deploy. In addition to dropping in our custom config, we also generate an ad-hoc ssh key to feed into the docker container. This is because the discourse app runs entirely inside the container, so any sort of direct interaction with it requires the user to interact with the container itself, rather than just the server running the container. Conveniently, discourse sets up a sshd from templates/ssh.template.yml and will automatically load the current user's key from ~/.ssh/id_rsa into the container. This part of the script makes sure that for each deploy there is a fresh key that is completely separate from any other (potentially important/sensitive) keys that might be living in the server.

cp config/bizo-discourse.yml /var/docker/containers/app.yml

# Make sure to back up anything that might be in the existing id_rsa!  Ideally
# the install is running as its own special user.
(mkdir -p ssh-key && cd ssh-key && ssh-keygen -f id_rsa -t rsa -N '')
(mkdir -p ~/.ssh && cp ssh-key/* ~/.ssh)

And finally, to bootstrap the container and start the actual discourse app inside it:

bash /var/docker/launcher bootstrap app > discourse_bootstrap.log 2>&1
bash /var/docker/launcher start app > discourse_run.log 2>&1 &

S3 Uploads

The final step after getting the app set up is to get file uploads on S3, which is explained in this guide. However, this might break existing user avatars, which the server still expects to find in their old location. (Note that this also applies to gravatars since discourse will download and cache them by default.) The fix is simple: ssh into the container and run bundle exec rake avatars:refresh. This will re-download avatars and update the users in the database accordingly.

Maintenance

With this setup, maintenance becomes extremely easy. Server crashes and software upgrades can be handled by spinning up new instances running the bootstrap shell script. Scaling is also as easy as spinning up another instance and putting everything behind a load balancer. Forum backups and data recovery can be handled easily with Amazon's native tools. At the end of the day, all this adds up to less time wasted on maintaining forum software and more time working on business-critical apps.

Why Scrum is great, but Kanban makes more sense to us

Fri, 27 Jun 2014 00:00:00 +0000

In a previous blog post, Extreme Programming For Modern Start-ups, Pat explained from his extensive experience with XP agile methodology why XP isn't always flexible enough for our work organization here at Bizo. My own past experience is with Scrum, the other popular Agile methodology. In this post I want to show where Scrum shines, and why Kanban finally wins for the kind of work we do.

- Scrum is great but you don't use it at Bizo. Why not?

I joined Bizo six months ago after working for 5 years for a software company which uses Scrum for all its projects. This means my previous work experience with Scrum is still quite fresh while I now have a good idea of how things work here at Bizo. Both Scrum and Kanban focus on almost orthogonal aspects of Agile. If you search the Internet for scrumban or scrum-ban, you will find people who thought about mixing the two. But generally you would do one or the other, mainly because of a difference in spirit between these two methodologies.

Scrum focuses on making people work well together towards building a relevant product. Scrum is human to its core. It takes our inherent social weaknesses and strengths into account. The planning meeting and the demo at the end of a sprint provide focus on short term deadlines. The stand-ups and retrospective keep the communication channels always open. The Product Owner is the point of truth for what goes in the product, and the Scrum Master watches and organizes the backlog. Both of these people protect the team against external influences by always keeping the focus on the current sprint. In my experience Scrum is amazingly good at making people work well together. And it's also a very satisfying way to work, within some constraints.

To be a good fit for Scrum, your project must:

have a 5 to 7 person team. Fewer people, and all this communication will feel like a useless overhead. More people, and the communication will take too long, with team members losing interest.
target one application or system. You shouldn't have a team of people working on different subjects. The tight communication channels in the team (planning meetings, standups, demos, retrospective) become pointless if each team member can't relate to what the other members do.
have work that can be done in parallel for everyone in the team. You don't want to have idle people waiting for each other.
be able to protect the team members from interruptions. Other projects members and stakeholders might want to interact with the team members for their own interest.

In my previous company, I had the privilege to work with Scrum in such an environment. When the context fits the constraints, the Scrum team creates interactions that fit perfectly with the way our human brain works and socializes. But even in the company I worked for, this context didn't happen that often. When you take into account reality and modern software development, you actually need more flexibility.

At Bizo, we have close to 400 git repos maintained by an engineering team of around 25 people. All of these projects contain code executed somewhere in our services hosted in the AWS cloud. By choice, there is no Operations Team at Bizo. This means we, developers, care for our app from fleshing out the design, to its deployment and monitoring. We have great tooling to help making deployment and monitoring a small amount of our time. But these tools help dealing with interruptions, they don't avoid them. So you might have guessed, Scrum is not a good fit for the projects we have here at Bizo.

- Ok, so you guys at Bizo use Kanban. Why Kanban?

Kanban focuses on having tasks flow from inception to completion in the most efficient fashion. Human interaction is completely left to the team's culture (quick plug for our culture guide). Kanban focuses on monitoring the task flow while trusting the organization to find the best way to optimize this flow. Optimizing the flow can mean adding a certain skill to the team, automating one particular slow task, or telling your marketing team to change priorities. Anything really. And that's what's great about Kanban, you're put in front of your own responsibilities. Kanban shows you that tasks don't flow? Your job is to find a way to make it work.

At Bizo, we have an architecture of many loosely coupled services and applications. We generally have small projects involving 1 to 7 people, with deployments that could happen from several times a week to once in a blue moon. Allowing for such variations in projects is great for many reasons, but it can create potential bottlenecks, or endless work in progress. To avoid this, you need monitoring, smart scheduling and flexible organizations. You also need visibility across all the small projects. Kanban gives us the visibility and doesn't get in our way when we need to change our organization.

- So to sum up, Scrum for the perfect world, Kanban for the win.

Scrum gives you a full framework to do your project. From the size of your teams to all the meetings you'll need, you get a full fledged target for your project organization. In my experience, if you can reach the target, or at least get close to it, it works amazingly well. Now Kanban doesn't tell you what to put in your task, or how you should perform them. It just tells how you should schedule and monitor these tasks, you then have to find the organization that works best for your project. This makes Kanban fit for a much wider range of projects.

If you're interested in learning more about Scrum or Kanban:

An introduction to Scrum by Mike Cohn (then click the "View Presentation" button)
A nice write-up of Kanban

Executors.newCachedThreadPool() considered harmful

Tue, 24 Jun 2014 00:00:00 +0000

This is more of a public-service announcement blog-post...

That’s right, Executors.newCachedThreadPool() isn't a great choice for server code that's servicing multiple clients and concurrent requests.

Why? There are basically two (related) problems with it:

1) It's unbounded, which means that you're opening the door for anyone to cripple your JVM by simply injecting more work into the service (DoS attack). Threads consume a non-negligible amount of memory and also increase memory consumption based on their work-in-progress, so it's quite easy to topple a server this way (unless you have other circuit-breakers in place).

2) The unbounded problem is exacerbated by the fact that the Executor is fronted by a SynchronousQueue which means there's a direct handoff between the task-giver and the thread pool. Each new task will create a new thread if all existing threads are busy. This is generally a bad strategy for server code. When the CPU gets saturated, existing tasks take longer to finish. Yet more tasks are being submitted and more threads created, so tasks take longer and longer to complete. When the CPU is saturated, more threads is definitely not what the server needs.

Here are my recommendations:

Use a fixed-size thread pool (Executors.newFixedThreadPool) or a ThreadPoolExecutor with a set maximum number of threads;
Figure out how many threads you need to keep each CPU core saturated (but not too much). A good rule of thumb is to start around 10 threads per core for a workload that involves disk and network I/O, and see if it’s sufficient to keep all CPUs busy. Increase or decrease the number of maximum threads based on results obtained from load testing. Multiply this factor by Runtime.getRuntime.availableProcessors to obtain the total size of the thread pool.
I generally don't recommend queueing within a server's main request-processing loop. If you need queuing behavior, use a fixed-size ArrayBlockingQueue -- it's more efficient than LinkedBlockingQueue. Use only if you absolutely need it. A system with queuing is more difficult to understand and tune.
The default RejectedExecutionHandler is AbortPolicy, which throws a runtime RejectedExecutionException when the queue (if any) or the thread pool reach maximum capacity. This is not a bad default but I generally want to guarantee execution in spite of this, so I advise using the CallerRunsPolicy instead. The CallerRunsPolicy also helps to avoid deadlocks (many threads blocking on each other) in the case where tasks may have internal dependencies.
Always provide your own ThreadFactory so all your threads are named appropriately and have the daemon flag set. Your thread pool shouldn't keep the application alive. That's the responsibility of the application itself (i.e., main thread).

The result of all this should look similar to this (in Scala syntax):

    object ThreadPoolHelpers {
      private def cpus = Runtime.getRuntime.availableProcessors

      def daemonThreadFactory(name: String) = new ThreadFactory {
        private val count = new AtomicInteger()

        override def newThread(r: Runnable) = {
          val thread = new Thread(r)
          thread.setName(s"$name-${count.incrementAndGet}")
          thread.setDaemon(true)
          thread
        }
      }

      /** Core thread pool to be used for concurrent request-processing */
      def newCoreThreadPool(name: String) = {
        // use direct handoff (SynchronousQueue) + CallerRunsPolicy to avoid deadlocks
        // since tasks may have internal dependencies.
        val pool = new ThreadPoolExecutor(
          /* core size */ (5 * cpus),
          /* max size */  (15 * cpus),
          /* idle timeout */ 60, TimeUnit.SECONDS,
          new SynchronousQueue[Runnable](),
          daemonThreadFactory(name)
        )
        pool.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy)
        pool
      }
    }

About that ScheduledExecutorService ...

While we're at it, one last piece of advice is to separate request-processing threads from scheduled task execution. It’s often tempting to share a ScheduledExecutorService for both request-processing and scheduled tasks but I recommend using separate thread pools for two reasons:

1) separation allows you to balance the relative processing capacity of your request-processing workload vs scheduled tasks. In particular, if you use scheduled tasks to act as timeouts for incoming requests (e.g. if you are using Futures that get queued into your main thread pool), the separation can have a substantial impact on the timeliness of your timeouts.

2) ThreadPoolExecutor is more efficient for request-processing since it does not have to order incoming tasks with respect to other scheduled tasks. This is especially true if you are using a SynchronousQueue and requests are dispatched to threads in a straight-through fashion (no queuing).

I hope this advice is relevant to you and your server code. If not, please drop me a line!

Exporting Formatted Code Snippets from Emacs

Fri, 13 Jun 2014 00:00:00 +0000

I wanted a way to create formatted code snippets (syntax highlighting, etc.) for pasting into GMail or Google Slides. The catch was I didn't want to use a web service (e.g. pastie, hilite.me) or switch from Emacs to another text editor (e.g. Sublime with the SublimeHighlight plugin).

Learning from Markdown-Mode

I remembered that Emacs markdown-mode lets you preview, in a browser, the Markdown text rendered as HTML.

Basically, when you install markdown-mode, you also install a Markdown executable like Python-Markdown. When you invoke markdown-preview in Emacs, markdown-mode runs the Markdown executable in a shell. The input is the Markdown text in the current Emacs region, and the HTML output goes to a temporary Emacs buffer. The buffer's contents, in turn, are opened in a browser by the Emacs command browse-url-of-buffer.

Pygments to the Rescue

We could use a similar strategy to markdown-mode's for exporting formatted code snippets. We just need a command-line tool that will receive code from stdin and send it formatted to stdout.

Pygments fits the bill. Pygments is a syntax higlighter written in Python, with many lexers and formatting options. It can be installed using pip (e.g. pip install pygments).

Lisp Isn't So Painful

After installing Pygments, you get a script called pygmentize. Of course to call it from Emacs, you'll need some Lisp. Here's a helper that constructs the desired invocation:

26 (defun pygmentize-html-command (beginning-line-number)
27   (let ((lexer (gethash major-mode pygmentize-lexers))
28         (linenostart (number-to-string beginning-line-number)))
29     (if (not lexer) (error (concat "error: no lexer for " (symbol-name major-mode))))
30     (concat "pygmentize -f html" 
31             " -l " lexer 
32             " -O style=autumn,linenos=inline,noclasses=true,linenostart=" linenostart)))

A lookup table maps Emacs major-modes to their lexer argument for pygmentize:

16 (setq pygmentize-lexers (make-hash-table))
17 (puthash 'emacs-lisp-mode "common-lisp" pygmentize-lexers)
18 (puthash 'scala-mode "scala" pygmentize-lexers)
19 (puthash 'java-mode "java" pygmentize-lexers)
20 (puthash 'ruby-mode "rb" pygmentize-lexers)
21 (puthash 'python-mode "py" pygmentize-lexers)
22 (puthash 'sh-mode "sh" pygmentize-lexers)
23 (puthash 'diff-mode "diff" pygmentize-lexers)

The Emacs clip region is passed via stdin to pygmentize, with stdout redirected to a temporary buffer:

38 (defun pygmentize-html (&optional output-buffer-name)
39   (interactive)
40   (save-window-excursion
41     (unless output-buffer-name
42       (setq output-buffer-name pygmentize-html-output-buffer-name))
43     (let* ((beginning-line-number (line-number-at-pos (region-beginning)))
44            (shell-command (pygmentize-html-command beginning-line-number)))
45       (shell-command-on-region (region-beginning) 
46                                (region-end) 
47                                shell-command
48                                output-buffer-name))
49     (switch-to-buffer output-buffer-name)
50     (clipboard-kill-ring-save (point-min) (point-max))
51     output-buffer-name))

The temporary buffer's syntax-highlighted, HTML-formatted contents will be opened in a browser, using Emacs's browse-url-of-buffer function:

53 (defun pygmentize-html-preview (&optional output-buffer-name)
54   (interactive)
55   (unless output-buffer-name
56     (setq output-buffer-name pygmentize-html-output-buffer-name))
57   (browse-url-of-buffer (pygmentize-html output-buffer-name)))

Gotta have a keybinding:

59 (global-set-key (kbd "C-c h p") 'pygmentize-html-preview)

Exporting the Snippets

This gist contains the full Emacs-Lisp code. (And it easily embeds into a .emacs file.)

To export a snippet, highlight some code, and invoke pygmentize-html-preview:

In the launched browser window, select all and copy:

Paste into GMail, Google Slides, or anything that will accept formatted HTML:

Bulk Editing Jenkins Configurations

Wed, 23 Apr 2014 00:00:00 +0000

During our SCM migration, we had to update the configurations of hundreds of jenkins jobs on three different instances. Rather than do that manually, we wrote this ruby snippet (edit_config.rb):

require 'fileutils'
require 'optparse'
require 'nokogiri'

# Accepts a block that takes an Nokogiri XML doc as a parameter
# and modifies the XML doc as desired.
def apply_to_configs (root_directory, orig_file_extension, &block)
  config_files = Dir.glob("#{root_directory}/**/config.xml")
  config_files.each { |config|
    doc = Nokogiri::XML::Document.parse(File.read(config)) do |options|
      options.default_xml.noblanks # ignore initial whitespace
    end
    project = doc.at_xpath("//project")
    next if project.nil?
    FileUtils.cp config, "#{config}.#{orig_file_extension}"
    case block.arity
      when 1
        new_doc = yield doc
      when 2
        new_doc = yield doc, config
    end
    File.open("#{config}", 'w') {|f| doc.write_xml_to(f, :indent => 2) }
  }
end

This function updates each config.xml by following these steps:

Find all config.xml files in any subdirectory of a root directory
Make a backup copy of the file with an additional specified by origfileextension parameter
Pass the contents of the config.xml as a nokogiri document object to the passed in block
The block modifies the document object in whatever way desired
Overwrite config.xml with the updated contents of the XML document.

Here's an example using applytoconfigs to set the checkbox "Send separate e-mails to individuals who broke the build" on every job with a post-build step to send failure emails.

require 'nokogiri'
require_relative './edit_config'

apply_to_configs($local_jobs_root, $orig_file_extension) do |config_doc, config_path|
  job_name = config_path.split("/")[-2]
  send_to_individuals = config_doc.at_xpath("//hudson.tasks.Mailer/sendToIndividuals")
  puts "Set #{send_to_individuals} for #{job_name}"
  send_to_individuals.content="true" unless send_to_individuals.nil?
end

Now that we have the automation in place, we've been able to use this several times to modify all our jenkins config setups.

Extreme Programming For Modern Start-Ups

Fri, 11 Apr 2014 00:00:00 +0000

Extreme Programming (XP) can be a very effective way to build software, but out of the box, it is poorly suited to many teams. It requires that the team is small, co-located, and working on a single product at any given time. It also assumes that suitable designs can be arrived at by working in micro-increments (eg. TDD cycles) without up-front design. In this post, I'll discuss how XP can be adapted to suit modern start-ups.

XP is rad

XP works very well for quickly building a new software system and adapting it to meet customer needs. Iterative development, customer value, automated testing, bug-free code and energized work are all a great fit for any innovative team. It is simple to get started (assuming your team is prepared to dive right in), simple to manage, and is designed to be adapted over time.

What about Kanban?

Since the Kanban methodology doesn't have XP's constraints, it seems to be a popular choice in recent years. Kanban, however, is more complicated to get started with (coming up with phases and WIP limits) and more complex to manage (adjusting classes of service and SLAs). Despite the complexity, it doesn't cover technical practices at all. Additionally, the intense focus that is fostered by using iterations is lost.

Also, Kanban (as originally designed) seems to eschew estimation, which doesn't work well for fine-grained project tracking. All items (stories or tasks) in a Kanban system are treated as equivalent in size, but on the teams I've worked on, tasks/stories usually vary in size quite a bit. While splitting large stories is a good practice with either methodology, it doesn't completely solve the problem. Significant functionality can't really be broken into pieces equivalent in size to a small task (like re-arranging widgets on a page or fixing a small bug).

The variance in item size means that the predictability of cycle times (the primary metric for setting expectations) is only accurate to the extent that the mix of task sizes stays consistent over time. On most teams, product work seems to come in waves, so the cycle time for a task during construction of a major new product is going to be significantly longer than when the team is working on maintenance and nominal enhancements.

Although some teams use T-shirt size estimation with Kanban (where each size has its own cycle time calculation), the mix of item sizes being worked on by a team will influence the cycle times of all items. In other words, if at one point, 5 XL items are being worked and 2 Small items are being worked on, the cycle time for all items is likely to be higher than at a time where 2 XL items and 5 Small items are being worked on. Accordingly, the cycle team per size will still fluctuate significantly, although there is probably some mitigating effect compared to not using estimation at all. (We just started doing this, and haven't analyzed the results yet.)

Times have changed

The boom

These days, start-ups (especially in SF/Silicon Valley) are operating in a different environment than when XP flourished a decade ago. We are in a boom which makes it very difficult to hire qualified engineers in any start-up hub. This has motivated start-ups to hire quality engineers anywhere they can find them. That can be in parts of the US that are not start-up hubs, or even abroad. This means many teams at modern start-ups are geographically distributed, and often on different time zones, as is the case here at Bizo.

Big Data

Also, the number of people active on the web is much larger than it used to be, and the amount of data required to build compelling applications is exploding. Accommodating that volume of data, especially in an era where users expect applications to respond instantaneously, requires carefully choosing efficient algorithms and data structures (eg. HyperLogLog, bloom filters, P-Square, etc.), selecting appropriate data stores and choosing the right model of concurrency. Designing software using the typical approach to TDD won't lead smoothly to a design that performs well under these conditions.

In an era where distributed teams and Big Data are the new norm, we need a refresh of XP to suit our needs.

Distributed teams

XP insists on co-located teams because high-bandwidth communication happens almost automatically. Also, XP espouses Pair Programming, which has traditionally been done in-person. Let's look at some alternatives for distributed teams.

Pro-longed stand-ups as a stand-in for in-person communication

The next best thing to in-person communication seems to be video chat. At Bizo, we hold our stand-ups using Google+ Hangouts. XP encourages very quick (~5 minute) stand-up meetings (hence the name) where each team member says just 3 things: what they did yesterday, what they're going to do today, and what they're blocked on, if anything. We have naturally tended longer stand-ups (15-20m) on our team, with team members disseminating all kinds of information and asking general questions to the group. When I was on a previous team which was co-located, I would try to encourage team members to keep it brief, and go 'offline' with any additional discussion. I am coming to the conclusion that on a distributed team, however, a slightly longer stand-up is actually good: the additional time spent is a small price to pay for high-value ad/hoc communication that is similar to what happens throughout the day on co-located teams. Inviting the Project Owner to the daily stand-up is a great way to maintain XP's focus on Real Customer Involvement. Having regular demos with the Project Owner (eg. the Product Manager, the client or whoever is the project sponsor) is also a good idea.

Informative workspace

An informative workspace can be accomplished by having a monitor in all company offices (with engineers) that constantly displays a dashboard and/or an up-to-date view of the project/task tracking app. (Such views should also be readily accessible to remote employees on demand.) Pivotal Tracker is an awesome web app for XP-style project tracking - the UI is well-suited to just such a purpose.

Pair Programming and real-time-ish code reviews

Although Pair Programming is normally done in-person, some great tools have been developed that make it much more feasible to do remotely. Screen Hero is one example, which is designed for highly performant screen-sharing where both parties can type and mouse around, as well as built-in voice chat. (Although it's awesome, we don't use it very often because it doesn't have Linux support - just Mac and Windows.)

For teams where members have significantly different schedules (eg. due to time zone differences), Pair Programming isn't feasible. In those cases, a near real-time code review process is a decent substitute. Personally, I think code reviews should have a single person who is responsible for giving a thumbs up/down on a change set. (Additional reviewers are best included only as an FYI.) The author can email a specific person asking for them to review their code at their earliest convenience (or email their team asking for a volunteer). In my opinion, the ideal would be if the reviewer actually calls the author and keeps them on the phone while they review the code. Code reviews can sometimes involve an in-depth back-and-forth, and it's best if those can be done in real-time. The key here is turn-around time so that integration and deployment can be done nearly continuously.

Large teams

Kanban-XP interop

XP is only designed for teams of up to 10 people, all working on the same project at any given time. The obvious way to scale is to break teams up as they grow, so no single team ever exceeds that limit. That's also a great way to ensure that each team is only working on a single project, yet the organization can progress on several projects simultaneously. XP isn't really suited to manage the flow of work across teams and projects, so having an over-arching Kanban process with XP implemented within individual teams may offer the best of both worlds.

Pods

Having teams work on a single project enhances the team's focus and synergy. However, since an organization's project portfolio changes over time, that requires continual re-organization of teams. Having teams be completely ephemeral necessitates standardization of tools and processes throughout the engineering department to avoid excessive re-training, which can lead to bureaucracy over time. One way to mitigate this is to limit such standardization to pods of related ephemeral teams, where team members generally stay within a pod for the duration of several projects.

Scalable systems

Although comprehensive up-front design is a very straight-forward way to build scalable systems, there are techniques for taking a more agile approach.

Spike Solutions

First, spike solutions can be be used to validate basic performance characteristics. Some very limited up-front design (back-of-the-envelope performance calculations for various potential infrastructure pieces) can yield a prospective stack for a system. That stack can be used in a prototype solution (with little to no business logic) that validates whether the stack can stand-up to the expected load. The most effective technique for validating scalability at this stage (when feasible) is performance tests that use a unit testing framework (eg. JUnit or RSpec) and invoke the application using in-process calls. The logistics are easier than true end-to-end load testing, although the results are less conclusive.

If the spike is successful in verifying adequate performance, subsequent stories can be implemented with a normal TDD cycle utilizing the tested stack. Maintain the performance tests so that they can be run against the real system as it develops, although you probably do not want to run it as part of your unit test suite (which should be very fast).

Leveraging the cloud

Its important to note that the initial solution may not be all that efficient in terms of infrastructure cost. If you're hosting your system in the cloud (and not provisioning capacity for it ahead of time), that's often ok, as long as the system can scale horizontally. Keep an eye on infrastructure costs, and schedule performance stories to improve efficiency over time.

Load testing and related hackery

Prior to fully launching, it may be wise to do a true end-to-end load test. The logistics of these tests are often difficult (and potentially expensive).

If your system is already running at scale, but you want to load test a new piece of infrastructure (or algorithm, etc.), one hack to consider is embedding an isolated load test within your running system. The key to this hack is taking care to limit the impact of the test on your production system. At a minimum, use short timeouts around the experimental code and ensure that errors are caught. Also, if you can't roll back painlessly, you may want to build in some kind of kill switch for the test.

Everything else is the same

With the tweaks mentioned here, the remaining XP practices and principles are feasible without significant modification. Those would be as follows (as outlined in: The Art of Agile Development [Jim Shore] 2007): Vision, Release Planning, Iteration Planning, Test-Driven Development, Energized Work, Root-Cause Analysis, Retrospectives, Ubiquitous Language, Coding Standards, Reporting, Slack, Stories, Estimating, Risk Management, Customer Tests, Refactoring, Incremental Design and Architecture, Simple Design, Spike Solutions, Exploratory Testing

Conclusion

Much of what is described here was gleaned from experience, but some of it is conjecture, so YMMV. I'm interested to hear feedback from those who have tried to scale XP.

The Mocks Are Alright

Thu, 20 Mar 2014 00:00:00 +0000

We deploy applications on AWS, and we run jobs that check on them. Like any other code, the jobs need tests.

Consider a simple reaper, which terminates Elastic MapReduce (EMR) clusters whose job flows have taken way too long:

trait Reaper {
  def terminateLongJobFlows(minNormalizedInstanceHours: Int) {
    val emr = elasticMapReduce()
    /* filter jobs from emr.describeJobFlows()... */
    /* run emr.terminateJobFlows() on filtered jobs... */
  }

  def elasticMapReduce(): AmazonElasticMapReduce
}

object Reaper extends Reaper {
  def elasticMapReduce(): AmazonElasticMapReduce =
    new AmazonElasticMapReduceClient()
}

If you wanted to write a test/spec for Reaper, you'd need some sort of test double for its collaborator AmazonElasticMapReduce. What kind of double you use is up to you...

Mock Trial

For this post, consider a ScalaTest WordSpec using mocks.

@RunWith(classOf[JUnitRunner])
class ReaperSpec extends WordSpec with Matchers with MockitoSugar {

  "Reaper.terminateLongJobFlows()" should {
    "terminate jobs taking >= a given # of normalized instance hours" in {

      val emr = mockEmr(Seq(("a", 40), ("b", 60), ("c", 20), ("d", 80)))
      val reaper = new Reaper() {
        def elasticMapReduce(): AmazonElasticMapReduce = emr
      }

      reaper.terminateLongJobFlows(50)

      val argCaptor =
        ArgumentCaptor.forClass(classOf[TerminateJobFlowsRequest])

      verify(emr).terminateJobFlows(argCaptor.capture)

      val expectedJobFlowIds = Set("b", "d")
      val actualJobFlowIds = argCaptor.getValue.getJobFlowIds.asScala.toSet

      actualJobFlowIds should equal (expectedJobFlowIds)
    }
  }

  /* ... */
}

What makes this test mockist (?) is that it spells out how the system under test (SUT), our Reaper, interacts with its collaborator, AmazonElasticMapReduce. The test will

verify our Reaper invoked AmazonElasticMapReduce.terminateJobFlows(), and
check that the supplied argument, a TerminateJobFlowsRequest, contains the IDs of the job flows that should be terminated.

(1) and (2) are handled by a mock framework like Mockito.

Contrast this with a stubbist (?) approach (obligatory link). You're generally not concerned about the interaction but rather the state of things at the end. "I terminated some job flows. Are they there anymore?"

Deep Mockery

There was a time when it would've been painful to write a helper method like mockEmr, which creates the mock AmazonElasticMapReduce, plus all of its contained information.

Consider that many AWS API calls return *Result value objects with graphs of other value objects, sometimes going three or four levels deep (e.g. describeJobFlows() -> getJobFlows() -> getExecutionStatusDetail() -> getState()). So mocking chained API calls would require creating mocks for each call in the chain--tedious, to say the least.

With Mockito, however, you can more easily mock deeply by passing to the mock constructor an additional argument, RETURNS_DEEP_STUBS. Mockito handles the deep mock implementation (basically doing what you would've done), and returns you a mock that lets you apply when/thenReturn to chained API calls, and verify where appropriate.

Subsequently, mock creation gets more concise:

def mockEmr(idsAndHours: Seq[(String, Int)]): AmazonElasticMapReduce = {
  val emr = mock[AmazonElasticMapReduce](Mockito.RETURNS_DEEP_STUBS)
  val jobFlows = idsAndHours.map { case (id, hours) =>
    mockJobFlow(id, hours, JobFlowExecutionState.RUNNING)
  }.asJava
  when(emr.describeJobFlows().getJobFlows()) thenReturn (jobFlows)
  emr
}

def mockJobFlow(
    jobFlowId: String,
    normalizedInstanceHours: Int,
    state: JobFlowExecutionState): JobFlowDetail = {
  val jobFlow = mock[JobFlowDetail](Mockito.RETURNS_DEEP_STUBS)
  when(jobFlow.getJobFlowId()) thenReturn(jobFlowId)
  when(jobFlow.getInstances().getNormalizedInstanceHours()) thenReturn(normalizedInstanceHours)
  when(jobFlow.getExecutionStatusDetail().getState()) thenReturn(state.name)
  jobFlow
}

Sharing is Caring

Because mocks tend to be written more specifically to the tests they support, it's easy to forget about sharing them. But that doesn't mean they couldn't be generalized or shared. ScalaTest has advice on how to share fixtures. For example, you could extract the mock creation methods into their own trait:

trait SharedFixtures extends MockitoSugar { this: Suite =>

  def mockEmr(idsAndHours: Seq[(String, Int)]): AmazonElasticMapReduce = {
    /* ... */
  }

  def mockJobFlow(jobFlowId: String, state: JobFlowExecutionState, terminationProtected: Boolean = false): JobFlowDetail = { /* ... */ }
}

Then, any spec can just extend that trait:

class ReaperSpec extends WordSpec with Matchers with SharedFixtures

You Reap What You Sow

I wrote my test and mocks focused on the AWS API calls that the reaper makes. I'm trading a stronger coupling to the reaper's implementation, for the ability to confirm, without a full integration test, that I'm making the calls that will terminate the correct job flows.

But what if the API calls change? Then I'll have to change the reaper and the test mocks.

Sometimes this is self-inflicted. There are a couple of ways I can use a collaborator object; I switch from one approach to the other; and then I curse myself for the mocks I wrote to support the original implementation.

Other times you have no choice. As of this writing, the AmazonElasticMapReduce job flow view is deprecated, to be superseded by a cluster view. The two views provide similar information, but not the same; when the job flow view is removed, I'll have to change my tests. But that is a problem for another day, to be handled if it happens.

Mock Verdict

In this particular case, a framework like Mockito combined with ScalaTest makes writing tests quick and easy. My SUT gets data from a third-party API which fronts a web service, and I want to verify how it uses the API--calls and arguments--as it reacts to that data. To help me write my tests in a focused and concise manner:

The when/thenReturn construct lets me concisely mock the calls to get data.
The verify construct lets me easily, more fully check the calls that act on the data.
The deep stubbing provided by Mockito lets me construct deep mocks because I can apply when/thenReturn and verify on them where needed.

Mock It Stub It Spy It Test It

It sure would be nice to have an AmazonElasticMapReduce already written with a simple implementation of a job flow container, that could support job flow and cluster API views. I suppose I should get around to that.

Furthermore, with Mockito, you can add behavior verification to any object--even a stub!--by spying on it.

Implement AmazonElasticMapReduceStub:

public class AmazonElasticMapReduceStub implements
AmazonElasticMapReduce {

    private final List<JobFlowDetail> jobFlows = new
    ArrayList<JobFlowDetail>();

    public addJobFlow(String jobFlowId, int normalizedInstanceHours, JobFlowExecutionState state) {
        // Add to jobFlows...
    }

    @Override
    public DescribeJobFlowsResult describeJobFlows() {
        return new DescribeJobFlowsResult().withJobFlows(jobFlows);
    }

    @Override
    public ListClustersResult listClusters(final ListClustersRequest listClustersRequest) throws AmazonServiceException, AmazonClientException {
        // Something to transform jobFlows from List<JobFlowDetail> to List<Clusters>...
    }

    @Override
    public terminateJobFlows(request: TerminateJobFlowsRequest) {
        // Remove from jobFlows...
    }

    /* ... */
}

Spy it:

def spyEmr(idsAndHours: Seq[(String, Int)]): AmazonElasticMapReduce = {
  val emr = spy(new AmazonElasticMapReduceStub())
  for ((ids, hours) <- idsAndHours)
    emr.addJobFlow(id, hours, JobFlowExecutionState.RUNNING)
  emr
}

Now use the spy instead of the mock. Both verify() and the argument checking will work as before.

Also:

Services Should Come with Stubs by Stephen @ Bizo
Why I Don't Like Mocks by Stephen @ Bizo
My teammates are more conscientious than I and have provided a few stubs for AWS. Check out our S3 stub implementation, or our Kinesis stub implementation if you are cutting edge like that.

Statistics With Spark

Fri, 07 Mar 2014 00:00:00 +0000

Lately I've been writing a lot of Spark Jobs that perform some statistical analysis on datasets. One of the things I didn't realize right away - is that RDD's have built in support for basic statistic functions like mean, variance, sample variance, standard deviation.

These operations are avaible on RDD's of Double via DoubleRDDFunctions. You can access these functions like so:

  import org.apache.spark.SparkContext._ // implicit conversions in here

  val myRDD = newRDD().map { _.toDouble }
  myRDD.mean
  myRDD.sampleVariance // divides by n-1
  myRDD.sampleStdDev  // divides by n-1

Getting It All At Once

If you're interested in calling multiple stats functions at the same time, it's a better idea to get them all in a single pass. Spark provides the stats method in DoubleRDDFunctions for that; it also provides the total count of the RDD as well.

Histograms

Means and standard deviation are a decent starting point when you're looking at a new dataset; but you have to be careful because measures of central tendency - they hide the distribution from you. If you're looking at something like response latency - there could very well be dragons lurking there.

Fortunately Spark also includes a nifty little histogram method which you can use:

val myRDD = newRDD().map { _.toDouble }
myRDD.histogram(10) // 10 evenly spaced buckets, between myRDD.min ->  myRDD.max 
myRDD.histogram(new Array(0.0, 10.0, 20.0, 30.0)) // manually specify the buckets

Beyond The Box

Spark provides a very basic, but useful starting point. If you want access to more advanced statistical methods like classification or regression - check out MLLib. However at the time of writing its still a very young project and you might have to implement things on your own.

In our case, we ended up implementing some basic z & chi-squared tests for some of our bidding algorithms like:

Comparing binomial proportions of two samples
Comparing means of two samples
Comparing two distributions drawn from different samples

These are still very much in there infancy, but it's possible we might open source them at a later date.

For now, when implementing things yourself it's nice to keep operations as general as possible, so I'd recommend trying to keep all your hand-rolled stats method to operations on RDD's of primatives (or matrices).

Required Reading Team Geek

Fri, 07 Mar 2014 00:00:00 +0000

From the beginning (2008), the engineering team at Bizo has focused on building a great culture around a few key attributes like quality, discipline and communication. A couple of years ago our director of engineering, Larry, happened upon a book, Team Geek, that does an amazing job of describing the ideal set of cultural tenants and personal behaviors that we are trying to build our culture around.

After reading it in one evening (the book is a short 194 pages), I immediately bought copies for the entire engineering team and setup a few "book club" meetings where we all discussed the contents of the book and how they related to our team. And now, every new engineer that joins is required to read the book.

Here is a list of the chapters (I should do some future chapter highlight posts):

Chapter 1: The Myth of the Genius Programmer
Chapter 2: Building an Awesome Team Culture
Chapter 3: Every Boat Needs a Captain
Chapter 4: Dealing with Poisonous People
Chapter 5: The Art of Organizational Manipulation
Chapter 6: Users Are People, Too

Among other things "Team Geek" does a great job of describing what makes a great team member with the acronym of HRT (pronounced like "heart") which stands for Humility, Respect and Trust. We've adopted this term internally as part of the cultural nomenclature.

Overall, I think any and all engineering professionals should read this book and take to heart some of the lessons on how to define a great engineering culture and more importantly how to become someone that not only solves hard technical problems but is a pleasure to work with. Even non-engineers can take a lot of these lessons to heart; Humility, Respect and Trust are definitely characteristics that make good team members no matter the part of the organization.

args4j-helpers

Wed, 26 Feb 2014 00:00:00 +0000

I really like using args4j for command-line parsing in both java, and scala, but I found myself writing the same boilerplately code to parse options, deal with help, deal with parsing issues, etc. args4j-helpers is a project that simplifies parsing with args4j.

It provides typical option parsing error handling:

If help (provided via OptionWithHelp base trait bound to --help and -h) is requested, print usage information to stderr, exit with code 0.
If a required option is missing (required=true), print usage information to stderr, exit with code 1 (unless help was requested).

This typical parsing code:

class Options {
  import org.kohsuke.args4j.Option

  @Option(name="--help", aliases=Array("-h"), usage="show this message")
  var help = false

  @Option(name="--blah", aliases=Array("-b"), usage="some val", metaVar="BLAH")
  var blah: Int = 0
}

def main(args: Array[String]) {
  val options = new Options
  val parser = new CmdLineParser(options)

  try {
    parser.parseArgument(args : _*)

    if (options.help) {
      parser.printUsage(System.err)
      sys.exit(0)
    }
  } catch {
    case e: CmdLineException => {
      System.err.println(e.getMessage)
      parser.printUsage(System.err)
      sys.exit(1)
    }
  }
}

can be simplified as:

class Options extends OptionsWithHelp {
  import org.kohsuke.args4j.Option

  @Option(name="--blah", aliases=Array("-b"), usage="some val", metaVar="BLAH")
  var blah: Int = 0
}

def main(args: Array[String]) {
  val options = optionsOrExit(args, new Options)
}

Additionally, the helper class handles the case where help is requested and required arguments are missing (which is missing from the simplified boilerplate code).

The code is in scala, and available on github.

In the future, I'd like to extend it to add better support for more scala-ish types (Option, Seq, etc., which should be mostly possible by implementing additional args4j OptionHandlers).

Beware Java Enums in Spark

Sun, 23 Feb 2014 00:00:00 +0000

A few days back I wrote a Spark job that runs an A/B test to compare the conversion rates between two groups of website visitors on one of our client's websites.

In this case, visitors were placed into the test/control group based on a deterministic method using their uuid. At the time of writing we had multiple projects wanting to use our general A/B test code. However some projects were using Scala 2.9x and others were on Scala 2.10. We wanted to share code but we didn't want the hassle of maintaining separate artifacts for each Scala version - so we packaged the group labels and the method for determining who belongs to which group in a Java project. It looked something like this:

  public enum TestControl {
    TEST, CONTROL

    public static TestControl getGroup(UUID id) { 
      // returrn appropriate group
    }
  }

The Spark Code

The report was a fairly standard Spark report for us. Basically it involved loading & massaging our log files into the appropriate RDD format - then mapping them into a format conducive to aggregation. Spark supports aggregation of key/value pairs using the reduceByKey method via an implicit conversion on an RDD[Tuple2] (see PairRDDFunctions).

Since I was interested in comparing the average conversion rate between the two groups - it was a simple matter of mapping the log files into the appropriate key/value pairs and then calling reduceByKey like so:

// aggregation object
 case class Stats(impressions: Long = 0, conversions: Long = 0)  {
  lazy val converisonRate: Double = conversions / impressions.toDouble

  def +(other: Stats) {
    Stats(
      impressions + other.impressions,
      conversions + other.conversions
    )
  }
} 

val myRDD = // load/map log files from S3 to case class with methods for each field

val readyForAggregation = myRDD.map { 
  case line if line.isImp  => (TestControl.getGroup(line.uuID), Stats(1, 0)) 
  case line if line.isConv => (TestControl.getGroup(line.uuID), Stats(0, 1)) 
}

val results = readyForAggregation.reduceByKey { _ + _ }

At this point we'd expect the resulting RDD to contain just two elements, one for the test group and another for the control (as our key was a TestControl value and getGroup only produce 2 values).

Testing

It might seem obvious that we should only get 2 items in our results RDD but it's always good to check with a unit test. Running a local unit test proved that indeed yes this was the case:

 
  @Test
  def resultsShouldHaveJust2Elements() {
    val lines = // make some stub log lines (Scala case classes)
    val results = new ABTestReport(lines).run // runs aggregation discussed above
    results.size shouldBe 2
  }

Sweet all our tests pass, we're done! But wait! Not so fast, running this on a real spark cluster yielded very different results...

  val data = results.collect // pulls RDD onto master as an Array
  println(data.size) // 52 -> WTF?!!!

  println(data) // Array( (CONTROL, Stats), (TEST, Stats), (CONTROL, Stats), (TEST, Stats)...)

WTF Spark?!

So what's going on here? We have a local unit test that shows that reduceByKey works as expected when using our Enum for a key, but our spark cluster seems to incorrectly reduce our results - we wind up with way too many keys.

Well after some digging with a few other engineers it turns out that by default Spark will map your data items to partitions using a HashPartitioner. HashPartitioner uses the hashCode of an object to determine which partition it will live in. Ok so far, that seems completely sensible.

The Devil is in the Enum

But wait, the hashCode method on Java's enum type is based on the memory address of the object. So while yes, we're guaranteed that the same enum value have a stable hashCode inside a particular JVM (since the enum will be a static object) - we don't have this guarantee when you try to compare hashCodes of Java enums with identical values living in different JVMs. They will very likely have different hashCode values.

Our local unit test passed because it executed on a single JVM, the enum's hashCode remained consistent when HashPartitioner asked for it - yet our real cluster failed since HashPartitioner was getting different hashCodes for same enum values due to each slave having its own machine/JVM.

What to do instead

At this point is should be pretty clear that we should not use Java enums as keys for RDD's we'd like to aggregate. Fortunately there are two easy alternatives:

The omg I'm lazy way

You can simply toString() your enums prior to calling reduceByKey since String's hashCode method does not rely on the memory address of the object.

A better way

Use sealed traits and a case objects instead, they use a fixed hashCode value and do not use the memory location to calculate the hashCode

sealed trait Group 

object TestControl {
  case object Test extends Group
  case object Control extends Group
}

Wrapping Up

I think this goes to show you that while local unit tests on a single JVM are always a good idea, you should also check your results carefully on clustered systems. When you start doing parallel computing you often run into bugs that only show up on real clusters. Happy Spark-ing

Why we chose not to git fast forward merge

Fri, 14 Feb 2014 00:00:00 +0000

This post comes from an interesting email thread we had. In the thread, Stephen was answering various questions from the team about why it is so important to NOT use fast-forward merges.

Question: Can you give a quick explanation of what --no-ff and --log do and why we want them?

Answer:

Sure.

Out of the box, if you call "git merge", it is not guaranteed to create a merge commit.

Technically, merge commits are only required when both branches have new commits, e.g.:

master: C1 -> C2
feature: C1 -> C3

If you do:

git checkout master
git merge feature

It will be forced to make a new merge commit, C4 that ties together C2 and C3:

master: C1 -> C2 -> C4
feature C1 -> C3 /

(If the ASCII art lines up, C4 has both C2 and C3 pointing at it.)

However, if feature had been created off of C2 instead of C1, and so it looked like this:

master: C1 -> C2
feature: C2 -> C3

And you do:

git checkout master
git merge feature

Git will say "huh, technically I could make master the same as feature by just moving master to now be C3", and so master would look like:

master: C1 -> C2 -> C3

Which makes sense, git has incorporated the work of feature, and without rewriting history. In git parlance, this is called a fast forward.

The downside is that you've lost the notion that the feature branch ever existed. Which is not necessarily crucial, but if you want your DAG to always look a certain way, aesthetically it's nice to force the merge commit anyway, so:

git merge --no-ff feature

Will force a new merge commit, C4, who parents are C2 and C3.

The general assertion is that, if you're running "git merge" by hand in the first place, you are probably merging across "real" branches and so forcing an actual merge commit is probably a good idea.

The other flag, --log, just means that the merge commit's generated commit message will include a brief description of what was merged, e.g. instead of just:

  Merge feature-a into master.

It'll be:

  Merge feature-a into master.
  

  Commits:
  * Second commit on feature-a
  * First commit on feature-a

Which is handy for when you're scanning merge commits in git log/etc.

Try it yourself!

To see by yourself, below instructions will help you setting up a dummy repo.

Pre-requisite: git is installed. Should work on any shell or Windows command prompt.

Enter the following to create the repo and add the first commit:

git init test ; cd test ; echo "content" > file.txt ; git add * ; git commit -m "init"

Create a branch, modify your file and commit (done twice here to get a better gitk view later):

git checkout -b branch1 ; echo "modif" >> file.txt ; git commit -am "branch1 modif" ; echo "modif" >> file.txt ; git commit -am "branch1 modif"

Merge your branch without fast-forward and optionally --log:

git checkout master ; git merge --no-ff --log --no-edit branch1

Again, create a branch, modify your file and commit (again x2 for the gitk view):

git checkout -b branch2 ; echo "modif" >> file.txt ; git commit -am "branch2 modif" ; echo "modif" >> file.txt ; git commit -am "branch2 modif"

Merge your branch with fast-forward:

git checkout master ; git merge --ff-only branch2

Running 'gitk' in your repository path should something like:

We can see on the gitk screenshot that “branch2” becomes invisible in the history.

If for example you have a one branch per feature policy, this means you lost the ability to tell what are all the commits that correspond to that feature.

Question: This seems to be similar in spirit but opposite in effect to "git pull --rebase", which prevents git pull from creating merge commits and use rebase instead.

Answer:

Yes, that is insightful.

When pulling, we need to tell git "no, really, don't make merges".

And when merging, we need to tell git "no, really, make a merge".

I don't have a good explanation for why this is, other than it's just the whole "the git CLI is unintuitive/not well designed" thing.

Well, and, to give them a little credit, git is not opinionated in saying "your workflow must be this", which has the unfortunate side effect that the default behavior does not itself adhere to one or another workflow.

Question: Is there a community convention over this, or do most projects just accept whatever git does by default?

Answer:

I don't know for sure, but my feeling is that there is a lot of community convention around both "pull --rebase" and "merge --no-ff".

In particular, "git pull --rebase" seems really common, given the number of blog posts about it, and the fact that it eventually got its own config setting ("pull.rebase=true") for the user to change their default "git pull" behavior.

Same thing with "git --no-ff --log"--they both also got config settings.

To me, a flag graduating to a config setting says a lot of people are using it as convention. (I would go further and assert that getting a config setting almost means the behavior is the correct/preferred way, but the default can't be changed for backwards compatibility purposes.)

Crucible Survivor - a code review dashboard

Fri, 14 Feb 2014 00:00:00 +0000

We've talked a lot about the importance of code reviews and developer feedback.

Crucible Survivor is a Hall of Fame / Hall of Shame dashboard that helps to encourage completing your reviews.

It integrates with with Crucible (obviously) for code review stats.

Scoring

The scoring is pretty simple right now. Each reviewer gets a Fame point for completing a review, and a Shame point for each review they have not yet completed.

We've had this running for a few months now, and the general thinking is it would be better to have open reviews be more shameful the longer they have been kept open. Maybe a future improvement.

Design / Credits

The design is taken from Jira Survivor, which itself was forked from Github Survivor.

Code and Application Architecture

The code is a large departure from the original projects. The original projects use MongoDB to store data scraped from the github/jira APIs, and a python web app to serve the site.

Crucible Survivor is an angular app that is mostly static. The review stats are included in the app as a JS include. The app is hosted as a static website in S3, with the contents of the stats JS generated periodically by a jenkins server.

This was a hack-day experiment in 'static' dashboard apps, and I'm really happy with how it came out. I really like how gathering the content to display is decoupled from the display. Serving the site requires no server infrastructure, and the update process can be very flexible. Finally, it's incredible easy to test/run locally -- just generate a fake stats file and open the site.

Grab the code, get it running!

The code is open sourced, and available on github along with instructions on configuring, running, developing.

I'd love to hear any feedback or comments.

SCM Migration

Mon, 26 Aug 2013 00:00:00 +0000

We happily used Atlassian’s hosted OnDemand service for source code management with the following setup

Subversion: source control management
FishEye: source code browsing
Crucible: code reviews
Jenkins (hosted on EC2): continuous integration and periodic jobs

However, Atlassian is ending their OnDemand offering for source code management in October so it was time for a change. The good news: we were wanting to migrate to git anyway. The bad news: we had around hundreds projects in our subversion repository and needed to break them up into separate git repositories.We switched on a Thursday morning with minimal developer interruptions, now we're on a new setup

Bitbucket: source control management and code browsing
Crucible (hosted on EC2): code reviews
Jenkins (hosted on EC2): continuous integration and periodic jobs

How'd we do it? Read on, my friend.

Problem

Move hundreds of projects (some with differing branching structures) to an equivalent number of git repositories. And change hundreds of Jenkins job configurations from pulling code out of subversion to pulling code from git. And set up a new Crucible instance for code reviews for the hundreds of repos. All without disrupting the dev team's work. For subversion, this meant moving the code, including branches, and commit history from subversion into Bitbucket. For Jenkins, it meant changing the job configs to point at the equivalent git repository with the same code and branch as the old subversion configuration. This blog post focuses on the subversion to git migration. Fixing the Jenkins configs will be covered in a later blog post.

Subversion to Git

Converting a single repository from subversion to git is fortunately straight forward due to the terrific tool git-svn (https://www.kernel.org/pub/software/scm/git/docs/git-svn.html).The challenging part was determining how each project configured branches. In subversion, branches are just another subdirectory the repository. Basically any level of the directory hierarchy can support branches. You can pretty much put them anywhere. Git, however, only supports branches at the root of the repository. Git-svn allows you to tell git what directory the branches are in, but first you have to find what directory that is.

Our subversion repositories followed two primary branching structures: branch at the module level or branch at the project level.

One layout that I will call "module level". Module level projects had a separate branch point for each module in the project. These projects were usually several loosely connected modules that could be deployed separately or libraries that were related but could be imported independently. Module level projects looked like this:

svn/<project>/trunk/<module1>
svn/<project>/trunk/<module2>
svn/<project>/branches/<module1>/<branch1formodule1>
svn/<project>/branches/<module1>/<branch2formodule1>
svn/<project>/branches/<module2>/<branch1formodule2>

"module level" projects mapped into a separate git repo for each module using this git-svn command:

git svn clone <svn_root> --trunk <project>/trunk/<module> --branches <project>/branches/<module> --tags <project>/branches/<tags> <module>

The other branching structure I’ll call "project level". These projects also had multiple modules, but the branches were defined such that each branch contained the entire project. These projects were usually separate modules for the domain layer, application layer and web layer or closely related applications that use the same database. Parts could perhaps be deployed separately but they often need to be deployed at the same time such as when the database schema changed. Project level projects looked like this:

svn/<project>/trunk/<module1>
svn/<project>/trunk/<module2>
svn/<project>/branches/<branch1>/<module1>
svn/<project>/branches/<branch1>/<module2>
svn/<project>/branches/<branch2>/<module1>
svn/<project>/branches/<branch2>/<module2>

"project level" projects mapped into a single git repo containing all modules using a git-svn command:

git svn clone <svn_root> --trunk <project>/trunk --branches <project>/branches --tags <project>/branches <project>

To automate the git-svn clones, I wrote a ruby script that used "svn ls" to find the list of all projects. Each project was assumed to be "module level" unless it was in a hard-coded list of known "project level" projects. It was important for this to be fully automated as the list of "project level" projects was not complete until near the end of the migration. It took several tries to make sure the migration was correct. Some projects unfortunately used both branching structures, which is not supported by git-svn. Some of these branches were abandoned anyway, but others were moved using "svn mv" to fit that project's standard branch structure.

Local Git to Bitbucket

Atlassian provided a jar

to push a git-svn repository up to Bitbucket. The jar also can create an authors file from the subversion repository to map a subversion user to the values git needs for a committer - first name, last name and email address. This made scripting the Bitbucket upload for each repository straightforward. The jar also handles syncs to an existing Bitbucket repository so developers could continue committing to their svn projects and Bitbucket would automatically get updated. Note this only does fast forward syncs so the incremental sync stops working once commits were made directly to Bitbucket.

Crucible

Crucible is a tool to facilitate code reviews. It imports commits from your SCM tool, allows inline comments on the diffs and manages the code review life cycle of assigning reviewers, tracking who has approved the changes, and closing the review once approved. Crucible setup is fairly straightforward with a couple of caveats.

Crucible needs to access your repositories to pull in the commit history. There is no native support for pointing crucible at a Bitbucket team account and having Crucible automatically import each repository. There is an free add-on that works for an initial import, but initially it did not bring in new repositories that are added to the team account after the initial import. It turns out the update did not work because I was using a Bitbucket user that could not access the User list from the Bitbucket API. Changing the Bitbucket user to one with access to this API end point solves this problem. Incremental updates to the repository list are now working.

While Crucible supports ssh access to git repositories in general, I ran into the problem described here https://answers.atlassian.com/questions/34283/how-to-connect-to-bitbucket-from-fisheye. Basically, Crucible does not support Bitbucket's ssh URL format. Instead of using ssh, I had to use https to connect to the Bitbucket repositories. This means each repository configuration requires the Bitbucket username and password to be specified separately, which is not ideal.

Testing

After running git-svn clone on a few projects, I went ahead and pulled all the projects down with git-svn. The distributed nature of git helped testing because the entire repository could be represented locally without needing to upload it to any server to test the initial clones. However, cloning all the repositories took about 24 hours. During this time there was minimal CPU and I/O load so I multithreaded the cloning jobs using 16 threads. This improved the time to just 1.5 hours on only a dual core machine.

I was initially hesitant to upload all the repositories to Bitbucket because I did not want to have to manually delete the repos if there was a problem. However, I found the Bitbucket REST API. It is pretty well put together and was easy to use because it generally follows REST conventions. I've yet to find anything that can be done in the UI that can not be done in the API, which has been outstanding for adding additional niceties like adding commit hooks to push changes to crucible for each repository. For the purposes of migration, the best feature was deleting repositories. Knowing I could automatically clean up any mistakes provided the confidence to just let it rip. I actually ended up using this to clean up two false start migrations:

git-svn has a "show-ignore" command to translate files ignored by subversion into a .gitignore file. I initially added .gitignore to the git repositories. However, this meant every repository had a commit in Bitbucket and so would no longer accept changes from subversion. This was resolved by adding .gitignore to subversion before the conversion.
the first authors file I created was missing a few users. This was not discovered by noticing the Bitbucket commit history did not look as nice. It was nice to be able to just wipe it all out with a single command, fix the authors file, and redo the upload with a single command.

Post-migration

The time following the initial migration was when the automation really came in handy. A couple developers were out of the office during the cut over. They were able to make commits of their local work to subversion and then I could re-sync just those repositories even after other developers had begun working on other repositories in Bitbucket. This went very smoothly with no hand wringing or diff patching required to make sure local work was not lost.

Wrap-up

Overall the migration went off with no hiccups. We're still tweaking our preferred settings for git pushes and pulls to get to our ideal workflow, but we're happy to be using Bitbucket. Crucible does not integrate with Bitbucket as nicely as it did with subversion in our old setup. Hopefully Atlassian will continue to make improvements to this integration as we really like the Crucible code review workflow. I'm always impressed how automation begets automation. Once you've taken the step of automating part of the process, it is so much easier to see the next step. We are already seeing some benefits from the time spent interacting with the Bitbucket API as we're now able to add and modify commit hooks on all the repositories easily.

Using AWS Custom SSL Domain Names for CloudFront

Thu, 20 Jun 2013 00:00:00 +0000

AWS recently announced the limited availability of Custom SSL Domain Names for CloudFront. You have to request an invitation in order to start using it but I am guessing it won't be long until it has been rolled out to all customers.

We've been asking/waiting for Custom SSL on CloudFront for years and were excited when it finally came out. The sign up was easy and we were approved a day or two later.

Existing Setup

Our main use case for Custom SSL on CloudFront involves replacing a service that proxies secure requests to our non-secure CloudFront distro. We proxy secure requests because we didn't want the secure CloudFront domain leaking out to our customers for various reasons including:

We wanted to be able to point the domain elsewhere if we needed to
We wanted to keep our branding consistent on domains.

It basically looks like the following diagram:

The problem with having a proxy is two fold:

We have to operate that proxy which goes against our general rule to "never operate services when AWS can do it for you"
We get subpar performance relative since requests are no longer served from a distributed geo-located CDN.

But we needed theflexibilityand branding mentioned above so we dealt with it. Not anymore...

Migrating to Custom SSL Domain Names for CloudFront

Once we got approval for custom SSL, the migration was pretty straightforward. I am not going to regurgitate the detailed documentation but will summarize the process.

Upload your SSL cert and make sure path starts with "/cloudfront" (This was annoying because we couldn't reuse our existing certificates that we were already using for ELBs)
Update your CF distro (I did so via the AWS Console):
- add the domain name you want to support (e.g. secure-example.bizographics.com from above)
- choose the SSL cert that you uploaded in the first step
- Save
Wait for the CF distro to redeploy the configuration change
Update your Route53 DNS to point at the CF CNAME rather than the ELB endpoint
Wait for DNS to Update
Shut down ELB of Proxy

As you can see this was pretty easy. Most of the time was spent waiting for the CF distro the re-deploy (10s of minutes max) and DNS to update (which can take several days).

All-in-all, the minor annoyance of having two copies of the same SSL cert was worth the win of not having to operate the proxy and getting better performance for our customers. Check out the graph below showing the improved performance:

Note on Cost

The cost of custom SSL on CF seems ok but could be better and the wording is not totally clear: " You pay $600 per month for each custom SSL certificate associated with one or more CloudFront distributions."We have the same cert setup for multiple CF distros but I am not sure if we will be charged $600 for each disto using the cert or $600 for each cert regardless of how many distros are using it. (Will try to get clarification...) AWS claims the pricing is comparable to other similar offerings. That doesn't seem to jive with their usual practice of driving costs much lower but is livable for now.

Scala Command-Line Hacks

Mon, 22 Apr 2013 00:00:00 +0000

Do you like command-line scripting and one-liners with Perl, Ruby and the like?

For instance, here's a Ruby one-liner that uppercases the input:

% echo matz | ruby -p -e '$_.tr! "a-z", "A-Z"'MATZ

You like that kind of stuff? Yes? Excellent! Then I offer you a hacking idea for Scala.

As you may know, Scala offers similar capability with the -e command-line option but it's fairly limited in its basic form because of the necessary boilerplate code to set up iteration over the standard input... it just begs for a simple HACK!

Using a simple bash wrapper,

#!/bin/bash
#
# Usage: scala-map MAP_CODE
#
code=$(cat&lt;&lt;END
scala.io.Source.stdin.getLines map { $@ } foreach println
END
)
scala -e "$code"

then we can express similar one-liners using Scala code and the standard library:

 

% ls | scala-map _.toUpperCase

# FOO
# BAR
# BAZ
# ...

% echo "foo bar baz" | scala-map '_.split(" ").mkString("-")'
# foo-bar-baz

Nifty, right? Here's another script template to fold over the standard input,

#!/bin/bash

#
# Usage: scala-fold INITIAL_VALUE FOLD_CODE
#
# where the following val's can be used in FOLD_CODE:
#
# `acc` is bound to the accumulator value
# `line` is bound to the current line
#


code=$(cat END
println(scala.io.Source.stdin.getLines.foldLeft($1) { case (acc, line) =&gt; $2 })
END
)
scala -e "$code"

Now if you wanted to calculate the sum of the second column of space-separated input, you'd write:

$ cat | scala-fold 0 'acc + (line.split(" ")(1).toInt)'foo 1bar 2baz 3
(CTRL-D)6You get the idea ... hopefully this inspires you to try a few things with Scala scripting templates!

Efficiency & Scalability

Fri, 19 Apr 2013 00:00:00 +0000

Software engineers know that distributed systems are often hard to scale and many can intuitively point to reasons why this is the case by bringing up points of contention, bottlenecks and latency-inducing operations. gIndeed, there exists a plethora of reasons and explanations as to why most distributed systems are inherently hard to scale, from theCAP theoremto scarcity of certain resources, e.g., RAM, network bandwidth ...

It's said thatgood engineersknow how to identify resources that may not appear to be relevant to scaling initially but will become more significant as particular kinds of demand grow. If that’s the case, thengreat engineersknow that system architecture is often the determining factor in system scalabilityg—that a system’s own architecture may be its worse enemy — so they define and structure systems in order avoid fundamental flaws.

In this post, I want to explore the relationship between system efficiency and scalability in distributed systems;they are to some extent two sides of the same coin. gWe’ll consider specifically&two common system architecture traits: &replicationandrouting. &Some of this may seem obvious to some of you but it’s always good to back intuition with some additional reasoning.

Before we go any further, it’s helpful to formulate a definition of efficiency applicable to our context:

efficiency is the extent to which work is performed relative to the total work and/or cost incurred.

We’ll also use the following definition of scalability,

scalability is the ability of a system to accommodate an increased workload by repeatedly applying a cost-effective strategy for extending a system’s capacity.

So, scalability and efficiency are both determined by cost-effectiveness with the distinction that scalability is a measure of marginal gain. gStated differently,if efficiency decreases significantly as a system grows, then a system is said to be non-scalable.

Enough rambling, let’s get our thinking caps on! gSince we’re talking about distributed systems, it’s practically inevitable to compare against traditional single-computer systems, so we’ll start with a narrow definition of system efficiency:

              average work for processing a request on a single computer
 Efficiency = ——————————————————————————————————————————————————————————
              average work for processing a request in distributed system

This definition is a useful starting point for our exploration because it abstracts out the nature of the processing that’s happening within the system; it’s overly simple but it allows us to focus our attention on the big picture.

More succinctly, we'll write:

 (1) Efficiency = Wsingle / Wcluster

Replication Cost

Many distributed systems replicate some or all of the data they process across different processingnodes(to increase reliability, availability or read performance) so we can model:

  (2) Wcluster = Wsingle + (r x Wreplication)

where r is the number of replicas in the system and Wreplication is the work required to replicate the data to other nodes. Wreplication is typically lower than Wsingle, though realistically they have different cost models (e.g., Wsingle may be CPU-intensive whereas Wreplication may be I/O-intensive). If n is the number of nodes in the system, then r may be as large as (n-1), meaning replicating to all other nodes, though most systems will only replicate to 2 or 3 other nodes — for good reason&—as we’ll discover later.

We’ll now define the replication coeffient, which expresses the relative cost of replication compared to the cost of processing the request on a single node:

  (3) Qreplication = Wreplication / Wsingle

Solving for Qreplication, we get:

 (4) Wreplication = Qreplication x Wsingle

If we substitute Wreplication in (2) by the equation formulated in (4), we obtain:

 
  (5) Wcluster = Wsingle x [ 1 + ( r x Qreplication * Wsingle ) ]

We now factor out Wsingleton from the left side:

 
  (6) Wcluster = Wsingle x [ 1 + r * Qreplication ]

Taking the efficiency equation (1) and substituting from (6), the equation becomes:

  (7) Efficiency = Wsingle / [ Wsingle x ( 1 + r * Qreplication ) ]

We then simplify Wsingle to obtain the final efficiency for a replicating distributed system:

  (8) Efficiency (replication) = 1 / [ 1 + (r x Qreplication) ]

As expected, both r and Qreplication are critical factors determining efficiency.

Interpreting this last equation and assuming Qreplication is a constant inherent to the system’s processing, our two takeaways are:

If the system replicates to all other nodes (i.e.,r = n - 1) it becomes clear that the efficiency of the system will degrade as more nodes are added and will approach zero asnbecomes sufficiently large.

To illustrate this, let's assume Qreplication = 10%

Efficiency (r = 1, n = 2) = 91%
Efficiency (r = 2, n = 3) = 83%
Efficiency (r = 3, n = 4) = 76%
Efficiency (r = 4, n = 5) = 71%
Efficiency (r = 5, n = 6) = 67%
...

In other words, fully-replicated distributed systems don't scale.

For a system to scale, the replication factor should be a (small) constant.

Let's illustrate this with Qreplication fixed at 10% and using a replication factor of 3,

Efficiency (r = 3, n = 4) = 76%
Efficiency (r = 3, n = 5) = 76%
Efficiency (r = 3, n = 6) = 76%
Efficiency (r = 3, n = 7) = 76%
Efficiency (r = 3, n = 8) = 76%
...

As we can see, fixed-replication-factor distributed systems scale - although, as you might expect, they do not exhibit the same efficiency as a single-node system. At worse, the efficiency will be 1/r — as you would intuitively expect.

Routing Cost

When a distributed system routes requests to nodes holding the relevant information (e.g., a partially replicated system, r < n) its working model may be defined as,

  (9) Wcluster = (r / n) * Wsingle  +  (n-r)/n * (Wrouting + Wsingle)

The above equation represents the fact that r out of n requests are processed locally whereas the remainer of the requests are routed and processed on a different node.

Let’s define the routing coefficient to be,

  (10) Qrouting g= &Wrouting / Wsingle

Solving for Wrouting in (9) by (11) to obtain,

  (12) Wcluster = (r/n) * Wsingle g+ &(n-r)/n * [ (Qrouting * Wsingle) + Wsingle ]

and taking the efficiency equation (1), substituting from (12), the simplified equation becomes:

  (13) Efficiency (routing) = n / [ n + (n - r) * Qrouting ]

Looking at this last equation, we can infer that:

As the system grows and n goes towards infinity, the efficiency of the system can be expressed as 1 / (1 + Qrouting). The efficiency is not dependent on the actual number of nodes within the system therefore routing-based systems generally scale. (But you knew that already)
If the number of nodes is large compared to the replication factor (n >> r) and Qrouting is significant (1.0, same cost as Wsingle), then the efficiency is ½, or 50%. This matches the intuition that the system is routing practically all requests and therefore spending half of its efforts on routing. The system is scaling linearly but it’s costing twice as much to operate (for every node) compared to a single-node system.
If the cost of routing is insignificant (Qrouting = 0), the efficiency is 100%. That’s right, if it doesn’t cost anything to route the request to a node that can process it, the efficiency is the same as a single-node system.

Let’s consider a practical distributed system with 10 nodes (n = 10), a replication factor of 3 (r = 3), and a relative routing cost of 10% ( = 0.10). This system would have an efficiency of 10 / 10 + (7 * 10%) = 93.46%. As you can see, routing-based distributed systems can be pretty efficient if Qrouting is relatively small.

Where To Now?

Well, this was a fun exploration of system scalability in the abstract. gWe came up with interesting equations to describe the scalabilty of both data-replicating and request-routing architectures. &With some thinkering, these can serve as a good basis for reasoning about some of your distributed systems.

In real life, however, there are many other aspects to consider when scaling systems. gIn fact, it often feels like a whack-a-mole hunt; you never know there the next performance non-linearity is going to rear its ugly head. &But if you use either (or both) the data-replicating and request-routing style architecture with reasonable replication factors and you manage to keep your replication/routing costs well below your single-node processing costs, you may find some comfort in knowing that at least you haven’t introduced a fundamental scaling limitation unto your system.

Sensible Defaults for Apache HttpClient

Mon, 15 Apr 2013 00:00:00 +0000

Before coming to Bizo, I wrote a web service client that retrieved daily XML reports over HTTP using the Apache DefaultHttpClient. Everything went fine until one day the connection simply hung forever. We found this odd because we had set the connection timeout. It turned out we also needed to set the socket timeout (HttpConnectionParams.SO_TIMEOUT). The default for both connection timeout (max time to wait for a connection) and socket timeout (max time to wait between consecutive data packets) is infinity. The server was accepting the connection but then not sending any data so our client hung forever without even reporting any errors. Rookie mistake, but everyone is a rookie at least once. Even if you are an expert with HttpClient, chances are there will be someone maintaining your code in the future who is not.

Another problem with defaults using HttpClient is with PoolingClientConnectionManager. PoolingClientConnectionManager has two attributes: MaxTotal and MaxPerRoute. MaxTotal is the maximum total number of connections in the pool. MaxPerRoute is the maximum number of connections to a particular host. If the client attempts to make a request and either of these maximums have been reached, then by default the client will block until a connection is free. Unfortunately the default for MaxTotal is 20 and the default MaxPerRoute is only 2. In a SOA, it is common to have many connections from a client to a particular host. The limit of 2 (or even 1) connections per host makes sense for a polite web crawler, but in a SOA, you are likely going to need a lot more. Even the 20 maximum total connections in the pool is likely much lower than desired.

If the client does reach the MaxPerRoute or the MaxTotal connections, it will block until the connection manager timeout (ClientPNames.CONNMANAGERTIMEOUT) is reached. This timeout controls how long the client will wait for a connection from the connection manager. Fortunately, if this timeout is not set directly, it will default to the connection timeout if that is set, which will prevent the client from queuing up requests indefinitely.

What would a better set of defaults be?

A good default is something that is "safe". A safe default for a connection timeout is long enough to not give up waiting when things are working normally, but short enough to not cause system instability when the is down. Unfortunately safe is context dependent. Safe for a daily data sync process and safe for an in thread service request handler are very different. Safe for a request that is critical to the correct functioning of the program is different than safe for a some ancillary logging that is ok to miss 1% of the time. A default for timeouts that is safe in all cases is not really possible.

Safe defaults for PoolingClientConnectionManager's MaxTotal and MaxPerRoute should be big enough that they won’t be hit unless there is a bug. New to version 4.2 is the fluent-hc API for making http requests. This uses a PoolingClientConnectionManager with defaults of 200 MaxTotal and 100 MaxPerRoute. We are using these same defaults for all our configurations.

Note that the fluent-hc API is very nice, but requires setting the connection timeouts on each request. This is perfect if you need to tune the settings for each request but does not provide a safety check against accidentally leaving the timeout infinite.

How can you help out a new dev implementing a new HTTP client?

If you can't have a safe default and the existing defaults are decidedly not safe, then it is best to require a configuration. We created a wrapper for PoolingClientConnectionManager that requires the developer to choose a configuration instead of letting the defaults silently take effect. One way to require a configuration is to force passing in the timeout values. However, it can be a hard to know the right values especially when stepping into a new environment. To help a developer implementing a new client at Bizo, we created some canonical configurations in the wrapper based on our experience working in our production environment on AWS. The configurations are:

Configuration	Connection timeout	Socket timeout	MaxTotal	MaxPerRoute
SameRegion	125 ms	125 ms	200	100
SameRegionWithUSEastFailover	1 second	1 second	200	100
CrossRegion	10 seconds	10 seconds	200	100
MaxTimeout	1 minute	5 minutes	200	100

Clients with critical latency requirements can use the SameRegion configuration and need to make sure they are connecting to a service in the same AWS region. Back end processes that can tolerate latency can use the MaxTimeout configuration. Now when a developer is implementing a new client, the timeouts used by other services are readily available without having to hunt through other code bases. The developer can compare these with the current use case and choose an appropriate configuration. Additionally, if we learn that some of these configurations need to be tweaked, then we can easily modify all affected code.

Commonly the socket timeout will need to be adjusted for a specific service. After a connection is established, a service will not typically start sending its response until it has finished whatever calculation was requested. This can vary greatly even for different parameters on the same service endpoint. The socket timeout will need to be set based on the expected response times of the service.

It is easy to miss a particular setting even if you know it is there. At Bizo, we are always looking for ways to solve a problem in one place. We are hopeful that this will eliminate any issues we have had with bad defaults in our HttpClients.

Map-side aggregations in Apache Hive

Mon, 18 Feb 2013 00:00:00 +0000

When running large scale Hive reports, one error we occasionally run into is the following:

Possible error:

Out of memory due to hash maps used in map-side aggregation.

Solution:

Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;'

What's going on is that Hive is trying to optimize the query by performing a map-side aggregation. This is a map-side optimization that does a partial aggregation inside of the mapper, which results in the mapper outputting fewer rows. In turn, this reduces the amount of information that Hadoop needs to sort and distribute to the reducers.

Let's think about what the Hadoop job looks like with the canonical word count example.

In the word count example, the naive approach is for the mapper to tokenize each row of input and output the key-value pair (#{token}, 1). The Hadoop framework will sort these pairs by the tokens, and the reducer sums the values to produce the total counts for each token.

Using a map-side aggregation, the mappers would instead tokenize each row and store partial counts in an in-memory hash map. (More precisely, the mappers are storing each key with the corresponding partial aggregation, which is just a count in this case.) Periodically, the mappers will output the pairs (#{token}, #{token_count}). The Hadoop framework again sorts these pairs and the reducers sum the values to produce the total counts for each token. In this case, the mappers will each output one row for each token every time the map is flushed instead of one row for each occurrence of each token. The tradeoff is that they need to keep a map of all tokens in memory.

By default, Hive will try to use the map-side aggregation optimization, but it falls back to the standard approach if the hash map is not producing enough of a memory savings. After processing 100,000 rows modifiable via hive.groupby.mapaggr.checkinterval, Hive will check the number of items in the hash map. If it exceeds 50% (modifiable via hive.map.aggr.hash.min.reduction) of the number of rows read, the map-side aggregation will be aborted.

Hive will also estimate the amount of memory needed for each entry in the hash map and flush the map to the reducers whenever the size of the map exceeds 50% of the available mapper memory (modifiable via hive.map.aggr.hash.percentmemory). This, however, is an estimate based on the number of rows and the expected size of each row, so if the memory usage is per row is unexpectedly high, the mappers may run out of memory before the hash map is flushed to the reducers.

In particular, if a query uses a count distinct aggregation, the partial aggregations actually contain a list of all values seen. As more distinct values are seen, the amount of memory used by the map will increase without necessarily increasing the number of rows of the map, which is what Hive uses to determine when to flush the partial aggregations to the reducers.

Whenever a mapper runs out of memory, a group by clause is present, and map-side aggregation is turned on, Hive will helpfully suggest that you reduce the flush threshold to avoid running out of memory. This will lower the threshold (in rows) of when Hive will automatically flush the map, but it may not help if the map size (in bytes) is growing independently of the number of rows.

Some alternate solutions include simply turning off map-side aggregations (set hive.map.aggr = false), allocating more memory to your mappers via the Hadoop configuration, or restructuring the query so that Hive will pick a different query plan.

For example, a simple

select count(distinct v) from tbl

can be rewritten as

select count(1) from (select v from tbl group by v) t.

This latter query will avoid using the count distinct aggregation and may be more efficient for some queries.

Reader Driven Development

Fri, 15 Feb 2013 00:00:00 +0000

In this talk on Effective ML, Yaron Minsky talks about Reader Driven Development. That is, writing your code with the reader in mind. Making decisions that will make the code more easily read and understood by other developers down the line.

The interest of the reader always pushes in the direction of clarity, simplicity, and the ability to change the code later. In most real projects, code is read and changed many more times than it is written. The readers interest are paramount in that regard.

When writing code the interests of the reader and writer may be at odds, and when faced with a decision, always err in the direction of the reader. The reader is always right. Regardless of team size, it's helpful to program this way. Even code you've written yourself may not be as clear 6 months or a year later otherwise. Great perspective, and I think it fits in nicely with previous posts here on programming style and code reviews (tend to agree with your reviewers, they are the audience!).

Asanban: Lean Development with Asana and Kanban

Wed, 23 Jan 2013 00:00:00 +0000

On Bizo's External Apps team (aka. 'xapps'), we've been using a Kanban system to manage our work. All of Bizo Engineering uses Asana to track tasks, which isn't specifically designed for Kanban. We've settled on a set of of conventions that we use in Asana which enable our Kanban system. These conventions also help us to track metrics like the average lead time from month to month.

Background

Kanban is a second-generation Agile software development methodology. The focus is on finding and fixing bottlenecks, as well as removing waste by limiting work-in-progress. (The "WIP" limits referenced in this post are the number of work items that are allowed to be in a particular stage of the system at one time.) Adopting a Kanban system has made things easier for engineers, increased efficiency, and is very popular with our Product Management folks as well. We are now focused on delivering value incrementally rather than specifying and implementing larger chunks of work. If you're interested in adopting Kanban, I recommend reading David Anderson's seminal book on the topic: Kanban: Successful Evolutionary Change for Your Technology Business

Conventions

Each stage of work in our value chain is a priority heading in Asana. The name of the priority header follows the convention: "{STEP NAME} ({WIP}):", eg. "Dev Ready (10):". The steps that are earliest in the value chain are at the bottom of our Asana project, with tasks moving upwards through each stage until they reach "Production (15):" at the top when the functionality described by the task has been delivered to production. Once Product Management has verified that the functionality described by a task is functioning correctly in production, they mark the task as complete. We use tags to represent work item types, although its fairly limited at present.

Metrics

One of the most basic metrics to track in a Kanban system is the average amount of lead time (the time it takes from when a task gets added to the input queue until value is delivered). I have created some tooling that allows us to accomplish this systematically. I'll first describe what it does, and then how you can use it.

The first piece of the tooling bulk loads task data from the Asana API into MongoDB. The API returns JSON and I just store JSON as-is in MongoDB, which works out well since MongoDB speaks JSON natively. One hiccup is that once tasks are archived in Asana, you can no longer obtain information about them through the API. Accordingly, the bulk load needs to be scheduled to run on a regular basis (in our case, every night) so that we don't lose information about archived tasks. Furthermore, we have a policy that tasks should not be archived until they have been completed for at least 24 hours, so that the bulk loader will always run at least once after a task has been completed before it gets archived. After loading the task data, the bulk loader will create data describing how much time each task spent in each state, as well as how long each task took (in days) to complete from start to finish (lead time).

The other piece is a Sinatra web service that runs a map-reduce against the lead time data created by the bulk loader and serves lead times by month as JSON. It can also aggregate by year or day (but I don't think aggregating by day is useful).

I have packaged up both of those pieces into a gem called "asanban", which you can use. The source code and instructions for installation and usage are here: https://github.com/patgannon/asanban

Pain Points

There are a couple of problems I've run into using Asana with a Kanban system. The first is that there's no way to enforce WIP limits. Users just have to be mindful of the limits shown in the priority headers. I have been thinking about writing a nightly report that uses the data created by the bulk loader to find violated WIP limits and send out emails, but I haven't gotten to it yet. (This tooling is essentially a hack day project at this point.) There is also no functionality to facilitate different classes of service (SLAs and WIP break-downs), but maybe those could be supported using the same kind of nightly report.

Another problem I've run into is that task sizes can be all over the place, which reduces the meaningfulness of the metrics. Some Kanban practitioners use hierarchical work items to address this kind variability in size. Stories can be grouped into epics and/or broken down into "grains". Asana does support sub-tasks, so I may recommend that we use those to break down large work items in the future, at which point the bulk loader would be modified to track metrics by sub-task (for tasks that have them, which would be assumed to be epics).

Next Steps

As we fine tune our Kanban process, we'll use these lead time metrics to verify that when we've made an adjustment (changing a WIP limit or adding a buffer, for example), that our performance improves. I'd like to have more metrics so that we can have even better insight into our system in the future. For example, I'd like to see the average time tasks spend in particular steps, the average amount of total WIP, as well as WIP in each step (shown over time) and failure load.

The first order of business moving forward on this will probably be an improved charting interface on top of the existing metrics. Also, it would be nice if the bulk loader used a scheduling library so that folks don't have to manually schedule it in cron. It also could use some automated tests!!! As I mentioned previously, I've just been working on this on hack days, so if there's something you'd like to see done soon, well... pull requests will be gladly accepted! :)

What Makes Spark Exciting

Mon, 21 Jan 2013 00:00:00 +0000

At Bizo, we're currently evaluating/prototyping Spark as a replacement for Hive for our batch reports.

As a brief intro, Spark is an alternative to Hadoop. It provides a cluster computing framework for running distributed jobs. Similar to Hadoop, you provide Spark with jobs to run, and it handles splitting up the job into small tasks, assigning those tasks to machines (optionally with Hadoop-style data locality), issuing retries if tasks fail transiently, etc.

In our case, these jobs are processing a non-trivial amount of data (log files) on a regular basis, for which we currently use Hive.

Why Replace Hive?

Admittedly, Hive has served us well for quite awhile now. (One of our engineers even built a custom "Hadoop on demand" framework for running periodic on-demand Hadoop/Hive jobs in EC2 several months before Amazon Elastic Map Reduce came out.)

Without Hive, it would have been hard for us to provide the same functionality, probably at all, let alone in the same time frame. That said, it has gotten to the point where Hive is more frequently invoked in negative contexts ("damn it, Hive") than positive.

Personally, I admittedly even try to avoid tasks that involve working with Hive. I find it to be frustrating and, well, just not a lot of fun. Why? Two primary reasons:

1. Hive jobs are hard to test

Bizo has a culture of excellence, and for engineering one of the things this means is testing. We really like tests. Especially unit tests, which are quick to run and enable a fast TDD cycle.

Unfortunately, Hive makes unit testing basically impossible. For several reasons:

Hive scripts must be run in a local Hadoop/Hive installation.

Ironically, very few developers at Bizo have local Hadoop installations. We are admittedly spoiled by Elastic Map Reduce, such that most of us (myself anyway) wouldn't even know how to setup Hadoop off the top of our heads. We just fire up an EMR cluster.

Hive scripts have production locations embedded in them.

Both our log files and report output are stored in S3, so our Hive scripts end up with lots of "s3://" paths scattered throughout in them. While we do run dev versions of reports with "-dev" S3 buckets, still relying on S3 and raw log files (that are usually in a compressed/binary-ish format) is not conducive to setting up lots of really small, simplified scenarios to unit test each boundary case.

Hive scripts do not provide any abstraction - they are just one big HiveQL file. This means its hard to break up a large report into small, individually testable steps.

Despite these limitations, about a year ago we had a developer dedicate some effort to prototyping an approach that would run Hive scripts within our CI workflow. In the end, while his prototype worked, the workflow was wonky enough that we never adopted it for production projects.

The result? Our Hive reports are basically untested. This sucks.

2. Hive is hard to extend

Extending Hive via custom functions (UDFs and UDAFs) is possible, and we do it all the time - but it's a pain in the ass. Perhaps this is not Hive's fault, and it's some Hadoop internals leaking into Hive, but the various ObjectInspector hoops, to me, always seemed annoying to deal with.

Given these shortcomings, Bizo has been looking for a Hive-successor for awhile, even going so far as to prototype revolute, a Scala DSL on top of Cascading, but had not yet found something we were really excited about.

Enter Spark!

We had heard about Spark, but did not start trying it until being so impressed by the Spark presentation at AWS re:Invent (the talk received the highest rating of all non-keynote sessions) that we wanted to learn more.

One of Spark's touted strengths is being able to load and keep data in memory, so your queries aren't always I/O bound.

That is great, but the exciting aspect for us at Bizo is how Spark, either intentionally or serendipitously, addresses both of Hive's primary shortcomings, and turns them into huge strengths. Specifically:

1. Spark jobs are amazingly easy to test

Writing a test in Spark is as easy as:

class SparkTest {
  @Test
  def test() {
    // this is real code...

    val sc = new SparkContext(&quot;local&quot;, &quot;MyUnitTest&#39;)

    // and now some psuedo code...

    val output = runYourCodeThatUsesSpark(sc)
    assertAgainst(output)
  }
}

(I will go into more detail about runYourCodeThatUsesSpark in a future post.)

This one liner starts up a new SparkContext, which is all your program needs to execute Spark jobs. There is no local installation required (just have the Spark jar on your classpath, e.g. via Maven or Ivy), no local server to start/stop. It just works.

As a technical aside, this "local" mode starts up an in-process Spark instance, backed by a thread-pool, and actually opens up a few ports and temp directories, because it's a real, live Spark instance.

Granted, this is usually more work than you want to be done in an unit test (which ideally would not hit any file or network I/O), but the redeeming quality is that it's fast. Tests run in ~2 seconds.

Okay, yes, this is slow compared to pure, traditional unit tests, but is such a huge revolution compared to Hive that we'll gladly take it.

2. Spark is easy to extend

Spark's primary API is a Scala DSL, oriented around what they call an RDD, or Resilient Distributed Dataset, which is basically a collection that only supports bulk/aggregate transforms (so methods like map, filter, and groupBy, which can be seen as transforming the entire collection, but no methods like get or take which assume in-memory/random access).

Some really short, made up example code is:

// RDD[String] is like a collection of lines
val in: RDD[String] = sc.textFile(&quot;s3://bucket/path/&quot;)

// perform some operation on each line
val suffixed = in.map { line =&gt; line + &quot;some suffix&quot; }

// now save the new lines back out
suffixed.saveAsTextFile(&quot;s3://bucket/path2&quot;)

Spark's job is to package up your map closure, and run it against that extra large text file across your cluster. And it does so by, after shuffling the code and data around, actually calling your closure (i.e. there is no LINQ-like introspection of the closure's AST).

This may seem minor, but it's huge, because it means there is no framework code or APIs standing between your running closure and any custom functions you'd want to run. Let's say you want to use SomeUtilityClass (or the venerable StringUtils), just do:

  import com.company.SomeUtilityClass

  val in: RDD[String] = sc.textFile("s3://bucket/path/")

  val processed = in.map { line =>
    // just call it, it&#39;s a normal method call
    SomeUtilityClass.process(line) 
  }

  processed.saveAsTextFile("s3://bucket/path2")

Notice how SomeUtilityClass doesn't have to know it's running within a Spark RDD in the cluster. It just takes a String. Done.

Similarly, Spark doesn't need to know anything about the code you use witin the closure, it just needs to be available on the classpath of each machine in the cluster (which is easy to do as part of your cluster/job setup, you just copy some jars around).

This seamless hop between the RDD and custom Java/Scala code is very nice, and means your Spark jobs end up reading just like regular, normal Scala code (which to us is a good thing!).

Is Spark Perfect?

As full disclosure, we're still in the early stages of testing Spark, so we can't yet say whether Spark will be a wholesale replacement for Hive within Bizo. We haven't gotten to any serious performance comparisons or written large, complex reports to see if Spark can take whatever we throw at it.

Personally, I am also admittedly somewhat infutuated with Spark at this point, so that could be clouding my judgement about the pros/cons and the tradeoffs with Hive.

One Spark con so far is that Spark is pre-1.0, and it can show. I've seen some stack traces that shouldn't happen, and some usability warts, that hopefully will be cleared up by 1.0. (That said, even as a newbie I find the codebase small and very easy to read, such that I've had several small pull requests accepted already - which is a nice consolation compared to the daunting codebases of Hadoop and Hive.)

We have also seen that, for our first Spark job, moving from "Spark job written" to "Spark job running in production" is taking longer than expected. But given that Spark is a new tool to us, we expect this to be a one-time cost.

More to Come

I have a few more posts coming up which explain our approach to Spark in more detail, for example:

Testing best practices
Running Spark in EMR
Accessing partitioned S3 logsTo see those when they come out, make sure to subscribe to the blog, or, better yet, come work at Bizo and help us out!

Grouping pageviews into visits: a Scala code kata

Wed, 26 Sep 2012 00:00:00 +0000

The basic units of any website traffic analysis are pageviews, visits, and unique visitors. Tracking pageviews is simply a matter of counting requests to the server. Calculating unique visitors usually relies on cookies and unique identifiers. Visits, however, require a bit more work. For our purposes, a single visit is defined as a sequence of pageviews where the interval between pageviews is less than a fixed length like 15 minutes.

I thought that the problem of grouping pageviews into visits would make an interestingcode kata. Here’s the statement of the problem that I worked from:

Given a non-empty sequence of timestamps (as milliseconds since the epoch), write a function that would return a sequence of visits, where each visit is itself a sequence of timestamps where each pair of consecutive timestamps is no more than N milliseconds apart.

Procedural

As a starting point, I decided to take a straightforward procedural approach:

def doingItIteratively(pageviews: Seq[Long]): Seq[Seq[Long]] = {
  val iterator = pageviews.sorted.iterator
  val visits = ListBuffer[ListBuffer[Long]]()

  var previousPV: Long = iterator.next
  var currentVisit: ListBuffer[Long] = ListBuffer(previousPV)

  for (currentPV <- iterator) {
    if (currentPV - previousPV > N) {
      visits += currentVisit
      currentVisit = ListBuffer[Long]()
    }

    currentVisit += currentPV
    previousPV = currentPV
  }

  visits += currentVisit
  visits map (_.toSeq) toSeq
}

So, we simply iterate through the (sorted) events tracking both the current visit and the previous pageview. If the current pageview represents a new visit, push the previous visit into the list of all visits and start a new one. Then push the current pageview into the (potentially new) visit.

Folding

It actually felt a bit odd to write procedural code like this and ignore the functional parts of Scala. Using a fold cleans the code up a bit and gets rid of the mutable state.

def doingItByFolds(pageviews: Seq[Long]): Seq[Seq[Long]] = {
  val sortedPVs = pageviews.sorted

  (Seq[Seq[Long]]() /: sortedPVs) { (visits, pv) =>
    val isNewVisit = visits.lastOption flatMap (_.lastOption) map { prevPV => 
      pv - prevPV > N
    } getOrElse true

    if (isNewVisit) {
      visits :+ Seq(pv)
    } else {
      visits.init :+ (visits.last :+ pv)
    }
  }
}

Here, we’re starting with an empty list of visits and folding it over the sorted pageviews. At each pageview, we decide if we need to start a new visit. If so, we append a new visit containing the pageview to the accumulated visits. If not, we pop off the last visit, append the pageview, and put the last visit back on the tail of the accumulated visits.

Folding With Intervals

One part that’s still a bit messy is comparing the current timestamp to the previous one. We can improve that by iterating through the intervals between pageviews instead of the actual pageviews.

def slidingThroughIt(pageviews: Seq[Long]): Seq[Seq[Long]] = {
  val intervals = (0L +: pageviews.sorted).sliding(2)

  (Seq[Seq[Long]]() /: intervals) { (visits, interval) =>
    if (interval(1) - interval(0) > N) {
      visits :+ Seq(interval(1))
    } else {
      visits.init :+ (visits.last :+ interval(1))
    }
  }
}

Here, we’re prepending a “0L” timestamp (and assuming that none of the pageviews happened in the early 70s) and using the “sliding” method to pair each timestamp with the previous one.

With Case Class

So far, we’ve been using a sequence of pageviews as a visit. What happens if we add an explicit Visit type? This lets us convert all pageviews into Visits at the start, then focus on merging overlapping Visits. One nice benefit is that this is a map-reduce algorithm that can be easily parallelized instead of one that must sequentially iterate over the pageviews (either explicitly or with a fold).

case class Visit(start: Long, end: Long, pageviews: Seq[Long]) {
  def +(other: Visit): Visit = {
    Visit(min(start,other.start), max(end, other.end),
      (pageviews ++ other.pageviews).sorted)
  }
}

def doingItMapReduceStyle(pageviews: Seq[Long]): Seq[Visit] = {
  pageviews
  .par
  .map { pv => Seq(Visit(pv, pv+N, Seq(pv)) }
  .reduce { (visit1, visit2) =>
    val sortedVisits = (v1 ++ v2) sortBy (_.start)

    (Seq[Visit]() /: sortedVisits) { (visits, next) =>
      if (visits.lastOption map(_.end &gt;= next.start) getOrElse false) {
        visits.init :+ (visits.last + visit)
      } else {
        visits :+ visit
      }
    }
  }
}

The map-reduce solution is fun, but in a production system, I’d probably stick with the sliding variation and add a bit more flexibility to track actual pageview objects instead of just timestamps.

Using GROUP BYs or multiple INSERTs with complex data types in Hive.

Wed, 19 Sep 2012 00:00:00 +0000

In any sort of ad hoc data analysis, the first step is often to extract a specific subset of log lines from our files. For example, when looking at a single partner’s web traffic, I often use an initial query to copy that partner’s data into a new table. In addition to segregating out only the data relevant to my analysis, I use this to copy the data from S3 into HDFS, which will make later queries more efficient. (Using maps as our log lines is how we supportdynamic columns.)

create external table if not exists
original_logs(fields map&lt;string,string&gt;) location “...” ;

create table if not exists
extracted_logs(fields map<string,string>) ;

insert overwrite table extracted_logs
select * from original_logs where fields[“partnerId”] = 123 ;

If I’m doing this for multiple partners, it’s tempting to use a multiple-insert so Hadoop only needs to make one pass of the original data.

create external table if not exists
original_logs(fields map&lt;string,string&gt;) location “...” ;

create table if not exists
extracted_logs(fields map<string,string>)
partitioned by (partnerId int);

from original_logs
insert overwrite table extracted_logs partition (partnerId = 123)
select * from original_logs where fields[“partnerId”] = 123
insert overwrite table extracted_logs partition (partnerId = 234)
select * from original_logs where fields[“partnerId”] = 234

Unfortunately, in Hive 0.7.x, this query fails with the error message “Hash code on complex types not supported yet.” A multiple-insert statement uses an implicit group by, and Hive 0.7.x does not support grouping by complex types. This bug was partially addressed in 0.8, which added support for arrays and maps, but structs and unions are still not supported.

At an initial glance, it does look like adding this support should be straightforward. This could be a good candidate for our next open source day.

mdadm: device or resource busy

Sat, 07 Jul 2012 00:00:00 +0000

I just spent a few hours tracking an issue with mdadm (Linux utility used to manage software RAID devices) and figured I'd write a quick blog post to share the solution so others don't have to waste time on the same.

As a short background, we use mdadm to create RAID-0 stripped devices for our Sugarcube analytics (OLAP) servers using Amazon EBS volumes.

The issue manifested itself as a random failure during device creation:

$ mdadm --create /dev/md0 --level=0 --chunk 256 --raid-devices=4 /dev/xvdh1 /dev/xvdh2 /dev/xvdh3 /dev/xvdh4
mdadm: Defaulting to version 1.2 metadata
mdadm: ADD_NEW_DISK for /dev/xvdh3 failed: Device or resource busy

I searched and searched the interwebs and tried every trick I found to no avail. We don't have dmraid installed on our Linux images (Ubuntu 12.04 LTS / Alestic cloud image) so there's no possible conflict there. All devices were clean, as they are freshly created EBS volumes and I knew none of them were in use.

Before running mdadm --create, mdstat was clean:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
unused devices: <none>

And yet after running it the devices were assigned to two different devices instead of just /dev/md0:

$ cat /proc/mdstatPersonalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]

md127 : inactive xvdh4[3](S) xvdh3[2](S)
      1048573952 blocks super 1.2

md0 : inactive xvdh2[1](S) xvdh1[0](S)
    1048573952 blocks super 1.2
    
unused devices: <none>

Looking into dmesg didn't reveal anything interesting either:

$ dmesg
...
[3963010.552493] md: bind<xvdh1>
[3963010.553011] md: bind<xvdh2>
[3963010.553040] md: could not open unknown-block(202,115).
[3963010.553052] md: md_import_device returned -16
[3963010.566543] md: bind<xvdh3>
[3963010.731009] md: bind<xvdh4>

And strangely, the creation or assembly would sometime work and sometime not:

$ mdadm --manage /dev/md0 --stop
mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]
mdadm: /dev/md0 has been started with 4 drives.

$ mdadm --manage /dev/md0 --stop
mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]
mdadm: cannot open device /dev/xvdh3: Device or resource busy

$ mdadm --manage /dev/md0 --stop
mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]
mdadm: cannot open device /dev/xvdh1: Device or resource busy
mdadm: /dev/xvdh1 has no superblock - assembly aborted

$ mdadm --manage /dev/md0 --stop
mdadm: stopped /dev/md0

$sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]
mdadm: /dev/md0 has been started with 4 drives.

I started suspecting I was facing some kind of underlying race condition where the devices would get assigned/locked during the device creation process. So I started googling for "mdadm create race" and I finally found a post that tipped me off. While it didn't provide the solution, the post put me on the right track by mentioning udev and it took only a few more minutes to narrow down on the solution: disabling udev events during device creation to avoid contention on device handles.

So now our script goes something like:

$ udevadm control --stop-exec-queue
$ mdadm --create /dev/md0 --run --level=0 --raid-devices=4 ...
$ udevadm control --start-exec-queue

And we now have consistent reliable device creation.Hopefully this blog post will help other passers-by with a similar problem. Good luck!