Bizo Dev Blog Java/Scala and Highly Scalable Systems on AWS http://dev.bizo.com Setting up Discourse on AWS alexkuang Mon, 30 Jun 2014 00:00:00 +0000 http://dev.bizo.com/2014/06/discourse-on-aws.html http://dev.bizo.com/2014/06/discourse-on-aws.html <p>The Bizo dev team recently decided to experiment with using <a href="http://www.discourse.org/">discourse</a> as a discussion forum. Discourse has published an <a href="https://github.com/discourse/discourse/blob/master/docs/INSTALL-digital-ocean.md">installation guide</a> for a single-machine setup using DigitalOcean, but we decided to deploy on AWS instead.</p> <h2>Why AWS?</h2> <p>While the default installation is fast and easy to implement, it does leave the web server running in the same box as all the important data: postgres, redis, file uploads. This means that if the server location were, hypothetically, invaded by a troop of angry hammer-wielding monkeys, there would be much downtime and data loss involved.</p> <p>As proper monkey-fearing developers, we prefer to use <a href="http://martinfowler.com/bliki/ImmutableServer.html">immutable servers</a> whenever possible. By using immutable servers and keeping data in dedicated AWS services, we can treat them as disposable and essentially expect them to be torn down and rebuilt on a regular basis. This provides incentive to keep configuration simple, which makes automation easier, and in turn leads to less time fussing over things like server state and maintaining exact run-time configuration.</p> <p>We prefer sticking to this approach even for experimental apps like discourse. In this case, we used EC2 for the webserver, RDS for the postgres database, ElastiCache for the redis store, and S3 for file uploads.</p> <h2>Setting up AWS Services</h2> <p>Setting up AWS services is generally pretty straighforward, but there are a few things to keep in mind.</p> <h3>Security Groups</h3> <p>Security groups are <em>always</em> a good idea, but they are required for RDS and ElastiCache to communicate with EC2 instances. Make sure to allow ssh (port 22) and http (ports 80 + 443), and give it a memorable name like <code>discourse-prod</code>.</p> <h3>RDS, ElastiCache</h3> <p>Again, the only catch here is to make sure that the security group attached to these instances is the same one that the EC2 instance is attached to. Otherwise, the web server won&#39;t be able to talk to either of them. Make sure to note the hostnames, auth credentials, etc here for use in configuring discourse.</p> <h3>EC2</h3> <p>The only officially supported install method for discourse is via Docker, which requires choosing specific versions of Ubuntu for the EC2 instance. See the <a href="http://docs.docker.com/installation/ubuntulinux/">docker documentation</a> for more details.</p> <h2>Discourse Config</h2> <p>Since we&#39;re not doing a standalone deploy, we based our discourse docker config on the <a href="https://github.com/discourse/discourse_docker/blob/master/samples/web_only.yml">web-only</a> example that discourse provides. The final config looked something like this:</p> <div class="highlight"><pre><code class="yaml language-yaml" data-lang="yaml"><span class="l-Scalar-Plain">templates</span><span class="p-Indicator">:</span> <span class="p-Indicator">-</span> <span class="s">&quot;templates/sshd.template.yml&quot;</span> <span class="p-Indicator">-</span> <span class="s">&quot;templates/web.template.yml&quot;</span> <span class="l-Scalar-Plain">expose</span><span class="p-Indicator">:</span> <span class="p-Indicator">-</span> <span class="s">&quot;80:80&quot;</span> <span class="p-Indicator">-</span> <span class="s">&quot;2222:22&quot;</span> <span class="l-Scalar-Plain">params</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">version</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">HEAD</span> <span class="l-Scalar-Plain">env</span><span class="p-Indicator">:</span> <span class="c1"># Creating an account with a developer email will automatically give it </span> <span class="c1"># admin access to the site for setup</span> <span class="l-Scalar-Plain">DISCOURSE_DEVELOPER_EMAILS</span><span class="p-Indicator">:</span> <span class="s">&#39;email_address@company.com&#39;</span> <span class="l-Scalar-Plain">DISCOURSE_HOSTNAME</span><span class="p-Indicator">:</span> <span class="s">&#39;DOMAIN_FOR_DISCOURSE_SITE.com&#39;</span> <span class="c1"># Enter info from RDS here</span> <span class="l-Scalar-Plain">DISCOURSE_DB_SOCKET</span><span class="p-Indicator">:</span> <span class="s">&#39;&#39;</span> <span class="l-Scalar-Plain">DISCOURSE_DB_HOST</span><span class="p-Indicator">:</span> <span class="s">&#39;DB_INSTANCE_ID.REGION.rds.amazonaws.com&#39;</span> <span class="l-Scalar-Plain">DISCOURSE_DB_PORT</span><span class="p-Indicator">:</span> <span class="s">&#39;5432&#39;</span> <span class="l-Scalar-Plain">DISCOURSE_DB_USERNAME</span><span class="p-Indicator">:</span> <span class="s">&#39;DB_USER&#39;</span> <span class="l-Scalar-Plain">DISCOURSE_DB_PASSWORD</span><span class="p-Indicator">:</span> <span class="s">&#39;DB_PASSWORD&#39;</span> <span class="l-Scalar-Plain">DISCOURSE_DB_NAME</span><span class="p-Indicator">:</span> <span class="s">&#39;DB_NAME&#39;</span> <span class="c1"># Enter info from elasticache here</span> <span class="l-Scalar-Plain">DISCOURSE_REDIS_HOST</span><span class="p-Indicator">:</span> <span class="s">&#39;REDIS_INSTANCE.cache.amazonaws.com&#39;</span> <span class="l-Scalar-Plain">DISCOURSE_REDIS_PORT</span><span class="p-Indicator">:</span> <span class="s">&#39;6379&#39;</span> <span class="c1"># Amazon SES can be used for SMTP, or even gmail for lower volumes</span> <span class="l-Scalar-Plain">DISCOURSE_SMTP_ADDRESS</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">SMTP_SERVER</span> <span class="l-Scalar-Plain">DISCOURSE_SMTP_PORT</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">SMTP_PORT</span> <span class="l-Scalar-Plain">DISCOURSE_SMTP_USER_NAME</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">SMTP_USER</span> <span class="l-Scalar-Plain">DISCOURSE_SMTP_PASSWORD</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">SMTP_PASSWORD</span> <span class="l-Scalar-Plain">volumes</span><span class="p-Indicator">:</span> <span class="p-Indicator">-</span> <span class="l-Scalar-Plain">volume</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">host</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">/var/docker/shared</span> <span class="l-Scalar-Plain">guest</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">/shared</span> <span class="c1"># you may use the docker manager to upgrade and monitor your docker image</span> <span class="c1"># UI will be visible at http://yoursite.com/admin/docker</span> <span class="l-Scalar-Plain">hooks</span><span class="p-Indicator">:</span> <span class="c1"># you may import your key using launchpad if needed</span> <span class="c1">#after_sshd:</span> <span class="c1"># - exec: ssh-import-id some-user</span> <span class="l-Scalar-Plain">after_code</span><span class="p-Indicator">:</span> <span class="p-Indicator">-</span> <span class="l-Scalar-Plain">exec</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">cd</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">$home/plugins</span> <span class="l-Scalar-Plain">cmd</span><span class="p-Indicator">:</span> <span class="p-Indicator">-</span> <span class="l-Scalar-Plain">mkdir -p plugins</span> <span class="p-Indicator">-</span> <span class="l-Scalar-Plain">git clone https://github.com/discourse/docker_manager.git</span> </code></pre></div> <h2>Bootstrapping EC2 instances</h2> <p>We have existing infrastructure that will spin up new instances with applicable settings such as security groups, auto scaling group, and load balancers. A simple shell script sets up the internals on the instance itself. First, installing docker and the docker container infrastructure for discourse:</p> <div class="highlight"><pre><code class="bash language-bash" data-lang="bash"><span class="c"># This is taken almost verbatim from the discourse installation guide</span> <span class="o">(</span>wget -qO- https://get.docker.io/ <span class="p">|</span> bash<span class="o">)</span> &gt; docker_install.log 2&gt;<span class="p">&amp;</span>1 install -g docker -m 2775 -d /var/docker git clone https://github.com/discourse/discourse_docker.git /var/docker </code></pre></div> <p>At the time of this writing, some of the provided install scripts do not play very nicely with Ubuntu&#39;s default <code>dash</code> so <code>bash</code> is used explicitly.</p> <p>Next, we customize the deploy. In addition to dropping in our custom config, we also generate an ad-hoc ssh key to feed into the docker container. This is because the discourse app runs entirely inside the container, so any sort of direct interaction with it requires the user to interact with the container itself, rather than just the server running the container. Conveniently, discourse sets up a sshd from <code>templates/ssh.template.yml</code> and will automatically load the current user&#39;s key from <code>~/.ssh/id_rsa</code> into the container. This part of the script makes sure that for each deploy there is a fresh key that is completely separate from any other (potentially important/sensitive) keys that might be living in the server.</p> <div class="highlight"><pre><code class="bash language-bash" data-lang="bash">cp config/bizo-discourse.yml /var/docker/containers/app.yml <span class="c"># Make sure to back up anything that might be in the existing id_rsa! Ideally</span> <span class="c"># the install is running as its own special user.</span> <span class="o">(</span>mkdir -p ssh-key <span class="o">&amp;&amp;</span> <span class="nb">cd </span>ssh-key <span class="o">&amp;&amp;</span> ssh-keygen -f id_rsa -t rsa -N <span class="s1">&#39;&#39;</span><span class="o">)</span> <span class="o">(</span>mkdir -p ~/.ssh <span class="o">&amp;&amp;</span> cp ssh-key/* ~/.ssh<span class="o">)</span> </code></pre></div> <p>And finally, to bootstrap the container and start the actual discourse app inside it:</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">bash /var/docker/launcher bootstrap app &gt; discourse_bootstrap.log 2&gt;&amp;1 bash /var/docker/launcher start app &gt; discourse_run.log 2&gt;&amp;1 &amp; </code></pre></div> <h2>S3 Uploads</h2> <p>The final step after getting the app set up is to get file uploads on S3, which is explained in <a href="https://meta.discourse.org/t/setting-up-file-and-image-uploads-to-s3/7229">this guide</a>. However, this might break existing user avatars, which the server still expects to find in their old location. (Note that this also applies to gravatars since discourse will download and cache them by default.) The fix is simple: ssh into the container and run <code>bundle exec rake avatars:refresh</code>. This will re-download avatars and update the users in the database accordingly.</p> <h2>Maintenance</h2> <p>With this setup, maintenance becomes extremely easy. Server crashes and software upgrades can be handled by spinning up new instances running the bootstrap shell script. Scaling is also as easy as spinning up another instance and putting everything behind a load balancer. Forum backups and data recovery can be handled easily with Amazon&#39;s native tools. At the end of the day, all this adds up to less time wasted on maintaining forum software and more time working on business-critical apps.</p> Why Scrum is great, but Kanban makes more sense to us matthieu Fri, 27 Jun 2014 00:00:00 +0000 http://dev.bizo.com/2014/06/why-scrum-is-great.html http://dev.bizo.com/2014/06/why-scrum-is-great.html <p>In a previous blog post, <a href="http://localhost:4000/2014/04/extreme-programming-for-modern-start-ups.html">Extreme Programming For Modern Start-ups</a>, Pat explained from his extensive experience with XP agile methodology why XP isn&#39;t always flexible enough for our work organization here at Bizo. My own past experience is with Scrum, the other popular Agile methodology. In this post I want to show where Scrum shines, and why Kanban finally wins for the kind of work we do.</p> <p><em>- Scrum is great but you don&#39;t use it at Bizo. Why not?</em></p> <p>I joined Bizo six months ago after working for 5 years for a software company which uses Scrum for all its projects. This means my previous work experience with Scrum is still quite fresh while I now have a good idea of how things work here at Bizo. Both Scrum and Kanban focus on almost orthogonal aspects of Agile. If you search the Internet for scrumban or scrum-ban, you will find people who thought about mixing the two. But generally you would do one or the other, mainly because of a difference in spirit between these two methodologies.</p> <p>Scrum focuses on making people work well together towards building a relevant product. Scrum is human to its core. It takes our inherent social weaknesses and strengths into account. The planning meeting and the demo at the end of a sprint provide focus on short term deadlines. The stand-ups and retrospective keep the communication channels always open. The Product Owner is the point of truth for what goes in the product, and the Scrum Master watches and organizes the backlog. Both of these people protect the team against external influences by always keeping the focus on the current sprint. In my experience Scrum is amazingly good at making people work well together. And it&#39;s also a very satisfying way to work, within some constraints.</p> <p>To be a good fit for Scrum, your project must:</p> <ul> <li><strong>have a 5 to 7 person team.</strong> Fewer people, and all this communication will feel like a useless overhead. More people, and the communication will take too long, with team members losing interest.</li> <li><strong>target one application or system.</strong> You shouldn&#39;t have a team of people working on different subjects. The tight communication channels in the team (planning meetings, standups, demos, retrospective) become pointless if each team member can&#39;t relate to what the other members do.</li> <li><strong>have work that can be done in parallel for everyone in the team.</strong> You don&#39;t want to have idle people waiting for each other.</li> <li><strong>be able to protect the team members from interruptions.</strong> Other projects members and stakeholders might want to interact with the team members for their own interest.</li> </ul> <p>In my previous company, I had the privilege to work with Scrum in such an environment. When the context fits the constraints, the Scrum team creates interactions that fit perfectly with the way our human brain works and socializes. But even in the company I worked for, this context didn&#39;t happen that often. When you take into account reality and modern software development, you actually need more flexibility.</p> <p>At Bizo, we have close to 400 git repos maintained by an engineering team of around 25 people. All of these projects contain code executed somewhere in our services hosted in the AWS cloud. By choice, there is no Operations Team at Bizo. This means we, developers, care for our app from fleshing out the design, to its deployment and monitoring. We have great tooling to help making deployment and monitoring a small amount of our time. But these tools help dealing with interruptions, they don&#39;t avoid them. So you might have guessed, Scrum is not a good fit for the projects we have here at Bizo.</p> <p><em>- Ok, so you guys at Bizo use Kanban. Why Kanban?</em></p> <p>Kanban focuses on having tasks flow from inception to completion in the most efficient fashion. Human interaction is completely left to the team&#39;s culture (quick plug for our <a href="http://dev.bizo.com/culture/index.html">culture guide</a>). Kanban focuses on monitoring the task flow while trusting the organization to find the best way to optimize this flow. Optimizing the flow can mean adding a certain skill to the team, automating one particular slow task, or telling your marketing team to change priorities. Anything really. And that&#39;s what&#39;s great about Kanban, you&#39;re put in front of your own responsibilities. Kanban shows you that tasks don&#39;t flow? Your job is to find a way to make it work.</p> <p>At Bizo, we have an architecture of many loosely coupled services and applications. We generally have small projects involving 1 to 7 people, with deployments that could happen from several times a week to once in a blue moon. Allowing for such variations in projects is great for many reasons, but it can create potential bottlenecks, or endless work in progress. To avoid this, you need monitoring, smart scheduling and flexible organizations. You also need visibility across all the small projects. Kanban gives us the visibility and doesn&#39;t get in our way when we need to change our organization.</p> <p><em>- So to sum up, Scrum for the perfect world, Kanban for the win.</em></p> <p>Scrum gives you a full framework to do your project. From the size of your teams to all the meetings you&#39;ll need, you get a full fledged target for your project organization. In my experience, if you can reach the target, or at least get close to it, it works amazingly well. Now Kanban doesn&#39;t tell you what to put in your task, or how you should perform them. It just tells how you should schedule and monitor these tasks, you then have to find the organization that works best for your project. This makes Kanban fit for a much wider range of projects.</p> <p>If you&#39;re interested in learning more about Scrum or Kanban:</p> <ul> <li><a href="http://www.mountaingoatsoftware.com/presentations/an-introduction-to-scrum">An introduction to Scrum by Mike Cohn</a> (then click the &quot;View Presentation&quot; button)</li> <li><a href="https://help.rallydev.com/what-is-kanban">A nice write-up of Kanban</a></li> </ul> Executors.newCachedThreadPool() considered harmful boisvert Tue, 24 Jun 2014 00:00:00 +0000 http://dev.bizo.com/2014/06/cached-thread-pool-considered-harmlful.html http://dev.bizo.com/2014/06/cached-thread-pool-considered-harmlful.html <p>This is more of a public-service announcement blog-post...</p> <p>That’s right, <a href="http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool%28%29">Executors.newCachedThreadPool()</a> isn&#39;t a great choice for server code that&#39;s servicing multiple clients and concurrent requests.</p> <p>Why? There are basically two (related) problems with it:</p> <p>1) It&#39;s <em>unbounded</em>, which means that you&#39;re opening the door for anyone to cripple your JVM by simply injecting more work into the service (DoS attack). Threads consume a non-negligible amount of memory and also increase memory consumption based on their work-in-progress, so it&#39;s quite easy to topple a server this way (unless you have other circuit-breakers in place).</p> <p>2) The unbounded problem is exacerbated by the fact that the Executor is fronted by a <a href="http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/SynchronousQueue.html">SynchronousQueue</a> which means there&#39;s a direct handoff between the task-giver and the thread pool. Each new task will create a new thread if all existing threads are busy. This is generally a bad strategy for server code. When the CPU gets saturated, existing tasks take longer to finish. Yet more tasks are being submitted and more threads created, so tasks take longer and longer to complete. When the CPU is saturated, more threads is definitely not what the server needs.</p> <p>Here are my recommendations:</p> <ul> <li><p>Use a fixed-size thread pool (<a href="http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/Executors.html#newFixedThreadPool%28int,%20java.util.concurrent.ThreadFactory%29">Executors.newFixedThreadPool</a>) or a <a href="http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.html">ThreadPoolExecutor</a> with a set maximum number of threads;</p></li> <li><p>Figure out how many threads you need to keep each CPU core saturated (but not too much). A good rule of thumb is to start around 10 threads per core for a workload that involves disk and network I/O, and see if it’s sufficient to keep all CPUs busy. Increase or decrease the number of maximum threads based on results obtained from load testing. Multiply this factor by <a href="http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#availableProcessors%28%29">Runtime.getRuntime.availableProcessors</a> to obtain the total size of the thread pool.</p></li> <li><p>I generally don&#39;t recommend queueing within a server&#39;s main request-processing loop. If you need queuing behavior, use a fixed-size <a href="http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ArrayBlockingQueue.html">ArrayBlockingQueue</a> -- it&#39;s more efficient than <a href="http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/LinkedBlockingQueue.html">LinkedBlockingQueue</a>. Use only if you absolutely need it. A system with queuing is more difficult to understand and tune.</p></li> <li><p>The default <a href="http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/RejectedExecutionHandler.html">RejectedExecutionHandler</a> is <a href="http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.AbortPolicy.html">AbortPolicy</a>, which throws a runtime <code>RejectedExecutionException</code> when the queue (if any) or the thread pool reach maximum capacity. This is not a bad default but I generally want to guarantee execution in spite of this, so I advise using the <a href="http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.CallerRunsPolicy.html">CallerRunsPolicy</a> instead. The <code>CallerRunsPolicy</code> also helps to avoid deadlocks (many threads blocking on each other) in the case where tasks may have internal dependencies.</p></li> <li><p>Always provide your own <a href="http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadFactory.html">ThreadFactory</a> so all your threads are named appropriately and have the daemon flag set. Your thread pool shouldn&#39;t keep the application alive. That&#39;s the responsibility of the application itself (i.e., main thread).</p></li> </ul> <p>The result of all this should look similar to this (in Scala syntax):</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"> <span class="k">object</span> <span class="nc">ThreadPoolHelpers</span> <span class="o">{</span> <span class="k">private</span> <span class="k">def</span> <span class="n">cpus</span> <span class="k">=</span> <span class="nc">Runtime</span><span class="o">.</span><span class="n">getRuntime</span><span class="o">.</span><span class="n">availableProcessors</span> <span class="k">def</span> <span class="n">daemonThreadFactory</span><span class="o">(</span><span class="n">name</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">ThreadFactory</span> <span class="o">{</span> <span class="k">private</span> <span class="k">val</span> <span class="n">count</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">AtomicInteger</span><span class="o">()</span> <span class="k">override</span> <span class="k">def</span> <span class="n">newThread</span><span class="o">(</span><span class="n">r</span><span class="k">:</span> <span class="kt">Runnable</span><span class="o">)</span> <span class="k">=</span> <span class="o">{</span> <span class="k">val</span> <span class="n">thread</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Thread</span><span class="o">(</span><span class="n">r</span><span class="o">)</span> <span class="n">thread</span><span class="o">.</span><span class="n">setName</span><span class="o">(</span><span class="n">s</span><span class="s">&quot;$name-${count.incrementAndGet}&quot;</span><span class="o">)</span> <span class="n">thread</span><span class="o">.</span><span class="n">setDaemon</span><span class="o">(</span><span class="kc">true</span><span class="o">)</span> <span class="n">thread</span> <span class="o">}</span> <span class="o">}</span> <span class="cm">/** Core thread pool to be used for concurrent request-processing */</span> <span class="k">def</span> <span class="n">newCoreThreadPool</span><span class="o">(</span><span class="n">name</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="o">{</span> <span class="c1">// use direct handoff (SynchronousQueue) + CallerRunsPolicy to avoid deadlocks</span> <span class="c1">// since tasks may have internal dependencies.</span> <span class="k">val</span> <span class="n">pool</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">ThreadPoolExecutor</span><span class="o">(</span> <span class="cm">/* core size */</span> <span class="o">(</span><span class="mi">5</span> <span class="o">*</span> <span class="n">cpus</span><span class="o">),</span> <span class="cm">/* max size */</span> <span class="o">(</span><span class="mi">15</span> <span class="o">*</span> <span class="n">cpus</span><span class="o">),</span> <span class="cm">/* idle timeout */</span> <span class="mi">60</span><span class="o">,</span> <span class="nc">TimeUnit</span><span class="o">.</span><span class="nc">SECONDS</span><span class="o">,</span> <span class="k">new</span> <span class="nc">SynchronousQueue</span><span class="o">[</span><span class="kt">Runnable</span><span class="o">](),</span> <span class="n">daemonThreadFactory</span><span class="o">(</span><span class="n">name</span><span class="o">)</span> <span class="o">)</span> <span class="n">pool</span><span class="o">.</span><span class="n">setRejectedExecutionHandler</span><span class="o">(</span><span class="k">new</span> <span class="nc">ThreadPoolExecutor</span><span class="o">.</span><span class="nc">CallerRunsPolicy</span><span class="o">)</span> <span class="n">pool</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div> <h2>About that ScheduledExecutorService ...</h2> <p>While we&#39;re at it, one last piece of advice is to separate request-processing threads from scheduled task execution. It’s often tempting to share a <a href="http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledExecutorService.html">ScheduledExecutorService</a> for both request-processing and scheduled tasks but I recommend using separate thread pools for two reasons:</p> <p>1) separation allows you to balance the relative processing capacity of your request-processing workload vs scheduled tasks. In particular, if you use scheduled tasks to act as timeouts for incoming requests (e.g. if you are using Futures that get queued into your main thread pool), the separation can have a substantial impact on the timeliness of your timeouts.</p> <p>2) <a href="http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html">ThreadPoolExecutor</a> is more efficient for request-processing since it does not have to order incoming tasks with respect to other scheduled tasks. This is especially true if you are using a <a href="http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/SynchronousQueue.html">SynchronousQueue</a> and requests are dispatched to threads in a straight-through fashion (no queuing).</p> <p>I hope this advice is relevant to you and your server code. If not, please drop me a line!</p> Exporting Formatted Code Snippets from Emacs rayo Fri, 13 Jun 2014 00:00:00 +0000 http://dev.bizo.com/2014/06/exporting-formatted-code-snippets-from-emacs.html http://dev.bizo.com/2014/06/exporting-formatted-code-snippets-from-emacs.html <p>I wanted a way to create formatted code snippets (syntax highlighting, etc.) for pasting into GMail or Google Slides. The catch was I didn&#39;t want to use a web service (e.g. <a href="http://pastie.org/">pastie</a>, <a href="http://hilite.me/">hilite.me</a>) or switch from Emacs to another text editor (e.g. <a href="http://www.sublimetext.com/">Sublime</a> with the <a href="https://github.com/n1k0/SublimeHighlight">SublimeHighlight plugin</a>).</p> <h2>Learning from Markdown-Mode</h2> <p>I remembered that Emacs <a href="http://www.emacswiki.org/emacs/MarkdownMode">markdown-mode</a> lets you preview, in a browser, the Markdown text rendered as HTML.</p> <p>Basically, when you install markdown-mode, you also install a Markdown executable like <a href="http://pythonhosted.org/Markdown/">Python-Markdown</a>. When you invoke <code>markdown-preview</code> in Emacs, markdown-mode runs the Markdown executable in a shell. The input is the Markdown text in the current Emacs region, and the HTML output goes to a temporary Emacs buffer. The buffer&#39;s contents, in turn, are opened in a browser by the Emacs command <code>browse-url-of-buffer</code>.</p> <h2>Pygments to the Rescue</h2> <p>We could use a similar strategy to markdown-mode&#39;s for exporting formatted code snippets. We just need a command-line tool that will receive code from <em>stdin</em> and send it formatted to <em>stdout</em>.</p> <p><a href="http://pygments.org/">Pygments</a> fits the bill. Pygments is a syntax higlighter written in Python, with many lexers and formatting options. It can be installed using <code>pip</code> (e.g. <code>pip install pygments</code>).</p> <h2>Lisp Isn&#39;t So Painful</h2> <p>After installing Pygments, you get a script called <code>pygmentize</code>. Of course to call it from Emacs, you&#39;ll need some Lisp. Here&#39;s a helper that constructs the desired invocation:</p> <div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="background-color: #f0f0f0; padding: 0 5px 0 5px">26</span> (<span style="color: #00aaaa">defun</span> <span style="color: #aa0000">pygmentize-html-command</span> (<span style="color: #aa0000">beginning-line-number</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">27</span> (<span style="color: #0000aa">let</span> ((<span style="color: #aa0000">lexer</span> (<span style="color: #00aaaa">gethash</span> <span style="color: #aa0000">major-mode</span> <span style="color: #aa0000">pygmentize-lexers</span>)) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">28</span> (<span style="color: #aa0000">linenostart</span> (<span style="color: #aa0000">number-to-string</span> <span style="color: #aa0000">beginning-line-number</span>))) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">29</span> (<span style="color: #0000aa">if</span> (<span style="color: #00aaaa">not</span> <span style="color: #aa0000">lexer</span>) (<span style="color: #00aaaa">error</span> (<span style="color: #aa0000">concat</span> <span style="color: #aa5500">&quot;error: no lexer for &quot;</span> (<span style="color: #00aaaa">symbol-name</span> <span style="color: #aa0000">major-mode</span>)))) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">30</span> (<span style="color: #aa0000">concat</span> <span style="color: #aa5500">&quot;pygmentize -f html&quot;</span> <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">31</span> <span style="color: #aa5500">&quot; -l &quot;</span> <span style="color: #aa0000">lexer</span> <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">32</span> <span style="color: #aa5500">&quot; -O style=autumn,linenos=inline,noclasses=true,linenostart=&quot;</span> <span style="color: #aa0000">linenostart</span>))) </pre></div> <p>A lookup table maps Emacs major-modes to their lexer argument for <code>pygmentize</code>:</p> <div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="background-color: #f0f0f0; padding: 0 5px 0 5px">16</span> (<span style="color: #0000aa">setq</span> <span style="color: #aa0000">pygmentize-lexers</span> (<span style="color: #00aaaa">make-hash-table</span>)) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">17</span> (<span style="color: #aa0000">puthash</span> <span style="color: #0000aa">&#39;emacs-lisp-mode</span> <span style="color: #aa5500">&quot;common-lisp&quot;</span> <span style="color: #aa0000">pygmentize-lexers</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">18</span> (<span style="color: #aa0000">puthash</span> <span style="color: #0000aa">&#39;scala-mode</span> <span style="color: #aa5500">&quot;scala&quot;</span> <span style="color: #aa0000">pygmentize-lexers</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">19</span> (<span style="color: #aa0000">puthash</span> <span style="color: #0000aa">&#39;java-mode</span> <span style="color: #aa5500">&quot;java&quot;</span> <span style="color: #aa0000">pygmentize-lexers</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">20</span> (<span style="color: #aa0000">puthash</span> <span style="color: #0000aa">&#39;ruby-mode</span> <span style="color: #aa5500">&quot;rb&quot;</span> <span style="color: #aa0000">pygmentize-lexers</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">21</span> (<span style="color: #aa0000">puthash</span> <span style="color: #0000aa">&#39;python-mode</span> <span style="color: #aa5500">&quot;py&quot;</span> <span style="color: #aa0000">pygmentize-lexers</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">22</span> (<span style="color: #aa0000">puthash</span> <span style="color: #0000aa">&#39;sh-mode</span> <span style="color: #aa5500">&quot;sh&quot;</span> <span style="color: #aa0000">pygmentize-lexers</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">23</span> (<span style="color: #aa0000">puthash</span> <span style="color: #0000aa">&#39;diff-mode</span> <span style="color: #aa5500">&quot;diff&quot;</span> <span style="color: #aa0000">pygmentize-lexers</span>) </pre></div> <p>The Emacs clip region is passed via <em>stdin</em> to <code>pygmentize</code>, with <em>stdout</em> redirected to a temporary buffer:</p> <div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="background-color: #f0f0f0; padding: 0 5px 0 5px">38</span> (<span style="color: #00aaaa">defun</span> <span style="color: #aa0000">pygmentize-html</span> (<span style="color: #0000aa">&amp;optional</span> <span style="color: #aa0000">output-buffer-name</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">39</span> (<span style="color: #aa0000">interactive</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">40</span> (<span style="color: #aa0000">save-window-excursion</span> <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">41</span> (<span style="color: #00aaaa">unless</span> <span style="color: #aa0000">output-buffer-name</span> <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">42</span> (<span style="color: #0000aa">setq</span> <span style="color: #aa0000">output-buffer-name</span> <span style="color: #aa0000">pygmentize-html-output-buffer-name</span>)) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">43</span> (<span style="color: #0000aa">let*</span> ((<span style="color: #aa0000">beginning-line-number</span> (<span style="color: #aa0000">line-number-at-pos</span> (<span style="color: #aa0000">region-beginning</span>))) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">44</span> (<span style="color: #aa0000">shell-command</span> (<span style="color: #aa0000">pygmentize-html-command</span> <span style="color: #aa0000">beginning-line-number</span>))) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">45</span> (<span style="color: #aa0000">shell-command-on-region</span> (<span style="color: #aa0000">region-beginning</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">46</span> (<span style="color: #aa0000">region-end</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">47</span> <span style="color: #aa0000">shell-command</span> <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">48</span> <span style="color: #aa0000">output-buffer-name</span>)) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">49</span> (<span style="color: #aa0000">switch-to-buffer</span> <span style="color: #aa0000">output-buffer-name</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">50</span> (<span style="color: #aa0000">clipboard-kill-ring-save</span> (<span style="color: #aa0000">point-min</span>) (<span style="color: #aa0000">point-max</span>)) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">51</span> <span style="color: #aa0000">output-buffer-name</span>)) </pre></div> <p>The temporary buffer&#39;s syntax-highlighted, HTML-formatted contents will be opened in a browser, using Emacs&#39;s <code>browse-url-of-buffer</code> function:</p> <div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="background-color: #f0f0f0; padding: 0 5px 0 5px">53</span> (<span style="color: #00aaaa">defun</span> <span style="color: #aa0000">pygmentize-html-preview</span> (<span style="color: #0000aa">&amp;optional</span> <span style="color: #aa0000">output-buffer-name</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">54</span> (<span style="color: #aa0000">interactive</span>) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">55</span> (<span style="color: #00aaaa">unless</span> <span style="color: #aa0000">output-buffer-name</span> <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">56</span> (<span style="color: #0000aa">setq</span> <span style="color: #aa0000">output-buffer-name</span> <span style="color: #aa0000">pygmentize-html-output-buffer-name</span>)) <span style="background-color: #f0f0f0; padding: 0 5px 0 5px">57</span> (<span style="color: #aa0000">browse-url-of-buffer</span> (<span style="color: #aa0000">pygmentize-html</span> <span style="color: #aa0000">output-buffer-name</span>))) </pre></div> <p>Gotta have a keybinding:</p> <div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="background-color: #f0f0f0; padding: 0 5px 0 5px">59</span> (<span style="color: #aa0000">global-set-key</span> (<span style="color: #aa0000">kbd</span> <span style="color: #aa5500">&quot;C-c h p&quot;</span>) <span style="color: #0000aa">&#39;pygmentize-html-preview</span>) </pre></div> <h2>Exporting the Snippets</h2> <p><a href="https://gist.github.com/rayortigas/7c5ed449fcfafcf6851d">This gist</a> contains the full Emacs-Lisp code. (And it easily embeds into a <code>.emacs</code> file.)</p> <p>To export a snippet, highlight some code, and invoke <code>pygmentize-html-preview</code>:</p> <p><img src="/images/posts/exporting-formatted-code-snippets-from-emacs/emacs.png" alt="emacs"></p> <p>In the launched browser window, select all and copy:</p> <p><img src="/images/posts/exporting-formatted-code-snippets-from-emacs/copy.png" alt="copy"></p> <p>Paste into GMail, Google Slides, or anything that will accept formatted HTML:</p> <p><img src="/images/posts/exporting-formatted-code-snippets-from-emacs/paste.png" alt="paste"></p> Bulk Editing Jenkins Configurations dietz Wed, 23 Apr 2014 00:00:00 +0000 http://dev.bizo.com/2014/04/bulk-editing-jenkins-configurations.html http://dev.bizo.com/2014/04/bulk-editing-jenkins-configurations.html <p>During our <a href="http://dev.bizo.com/2013/08/scm-migration.html">SCM migration</a>, we had to update the configurations of hundreds of jenkins jobs on three different instances. Rather than do that manually, we wrote this <a href="https://github.com/mdietz198/bulk-edit-jenkins-config">ruby snippet</a> (edit_config.rb):</p> <div class="highlight"><pre><code class="ruby language-ruby" data-lang="ruby"><span class="nb">require</span> <span class="s1">&#39;fileutils&#39;</span> <span class="nb">require</span> <span class="s1">&#39;optparse&#39;</span> <span class="nb">require</span> <span class="s1">&#39;nokogiri&#39;</span> <span class="c1"># Accepts a block that takes an Nokogiri XML doc as a parameter</span> <span class="c1"># and modifies the XML doc as desired.</span> <span class="k">def</span> <span class="nf">apply_to_configs</span> <span class="p">(</span><span class="n">root_directory</span><span class="p">,</span> <span class="n">orig_file_extension</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">block</span><span class="p">)</span> <span class="n">config_files</span> <span class="o">=</span> <span class="no">Dir</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">&quot;</span><span class="si">#{</span><span class="n">root_directory</span><span class="si">}</span><span class="s2">/**/config.xml&quot;</span><span class="p">)</span> <span class="n">config_files</span><span class="o">.</span><span class="n">each</span> <span class="p">{</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span> <span class="n">doc</span> <span class="o">=</span> <span class="no">Nokogiri</span><span class="o">::</span><span class="no">XML</span><span class="o">::</span><span class="no">Document</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="no">File</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="n">config</span><span class="p">))</span> <span class="k">do</span> <span class="o">|</span><span class="n">options</span><span class="o">|</span> <span class="n">options</span><span class="o">.</span><span class="n">default_xml</span><span class="o">.</span><span class="n">noblanks</span> <span class="c1"># ignore initial whitespace</span> <span class="k">end</span> <span class="n">project</span> <span class="o">=</span> <span class="n">doc</span><span class="o">.</span><span class="n">at_xpath</span><span class="p">(</span><span class="s2">&quot;//project&quot;</span><span class="p">)</span> <span class="k">next</span> <span class="k">if</span> <span class="n">project</span><span class="o">.</span><span class="n">nil?</span> <span class="no">FileUtils</span><span class="o">.</span><span class="n">cp</span> <span class="n">config</span><span class="p">,</span> <span class="s2">&quot;</span><span class="si">#{</span><span class="n">config</span><span class="si">}</span><span class="s2">.</span><span class="si">#{</span><span class="n">orig_file_extension</span><span class="si">}</span><span class="s2">&quot;</span> <span class="k">case</span> <span class="n">block</span><span class="o">.</span><span class="n">arity</span> <span class="k">when</span> <span class="mi">1</span> <span class="n">new_doc</span> <span class="o">=</span> <span class="k">yield</span> <span class="n">doc</span> <span class="k">when</span> <span class="mi">2</span> <span class="n">new_doc</span> <span class="o">=</span> <span class="k">yield</span> <span class="n">doc</span><span class="p">,</span> <span class="n">config</span> <span class="k">end</span> <span class="no">File</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s2">&quot;</span><span class="si">#{</span><span class="n">config</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">)</span> <span class="p">{</span><span class="o">|</span><span class="n">f</span><span class="o">|</span> <span class="n">doc</span><span class="o">.</span><span class="n">write_xml_to</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="ss">:indent</span> <span class="o">=&gt;</span> <span class="mi">2</span><span class="p">)</span> <span class="p">}</span> <span class="p">}</span> <span class="k">end</span> </code></pre></div> <p>This function updates each config.xml by following these steps:</p> <ol> <li>Find all config.xml files in any subdirectory of a root directory</li> <li>Make a backup copy of the file with an additional specified by orig<em>file</em>extension parameter</li> <li>Pass the contents of the config.xml as a nokogiri document object to the passed in block</li> <li>The block modifies the document object in whatever way desired</li> <li>Overwrite config.xml with the updated contents of the XML document.</li> </ol> <p>Here&#39;s an example using apply<em>to</em>configs to set the checkbox &quot;Send separate e-mails to individuals who broke the build&quot; on every job with a post-build step to send failure emails.</p> <div class="highlight"><pre><code class="ruby language-ruby" data-lang="ruby"><span class="nb">require</span> <span class="s1">&#39;nokogiri&#39;</span> <span class="n">require_relative</span> <span class="s1">&#39;./edit_config&#39;</span> <span class="n">apply_to_configs</span><span class="p">(</span><span class="vg">$local_jobs_root</span><span class="p">,</span> <span class="vg">$orig_file_extension</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">config_doc</span><span class="p">,</span> <span class="n">config_path</span><span class="o">|</span> <span class="n">job_name</span> <span class="o">=</span> <span class="n">config_path</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&quot;/&quot;</span><span class="p">)</span><span class="o">[-</span><span class="mi">2</span><span class="o">]</span> <span class="n">send_to_individuals</span> <span class="o">=</span> <span class="n">config_doc</span><span class="o">.</span><span class="n">at_xpath</span><span class="p">(</span><span class="s2">&quot;//hudson.tasks.Mailer/sendToIndividuals&quot;</span><span class="p">)</span> <span class="nb">puts</span> <span class="s2">&quot;Set </span><span class="si">#{</span><span class="n">send_to_individuals</span><span class="si">}</span><span class="s2"> for </span><span class="si">#{</span><span class="n">job_name</span><span class="si">}</span><span class="s2">&quot;</span> <span class="n">send_to_individuals</span><span class="o">.</span><span class="n">content</span><span class="o">=</span><span class="s2">&quot;true&quot;</span> <span class="k">unless</span> <span class="n">send_to_individuals</span><span class="o">.</span><span class="n">nil?</span> <span class="k">end</span> </code></pre></div> <p>Now that we have the automation in place, we&#39;ve been able to use this several times to modify all our jenkins config setups.</p> Extreme Programming For Modern Start-Ups gannon Fri, 11 Apr 2014 00:00:00 +0000 http://dev.bizo.com/2014/04/extreme-programming-for-modern-start-ups.html http://dev.bizo.com/2014/04/extreme-programming-for-modern-start-ups.html <p>Extreme Programming (XP) can be a very effective way to build software, but out of the box, it is poorly suited to many teams. It requires that the team is small, co-located, and working on a single product at any given time. It also assumes that suitable designs can be arrived at by working in micro-increments (eg. TDD cycles) without up-front design. In this post, I&#39;ll discuss how XP can be adapted to suit modern start-ups.</p> <h2>XP is rad</h2> <p>XP works very well for quickly building a new software system and adapting it to meet customer needs. Iterative development, customer value, automated testing, bug-free code and energized work are all a great fit for any innovative team. It is simple to get started (assuming your team is prepared to dive right in), simple to manage, and is designed to be adapted over time.</p> <h2>What about Kanban?</h2> <p>Since the Kanban methodology doesn&#39;t have XP&#39;s constraints, it seems to be a popular choice in recent years. Kanban, however, is more complicated to get started with (coming up with phases and WIP limits) and more complex to manage (adjusting classes of service and SLAs). Despite the complexity, it doesn&#39;t cover technical practices at all. Additionally, the intense focus that is fostered by using iterations is lost.</p> <p>Also, Kanban (as originally designed) seems to eschew estimation, which doesn&#39;t work well for fine-grained project tracking. All items (stories or tasks) in a Kanban system are treated as equivalent in size, but on the teams I&#39;ve worked on, tasks/stories usually vary in size quite a bit. While splitting large stories is a good practice with either methodology, it doesn&#39;t completely solve the problem. Significant functionality can&#39;t really be broken into pieces equivalent in size to a small task (like re-arranging widgets on a page or fixing a small bug).</p> <p>The variance in item size means that the predictability of cycle times (the primary metric for setting expectations) is only accurate to the extent that the mix of task sizes stays consistent over time. On most teams, product work seems to come in waves, so the cycle time for a task during construction of a major new product is going to be significantly longer than when the team is working on maintenance and nominal enhancements.</p> <p>Although some teams use T-shirt size estimation with Kanban (where each size has its own cycle time calculation), the mix of item sizes being worked on by a team will influence the cycle times of all items. In other words, if at one point, 5 XL items are being worked and 2 Small items are being worked on, the cycle time for all items is likely to be higher than at a time where 2 XL items and 5 Small items are being worked on. Accordingly, the cycle team per size will still fluctuate significantly, although there is probably some mitigating effect compared to not using estimation at all. (We just started doing this, and haven&#39;t analyzed the results yet.)</p> <h2>Times have changed</h2> <h3>The boom</h3> <p>These days, start-ups (especially in SF/Silicon Valley) are operating in a different environment than when XP flourished a decade ago. We are in a boom which makes it very difficult to hire qualified engineers in any start-up hub. This has motivated start-ups to hire quality engineers anywhere they can find them. That can be in parts of the US that are not start-up hubs, or even abroad. This means many teams at modern start-ups are geographically distributed, and often on different time zones, as is the case here at Bizo.</p> <h3>Big Data</h3> <p>Also, the number of people active on the web is much larger than it used to be, and the amount of data required to build compelling applications is exploding. Accommodating that volume of data, especially in an era where users expect applications to respond instantaneously, requires carefully choosing efficient algorithms and data structures (eg. HyperLogLog, bloom filters, P-Square, etc.), selecting appropriate data stores and choosing the right model of concurrency. Designing software using the typical approach to TDD won&#39;t lead smoothly to a design that performs well under these conditions.</p> <p>In an era where distributed teams and Big Data are the new norm, we need a refresh of XP to suit our needs.</p> <h2>Distributed teams</h2> <p>XP insists on co-located teams because high-bandwidth communication happens almost automatically. Also, XP espouses Pair Programming, which has traditionally been done in-person. Let&#39;s look at some alternatives for distributed teams.</p> <h3>Pro-longed stand-ups as a stand-in for in-person communication</h3> <p>The next best thing to in-person communication seems to be video chat. At Bizo, we hold our stand-ups using Google+ Hangouts. XP encourages very quick (~5 minute) stand-up meetings (hence the name) where each team member says just 3 things: what they did yesterday, what they&#39;re going to do today, and what they&#39;re blocked on, if anything. We have naturally tended longer stand-ups (15-20m) on our team, with team members disseminating all kinds of information and asking general questions to the group. When I was on a previous team which was co-located, I would try to encourage team members to keep it brief, and go &#39;offline&#39; with any additional discussion. I am coming to the conclusion that on a distributed team, however, a slightly longer stand-up is actually good: the additional time spent is a small price to pay for high-value ad/hoc communication that is similar to what happens throughout the day on co-located teams. Inviting the Project Owner to the daily stand-up is a great way to maintain XP&#39;s focus on Real Customer Involvement. Having regular demos with the Project Owner (eg. the Product Manager, the client or whoever is the project sponsor) is also a good idea.</p> <h3>Informative workspace</h3> <p>An informative workspace can be accomplished by having a monitor in all company offices (with engineers) that constantly displays a dashboard and/or an up-to-date view of the project/task tracking app. (Such views should also be readily accessible to remote employees on demand.) Pivotal Tracker is an awesome web app for XP-style project tracking - the UI is well-suited to just such a purpose.</p> <h3>Pair Programming and real-time-ish code reviews</h3> <p>Although Pair Programming is normally done in-person, some great tools have been developed that make it much more feasible to do remotely. Screen Hero is one example, which is designed for highly performant screen-sharing where both parties can type and mouse around, as well as built-in voice chat. (Although it&#39;s awesome, we don&#39;t use it very often because it doesn&#39;t have Linux support - just Mac and Windows.)</p> <p>For teams where members have significantly different schedules (eg. due to time zone differences), Pair Programming isn&#39;t feasible. In those cases, a near real-time code review process is a decent substitute. Personally, I think code reviews should have a single person who is responsible for giving a thumbs up/down on a change set. (Additional reviewers are best included only as an FYI.) The author can email a specific person asking for them to review their code at their earliest convenience (or email their team asking for a volunteer). In my opinion, the ideal would be if the reviewer actually calls the author and keeps them on the phone while they review the code. Code reviews can sometimes involve an in-depth back-and-forth, and it&#39;s best if those can be done in real-time. The key here is turn-around time so that integration and deployment can be done nearly continuously.</p> <h2>Large teams</h2> <h3>Kanban-XP interop</h3> <p>XP is only designed for teams of up to 10 people, all working on the same project at any given time. The obvious way to scale is to break teams up as they grow, so no single team ever exceeds that limit. That&#39;s also a great way to ensure that each team is only working on a single project, yet the organization can progress on several projects simultaneously. XP isn&#39;t really suited to manage the flow of work across teams and projects, so having an over-arching Kanban process with XP implemented within individual teams may offer the best of both worlds.</p> <h3>Pods</h3> <p>Having teams work on a single project enhances the team&#39;s focus and synergy. However, since an organization&#39;s project portfolio changes over time, that requires continual re-organization of teams. Having teams be completely ephemeral necessitates standardization of tools and processes throughout the engineering department to avoid excessive re-training, which can lead to bureaucracy over time. One way to mitigate this is to limit such standardization to pods of related ephemeral teams, where team members generally stay within a pod for the duration of several projects.</p> <h2>Scalable systems</h2> <p>Although comprehensive up-front design is a very straight-forward way to build scalable systems, there are techniques for taking a more agile approach.</p> <h3>Spike Solutions</h3> <p>First, spike solutions can be be used to validate basic performance characteristics. Some very limited up-front design (back-of-the-envelope performance calculations for various potential infrastructure pieces) can yield a prospective stack for a system. That stack can be used in a prototype solution (with little to no business logic) that validates whether the stack can stand-up to the expected load. The most effective technique for validating scalability at this stage (when feasible) is performance tests that use a unit testing framework (eg. JUnit or RSpec) and invoke the application using in-process calls. The logistics are easier than true end-to-end load testing, although the results are less conclusive.</p> <p>If the spike is successful in verifying adequate performance, subsequent stories can be implemented with a normal TDD cycle utilizing the tested stack. Maintain the performance tests so that they can be run against the real system as it develops, although you probably do not want to run it as part of your unit test suite (which should be very fast).</p> <h3>Leveraging the cloud</h3> <p>Its important to note that the initial solution may not be all that efficient in terms of infrastructure cost. If you&#39;re hosting your system in the cloud (and not provisioning capacity for it ahead of time), that&#39;s often ok, as long as the system can scale horizontally. Keep an eye on infrastructure costs, and schedule performance stories to improve efficiency over time.</p> <h3>Load testing and related hackery</h3> <p>Prior to fully launching, it may be wise to do a true end-to-end load test. The logistics of these tests are often difficult (and potentially expensive).</p> <p>If your system is already running at scale, but you want to load test a new piece of infrastructure (or algorithm, etc.), one hack to consider is embedding an isolated load test within your running system. The key to this hack is taking care to limit the impact of the test on your production system. At a minimum, use short timeouts around the experimental code and ensure that errors are caught. Also, if you can&#39;t roll back painlessly, you may want to build in some kind of kill switch for the test.</p> <h2>Everything else is the same</h2> <p>With the tweaks mentioned here, the remaining XP practices and principles are feasible without significant modification. Those would be as follows (as outlined in: The Art of Agile Development [Jim Shore] 2007): Vision, Release Planning, Iteration Planning, Test-Driven Development, Energized Work, Root-Cause Analysis, Retrospectives, Ubiquitous Language, Coding Standards, Reporting, Slack, Stories, Estimating, Risk Management, Customer Tests, Refactoring, Incremental Design and Architecture, Simple Design, Spike Solutions, Exploratory Testing</p> <h2>Conclusion</h2> <p>Much of what is described here was gleaned from experience, but some of it is conjecture, so YMMV. I&#39;m interested to hear feedback from those who have tried to scale XP.</p> The Mocks Are Alright rayo Thu, 20 Mar 2014 00:00:00 +0000 http://dev.bizo.com/2014/03/the-mocks-are-alright.html http://dev.bizo.com/2014/03/the-mocks-are-alright.html <p>We deploy applications on AWS, and we run jobs that check on them. Like any other code, the jobs need tests.</p> <p>Consider a simple reaper, which terminates Elastic MapReduce (EMR) clusters whose job flows have taken way too long:</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="k">trait</span> <span class="nc">Reaper</span> <span class="o">{</span> <span class="k">def</span> <span class="n">terminateLongJobFlows</span><span class="o">(</span><span class="n">minNormalizedInstanceHours</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="o">{</span> <span class="k">val</span> <span class="n">emr</span> <span class="k">=</span> <span class="n">elasticMapReduce</span><span class="o">()</span> <span class="cm">/* filter jobs from emr.describeJobFlows()... */</span> <span class="cm">/* run emr.terminateJobFlows() on filtered jobs... */</span> <span class="o">}</span> <span class="k">def</span> <span class="n">elasticMapReduce</span><span class="o">()</span><span class="k">:</span> <span class="kt">AmazonElasticMapReduce</span> <span class="o">}</span> <span class="k">object</span> <span class="nc">Reaper</span> <span class="k">extends</span> <span class="nc">Reaper</span> <span class="o">{</span> <span class="k">def</span> <span class="n">elasticMapReduce</span><span class="o">()</span><span class="k">:</span> <span class="kt">AmazonElasticMapReduce</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">AmazonElasticMapReduceClient</span><span class="o">()</span> <span class="o">}</span> </code></pre></div> <p>If you wanted to write a test/spec for <code>Reaper</code>, you&#39;d need some sort of test double for its collaborator <code>AmazonElasticMapReduce</code>. What kind of double you use is up to you...</p> <h2>Mock Trial</h2> <p>For this post, consider a <a href="http://www.scalatest.org/">ScalaTest</a> <code>WordSpec</code> using mocks.</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="nd">@RunWith</span><span class="o">(</span><span class="n">classOf</span><span class="o">[</span><span class="kt">JUnitRunner</span><span class="o">])</span> <span class="k">class</span> <span class="nc">ReaperSpec</span> <span class="k">extends</span> <span class="nc">WordSpec</span> <span class="k">with</span> <span class="nc">Matchers</span> <span class="k">with</span> <span class="nc">MockitoSugar</span> <span class="o">{</span> <span class="s">&quot;Reaper.terminateLongJobFlows()&quot;</span> <span class="n">should</span> <span class="o">{</span> <span class="s">&quot;terminate jobs taking &gt;= a given # of normalized instance hours&quot;</span> <span class="n">in</span> <span class="o">{</span> <span class="k">val</span> <span class="n">emr</span> <span class="k">=</span> <span class="n">mockEmr</span><span class="o">(</span><span class="nc">Seq</span><span class="o">((</span><span class="s">&quot;a&quot;</span><span class="o">,</span> <span class="mi">40</span><span class="o">),</span> <span class="o">(</span><span class="s">&quot;b&quot;</span><span class="o">,</span> <span class="mi">60</span><span class="o">),</span> <span class="o">(</span><span class="s">&quot;c&quot;</span><span class="o">,</span> <span class="mi">20</span><span class="o">),</span> <span class="o">(</span><span class="s">&quot;d&quot;</span><span class="o">,</span> <span class="mi">80</span><span class="o">)))</span> <span class="k">val</span> <span class="n">reaper</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Reaper</span><span class="o">()</span> <span class="o">{</span> <span class="k">def</span> <span class="n">elasticMapReduce</span><span class="o">()</span><span class="k">:</span> <span class="kt">AmazonElasticMapReduce</span> <span class="o">=</span> <span class="n">emr</span> <span class="o">}</span> <span class="n">reaper</span><span class="o">.</span><span class="n">terminateLongJobFlows</span><span class="o">(</span><span class="mi">50</span><span class="o">)</span> <span class="k">val</span> <span class="n">argCaptor</span> <span class="k">=</span> <span class="nc">ArgumentCaptor</span><span class="o">.</span><span class="n">forClass</span><span class="o">(</span><span class="n">classOf</span><span class="o">[</span><span class="kt">TerminateJobFlowsRequest</span><span class="o">])</span> <span class="n">verify</span><span class="o">(</span><span class="n">emr</span><span class="o">).</span><span class="n">terminateJobFlows</span><span class="o">(</span><span class="n">argCaptor</span><span class="o">.</span><span class="n">capture</span><span class="o">)</span> <span class="k">val</span> <span class="n">expectedJobFlowIds</span> <span class="k">=</span> <span class="nc">Set</span><span class="o">(</span><span class="s">&quot;b&quot;</span><span class="o">,</span> <span class="s">&quot;d&quot;</span><span class="o">)</span> <span class="k">val</span> <span class="n">actualJobFlowIds</span> <span class="k">=</span> <span class="n">argCaptor</span><span class="o">.</span><span class="n">getValue</span><span class="o">.</span><span class="n">getJobFlowIds</span><span class="o">.</span><span class="n">asScala</span><span class="o">.</span><span class="n">toSet</span> <span class="n">actualJobFlowIds</span> <span class="n">should</span> <span class="n">equal</span> <span class="o">(</span><span class="n">expectedJobFlowIds</span><span class="o">)</span> <span class="o">}</span> <span class="o">}</span> <span class="cm">/* ... */</span> <span class="o">}</span> </code></pre></div> <p>What makes this test mockist (?) is that it spells out how the system under test (SUT), our <code>Reaper</code>, interacts with its collaborator, <code>AmazonElasticMapReduce</code>. The test will</p> <ol> <li><p>verify our <code>Reaper</code> invoked <code>AmazonElasticMapReduce.terminateJobFlows()</code>, and</p></li> <li><p>check that the supplied argument, a <code>TerminateJobFlowsRequest</code>, contains the IDs of the job flows that should be terminated.</p></li> </ol> <p>(1) and (2) are handled by a mock framework like <a href="https://code.google.com/p/mockito/">Mockito</a>.</p> <p>Contrast this with a stubbist (?) approach (<a href="http://martinfowler.com/articles/mocksArentStubs.html">obligatory link</a>). You&#39;re generally not concerned about the interaction but rather the state of things at the end. <em>&quot;I terminated some job flows. Are they there anymore?&quot;</em></p> <h2>Deep Mockery</h2> <p>There was a time when it would&#39;ve been painful to write a helper method like <code>mockEmr</code>, which creates the mock <code>AmazonElasticMapReduce</code>, plus all of its contained information.</p> <p>Consider that many AWS API calls return <code>*Result</code> value objects with graphs of other value objects, sometimes going three or four levels deep (e.g. <code>describeJobFlows()</code> -&gt; <code>getJobFlows()</code> -&gt; <code>getExecutionStatusDetail()</code> -&gt; <code>getState()</code>). So mocking chained API calls would require creating mocks for each call in the chain--tedious, to say the least.</p> <p>With Mockito, however, you can more easily mock deeply by passing to the mock constructor an additional argument, <a href="http://docs.mockito.googlecode.com/hg/latest/org/mockito/Mockito.html#RETURNS_DEEP_STUBS"><code>RETURNS_DEEP_STUBS</code></a>. Mockito handles the deep mock implementation (basically doing what you would&#39;ve done), and returns you a mock that lets you apply <code>when</code>/<code>thenReturn</code> to chained API calls, and <code>verify</code> where appropriate.</p> <p>Subsequently, mock creation gets more concise:</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="k">def</span> <span class="n">mockEmr</span><span class="o">(</span><span class="n">idsAndHours</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">Int</span><span class="o">)])</span><span class="k">:</span> <span class="kt">AmazonElasticMapReduce</span> <span class="o">=</span> <span class="o">{</span> <span class="k">val</span> <span class="n">emr</span> <span class="k">=</span> <span class="n">mock</span><span class="o">[</span><span class="kt">AmazonElasticMapReduce</span><span class="o">](</span><span class="nc">Mockito</span><span class="o">.</span><span class="nc">RETURNS_DEEP_STUBS</span><span class="o">)</span> <span class="k">val</span> <span class="n">jobFlows</span> <span class="k">=</span> <span class="n">idsAndHours</span><span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="k">case</span> <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">hours</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">mockJobFlow</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">hours</span><span class="o">,</span> <span class="nc">JobFlowExecutionState</span><span class="o">.</span><span class="nc">RUNNING</span><span class="o">)</span> <span class="o">}.</span><span class="n">asJava</span> <span class="n">when</span><span class="o">(</span><span class="n">emr</span><span class="o">.</span><span class="n">describeJobFlows</span><span class="o">().</span><span class="n">getJobFlows</span><span class="o">())</span> <span class="n">thenReturn</span> <span class="o">(</span><span class="n">jobFlows</span><span class="o">)</span> <span class="n">emr</span> <span class="o">}</span> <span class="k">def</span> <span class="n">mockJobFlow</span><span class="o">(</span> <span class="n">jobFlowId</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">normalizedInstanceHours</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">state</span><span class="k">:</span> <span class="kt">JobFlowExecutionState</span><span class="o">)</span><span class="k">:</span> <span class="kt">JobFlowDetail</span> <span class="o">=</span> <span class="o">{</span> <span class="k">val</span> <span class="n">jobFlow</span> <span class="k">=</span> <span class="n">mock</span><span class="o">[</span><span class="kt">JobFlowDetail</span><span class="o">](</span><span class="nc">Mockito</span><span class="o">.</span><span class="nc">RETURNS_DEEP_STUBS</span><span class="o">)</span> <span class="n">when</span><span class="o">(</span><span class="n">jobFlow</span><span class="o">.</span><span class="n">getJobFlowId</span><span class="o">())</span> <span class="n">thenReturn</span><span class="o">(</span><span class="n">jobFlowId</span><span class="o">)</span> <span class="n">when</span><span class="o">(</span><span class="n">jobFlow</span><span class="o">.</span><span class="n">getInstances</span><span class="o">().</span><span class="n">getNormalizedInstanceHours</span><span class="o">())</span> <span class="n">thenReturn</span><span class="o">(</span><span class="n">normalizedInstanceHours</span><span class="o">)</span> <span class="n">when</span><span class="o">(</span><span class="n">jobFlow</span><span class="o">.</span><span class="n">getExecutionStatusDetail</span><span class="o">().</span><span class="n">getState</span><span class="o">())</span> <span class="n">thenReturn</span><span class="o">(</span><span class="n">state</span><span class="o">.</span><span class="n">name</span><span class="o">)</span> <span class="n">jobFlow</span> <span class="o">}</span> </code></pre></div> <h2>Sharing is Caring</h2> <p>Because mocks tend to be written more specifically to the tests they support, it&#39;s easy to forget about sharing them. But that doesn&#39;t mean they couldn&#39;t be generalized or shared. ScalaTest has advice on how to <a href="http://www.scalatest.org/user_guide/sharing_fixtures">share fixtures</a>. For example, you could extract the mock creation methods into their own trait:</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="k">trait</span> <span class="nc">SharedFixtures</span> <span class="k">extends</span> <span class="nc">MockitoSugar</span> <span class="o">{</span> <span class="k">this:</span> <span class="kt">Suite</span> <span class="o">=&gt;</span> <span class="k">def</span> <span class="n">mockEmr</span><span class="o">(</span><span class="n">idsAndHours</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">Int</span><span class="o">)])</span><span class="k">:</span> <span class="kt">AmazonElasticMapReduce</span> <span class="o">=</span> <span class="o">{</span> <span class="cm">/* ... */</span> <span class="o">}</span> <span class="k">def</span> <span class="n">mockJobFlow</span><span class="o">(</span><span class="n">jobFlowId</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">state</span><span class="k">:</span> <span class="kt">JobFlowExecutionState</span><span class="o">,</span> <span class="n">terminationProtected</span><span class="k">:</span> <span class="kt">Boolean</span> <span class="o">=</span> <span class="kc">false</span><span class="o">)</span><span class="k">:</span> <span class="kt">JobFlowDetail</span> <span class="o">=</span> <span class="o">{</span> <span class="cm">/* ... */</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div> <p>Then, any spec can just extend that trait:</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="k">class</span> <span class="nc">ReaperSpec</span> <span class="k">extends</span> <span class="nc">WordSpec</span> <span class="k">with</span> <span class="nc">Matchers</span> <span class="k">with</span> <span class="nc">SharedFixtures</span> </code></pre></div> <h2>You Reap What You Sow</h2> <p>I wrote my test and mocks focused on the AWS API calls that the reaper makes. I&#39;m trading a stronger coupling to the reaper&#39;s implementation, for the ability to confirm, without a full integration test, that I&#39;m making the calls that will terminate the correct job flows.</p> <p>But what if the API calls change? Then I&#39;ll have to change the reaper <em>and</em> the test mocks.</p> <p>Sometimes this is self-inflicted. There are a couple of ways I can use a collaborator object; I switch from one approach to the other; and then I curse myself for the mocks I wrote to support the original implementation.</p> <p>Other times you have no choice. As of this writing, the <code>AmazonElasticMapReduce</code> job flow view is <a href="http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/AmazonElasticMapReduce.html#describeJobFlows%28%29">deprecated</a>, to be superseded by a <a href="http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/AmazonElasticMapReduce.html#listClusters%28%29">cluster view</a>. The two views provide similar information, but not the same; when the job flow view is removed, I&#39;ll have to change my tests. But that is a problem for another day, to be handled if it happens.</p> <h2>Mock Verdict</h2> <p>In this particular case, a framework like Mockito combined with ScalaTest makes writing tests quick and easy. My SUT gets data from a third-party API which fronts a web service, and I want to verify how it uses the API--calls and arguments--as it reacts to that data. To help me write my tests in a focused and concise manner:</p> <ul> <li><p>The <code>when/thenReturn</code> construct lets me concisely mock the calls to get data.</p></li> <li><p>The <code>verify</code> construct lets me easily, more fully check the calls that act on the data.</p></li> <li><p>The deep stubbing provided by Mockito lets me construct deep mocks because I can apply <code>when/thenReturn</code> and <code>verify</code> on them where needed.</p></li> </ul> <h2>Mock It Stub It Spy It Test It</h2> <p>It sure would be nice to have an <code>AmazonElasticMapReduce</code> already written with a simple implementation of a job flow container, that could support job flow and cluster API views. <a href="https://github.com/bizo/aws-java-sdk-stubs/blob/fe16fc533d3eab78fccc8149625ac06b723f5684/src/main/java/com/bizo/awsstubs/services/elasticmapreduce/AmazonElasticMapReduceStub.java">I suppose I should get around to that.</a></p> <p>Furthermore, with Mockito, you can add behavior verification to any object--even a stub!--by <a href="http://docs.mockito.googlecode.com/hg/latest/org/mockito/Mockito.html#spy%28T%29">spying</a> on it.</p> <p>Implement <code>AmazonElasticMapReduceStub</code>:</p> <div class="highlight"><pre><code class="java language-java" data-lang="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">AmazonElasticMapReduceStub</span> <span class="kd">implements</span> <span class="n">AmazonElasticMapReduce</span> <span class="o">{</span> <span class="kd">private</span> <span class="kd">final</span> <span class="n">List</span><span class="o">&lt;</span><span class="n">JobFlowDetail</span><span class="o">&gt;</span> <span class="n">jobFlows</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o">&lt;</span><span class="n">JobFlowDetail</span><span class="o">&gt;();</span> <span class="kd">public</span> <span class="nf">addJobFlow</span><span class="o">(</span><span class="n">String</span> <span class="n">jobFlowId</span><span class="o">,</span> <span class="kt">int</span> <span class="n">normalizedInstanceHours</span><span class="o">,</span> <span class="n">JobFlowExecutionState</span> <span class="n">state</span><span class="o">)</span> <span class="o">{</span> <span class="c1">// Add to jobFlows...</span> <span class="o">}</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="n">DescribeJobFlowsResult</span> <span class="nf">describeJobFlows</span><span class="o">()</span> <span class="o">{</span> <span class="k">return</span> <span class="k">new</span> <span class="nf">DescribeJobFlowsResult</span><span class="o">().</span><span class="na">withJobFlows</span><span class="o">(</span><span class="n">jobFlows</span><span class="o">);</span> <span class="o">}</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="n">ListClustersResult</span> <span class="nf">listClusters</span><span class="o">(</span><span class="kd">final</span> <span class="n">ListClustersRequest</span> <span class="n">listClustersRequest</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">AmazonServiceException</span><span class="o">,</span> <span class="n">AmazonClientException</span> <span class="o">{</span> <span class="c1">// Something to transform jobFlows from List&lt;JobFlowDetail&gt; to List&lt;Clusters&gt;...</span> <span class="o">}</span> <span class="nd">@Override</span> <span class="kd">public</span> <span class="nf">terminateJobFlows</span><span class="o">(</span><span class="nl">request:</span> <span class="n">TerminateJobFlowsRequest</span><span class="o">)</span> <span class="o">{</span> <span class="c1">// Remove from jobFlows...</span> <span class="o">}</span> <span class="cm">/* ... */</span> <span class="o">}</span> </code></pre></div> <p>Spy it:</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="k">def</span> <span class="n">spyEmr</span><span class="o">(</span><span class="n">idsAndHours</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">Int</span><span class="o">)])</span><span class="k">:</span> <span class="kt">AmazonElasticMapReduce</span> <span class="o">=</span> <span class="o">{</span> <span class="k">val</span> <span class="n">emr</span> <span class="k">=</span> <span class="n">spy</span><span class="o">(</span><span class="k">new</span> <span class="nc">AmazonElasticMapReduceStub</span><span class="o">())</span> <span class="k">for</span> <span class="o">((</span><span class="n">ids</span><span class="o">,</span> <span class="n">hours</span><span class="o">)</span> <span class="k">&lt;-</span> <span class="n">idsAndHours</span><span class="o">)</span> <span class="n">emr</span><span class="o">.</span><span class="n">addJobFlow</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">hours</span><span class="o">,</span> <span class="nc">JobFlowExecutionState</span><span class="o">.</span><span class="nc">RUNNING</span><span class="o">)</span> <span class="n">emr</span> <span class="o">}</span> </code></pre></div> <p>Now use the spy instead of the mock. Both <code>verify()</code> and the argument checking will work as before.</p> <p><em>Also:</em></p> <ul> <li><p><a href="http://www.draconianoverlord.com/2013/04/13/services-should-come-with-stubs.html">Services Should Come with Stubs</a> by Stephen @ Bizo</p></li> <li><p><a href="http://www.draconianoverlord.com/2010/07/09/why-i-dont-like-mocks.html">Why I Don&#39;t Like Mocks</a> by Stephen @ Bizo</p></li> <li><p>My teammates are more conscientious than I and have provided a few <a href="https://github.com/bizo/aws-java-sdk-stubs">stubs for AWS</a>. Check out our <a href="https://github.com/bizo/aws-java-sdk-stubs/blob/master/src/main/java/com/bizo/awsstubs/services/s3/AmazonS3Stub.java">S3 stub implementation</a>, or our <a href="https://github.com/bizo/aws-java-sdk-stubs/blob/master/src/main/java/com/bizo/awsstubs/services/kinesis/AmazonKinesisClientStub.java">Kinesis stub implementation</a> if you are cutting edge like that.</p></li> </ul> Statistics With Spark Josh Fri, 07 Mar 2014 00:00:00 +0000 http://dev.bizo.com/2014/03/statistics-with-spark.html http://dev.bizo.com/2014/03/statistics-with-spark.html <p>Lately I&#39;ve been writing a lot of Spark Jobs that perform some statistical analysis on datasets. One of the things I didn&#39;t realize right away - is that RDD&#39;s have built in support for basic statistic functions like mean, variance, sample variance, standard deviation.</p> <p>These operations are avaible on RDD&#39;s of <code>Double</code> via <a href="http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.DoubleRDDFunctions">DoubleRDDFunctions</a>. You can access these functions like so:</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"> <span class="k">import</span> <span class="nn">org.apache.spark.SparkContext._</span> <span class="c1">// implicit conversions in here</span> <span class="k">val</span> <span class="n">myRDD</span> <span class="k">=</span> <span class="n">newRDD</span><span class="o">().</span><span class="n">map</span> <span class="o">{</span> <span class="k">_</span><span class="o">.</span><span class="n">toDouble</span> <span class="o">}</span> <span class="n">myRDD</span><span class="o">.</span><span class="n">mean</span> <span class="n">myRDD</span><span class="o">.</span><span class="n">sampleVariance</span> <span class="c1">// divides by n-1</span> <span class="n">myRDD</span><span class="o">.</span><span class="n">sampleStdDev</span> <span class="c1">// divides by n-1 </span> </code></pre></div> <h2>Getting It All At Once</h2> <p>If you&#39;re interested in calling multiple stats functions at the same time, it&#39;s a better idea to get them all in a single pass. Spark provides the <code>stats</code> method in DoubleRDDFunctions for that; it also provides the total count of the RDD as well. </p> <h2>Histograms</h2> <p>Means and standard deviation are a decent starting point when you&#39;re looking at a new dataset; but you have to be careful because measures of central tendency - they hide the distribution from you. If you&#39;re looking at something like response latency - there could very well be dragons lurking there. </p> <p>Fortunately Spark also includes a nifty little histogram method which you can use: </p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="k">val</span> <span class="n">myRDD</span> <span class="k">=</span> <span class="n">newRDD</span><span class="o">().</span><span class="n">map</span> <span class="o">{</span> <span class="k">_</span><span class="o">.</span><span class="n">toDouble</span> <span class="o">}</span> <span class="n">myRDD</span><span class="o">.</span><span class="n">histogram</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span> <span class="c1">// 10 evenly spaced buckets, between myRDD.min -&gt; myRDD.max </span> <span class="n">myRDD</span><span class="o">.</span><span class="n">histogram</span><span class="o">(</span><span class="k">new</span> <span class="nc">Array</span><span class="o">(</span><span class="mf">0.0</span><span class="o">,</span> <span class="mf">10.0</span><span class="o">,</span> <span class="mf">20.0</span><span class="o">,</span> <span class="mf">30.0</span><span class="o">))</span> <span class="c1">// manually specify the buckets</span> </code></pre></div> <h2>Beyond The Box</h2> <p>Spark provides a very basic, but useful starting point. If you want access to more advanced statistical methods like classification or regression - check out <a href="http://spark.apache.org/docs/0.9.0/mllib-guide.html">MLLib</a>. However at the time of writing its still a very young project and you might have to implement things on your own.</p> <p>In our case, we ended up implementing some basic z &amp; chi-squared tests for some of our bidding algorithms like:</p> <ul> <li>Comparing binomial proportions of two samples </li> <li>Comparing means of two samples</li> <li>Comparing two distributions drawn from different samples</li> </ul> <p>These are still very much in there infancy, but it&#39;s possible we might open source them at a later date. </p> <p>For now, when implementing things yourself it&#39;s nice to keep operations as general as possible, so I&#39;d recommend trying to keep all your hand-rolled stats method to operations on RDD&#39;s of primatives (or matrices). </p> Required Reading Team Geek donnie flood Fri, 07 Mar 2014 00:00:00 +0000 http://dev.bizo.com/2014/03/read-team-geek.html http://dev.bizo.com/2014/03/read-team-geek.html <p><img src="/images/posts/teamgeek.jpg" class="pull-left" style="height:425px; margin-right: 20px; margin-bottom: 20px;" /> </p> <p>From the beginning (2008), the engineering team at Bizo has focused on <a href="http://dev.bizo.com/2011/03/on-building-a-kick-ass-engineering-team----part-1.html">building a great culture</a> around a few key attributes like quality, discipline and communication. A couple of years ago our director of engineering, Larry, happened upon a book, <a href="http://shop.oreilly.com/product/0636920018025.do">Team Geek</a>, that does an amazing job of describing the ideal set of cultural tenants and personal behaviors that we are trying to build our culture around. </p> <p>After reading it in one evening (the book is a short 194 pages), I immediately bought copies for the entire engineering team and setup a few &quot;book club&quot; meetings where we all discussed the contents of the book and how they related to our team. And now, every new engineer that joins is required to read the book.</p> <p>Here is a list of the chapters (I should do some future chapter highlight posts):</p> <ul> <li>Chapter 1: The Myth of the Genius Programmer</li> <li>Chapter 2: Building an Awesome Team Culture</li> <li>Chapter 3: Every Boat Needs a Captain</li> <li>Chapter 4: Dealing with Poisonous People</li> <li>Chapter 5: The Art of Organizational Manipulation</li> <li>Chapter 6: Users Are People, Too</li> </ul> <p>Among other things &quot;Team Geek&quot; does a great job of describing what makes a great team member with the acronym of HRT (pronounced like &quot;heart&quot;) which stands for Humility, Respect and Trust. We&#39;ve adopted this term internally as part of the cultural nomenclature. </p> <p>Overall, I think any and all engineering professionals should read this book and take to heart some of the lessons on how to define a great engineering culture and more importantly how to become someone that not only solves hard technical problems but is a pleasure to work with. Even non-engineers can take a lot of these lessons to heart; Humility, Respect and Trust are definitely characteristics that make good team members no matter the part of the organization.</p> args4j-helpers larry ogrodnek Wed, 26 Feb 2014 00:00:00 +0000 http://dev.bizo.com/2014/02/args4j-helpers.html http://dev.bizo.com/2014/02/args4j-helpers.html <p>I really like using <a href="http://args4j.kohsuke.org/">args4j</a> for command-line parsing in both java, and scala, but I found myself writing the same boilerplately code to parse options, deal with help, deal with parsing issues, etc. <a href="https://github.com/ogrodnek/args4j-helpers">args4j-helpers</a> is a project that simplifies parsing with args4j.</p> <p>It provides typical option parsing error handling:</p> <ul> <li>If help (provided via <code>OptionWithHelp</code> base trait bound to <code>--help</code> and <code>-h</code>) is requested, print usage information to stderr, exit with code <code>0</code>.</li> <li>If a required option is missing (<code>required=true</code>), print usage information to stderr, exit with code <code>1</code> (unless help was requested).</li> </ul> <p>This typical parsing code:</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="k">class</span> <span class="nc">Options</span> <span class="o">{</span> <span class="k">import</span> <span class="nn">org.kohsuke.args4j.Option</span> <span class="nd">@Option</span><span class="o">(</span><span class="n">name</span><span class="o">=</span><span class="s">&quot;--help&quot;</span><span class="o">,</span> <span class="n">aliases</span><span class="k">=</span><span class="nc">Array</span><span class="o">(</span><span class="s">&quot;-h&quot;</span><span class="o">),</span> <span class="n">usage</span><span class="o">=</span><span class="s">&quot;show this message&quot;</span><span class="o">)</span> <span class="k">var</span> <span class="n">help</span> <span class="k">=</span> <span class="kc">false</span> <span class="nd">@Option</span><span class="o">(</span><span class="n">name</span><span class="o">=</span><span class="s">&quot;--blah&quot;</span><span class="o">,</span> <span class="n">aliases</span><span class="k">=</span><span class="nc">Array</span><span class="o">(</span><span class="s">&quot;-b&quot;</span><span class="o">),</span> <span class="n">usage</span><span class="o">=</span><span class="s">&quot;some val&quot;</span><span class="o">,</span> <span class="n">metaVar</span><span class="o">=</span><span class="s">&quot;BLAH&quot;</span><span class="o">)</span> <span class="k">var</span> <span class="n">blah</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="mi">0</span> <span class="o">}</span> <span class="k">def</span> <span class="n">main</span><span class="o">(</span><span class="n">args</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">String</span><span class="o">])</span> <span class="o">{</span> <span class="k">val</span> <span class="n">options</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Options</span> <span class="k">val</span> <span class="n">parser</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">CmdLineParser</span><span class="o">(</span><span class="n">options</span><span class="o">)</span> <span class="k">try</span> <span class="o">{</span> <span class="n">parser</span><span class="o">.</span><span class="n">parseArgument</span><span class="o">(</span><span class="n">args</span> <span class="k">:</span> <span class="k">_</span><span class="kt">*</span><span class="o">)</span> <span class="k">if</span> <span class="o">(</span><span class="n">options</span><span class="o">.</span><span class="n">help</span><span class="o">)</span> <span class="o">{</span> <span class="n">parser</span><span class="o">.</span><span class="n">printUsage</span><span class="o">(</span><span class="nc">System</span><span class="o">.</span><span class="n">err</span><span class="o">)</span> <span class="n">sys</span><span class="o">.</span><span class="n">exit</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span> <span class="o">}</span> <span class="o">}</span> <span class="k">catch</span> <span class="o">{</span> <span class="k">case</span> <span class="n">e</span><span class="k">:</span> <span class="kt">CmdLineException</span> <span class="o">=&gt;</span> <span class="o">{</span> <span class="nc">System</span><span class="o">.</span><span class="n">err</span><span class="o">.</span><span class="n">println</span><span class="o">(</span><span class="n">e</span><span class="o">.</span><span class="n">getMessage</span><span class="o">)</span> <span class="n">parser</span><span class="o">.</span><span class="n">printUsage</span><span class="o">(</span><span class="nc">System</span><span class="o">.</span><span class="n">err</span><span class="o">)</span> <span class="n">sys</span><span class="o">.</span><span class="n">exit</span><span class="o">(</span><span class="mi">1</span><span class="o">)</span> <span class="o">}</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div> <p>can be simplified as:</p> <div class="highlight"><pre><code class="scala language-scala" data-lang="scala"><span class="k">class</span> <span class="nc">Options</span> <span class="k">extends</span> <span class="nc">OptionsWithHelp</span> <span class="o">{</span> <span class="k">import</span> <span class="nn">org.kohsuke.args4j.Option</span> <span class="nd">@Option</span><span class="o">(</span><span class="n">name</span><span class="o">=</span><span class="s">&quot;--blah&quot;</span><span class="o">,</span> <span class="n">aliases</span><span class="k">=</span><span class="nc">Array</span><span class="o">(</span><span class="s">&quot;-b&quot;</span><span class="o">),</span> <span class="n">usage</span><span class="o">=</span><span class="s">&quot;some val&quot;</span><span class="o">,</span> <span class="n">metaVar</span><span class="o">=</span><span class="s">&quot;BLAH&quot;</span><span class="o">)</span> <span class="k">var</span> <span class="n">blah</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="mi">0</span> <span class="o">}</span> <span class="k">def</span> <span class="n">main</span><span class="o">(</span><span class="n">args</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">String</span><span class="o">])</span> <span class="o">{</span> <span class="k">val</span> <span class="n">options</span> <span class="k">=</span> <span class="n">optionsOrExit</span><span class="o">(</span><span class="n">args</span><span class="o">,</span> <span class="k">new</span> <span class="nc">Options</span><span class="o">)</span> <span class="o">}</span> </code></pre></div> <p>Additionally, the helper class handles the case where help is requested and required arguments are missing (which is missing from the simplified boilerplate code).</p> <p>The code is in scala, and <a href="https://github.com/ogrodnek/args4j-helpers">available on github</a>.</p> <p>In the future, I&#39;d like to extend it to add better support for more scala-ish types (<code>Option</code>, <code>Seq</code>, etc., which should be mostly possible by implementing additional args4j OptionHandlers).</p> Beware Java Enums in Spark Josh Sun, 23 Feb 2014 00:00:00 +0000 http://dev.bizo.com/2014/02/beware-enums-in-spark.html http://dev.bizo.com/2014/02/beware-enums-in-spark.html <p>A few days back I wrote a <a href="%22https://spark.incubator.apache.org/%22">Spark</a> job that runs an A/B test to compare the conversion rates between two groups of website visitors on one of our client&#39;s websites. </p> <p>In this case, visitors were placed into the test/control group based on a deterministic method using their uuid. At the time of writing we had multiple projects wanting to use our general A/B test code. However some projects were using Scala 2.9x and others were on Scala 2.10. We wanted to share code but we didn&#39;t want the hassle of maintaining separate artifacts for each Scala version - so we packaged the group labels and the method for determining who belongs to which group in a Java project. It looked something like this: </p> <div class="highlight"><pre><code class="java"> <span class="kd">public</span> <span class="kd">enum</span> <span class="n">TestControl</span> <span class="o">{</span> <span class="n">TEST</span><span class="o">,</span> <span class="n">CONTROL</span> <span class="kd">public</span> <span class="kd">static</span> <span class="n">TestControl</span> <span class="nf">getGroup</span><span class="o">(</span><span class="n">UUID</span> <span class="n">id</span><span class="o">)</span> <span class="o">{</span> <span class="c1">// returrn appropriate group</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div> <h2>The Spark Code</h2> <p>The report was a fairly standard Spark report for us. Basically it involved loading &amp; massaging our log files into the appropriate RDD format - then mapping them into a format conducive to aggregation. Spark supports aggregation of key/value pairs using the <code>reduceByKey</code> method via an implicit conversion on an <code>RDD[Tuple2]</code> (see <a href="http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions">PairRDDFunctions</a>). </p> <p>Since I was interested in comparing the average conversion rate between the two groups - it was a simple matter of mapping the log files into the appropriate key/value pairs and then calling reduceByKey like so: </p> <div class="highlight"><pre><code class="scala"><span class="c1">// aggregation object</span> <span class="k">case</span> <span class="k">class</span> <span class="nc">Stats</span><span class="o">(</span><span class="n">impressions</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">0</span><span class="o">,</span> <span class="n">conversions</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span> <span class="k">lazy</span> <span class="k">val</span> <span class="n">converisonRate</span><span class="k">:</span> <span class="kt">Double</span> <span class="o">=</span> <span class="n">conversions</span> <span class="o">/</span> <span class="n">impressions</span><span class="o">.</span><span class="n">toDouble</span> <span class="k">def</span> <span class="o">+(</span><span class="n">other</span><span class="k">:</span> <span class="kt">Stats</span><span class="o">)</span> <span class="o">{</span> <span class="nc">Stats</span><span class="o">(</span> <span class="n">impressions</span> <span class="o">+</span> <span class="n">other</span><span class="o">.</span><span class="n">impressions</span><span class="o">,</span> <span class="n">conversions</span> <span class="o">+</span> <span class="n">other</span><span class="o">.</span><span class="n">conversions</span> <span class="o">)</span> <span class="o">}</span> <span class="o">}</span> <span class="k">val</span> <span class="n">myRDD</span> <span class="k">=</span> <span class="c1">// load/map log files from S3 to case class with methods for each field</span> <span class="k">val</span> <span class="n">readyForAggregation</span> <span class="k">=</span> <span class="n">myRDD</span><span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="k">case</span> <span class="n">line</span> <span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">isImp</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="nc">TestControl</span><span class="o">.</span><span class="n">getGroup</span><span class="o">(</span><span class="n">line</span><span class="o">.</span><span class="n">uuID</span><span class="o">),</span> <span class="nc">Stats</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="mi">0</span><span class="o">))</span> <span class="k">case</span> <span class="n">line</span> <span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">isConv</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="nc">TestControl</span><span class="o">.</span><span class="n">getGroup</span><span class="o">(</span><span class="n">line</span><span class="o">.</span><span class="n">uuID</span><span class="o">),</span> <span class="nc">Stats</span><span class="o">(</span><span class="mi">0</span><span class="o">,</span> <span class="mi">1</span><span class="o">))</span> <span class="o">}</span> <span class="k">val</span> <span class="n">results</span> <span class="k">=</span> <span class="n">readyForAggregation</span><span class="o">.</span><span class="n">reduceByKey</span> <span class="o">{</span> <span class="k">_</span> <span class="o">+</span> <span class="k">_</span> <span class="o">}</span> </code></pre></div> <p>At this point we&#39;d expect the resulting RDD to contain just two elements, one for the test group and another for the control (as our key was a <code>TestControl</code> value and <code>getGroup</code> only produce 2 values). </p> <h2>Testing</h2> <p>It might seem obvious that we should only get 2 items in our results RDD but it&#39;s always good to check with a unit test. Running a local unit test proved that indeed yes this was the case: </p> <div class="highlight"><pre><code class="scala"> <span class="nd">@Test</span> <span class="k">def</span> <span class="n">resultsShouldHaveJust2Elements</span><span class="o">()</span> <span class="o">{</span> <span class="k">val</span> <span class="n">lines</span> <span class="k">=</span> <span class="c1">// make some stub log lines (Scala case classes)</span> <span class="k">val</span> <span class="n">results</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">ABTestReport</span><span class="o">(</span><span class="n">lines</span><span class="o">).</span><span class="n">run</span> <span class="c1">// runs aggregation discussed above</span> <span class="n">results</span><span class="o">.</span><span class="n">size</span> <span class="n">shouldBe</span> <span class="mi">2</span> <span class="o">}</span> </code></pre></div> <p>Sweet all our tests pass, we&#39;re done! But wait! Not so fast, running this on a real spark cluster yielded very different results... </p> <div class="highlight"><pre><code class="scala"> <span class="k">val</span> <span class="n">data</span> <span class="k">=</span> <span class="n">results</span><span class="o">.</span><span class="n">collect</span> <span class="c1">// pulls RDD onto master as an Array</span> <span class="n">println</span><span class="o">(</span><span class="n">data</span><span class="o">.</span><span class="n">size</span><span class="o">)</span> <span class="c1">// 52 -&gt; WTF?!!!</span> <span class="n">println</span><span class="o">(</span><span class="n">data</span><span class="o">)</span> <span class="c1">// Array( (CONTROL, Stats), (TEST, Stats), (CONTROL, Stats), (TEST, Stats)...) </span> </code></pre></div> <h2>WTF Spark?!</h2> <p>So what&#39;s going on here? We have a local unit test that shows that <code>reduceByKey</code> works as expected when using our Enum for a key, but our spark cluster seems to incorrectly reduce our results - we wind up with way too many keys.</p> <p>Well after some digging with a few other engineers it turns out that by default Spark will map your data items to partitions using a <a href="https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L72-L86">HashPartitioner</a>. HashPartitioner uses the hashCode of an object to determine which partition it will live in. Ok so far, that seems completely sensible.</p> <h2>The Devil is in the Enum</h2> <p>But wait, the hashCode method on Java&#39;s enum type is based on the memory address of the object. So while yes, we&#39;re guaranteed that the same enum value have a stable hashCode inside a particular JVM (since the enum will be a static object) - we don&#39;t have this guarantee when you try to compare hashCodes of Java enums with identical values living in different JVMs. <strong>They will very likely have different hashCode values</strong>.</p> <p>Our local unit test passed because it executed on a single JVM, the enum&#39;s hashCode remained consistent when HashPartitioner asked for it - yet our real cluster failed since HashPartitioner was getting different hashCodes for same enum values due to each slave having its own machine/JVM.</p> <h2>What to do instead</h2> <p>At this point is should be pretty clear that we should not use Java enums as keys for RDD&#39;s we&#39;d like to aggregate. Fortunately there are two easy alternatives: </p> <h3>The omg I&#39;m lazy way</h3> <p>You can simply <code>toString()</code> your enums prior to calling <code>reduceByKey</code> since String&#39;s <code>hashCode</code> method does not rely on the memory address of the object. </p> <h3>A better way</h3> <p>Use sealed traits and a case objects instead, they use a fixed hashCode value and do not use the memory location to calculate the hashCode </p> <div class="highlight"><pre><code class="scala"><span class="k">sealed</span> <span class="k">trait</span> <span class="nc">Group</span> <span class="k">object</span> <span class="nc">TestControl</span> <span class="o">{</span> <span class="k">case</span> <span class="k">object</span> <span class="nc">Test</span> <span class="k">extends</span> <span class="nc">Group</span> <span class="k">case</span> <span class="k">object</span> <span class="nc">Control</span> <span class="k">extends</span> <span class="nc">Group</span> <span class="o">}</span> </code></pre></div> <h2>Wrapping Up</h2> <p>I think this goes to show you that while local unit tests on a single JVM are always a good idea, you should also check your results carefully on clustered systems. When you start doing parallel computing you often run into bugs that only show up on real clusters. Happy Spark-ing</p> Why we chose not to git fast forward merge Matthieu Fri, 14 Feb 2014 00:00:00 +0000 http://dev.bizo.com/2014/02/why-we-chose-not-to-git-fast-forward-merge.html http://dev.bizo.com/2014/02/why-we-chose-not-to-git-fast-forward-merge.html <p>This post comes from an interesting email thread we had. In the thread, <a href="https://github.com/stephenh">Stephen</a> was answering various questions from the team about why it is so important to NOT use fast-forward merges.</p> <p><strong>Question:</strong> Can you give a quick explanation of what --no-ff and --log do and why we want them?</p> <p><strong>Answer:</strong></p> <p>Sure.</p> <p>Out of the box, if you call &quot;git merge&quot;, it is not guaranteed to create a merge commit.</p> <p>Technically, merge commits are only required when both branches have new commits, e.g.:</p> <ul> <li>master: C1 -&gt; C2</li> <li>feature: C1 -&gt; C3</li> </ul> <p>If you do:</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">git checkout master git merge feature </code></pre></div> <p>It will be forced to make a new merge commit, C4 that ties together C2 and C3:</p> <ul> <li>master: C1 -&gt; C2 -&gt; C4</li> <li>feature C1 -&gt; C3 /</li> </ul> <p>(If the ASCII art lines up, C4 has both C2 and C3 pointing at it.)</p> <p>However, if feature had been created off of C2 instead of C1, and so it looked like this:</p> <ul> <li>master: C1 -&gt; C2</li> <li>feature: C2 -&gt; C3</li> </ul> <p>And you do:</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">git checkout master git merge feature </code></pre></div> <p>Git will say &quot;huh, technically I could make master the same as feature by just moving master to now be C3&quot;, and so master would look like:</p> <ul> <li>master: C1 -&gt; C2 -&gt; C3</li> </ul> <p>Which makes sense, git has incorporated the work of feature, and without rewriting history. In git parlance, this is called a fast forward.</p> <p>The downside is that you&#39;ve lost the notion that the feature branch ever existed. Which is not necessarily crucial, but if you want your DAG to always look a certain way, aesthetically it&#39;s nice to force the merge commit anyway, so:</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">git merge --no-ff feature </code></pre></div> <p>Will force a new merge commit, C4, who parents are C2 and C3.</p> <p>The general assertion is that, if you&#39;re running &quot;git merge&quot; by hand in the first place, you are probably merging across &quot;real&quot; branches and so forcing an actual merge commit is probably a good idea.</p> <p>The other flag, --log, just means that the merge commit&#39;s generated commit message will include a brief description of what was merged, e.g. instead of just: <pre> Merge feature-a into master. </pre> It&#39;ll be: <pre> Merge feature-a into master. <br> Commits: * Second commit on feature-a * First commit on feature-a </pre></p> <p>Which is handy for when you&#39;re scanning merge commits in git log/etc.</p> <h3>Try it yourself!</h3> <p>To see by yourself, below instructions will help you setting up a dummy repo.</p> <p>Pre-requisite: git is installed. Should work on any shell or Windows command prompt.</p> <p>Enter the following to create the repo and add the first commit:</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">git init test ; cd test ; echo &quot;content&quot; &gt; file.txt ; git add * ; git commit -m &quot;init&quot; </code></pre></div> <p>Create a branch, modify your file and commit (done twice here to get a better gitk view later):</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">git checkout -b branch1 ; echo &quot;modif&quot; &gt;&gt; file.txt ; git commit -am &quot;branch1 modif&quot; ; echo &quot;modif&quot; &gt;&gt; file.txt ; git commit -am &quot;branch1 modif&quot; </code></pre></div> <p>Merge your branch without fast-forward and optionally --log:</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">git checkout master ; git merge --no-ff --log --no-edit branch1 </code></pre></div> <p>Again, create a branch, modify your file and commit (again x2 for the gitk view):</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">git checkout -b branch2 ; echo &quot;modif&quot; &gt;&gt; file.txt ; git commit -am &quot;branch2 modif&quot; ; echo &quot;modif&quot; &gt;&gt; file.txt ; git commit -am &quot;branch2 modif&quot; </code></pre></div> <p>Merge your branch with fast-forward:</p> <div class="highlight"><pre><code class="text language-text" data-lang="text">git checkout master ; git merge --ff-only branch2 </code></pre></div> <p>Running &#39;gitk&#39; in your repository path should something like:</p> <p><img src="/images/posts/git-noff-gitk-screenshot.png" style="width: 50%; height: 50%"></p> <p>We can see on the gitk screenshot that “branch2” becomes invisible in the history. </p> <p>If for example you have a one branch per feature policy, this means you lost the ability to tell what are all the commits that correspond to that feature.</p> <p><strong>Question:</strong> This seems to be similar in spirit but opposite in effect to &quot;git pull --rebase&quot;, which prevents git pull from creating merge commits and use rebase instead.</p> <p><strong>Answer:</strong></p> <p>Yes, that is insightful.</p> <p>When pulling, we need to tell git &quot;no, really, don&#39;t make merges&quot;.</p> <p>And when merging, we need to tell git &quot;no, really, make a merge&quot;.</p> <p>I don&#39;t have a good explanation for why this is, other than it&#39;s just the whole &quot;the git CLI is unintuitive/not well designed&quot; thing.</p> <p>Well, and, to give them a little credit, git is not opinionated in saying &quot;your workflow must be <em>this</em>&quot;, which has the unfortunate side effect that the default behavior does not itself adhere to one or another workflow.</p> <p><strong>Question:</strong> Is there a community convention over this, or do most projects just accept whatever git does by default?</p> <p><strong>Answer:</strong></p> <p>I don&#39;t know for sure, but my feeling is that there is a lot of community convention around both &quot;pull --rebase&quot; and &quot;merge --no-ff&quot;.</p> <p>In particular, &quot;git pull --rebase&quot; seems really common, given the number of blog posts about it, and the fact that it eventually got its own config setting (&quot;pull.rebase=true&quot;) for the user to change their default &quot;git pull&quot; behavior.</p> <p>Same thing with &quot;git --no-ff --log&quot;--they both also got config settings.</p> <p>To me, a flag graduating to a config setting says a lot of people are using it as convention. (I would go further and assert that getting a config setting almost means the behavior is the correct/preferred way, but the default can&#39;t be changed for backwards compatibility purposes.)</p> Crucible Survivor - a code review dashboard larry ogrodnek Fri, 14 Feb 2014 00:00:00 +0000 http://dev.bizo.com/2014/02/crucible-survivor-code-review-dashboard.html http://dev.bizo.com/2014/02/crucible-survivor-code-review-dashboard.html <p><img src="https://raw.github.com/ogrodnek/crucible-survivor/master/docs/crucible-survivor-small.png" alt="Crucible Survivor Dashboard"></p> <p>We&#39;ve talked a lot about the importance of <a href="http://dev.bizo.com/2012/03/on-code-reviews-and-developer-feedback.html">code reviews and developer feedback</a>.</p> <p><a href="https://github.com/ogrodnek/crucible-survivor">Crucible Survivor</a> is a Hall of Fame / Hall of Shame dashboard that helps to encourage completing your reviews.</p> <!--more--> <p>It integrates with with <a href="https://www.atlassian.com/software/crucible/overview">Crucible</a> (obviously) for code review stats.</p> <h2>Scoring</h2> <p>The scoring is pretty simple right now. Each reviewer gets a <em>Fame</em> point for completing a review, and a <em>Shame</em> point for each review they have not yet completed.</p> <p>We&#39;ve had this running for a few months now, and the general thinking is it would be better to have open reviews be more shameful the longer they have been kept open. Maybe a future improvement.</p> <h2>Design / Credits</h2> <p>The design is taken from <a href="http://blog.gengo.com/jira-survivor/">Jira Survivor</a>, which itself was forked from <a href="http://99designs.com/tech-blog/blog/2013/01/05/github-survivor/">Github Survivor</a>.</p> <h2>Code and Application Architecture</h2> <p>The <a href="https://github.com/ogrodnek/crucible-survivor">code</a> is a large departure from the original projects. The original projects use MongoDB to store data scraped from the github/jira APIs, and a python web app to serve the site.</p> <p>Crucible Survivor is an angular app that is mostly static. The review stats are included in the app as a JS include. The app is hosted as a static website in S3, with the contents of the stats JS generated periodically by a jenkins server.</p> <p>This was a hack-day experiment in &#39;static&#39; dashboard apps, and I&#39;m really happy with how it came out. I really like how gathering the content to display is decoupled from the display. Serving the site requires no server infrastructure, and the update process can be very flexible. Finally, it&#39;s incredible easy to test/run locally -- just generate a fake stats file and open the site.</p> <h3>Grab the code, get it running!</h3> <p>The code is <a href="https://github.com/ogrodnek/crucible-survivor">open sourced, and available on github</a> along with instructions on configuring, running, developing.</p> <p>I&#39;d love to hear any feedback or comments.</p> SCM Migration Mark Dietz Mon, 26 Aug 2013 00:00:00 +0000 http://dev.bizo.com/2013/08/scm-migration.html http://dev.bizo.com/2013/08/scm-migration.html <p>We happily used Atlassian’s hosted OnDemand service for source code management with the following setup</p> <ul> <li><p>Subversion: source control management</p></li> <li><p>FishEye: source code browsing</p></li> <li><p>Crucible: code reviews</p></li> <li><p><a href="http://dev.bizo.com/2009/11/using-hudson-to-manage-crons.html">Jenkins</a> (hosted on EC2): continuous integration and periodic jobs </p></li> </ul> <p>However, Atlassian is ending their OnDemand offering for source code management in October so it was time for a change. The good news: we were wanting to migrate to git anyway. The bad news: we had around hundreds projects in our subversion repository and needed to break them up into separate git repositories.We switched on a Thursday morning with minimal developer interruptions, now we&#39;re on a new setup</p> <ul> <li><p>Bitbucket: source control management and code browsing</p></li> <li><p>Crucible (hosted on EC2): code reviews</p></li> <li><p>Jenkins (hosted on EC2): continuous integration and periodic jobs</p></li> </ul> <p>How&#39;d we do it? Read on, my friend.</p> <h3>Problem</h3> <p>Move hundreds of projects (some with differing branching structures) to an equivalent number of git repositories. And change hundreds of Jenkins job configurations from pulling code out of subversion to pulling code from git. And set up a new Crucible instance for code reviews for the hundreds of repos. All without disrupting the dev team&#39;s work. For subversion, this meant moving the code, including branches, and commit history from subversion into Bitbucket. For Jenkins, it meant changing the job configs to point at the equivalent git repository with the same code and branch as the old subversion configuration. This blog post focuses on the subversion to git migration. Fixing the Jenkins configs will be covered in a later blog post.</p> <h3>Subversion to Git</h3> <p>Converting a single repository from subversion to git is fortunately straight forward due to the terrific tool git-svn (https://www.kernel.org/pub/software/scm/git/docs/git-svn.html).The challenging part was determining how each project configured branches. In subversion, branches are just another subdirectory the repository. Basically any level of the directory hierarchy can support branches. You can pretty much put them anywhere. Git, however, only supports branches at the root of the repository. Git-svn allows you to tell git what directory the branches are in, but first you have to find what directory that is.</p> <p>Our subversion repositories followed two primary branching structures: branch at the module level or branch at the project level.</p> <p>One layout that I will call &quot;module level&quot;. Module level projects had a separate branch point for each module in the project. These projects were usually several loosely connected modules that could be deployed separately or libraries that were related but could be imported independently. Module level projects looked like this:</p> <ul> <li><code>svn/&lt;project&gt;/trunk/&lt;module1&gt;</code></li> <li><code>svn/&lt;project&gt;/trunk/&lt;module2&gt;</code></li> <li><code>svn/&lt;project&gt;/branches/&lt;module1&gt;/&lt;branch1<em>for</em>module1&gt;</code></li> <li><code>svn/&lt;project&gt;/branches/&lt;module1&gt;/&lt;branch2<em>for</em>module1&gt;</code></li> <li><code>svn/&lt;project&gt;/branches/&lt;module2&gt;/&lt;branch1<em>for</em>module2&gt;</code></li> </ul> <p>&quot;module level&quot; projects mapped into a separate git repo for each module using this git-svn command:</p> <p><code>git svn clone &lt;svn_root&gt; --trunk &lt;project&gt;/trunk/&lt;module&gt; --branches &lt;project&gt;/branches/&lt;module&gt; --tags &lt;project&gt;/branches/&lt;tags&gt; &lt;module&gt;</code></p> <p>The other branching structure I’ll call &quot;project level&quot;. These projects also had multiple modules, but the branches were defined such that each branch contained the entire project. These projects were usually separate modules for the domain layer, application layer and web layer or closely related applications that use the same database. Parts could perhaps be deployed separately but they often need to be deployed at the same time such as when the database schema changed. Project level projects looked like this:</p> <ul> <li><code>svn/&lt;project&gt;/trunk/&lt;module1&gt;</code></li> <li><code>svn/&lt;project&gt;/trunk/&lt;module2&gt;</code></li> <li><code>svn/&lt;project&gt;/branches/&lt;branch1&gt;/&lt;module1&gt;</code></li> <li><code>svn/&lt;project&gt;/branches/&lt;branch1&gt;/&lt;module2&gt;</code></li> <li><code>svn/&lt;project&gt;/branches/&lt;branch2&gt;/&lt;module1&gt;</code></li> <li><code>svn/&lt;project&gt;/branches/&lt;branch2&gt;/&lt;module2&gt;</code></li> </ul> <p>&quot;project level&quot; projects mapped into a single git repo containing all modules using a git-svn command:</p> <p>git svn clone &lt;svn_root&gt; --trunk &lt;project&gt;/trunk --branches &lt;project&gt;/branches --tags &lt;project&gt;/branches &lt;project&gt;</p> <p>To automate the git-svn clones, I wrote a ruby script that used &quot;svn ls&quot; to find the list of all projects. Each project was assumed to be &quot;module level&quot; unless it was in a hard-coded list of known &quot;project level&quot; projects. It was important for this to be fully automated as the list of &quot;project level&quot; projects was not complete until near the end of the migration. It took several tries to make sure the migration was correct. Some projects unfortunately used both branching structures, which is not supported by git-svn. Some of these branches were abandoned anyway, but others were moved using &quot;svn mv&quot; to fit that project&#39;s standard branch structure.</p> <h3>Local Git to Bitbucket</h3> <p>Atlassian provided a <a href="https://go-dvcs.atlassian.com/display/aod/Migrating+from+Subversion+to+Git+on+Bitbucket">jar</a> </p> <p>to push a git-svn repository up to Bitbucket. The jar also can create an authors file from the subversion repository to map a subversion user to the values git needs for a committer - first name, last name and email address. This made scripting the Bitbucket upload for each repository straightforward. The jar also handles syncs to an existing Bitbucket repository so developers could continue committing to their svn projects and Bitbucket would automatically get updated. Note this only does fast forward syncs so the incremental sync stops working once commits were made directly to Bitbucket.</p> <h3>Crucible</h3> <p>Crucible is a tool to facilitate code reviews. It imports commits from your SCM tool, allows inline comments on the diffs and manages the code review life cycle of assigning reviewers, tracking who has approved the changes, and closing the review once approved. Crucible setup is fairly straightforward with a couple of caveats.</p> <p>Crucible needs to access your repositories to pull in the commit history. There is no native support for pointing crucible at a Bitbucket team account and having Crucible automatically import each repository. There is an free <a href="https://marketplace.atlassian.com/plugins/com.atlassian.fecru.reposync.reposync">add-on</a> that works for an initial import, but initially it did not bring in new repositories that are added to the team account after the initial import. It turns out the update did not work because I was using a Bitbucket user that could not access the User list from the Bitbucket API. Changing the Bitbucket user to one with access to this API end point solves this problem. Incremental updates to the repository list are now working.</p> <p>While Crucible supports ssh access to git repositories in general, I ran into the problem described here https://answers.atlassian.com/questions/34283/how-to-connect-to-bitbucket-from-fisheye. Basically, Crucible does not support Bitbucket&#39;s ssh URL format. Instead of using ssh, I had to use https to connect to the Bitbucket repositories. This means each repository configuration requires the Bitbucket username and password to be specified separately, which is not ideal.</p> <h3>Testing</h3> <p>After running git-svn clone on a few projects, I went ahead and pulled all the projects down with git-svn. The distributed nature of git helped testing because the entire repository could be represented locally without needing to upload it to any server to test the initial clones. However, cloning all the repositories took about 24 hours. During this time there was minimal CPU and I/O load so I multithreaded the cloning jobs using 16 threads. This improved the time to just 1.5 hours on only a dual core machine.</p> <p>I was initially hesitant to upload all the repositories to Bitbucket because I did not want to have to manually delete the repos if there was a problem. However, I found the Bitbucket <a href="https://confluence.atlassian.com/display/BITBUCKET/Use+the+Bitbucket+REST+APIs">REST API</a>. It is pretty well put together and was easy to use because it generally follows REST conventions. I&#39;ve yet to find anything that can be done in the UI that can not be done in the API, which has been outstanding for adding additional niceties like adding commit hooks to push changes to crucible for each repository. For the purposes of migration, the best feature was deleting repositories. Knowing I could automatically clean up any mistakes provided the confidence to just let it rip. I actually ended up using this to clean up two false start migrations:</p> <ul> <li><p>git-svn has a &quot;show-ignore&quot; command to translate files ignored by subversion into a .gitignore file. I initially added .gitignore to the git repositories. However, this meant every repository had a commit in Bitbucket and so would no longer accept changes from subversion. This was resolved by adding .gitignore to subversion before the conversion.</p></li> <li><p>the first authors file I created was missing a few users. This was not discovered by noticing the Bitbucket commit history did not look as nice. It was nice to be able to just wipe it all out with a single command, fix the authors file, and redo the upload with a single command.</p></li> </ul> <h3>Post-migration</h3> <p>The time following the initial migration was when the automation really came in handy. A couple developers were out of the office during the cut over. They were able to make commits of their local work to subversion and then I could re-sync just those repositories even after other developers had begun working on other repositories in Bitbucket. This went very smoothly with no hand wringing or diff patching required to make sure local work was not lost.</p> <h3>Wrap-up</h3> <p>Overall the migration went off with no hiccups. We&#39;re still tweaking our preferred settings for git pushes and pulls to get to our ideal workflow, but we&#39;re happy to be using Bitbucket. Crucible does not integrate with Bitbucket as nicely as it did with subversion in our old setup. Hopefully Atlassian will continue to make improvements to this integration as we really like the Crucible code review workflow. I&#39;m always impressed how automation begets automation. Once you&#39;ve taken the step of automating part of the process, it is so much easier to see the next step. We are already seeing some benefits from the time spent interacting with the Bitbucket API as we&#39;re now able to add and modify commit hooks on all the repositories easily.</p> Using AWS Custom SSL Domain Names for CloudFront Donnie Thu, 20 Jun 2013 00:00:00 +0000 http://dev.bizo.com/2013/06/using-aws-custom-ssl-domain-names-for.html http://dev.bizo.com/2013/06/using-aws-custom-ssl-domain-names-for.html <p>AWS recently announced the limited availability of <a href="http://aws.typepad.com/aws/2013/06/custom-ssl-domain-names-root-domain-hosting-for-amazon-cloudfront.html">Custom SSL Domain Names for CloudFront</a>. You have to <a href="http://aws.typepad.com/aws/2013/06/custom-ssl-domain-names-root-domain-hosting-for-amazon-cloudfront.html">request an invitation</a> in order to start using it but I am guessing it won&#39;t be long until it has been rolled out to all customers.</p> <p>We&#39;ve been asking/waiting for Custom SSL on CloudFront for years and were excited when it finally came out. The sign up was easy and we were approved a day or two later.</p> <h3>Existing Setup</h3> <p>Our main use case for Custom SSL on CloudFront involves replacing a service that proxies secure requests to our non-secure CloudFront distro. We proxy secure requests because we didn&#39;t want the secure CloudFront domain leaking out to our customers for various reasons including:</p> <ul> <li>We wanted to be able to point the domain elsewhere if we needed to</li> <li>We wanted to keep our branding consistent on domains. </li> </ul> <p>It basically looks like the following diagram:</p> <p><a href="http://dev.bizo.com/images/posts/proxy.jpg"><img src="http://dev.bizo.com/images/posts/proxy.jpg" alt=""></a></p> <p>The problem with having a proxy is two fold:</p> <ul> <li>We have to operate that proxy which goes against our general rule to &quot;never operate services when AWS can do it for you&quot;</li> <li>We get subpar performance relative since requests are no longer served from a distributed geo-located CDN.</li> </ul> <p>But we needed theflexibilityand branding mentioned above so we dealt with it. Not anymore...</p> <h3>Migrating to Custom SSL Domain Names for CloudFront</h3> <p>Once we got approval for custom SSL, the migration was pretty straightforward. I am not going to regurgitate the <a href="http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/SecureConnections.html#CNAMEsAndHTTPS">detailed documentation</a> but will summarize the process.</p> <ul> <li><p>Upload your SSL cert and make sure path starts with &quot;/cloudfront&quot; (This was annoying because we couldn&#39;t reuse our existing certificates that we were already using for ELBs)</p></li> <li><p>Update your CF distro (I did so via the AWS Console):</p> <ul> <li>add the domain name you want to support (e.g. secure-example.bizographics.com from above)</li> <li>choose the SSL cert that you uploaded in the first step</li> <li>Save</li> </ul></li> <li><p>Wait for the CF distro to redeploy the configuration change</p></li> <li><p>Update your Route53 DNS to point at the CF CNAME rather than the ELB endpoint</p></li> <li><p>Wait for DNS to Update</p></li> <li><p>Shut down ELB of Proxy</p></li> </ul> <p>As you can see this was pretty easy. Most of the time was spent waiting for the CF distro the re-deploy (10s of minutes max) and DNS to update (which can take several days). </p> <p>All-in-all, the minor annoyance of having two copies of the same SSL cert was worth the win of not having to operate the proxy and getting better performance for our customers. Check out the graph below showing the improved performance:</p> <p><a href="http://dev.bizo.com/images/posts/response-time.jpg"><img src="http://dev.bizo.com/images/posts/response-time.jpg" alt=""></a></p> <h3>Note on Cost</h3> <p>The <a href="http://aws.amazon.com/cloudfront/pricing/">cost of custom SSL on CF</a> seems ok but could be better and the wording is not totally clear: &quot; You pay $600 per month for each custom SSL certificate associated with one or more CloudFront distributions.&quot;We have the same cert setup for multiple CF distros but I am not sure if we will be charged $600 for each disto using the cert or $600 for each cert regardless of how many distros are using it. (Will try to get clarification...) AWS claims the pricing is comparable to other similar offerings. That doesn&#39;t seem to jive with their usual practice of driving costs much lower but is livable for now. </p> Scala Command-Line Hacks Alex Boisvert Mon, 22 Apr 2013 00:00:00 +0000 http://dev.bizo.com/2013/04/scala-command-line-hacks.html http://dev.bizo.com/2013/04/scala-command-line-hacks.html <p>Do you like command-line scripting and one-liners with <a href="http://www.unixguide.net/unix/perl_oneliners.shtml">Perl</a>, <a href="http://reference.jumpingmonkey.org/programming_languages/ruby/ruby-one-liners.html">Ruby</a> and the like?</p> <p>For instance, here&#39;s a Ruby one-liner that uppercases the input:</p> <div class="highlight"><pre><code class="ruby"><span class="sx">% echo </span><span class="n">matz</span> <span class="o">|</span> <span class="n">ruby</span> <span class="o">-</span><span class="nb">p</span> <span class="o">-</span><span class="n">e</span> <span class="s1">&#39;$_.tr! &quot;a-z&quot;, &quot;A-Z&quot;&#39;</span><span class="no">MATZ</span> </code></pre></div> <p>You like that kind of stuff? Yes? Excellent! Then I offer you a hacking idea for Scala.</p> <p>As you may know, Scala offers similar capability with the <code>-e</code> command-line option but it&#39;s fairly limited in its basic form because of the necessary boilerplate code to set up iteration over the standard input... it just begs for a simple HACK!</p> <p>Using a simple bash wrapper,</p> <div class="highlight"><pre><code class="bash"><span class="c">#!/bin/bash</span> <span class="c">#</span> <span class="c"># Usage: scala-map MAP_CODE</span> <span class="c">#</span> <span class="nv">code</span><span class="o">=</span><span class="k">$(</span>cat<span class="p">&amp;</span>lt<span class="p">;&amp;</span>lt<span class="p">;</span>END scala.io.Source.stdin.getLines map <span class="o">{</span> <span class="nv">$@</span> <span class="o">}</span> foreach println END <span class="k">)</span> scala -e <span class="s2">&quot;$code&quot;</span> </code></pre></div> <p>then we can express similar one-liners using Scala code and the standard library:</p> <div class="highlight"><pre><code class="bash"> % ls <span class="p">|</span> scala-map _.toUpperCase <span class="c"># FOO</span> <span class="c"># BAR</span> <span class="c"># BAZ</span> <span class="c"># ...</span> % <span class="nb">echo</span> <span class="s2">&quot;foo bar baz&quot;</span> <span class="p">|</span> scala-map <span class="s1">&#39;_.split(&quot; &quot;).mkString(&quot;-&quot;)&#39;</span> <span class="c"># foo-bar-baz</span> </code></pre></div> <p>Nifty, right? Here&#39;s another script template to fold over the standard input,</p> <div class="highlight"><pre><code class="bash"><span class="c">#!/bin/bash</span> <span class="c">#</span> <span class="c"># Usage: scala-fold INITIAL_VALUE FOLD_CODE</span> <span class="c">#</span> <span class="c"># where the following val&#39;s can be used in FOLD_CODE:</span> <span class="c">#</span> <span class="c"># `acc` is bound to the accumulator value</span> <span class="c"># `line` is bound to the current line</span> <span class="c">#</span> <span class="nv">code</span><span class="o">=</span><span class="k">$(</span>cat END println<span class="o">(</span>scala.io.Source.stdin.getLines.foldLeft<span class="o">(</span><span class="nv">$1</span><span class="k">)</span> <span class="o">{</span> <span class="k">case</span> <span class="o">(</span>acc, line<span class="o">)</span> <span class="o">=</span><span class="p">&amp;</span>gt<span class="p">;</span> <span class="nv">$2</span> <span class="o">})</span> END <span class="o">)</span> scala -e <span class="s2">&quot;$code&quot;</span> </code></pre></div> <p>Now if you wanted to calculate the sum of the second column of space-separated input, you&#39;d write:</p> <div class="highlight"><pre><code class="bash"><span class="nv">$ </span>cat <span class="p">|</span> scala-fold 0 <span class="s1">&#39;acc + (line.split(&quot; &quot;)(1).toInt)&#39;</span>foo 1bar 2baz 3 <span class="o">(</span>CTRL-D<span class="o">)</span>6You get the idea ... hopefully this inspires you to try a few things with Scala scripting templates! </code></pre></div> Efficiency & Scalability Alex Boisvert Fri, 19 Apr 2013 00:00:00 +0000 http://dev.bizo.com/2013/04/efficiency-scalability.html http://dev.bizo.com/2013/04/efficiency-scalability.html <p>Software engineers know that distributed systems are often hard to scale and many can intuitively point to reasons why this is the case by bringing up points of contention, bottlenecks and latency-inducing operations. gIndeed, there exists a plethora of reasons and explanations as to why most distributed systems are inherently hard to scale, from the<a href="http://en.wikipedia.org/wiki/CAP_theorem">CAP theorem</a>to scarcity of certain resources, e.g., RAM, network bandwidth ...</p> <p>It&#39;s said thatgood engineersknow how to identify resources that may not appear to be relevant to scaling initially but will become more significant as particular kinds of demand grow. If that’s the case, thengreat engineersknow that system architecture is often the determining factor in system scalabilityg—that a system’s own architecture may be its worse enemy — so they define and structure systems in order avoid fundamental flaws.</p> <p>In this post, I want to explore the relationship between system efficiency and scalability in distributed systems;they are to some extent two sides of the same coin. gWe’ll consider specifically&amp;two common system architecture traits: &amp;replicationandrouting. &amp;Some of this may seem obvious to some of you but it’s always good to back intuition with some additional reasoning.</p> <p>Before we go any further, it’s helpful to formulate a definition of efficiency applicable to our context:</p> <p><strong>efficiency</strong> is the extent to which work is performed relative to the total work and/or cost incurred.</p> <p>We’ll also use the following definition of scalability,</p> <p><strong>scalability</strong> is the ability of a system to accommodate an increased workload by repeatedly applying a cost-effective strategy for extending a system’s capacity.</p> <p>So, scalability and efficiency are both determined by cost-effectiveness with the distinction that scalability is a measure of marginal gain. gStated differently,if efficiency decreases significantly as a system grows, then a system is said to be <strong>non-scalable</strong>.</p> <p>Enough rambling, let’s get our thinking caps on! gSince we’re talking about distributed systems, it’s practically inevitable to compare against traditional single-computer systems, so we’ll start with a narrow definition of system efficiency:</p> <div class="highlight"><pre><code class="bash"> average work <span class="k">for </span>processing a request on a single computer <span class="nv">Efficiency</span> <span class="o">=</span> —————————————————————————————————————————————————————————— average work <span class="k">for </span>processing a request in distributed system </code></pre></div> <p>This definition is a useful starting point for our exploration because it abstracts out the nature of the processing that’s happening within the system; it’s overly simple but it allows us to focus our attention on the big picture.</p> <p>More succinctly, we&#39;ll write:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>1<span class="o">)</span> <span class="nv">Efficiency</span> <span class="o">=</span> Wsingle / Wcluster </code></pre></div> <h2>Replication Cost</h2> <p>Many distributed systems replicate some or all of the data they process across different processingnodes(to increase reliability, availability or read performance) so we can model:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>2<span class="o">)</span> <span class="nv">Wcluster</span> <span class="o">=</span> Wsingle + <span class="o">(</span>r x Wreplication<span class="o">)</span> </code></pre></div> <p>where <code>r</code> is the number of replicas in the system and <code>Wreplication</code> is the work required to replicate the data to other nodes. <code>Wreplication</code> is typically lower than <code>Wsingle</code>, though realistically they have different cost models (e.g., <code>Wsingle</code> may be CPU-intensive whereas <code>Wreplication</code> may be I/O-intensive). If <code>n</code> is the number of nodes in the system, then <code>r</code> may be as large as <code>(n-1)</code>, meaning replicating to all other nodes, though most systems will only replicate to 2 or 3 other nodes — for good reason&amp;—as we’ll discover later.</p> <p>We’ll now define the replication coeffient, which expresses the relative cost of replication compared to the cost of processing the request on a single node:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>3<span class="o">)</span> <span class="nv">Qreplication</span> <span class="o">=</span> Wreplication / Wsingle </code></pre></div> <p>Solving for <code>Qreplication</code>, we get:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>4<span class="o">)</span> <span class="nv">Wreplication</span> <span class="o">=</span> Qreplication x Wsingle </code></pre></div> <p>If we substitute <code>Wreplication</code> in (2) by the equation formulated in (4), we obtain:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>5<span class="o">)</span> <span class="nv">Wcluster</span> <span class="o">=</span> Wsingle x <span class="o">[</span> 1 + <span class="o">(</span> r x Qreplication * Wsingle <span class="o">)</span> <span class="o">]</span> </code></pre></div> <p>We now factor out <code>Wsingleton</code> from the left side:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>6<span class="o">)</span> <span class="nv">Wcluster</span> <span class="o">=</span> Wsingle x <span class="o">[</span> 1 + r * Qreplication <span class="o">]</span> </code></pre></div> <p>Taking the efficiency equation (1) and substituting <code>from (6)</code>, the equation becomes:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>7<span class="o">)</span> <span class="nv">Efficiency</span> <span class="o">=</span> Wsingle / <span class="o">[</span> Wsingle x <span class="o">(</span> 1 + r * Qreplication <span class="o">)</span> <span class="o">]</span> </code></pre></div> <p>We then simplify <code>Wsingle</code> to obtain the final efficiency for a replicating distributed system:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>8<span class="o">)</span> Efficiency <span class="o">(</span>replication<span class="o">)</span> <span class="o">=</span> 1 / <span class="o">[</span> 1 + <span class="o">(</span>r x Qreplication<span class="o">)</span> <span class="o">]</span> </code></pre></div> <p>As expected, both r and <code>Qreplication</code> are critical factors determining efficiency.</p> <p>Interpreting this last equation and assuming <code>Qreplication</code> is a constant inherent to the system’s processing, our two takeaways are:</p> <ul> <li>If the system replicates to all other nodes <code>(i.e.,r = n - 1)</code> it becomes clear that the efficiency of the system will degrade as more nodes are added and will approach zero asnbecomes sufficiently large.</li> </ul> <p>To illustrate this, let&#39;s assume <code>Qreplication = 10%</code> </p> <ul> <li>Efficiency (r = 1, n = 2) = 91%</li> <li>Efficiency (r = 2, n = 3) = 83%</li> <li>Efficiency (r = 3, n = 4) = 76%</li> <li>Efficiency (r = 4, n = 5) = 71%</li> <li>Efficiency (r = 5, n = 6) = 67%</li> <li>...</li> </ul> <p>In other words, <strong>fully-replicated distributed systems don&#39;t scale.</strong></p> <ul> <li>For a system to scale, the replication factor should be a (small) constant.</li> </ul> <p>Let&#39;s illustrate this with <code>Qreplication</code> fixed at 10% and using a replication factor of 3,</p> <ul> <li>Efficiency (r = 3, n = 4) = 76%</li> <li>Efficiency (r = 3, n = 5) = 76%</li> <li>Efficiency (r = 3, n = 6) = 76%</li> <li>Efficiency (r = 3, n = 7) = 76%</li> <li>Efficiency (r = 3, n = 8) = 76%</li> <li>...</li> </ul> <p>As we can see, <strong>fixed-replication-factor distributed systems scale</strong> - although, as you might expect, they do not exhibit the same efficiency as a single-node system. At worse, the efficiency will be <code>1/r</code> — as you would intuitively expect.</p> <h2>Routing Cost</h2> <p>When a distributed system routes requests to nodes holding the relevant information (e.g., a partially replicated system, <code>r &lt; n</code>) its working model may be defined as,</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>9<span class="o">)</span> <span class="nv">Wcluster</span> <span class="o">=</span> <span class="o">(</span>r / n<span class="o">)</span> * Wsingle + <span class="o">(</span>n-r<span class="o">)</span>/n * <span class="o">(</span>Wrouting + Wsingle<span class="o">)</span> </code></pre></div> <p>The above equation represents the fact that <code>r</code> out of <code>n</code> requests are processed locally whereas the remainer of the requests are routed and processed on a different node.</p> <p>Let’s define the routing coefficient to be,</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>10<span class="o">)</span> Qrouting <span class="nv">g</span><span class="o">=</span> <span class="p">&amp;</span>Wrouting / Wsingle </code></pre></div> <p>Solving for <code>Wrouting</code> in (9) by (11) to obtain,</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>12<span class="o">)</span> <span class="nv">Wcluster</span> <span class="o">=</span> <span class="o">(</span>r/n<span class="o">)</span> * Wsingle g+ <span class="p">&amp;</span><span class="o">(</span>n-r<span class="o">)</span>/n * <span class="o">[</span> <span class="o">(</span>Qrouting * Wsingle<span class="o">)</span> + Wsingle <span class="o">]</span> </code></pre></div> <p>and taking the efficiency equation (1), substituting from (12), the simplified equation becomes:</p> <div class="highlight"><pre><code class="bash"> <span class="o">(</span>13<span class="o">)</span> Efficiency <span class="o">(</span>routing<span class="o">)</span> <span class="o">=</span> n / <span class="o">[</span> n + <span class="o">(</span>n - r<span class="o">)</span> * Qrouting <span class="o">]</span> </code></pre></div> <p>Looking at this last equation, we can infer that:</p> <ul> <li><p>As the system grows and n goes towards infinity, the efficiency of the system can be expressed as 1 / (1 + Qrouting). The efficiency is not dependent on the actual number of nodes within the system therefore routing-based systems generally scale. (But you knew that already)</p></li> <li><p>If the number of nodes is large compared to the replication factor (n &gt;&gt; r) and Qrouting is significant (1.0, same cost as Wsingle), then the efficiency is ½, or 50%. This matches the intuition that the system is routing practically all requests and therefore spending half of its efforts on routing. The system is scaling linearly but it’s costing twice as much to operate (for every node) compared to a single-node system.</p></li> <li><p>If the cost of routing is insignificant (Qrouting = 0), the efficiency is 100%. That’s right, if it doesn’t cost anything to route the request to a node that can process it, the efficiency is the same as a single-node system.</p></li> </ul> <p>Let’s consider a practical distributed system with 10 nodes (n = 10), a replication factor of 3 (r = 3), and a relative routing cost of 10% ( = 0.10). This system would have an efficiency of 10 / 10 + (7 * 10%) = 93.46%. As you can see, routing-based distributed systems can be pretty efficient if <code>Qrouting</code> is relatively small.</p> <h3>Where To Now?</h3> <p>Well, this was a fun exploration of system scalability in the abstract. gWe came up with interesting equations to describe the scalabilty of both data-replicating and request-routing architectures. &amp;With some thinkering, these can serve as a good basis for reasoning about some of your distributed systems.</p> <p>In real life, however, there are many other aspects to consider when scaling systems. gIn fact, it often feels like a whack-a-mole hunt; you never know there the next performance non-linearity is going to rear its ugly head. &amp;But if you use either (or both) the data-replicating and request-routing style architecture with reasonable replication factors and you manage to keep your replication/routing costs well below your single-node processing costs, you may find some comfort in knowing that at least you haven’t introduced a fundamental scaling limitation unto your system.</p> Sensible Defaults for Apache HttpClient Mark Dietz Mon, 15 Apr 2013 00:00:00 +0000 http://dev.bizo.com/2013/04/sensible-defaults-for-apache-httpclient.html http://dev.bizo.com/2013/04/sensible-defaults-for-apache-httpclient.html <p>Before coming to Bizo, I wrote a web service client that retrieved daily XML reports over HTTP using the Apache <a href="http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/client/DefaultHttpClient.html">DefaultHttpClient</a>. Everything went fine until one day the connection simply hung forever. We found this odd because we had set the connection timeout. It turned out we also needed to set the socket timeout (HttpConnectionParams.SO_TIMEOUT). The default for both connection timeout (max time to wait for a connection) and socket timeout (max time to wait between consecutive data packets) is infinity. The server was accepting the connection but then not sending any data so our client hung forever without even reporting any errors. Rookie mistake, but everyone is a rookie at least once. Even if you are an expert with HttpClient, chances are there will be someone maintaining your code in the future who is not.</p> <p>Another problem with defaults using HttpClient is with <a href="http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/impl/conn/PoolingClientConnectionManager.html">PoolingClientConnectionManager</a>. PoolingClientConnectionManager has two attributes: MaxTotal and MaxPerRoute. MaxTotal is the maximum total number of connections in the pool. MaxPerRoute is the maximum number of connections to a particular host. If the client attempts to make a request and either of these maximums have been reached, then by default the client will block until a connection is free. Unfortunately the default for MaxTotal is 20 and the default MaxPerRoute is only 2. In a SOA, it is common to have many connections from a client to a particular host. The limit of 2 (or even 1) connections per host makes sense for a polite web crawler, but in a SOA, you are likely going to need a lot more. Even the 20 maximum total connections in the pool is likely much lower than desired.</p> <p>If the client does reach the MaxPerRoute or the MaxTotal connections, it will block until the connection manager timeout (ClientPNames.CONN<em>MANAGER</em>TIMEOUT) is reached. This timeout controls how long the client will wait for a connection from the connection manager. Fortunately, if this timeout is not set directly, it will default to the connection timeout if that is set, which will prevent the client from queuing up requests indefinitely.</p> <h3>What would a better set of defaults be?</h3> <p>A good default is something that is &quot;safe&quot;. A safe default for a connection timeout is long enough to not give up waiting when things are working normally, but short enough to not cause system instability when the is down. Unfortunately safe is context dependent. Safe for a daily data sync process and safe for an in thread service request handler are very different. Safe for a request that is critical to the correct functioning of the program is different than safe for a some ancillary logging that is ok to miss 1% of the time. A default for timeouts that is safe in all cases is not really possible.</p> <p>Safe defaults for PoolingClientConnectionManager&#39;s MaxTotal and MaxPerRoute should be big enough that they won’t be hit unless there is a bug. New to version 4.2 is the <a href="http://hc.apache.org/httpcomponents-client-ga/fluent-hc/index.html">fluent-hc</a> API for making http requests. This uses a PoolingClientConnectionManager with defaults of 200 MaxTotal and 100 MaxPerRoute. We are using these same defaults for all our configurations.</p> <p>Note that the fluent-hc API is very nice, but requires setting the connection timeouts on each request. This is perfect if you need to tune the settings for each request but does not provide a safety check against accidentally leaving the timeout infinite.</p> <h3>How can you help out a new dev implementing a new HTTP client?</h3> <p>If you can&#39;t have a safe default and the existing defaults are decidedly not safe, then it is best to require a configuration. We created a wrapper for PoolingClientConnectionManager that requires the developer to choose a configuration instead of letting the defaults silently take effect. One way to require a configuration is to force passing in the timeout values. However, it can be a hard to know the right values especially when stepping into a new environment. To help a developer implementing a new client at Bizo, we created some canonical configurations in the wrapper based on our experience working in our production environment on AWS. The configurations are:</p> <p><table class="table"> <thead> <tr> <th>Configuration</th> <th>Connection timeout</th> <th>Socket timeout</th> <th>MaxTotal</th> <th>MaxPerRoute</th> </tr> <thead></p> <p><tbody> <tr> <td>SameRegion</td> <td>125 ms</td> <td>125 ms</td> <td>200</td> <td>100</td> </tr></p> <p><tr> <td>SameRegionWithUSEastFailover</td> <td>1 second</td> <td>1 second</td> <td>200</td> <td>100</td> </tr></p> <p><tr> <td>CrossRegion</td> <td>10 seconds</td> <td>10 seconds</td> <td>200</td> <td>100</td> </tr></p> <p><tr> <td>MaxTimeout</td> <td>1 minute</td> <td>5 minutes</td> <td>200</td> <td>100</td> </tr> </tbody> <table></p> <p>Clients with critical latency requirements can use the SameRegion configuration and need to make sure they are connecting to a service in the same AWS region. Back end processes that can tolerate latency can use the MaxTimeout configuration. Now when a developer is implementing a new client, the timeouts used by other services are readily available without having to hunt through other code bases. The developer can compare these with the current use case and choose an appropriate configuration. Additionally, if we learn that some of these configurations need to be tweaked, then we can easily modify all affected code.</p> <p>Commonly the socket timeout will need to be adjusted for a specific service. After a connection is established, a service will not typically start sending its response until it has finished whatever calculation was requested. This can vary greatly even for different parameters on the same service endpoint. The socket timeout will need to be set based on the expected response times of the service.</p> <p>It is easy to miss a particular setting even if you know it is there. At Bizo, we are always looking for ways to solve a problem in one place. We are hopeful that this will eliminate any issues we have had with bad defaults in our HttpClients.</p> Map-side aggregations in Apache Hive Darren Lee Mon, 18 Feb 2013 00:00:00 +0000 http://dev.bizo.com/2013/02/map-side-aggregations-in-apache-hive.html http://dev.bizo.com/2013/02/map-side-aggregations-in-apache-hive.html <p>When running large scale Hive reports, one error we occasionally run into is the following:</p> <h4>Possible error:</h4> <p>Out of memory due to hash maps used in map-side aggregation.</p> <h4>Solution:</h4> <p>Currently <code>hive.map.aggr.hash.percentmemory</code> is set to 0.5. Try setting it to a lower value. i.e <code>&#39;set hive.map.aggr.hash.percentmemory = 0.25;&#39;</code> </p> <p>What&#39;s going on is that Hive is trying to optimize the query by performing a map-side aggregation. This is a map-side optimization that does a partial aggregation inside of the mapper, which results in the mapper outputting fewer rows. In turn, this reduces the amount of information that Hadoop needs to sort and distribute to the reducers.</p> <p>Let&#39;s think about what the Hadoop job looks like with the canonical word count example.</p> <p>In the word count example, the naive approach is for the mapper to tokenize each row of input and output the key-value pair (#{token}, 1). The Hadoop framework will sort these pairs by the tokens, and the reducer sums the values to produce the total counts for each token.</p> <p>Using a map-side aggregation, the mappers would instead tokenize each row and store partial counts in an in-memory hash map. (More precisely, the mappers are storing each key with the corresponding partial aggregation, which is just a count in this case.) Periodically, the mappers will output the pairs (#{token}, #{token_count}). The Hadoop framework again sorts these pairs and the reducers sum the values to produce the total counts for each token. In this case, the mappers will each output one row for each token every time the map is flushed instead of one row for each occurrence of each token. The tradeoff is that they need to keep a map of all tokens in memory.</p> <p>By default, Hive will try to use the map-side aggregation optimization, but it falls back to the standard approach if the hash map is not producing enough of a memory savings. After processing 100,000 rows modifiable via <code>hive.groupby.mapaggr.checkinterval</code>, Hive will check the number of items in the hash map. If it exceeds 50% (modifiable via <code>hive.map.aggr.hash.min.reduction</code>) of the number of rows read, the map-side aggregation will be aborted.</p> <p>Hive will also estimate the amount of memory needed for each entry in the hash map and flush the map to the reducers whenever the size of the map exceeds 50% of the available mapper memory (modifiable via <code>hive.map.aggr.hash.percentmemory</code>). This, however, is an estimate based on the number of rows and the expected size of each row, so if the memory usage is per row is unexpectedly high, the mappers may run out of memory before the hash map is flushed to the reducers.</p> <p>In particular, if a query uses a count distinct aggregation, the partial aggregations actually contain a list of all values seen. As more distinct values are seen, the amount of memory used by the map will increase without necessarily increasing the number of rows of the map, which is what Hive uses to determine when to flush the partial aggregations to the reducers.</p> <p>Whenever a mapper runs out of memory, a group by clause is present, and map-side aggregation is turned on, Hive will helpfully suggest that you reduce the flush threshold to avoid running out of memory. This will lower the threshold (in rows) of when Hive will automatically flush the map, but it may not help if the map size (in bytes) is growing independently of the number of rows.</p> <p>Some alternate solutions include simply turning off map-side aggregations (set hive.map.aggr = false), allocating more memory to your mappers via the Hadoop configuration, or restructuring the query so that Hive will pick a different query plan.</p> <p>For example, a simple</p> <div class="highlight"><pre><code class="sql"><span class="k">select</span> <span class="k">count</span><span class="p">(</span><span class="k">distinct</span> <span class="n">v</span><span class="p">)</span> <span class="k">from</span> <span class="n">tbl</span> </code></pre></div> <p>can be rewritten as</p> <div class="highlight"><pre><code class="sql"><span class="k">select</span> <span class="k">count</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="k">from</span> <span class="p">(</span><span class="k">select</span> <span class="n">v</span> <span class="k">from</span> <span class="n">tbl</span> <span class="k">group</span> <span class="k">by</span> <span class="n">v</span><span class="p">)</span> <span class="n">t</span><span class="p">.</span> </code></pre></div> <p>This latter query will avoid using the count distinct aggregation and may be more efficient for some queries.</p> Reader Driven Development larry ogrodnek Fri, 15 Feb 2013 00:00:00 +0000 http://dev.bizo.com/2013/02/reader-driven-development.html http://dev.bizo.com/2013/02/reader-driven-development.html <p>In this talk on <a href="http://vimeo.com/14313378">Effective ML</a>, <a href="http://cufp.org/users/yminsky">Yaron Minsky</a> talks about Reader Driven Development. That is, writing your code with the reader in mind. Making decisions that will make the code more easily read and understood by other developers down the line.</p> <blockquote> <p>The interest of the reader always pushes in the direction of clarity, simplicity, and the ability to change the code later. In most real projects, code is read and changed many more times than it is written. The readers interest are paramount in that regard.</p> </blockquote> <p>When writing code the interests of the reader and writer may be at odds, and when faced with a decision, always err in the direction of the reader. The reader is always right. Regardless of team size, it&#39;s helpful to program this way. Even code you&#39;ve written yourself may not be as clear 6 months or a year later otherwise. Great perspective, and I think it fits in nicely with previous posts here on <a href="http://dev.bizo.com/2012/06/the-golden-rule-of-programming-style.html">programming style</a> and <a href="http://dev.bizo.com/2012/03/on-code-reviews-and-developer-feedback.html">code reviews</a> (tend to agree with your reviewers, they are the audience!).</p> Asanban: Lean Development with Asana and Kanban Pat Gannon Wed, 23 Jan 2013 00:00:00 +0000 http://dev.bizo.com/2013/01/asanban-lean-development-with-asana-and.html http://dev.bizo.com/2013/01/asanban-lean-development-with-asana-and.html <p>On Bizo&#39;s External Apps team (aka. &#39;xapps&#39;), we&#39;ve been using a Kanban system to manage our work. All of Bizo Engineering uses Asana to track tasks, which isn&#39;t specifically designed for Kanban. We&#39;ve settled on a set of of conventions that we use in Asana which enable our Kanban system. These conventions also help us to track metrics like the average lead time from month to month.</p> <h3>Background</h3> <p><a href="http://en.wikipedia.org/wiki/Kanban">Kanban</a> is a second-generation Agile software development methodology. The focus is on finding and fixing bottlenecks, as well as removing waste by limiting work-in-progress. (The &quot;WIP&quot; limits referenced in this post are the number of work items that are allowed to be in a particular stage of the system at one time.) Adopting a Kanban system has made things easier for engineers, increased efficiency, and is very popular with our Product Management folks as well. We are now focused on delivering value incrementally rather than specifying and implementing larger chunks of work. If you&#39;re interested in adopting Kanban, I recommend reading David Anderson&#39;s seminal book on the topic: <a href="http://amzn.com/0984521402">Kanban: Successful Evolutionary Change for Your Technology Business</a></p> <h3>Conventions</h3> <p>Each stage of work in our value chain is a priority heading in Asana. The name of the priority header follows the convention: &quot;{STEP NAME} ({WIP}):&quot;, eg. &quot;Dev Ready (10):&quot;. The steps that are earliest in the value chain are at the bottom of our Asana project, with tasks moving upwards through each stage until they reach &quot;Production (15):&quot; at the top when the functionality described by the task has been delivered to production. Once Product Management has verified that the functionality described by a task is functioning correctly in production, they mark the task as complete. We use tags to represent work item types, although its fairly limited at present.</p> <p><a href="http://dev.bizo.com/images/posts/Screen+Shot+2013-01-23+at+2.59.18+PM.png"><img src="http://dev.bizo.com/images/posts/Screen+Shot+2013-01-23+at+2.59.18+PM.png" alt=""></a></p> <h3>Metrics</h3> <p>One of the most basic metrics to track in a Kanban system is the average amount of lead time (the time it takes from when a task gets added to the input queue until value is delivered). I have created some tooling that allows us to accomplish this systematically. I&#39;ll first describe what it does, and then how you can use it.</p> <p>The first piece of the tooling bulk loads task data from the Asana API into MongoDB. The API returns JSON and I just store JSON as-is in MongoDB, which works out well since MongoDB speaks JSON natively. One hiccup is that once tasks are archived in Asana, you can no longer obtain information about them through the API. Accordingly, the bulk load needs to be scheduled to run on a regular basis (in our case, every night) so that we don&#39;t lose information about archived tasks. Furthermore, we have a policy that tasks should not be archived until they have been completed for at least 24 hours, so that the bulk loader will always run at least once after a task has been completed before it gets archived. After loading the task data, the bulk loader will create data describing how much time each task spent in each state, as well as how long each task took (in days) to complete from start to finish (lead time).</p> <p>The other piece is a Sinatra web service that runs a map-reduce against the lead time data created by the bulk loader and serves lead times by month as JSON. It can also aggregate by year or day (but I don&#39;t think aggregating by day is useful).</p> <p>I have packaged up both of those pieces into a gem called &quot;asanban&quot;, which you can use. The source code and instructions for installation and usage are here: <a href="https://github.com/patgannon/asanban">https://github.com/patgannon/asanban</a></p> <h3>Pain Points</h3> <p>There are a couple of problems I&#39;ve run into using Asana with a Kanban system. The first is that there&#39;s no way to enforce WIP limits. Users just have to be mindful of the limits shown in the priority headers. I have been thinking about writing a nightly report that uses the data created by the bulk loader to find violated WIP limits and send out emails, but I haven&#39;t gotten to it yet. (This tooling is essentially a hack day project at this point.) There is also no functionality to facilitate different classes of service (SLAs and WIP break-downs), but maybe those could be supported using the same kind of nightly report.</p> <p>Another problem I&#39;ve run into is that task sizes can be all over the place, which reduces the meaningfulness of the metrics. Some Kanban practitioners use hierarchical work items to address this kind variability in size. Stories can be grouped into epics and/or broken down into &quot;grains&quot;. Asana does support sub-tasks, so I may recommend that we use those to break down large work items in the future, at which point the bulk loader would be modified to track metrics by sub-task (for tasks that have them, which would be assumed to be epics).</p> <h3>Next Steps</h3> <p>As we fine tune our Kanban process, we&#39;ll use these lead time metrics to verify that when we&#39;ve made an adjustment (changing a WIP limit or adding a buffer, for example), that our performance improves. I&#39;d like to have more metrics so that we can have even better insight into our system in the future. For example, I&#39;d like to see the average time tasks spend in particular steps, the average amount of total WIP, as well as WIP in each step (shown over time) and failure load.</p> <p>The first order of business moving forward on this will probably be an improved charting interface on top of the existing metrics. Also, it would be nice if the bulk loader used a scheduling library so that folks don&#39;t have to manually schedule it in cron. It also could use some automated tests!!! As I mentioned previously, I&#39;ve just been working on this on hack days, so if there&#39;s something you&#39;d like to see done soon, well... pull requests will be gladly accepted! :)</p> What Makes Spark Exciting Stephen Haberman Mon, 21 Jan 2013 00:00:00 +0000 http://dev.bizo.com/2013/01/what-makes-spark-exciting.html http://dev.bizo.com/2013/01/what-makes-spark-exciting.html <p>At <a href="http://www.bizo.com">Bizo</a>, we&#39;re currently evaluating/prototyping <a href="http://www.spark-project.org">Spark</a> as a replacement for <a href="http://hive.apache.org/">Hive</a> for our batch reports. </p> <p>As a brief intro, Spark is an alternative to Hadoop. It provides a cluster computing framework for running distributed jobs. Similar to Hadoop, you provide Spark with jobs to run, and it handles splitting up the job into small tasks, assigning those tasks to machines (optionally with Hadoop-style data locality), issuing retries if tasks fail transiently, etc. </p> <p>In our case, these jobs are processing a non-trivial amount of data (log files) on a regular basis, for which we currently use Hive. </p> <h3>Why Replace Hive?</h3> <p>Admittedly, Hive has served us well for quite awhile now. (One of our engineers even built a custom &quot;Hadoop on demand&quot; framework for running periodic on-demand Hadoop/Hive jobs in EC2 several months before <a href="http://aws.amazon.com/elasticmapreduce/">Amazon Elastic Map Reduce</a> came out.) </p> <p>Without Hive, it would have been hard for us to provide the same functionality, probably at all, let alone in the same time frame. That said, it has gotten to the point where Hive is more frequently invoked in negative contexts (&quot;damn it, Hive&quot;) than positive. </p> <p>Personally, I admittedly even try to avoid tasks that involve working with Hive. I find it to be frustrating and, well, just not a lot of fun. Why? Two primary reasons: </p> <h4>1. Hive jobs are hard to test</h4> <p>Bizo has a culture of excellence, and for engineering one of the things this means is testing. We really like tests. Especially unit tests, which are quick to run and enable a fast TDD cycle. </p> <p>Unfortunately, Hive makes unit testing basically impossible. For several reasons: </p> <ul> <li>Hive scripts must be run in a local Hadoop/Hive installation. </li> </ul> <p>Ironically, very few developers at Bizo have local Hadoop installations. We are admittedly spoiled by Elastic Map Reduce, such that most of us (myself anyway) wouldn&#39;t even know how to setup Hadoop off the top of our heads. We just fire up an EMR cluster.</p> <ul> <li>Hive scripts have production locations embedded in them. </li> </ul> <p>Both our log files and report output are stored in S3, so our Hive scripts end up with lots of &quot;s3://&quot; paths scattered throughout in them. While we do run dev versions of reports with &quot;-dev&quot; S3 buckets, still relying on S3 and raw log files (that are usually in a compressed/binary-ish format) is not conducive to setting up lots of really small, simplified scenarios to unit test each boundary case.</p> <ul> <li>Hive scripts do not provide any abstraction - they are just one big HiveQL file. This means its hard to break up a large report into small, individually testable steps.</li> </ul> <p>Despite these limitations, about a year ago we had a developer dedicate some effort to prototyping an approach that would run Hive scripts within our CI workflow. In the end, while his prototype worked, the workflow was wonky enough that we never adopted it for production projects. </p> <p>The result? Our Hive reports are basically untested. This sucks. </p> <h4>2. Hive is hard to extend</h4> <p>Extending Hive via custom functions (UDFs and UDAFs) is possible, and we do it all the time - but it&#39;s a pain in the ass. Perhaps this is not Hive&#39;s fault, and it&#39;s some Hadoop internals leaking into Hive, but the various <a href="http://hive.apache.org/docs/r0.5.0/api/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspector.html">ObjectInspector</a> hoops, to me, always seemed annoying to deal with. </p> <p>Given these shortcomings, Bizo has been looking for a Hive-successor for awhile, even going so far as to prototype <a href="https://github.com/aboisvert/revolute">revolute</a>, a Scala DSL on top of <a href="http://www.cascading.org/">Cascading</a>, but had not yet found something we were really excited about. </p> <h2>Enter Spark!</h2> <p>We had heard about Spark, but did not start trying it until being so impressed by the Spark presentation at AWS re:Invent (the talk received <a href="https://amplab.cs.berkeley.edu/news/sparkshark-a-big-hit-at-aws-reinvent/">the highest rating of all non-keynote sessions</a>) that we wanted to learn more. </p> <p>One of Spark&#39;s touted strengths is being able to load and keep data in memory, so your queries aren&#39;t always I/O bound. </p> <p>That is great, but the exciting aspect for us at Bizo is how Spark, either intentionally or serendipitously, addresses both of Hive&#39;s primary shortcomings, and turns them into huge strengths. Specifically: </p> <h4>1. Spark jobs are amazingly easy to test</h4> <p>Writing a test in Spark is as easy as: </p> <div class="highlight"><pre><code class="scala"><span class="k">class</span> <span class="nc">SparkTest</span> <span class="o">{</span> <span class="nd">@Test</span> <span class="k">def</span> <span class="n">test</span><span class="o">()</span> <span class="o">{</span> <span class="c1">// this is real code...</span> <span class="k">val</span> <span class="n">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(&amp;</span><span class="n">quot</span><span class="o">;</span><span class="n">local</span><span class="o">&amp;</span><span class="n">quot</span><span class="o">;,</span> <span class="o">&amp;</span><span class="n">quot</span><span class="o">;</span><span class="nc">MyUnitTest</span><span class="o">&amp;</span><span class="k">#</span><span class="mi">39</span><span class="o">;)</span> <span class="c1">// and now some psuedo code...</span> <span class="k">val</span> <span class="n">output</span> <span class="k">=</span> <span class="n">runYourCodeThatUsesSpark</span><span class="o">(</span><span class="n">sc</span><span class="o">)</span> <span class="n">assertAgainst</span><span class="o">(</span><span class="n">output</span><span class="o">)</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div> <p>(I will go into more detail about runYourCodeThatUsesSpark in a future post.) </p> <p>This one liner starts up a new <a href="http://spark-project.org/docs/latest/api/core/index.html#spark.SparkContext">SparkContext</a>, which is all your program needs to execute Spark jobs. There is no local installation required (just have the Spark jar on your classpath, e.g. via Maven or Ivy), no local server to start/stop. It just works. </p> <p>As a technical aside, this &quot;local&quot; mode starts up an in-process Spark instance, backed by a thread-pool, and actually opens up a few ports and temp directories, because it&#39;s a real, live Spark instance. </p> <p>Granted, this is usually more work than you want to be done in an unit test (which ideally would not hit any file or network I/O), but the redeeming quality is that it&#39;s fast. Tests run in ~2 seconds. </p> <p>Okay, yes, this is slow compared to pure, traditional unit tests, but is such a huge revolution compared to Hive that we&#39;ll gladly take it. </p> <h4>2. Spark is easy to extend</h4> <p>Spark&#39;s primary API is a Scala DSL, oriented around what they call an <a href="http://www.spark-project.org/docs/0.6.0/api/core/#spark.RDD">RDD</a>, or Resilient Distributed Dataset, which is basically a collection that only supports bulk/aggregate transforms (so methods like map, filter, and groupBy, which can be seen as transforming the entire collection, but no methods like get or take which assume in-memory/random access). </p> <p>Some really short, made up example code is: </p> <div class="highlight"><pre><code class="scala"><span class="c1">// RDD[String] is like a collection of lines</span> <span class="k">val</span> <span class="n">in</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(&amp;</span><span class="n">quot</span><span class="o">;</span><span class="n">s3</span><span class="o">://</span><span class="n">bucket</span><span class="o">/</span><span class="n">path</span><span class="o">/&amp;</span><span class="n">quot</span><span class="o">;)</span> <span class="c1">// perform some operation on each line</span> <span class="k">val</span> <span class="n">suffixed</span> <span class="k">=</span> <span class="n">in</span><span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="n">line</span> <span class="o">=&amp;</span><span class="n">gt</span><span class="o">;</span> <span class="n">line</span> <span class="o">+</span> <span class="o">&amp;</span><span class="n">quot</span><span class="o">;</span><span class="n">some</span> <span class="n">suffix</span><span class="o">&amp;</span><span class="n">quot</span><span class="o">;</span> <span class="o">}</span> <span class="c1">// now save the new lines back out</span> <span class="n">suffixed</span><span class="o">.</span><span class="n">saveAsTextFile</span><span class="o">(&amp;</span><span class="n">quot</span><span class="o">;</span><span class="n">s3</span><span class="o">://</span><span class="n">bucket</span><span class="o">/</span><span class="n">path2</span><span class="o">&amp;</span><span class="n">quot</span><span class="o">;)</span> </code></pre></div> <p>Spark&#39;s job is to package up your map closure, and run it against that extra large text file across your cluster. And it does so by, after shuffling the code and data around, actually calling your closure (i.e. there is no <a href="http://msdn.microsoft.com/en-us/library/vstudio/bb397926.aspx">LINQ</a>-like introspection of the closure&#39;s AST). </p> <p>This may seem minor, but it&#39;s huge, because it means there is no framework code or APIs standing between your running closure and any custom functions you&#39;d want to run. Let&#39;s say you want to use SomeUtilityClass (or the venerable <a href="http://commons.apache.org/lang/api-2.5/org/apache/commons/lang/StringUtils.html">StringUtils</a>), just do: </p> <div class="highlight"><pre><code class="scala"> <span class="k">import</span> <span class="nn">com.company.SomeUtilityClass</span> <span class="k">val</span> <span class="n">in</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">textFile</span><span class="o">(</span><span class="s">&quot;s3://bucket/path/&quot;</span><span class="o">)</span> <span class="k">val</span> <span class="n">processed</span> <span class="k">=</span> <span class="n">in</span><span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="n">line</span> <span class="k">=&gt;</span> <span class="c1">// just call it, it&amp;#39;s a normal method call</span> <span class="nc">SomeUtilityClass</span><span class="o">.</span><span class="n">process</span><span class="o">(</span><span class="n">line</span><span class="o">)</span> <span class="o">}</span> <span class="n">processed</span><span class="o">.</span><span class="n">saveAsTextFile</span><span class="o">(</span><span class="s">&quot;s3://bucket/path2&quot;</span><span class="o">)</span> </code></pre></div> <p>Notice how <code>SomeUtilityClass</code> doesn&#39;t have to know it&#39;s running within a Spark RDD in the cluster. It just takes a String. Done. </p> <p>Similarly, Spark doesn&#39;t need to know anything about the code you use witin the closure, it just needs to be available on the classpath of each machine in the cluster (which is easy to do as part of your cluster/job setup, you just copy some jars around). </p> <p>This seamless hop between the RDD and custom Java/Scala code is very nice, and means your Spark jobs end up reading just like regular, normal Scala code (which to us is a good thing!). </p> <h3>Is Spark Perfect?</h3> <p>As full disclosure, we&#39;re still in the early stages of testing Spark, so we can&#39;t yet say whether Spark will be a wholesale replacement for Hive within Bizo. We haven&#39;t gotten to any serious performance comparisons or written large, complex reports to see if Spark can take whatever we throw at it. </p> <p>Personally, I am also admittedly somewhat infutuated with Spark at this point, so that could be clouding my judgement about the pros/cons and the tradeoffs with Hive. </p> <p>One Spark con so far is that Spark is pre-1.0, and it can show. I&#39;ve seen some stack traces that shouldn&#39;t happen, and some usability warts, that hopefully will be cleared up by 1.0. (That said, even as a newbie I find the codebase small and very easy to read, such that I&#39;ve had <a href="https://github.com/mesos/spark/pull/352">several</a> <a href="https://github.com/mesos/spark/pull/351">small</a> <a href="https://github.com/mesos/spark/pull/362">pull requests</a> accepted already - which is a nice consolation compared to the daunting codebases of Hadoop and Hive.) </p> <p>We have also seen that, for our first Spark job, moving from &quot;Spark job written&quot; to &quot;Spark job running in production&quot; is taking longer than expected. But given that Spark is a new tool to us, we expect this to be a one-time cost. </p> <h3>More to Come</h3> <p>I have a few more posts coming up which explain our approach to Spark in more detail, for example: </p> <ul> <li><p>Testing best practices</p></li> <li><p>Running Spark in EMR</p></li> <li><p>Accessing partitioned S3 logsTo see those when they come out, make sure to subscribe to the blog, or, better yet, <a href="http://bizo.theresumator.com/">come work at Bizo</a> and help us out!</p></li> </ul> Grouping pageviews into visits: a Scala code kata Darren Lee Wed, 26 Sep 2012 00:00:00 +0000 http://dev.bizo.com/2012/09/grouping-pageviews-into-visits-scala.html http://dev.bizo.com/2012/09/grouping-pageviews-into-visits-scala.html <p>The basic units of any website traffic analysis are pageviews, visits, and unique visitors. Tracking pageviews is simply a matter of counting requests to the server. Calculating unique visitors usually relies on cookies and unique identifiers. Visits, however, require a bit more work. For our purposes, a single visit is defined as a sequence of pageviews where the interval between pageviews is less than a fixed length like 15 minutes.</p> <p>I thought that the problem of grouping pageviews into visits would make an interesting<a href="http://codekata.pragprog.com/">code kata</a>. Here’s the statement of the problem that I worked from:</p> <blockquote> <p>Given a non-empty sequence of timestamps (as milliseconds since the epoch), write a function that would return a sequence of visits, where each visit is itself a sequence of timestamps where each pair of consecutive timestamps is no more than N milliseconds apart.</p> </blockquote> <h3>Procedural</h3> <p>As a starting point, I decided to take a straightforward procedural approach:</p> <div class="highlight"><pre><code class="scala"><span class="k">def</span> <span class="n">doingItIteratively</span><span class="o">(</span><span class="n">pageviews</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">])</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">]]</span> <span class="k">=</span> <span class="o">{</span> <span class="k">val</span> <span class="n">iterator</span> <span class="k">=</span> <span class="n">pageviews</span><span class="o">.</span><span class="n">sorted</span><span class="o">.</span><span class="n">iterator</span> <span class="k">val</span> <span class="n">visits</span> <span class="k">=</span> <span class="nc">ListBuffer</span><span class="o">[</span><span class="kt">ListBuffer</span><span class="o">[</span><span class="kt">Long</span><span class="o">]]()</span> <span class="k">var</span> <span class="n">previousPV</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="n">iterator</span><span class="o">.</span><span class="n">next</span> <span class="k">var</span> <span class="n">currentVisit</span><span class="k">:</span> <span class="kt">ListBuffer</span><span class="o">[</span><span class="kt">Long</span><span class="o">]</span> <span class="k">=</span> <span class="nc">ListBuffer</span><span class="o">(</span><span class="n">previousPV</span><span class="o">)</span> <span class="k">for</span> <span class="o">(</span><span class="n">currentPV</span> <span class="k">&lt;-</span> <span class="n">iterator</span><span class="o">)</span> <span class="o">{</span> <span class="k">if</span> <span class="o">(</span><span class="n">currentPV</span> <span class="o">-</span> <span class="n">previousPV</span> <span class="o">&gt;</span> <span class="n">N</span><span class="o">)</span> <span class="o">{</span> <span class="n">visits</span> <span class="o">+=</span> <span class="n">currentVisit</span> <span class="n">currentVisit</span> <span class="k">=</span> <span class="nc">ListBuffer</span><span class="o">[</span><span class="kt">Long</span><span class="o">]()</span> <span class="o">}</span> <span class="n">currentVisit</span> <span class="o">+=</span> <span class="n">currentPV</span> <span class="n">previousPV</span> <span class="k">=</span> <span class="n">currentPV</span> <span class="o">}</span> <span class="n">visits</span> <span class="o">+=</span> <span class="n">currentVisit</span> <span class="n">visits</span> <span class="n">map</span> <span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toSeq</span><span class="o">)</span> <span class="n">toSeq</span> <span class="o">}</span> </code></pre></div> <p>So, we simply iterate through the (sorted) events tracking both the current visit and the previous pageview. If the current pageview represents a new visit, push the previous visit into the list of all visits and start a new one. Then push the current pageview into the (potentially new) visit.</p> <h3>Folding</h3> <p>It actually felt a bit odd to write procedural code like this and ignore the functional parts of Scala. Using a fold cleans the code up a bit and gets rid of the mutable state.</p> <div class="highlight"><pre><code class="scala"><span class="k">def</span> <span class="n">doingItByFolds</span><span class="o">(</span><span class="n">pageviews</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">])</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">]]</span> <span class="k">=</span> <span class="o">{</span> <span class="k">val</span> <span class="n">sortedPVs</span> <span class="k">=</span> <span class="n">pageviews</span><span class="o">.</span><span class="n">sorted</span> <span class="o">(</span><span class="nc">Seq</span><span class="o">[</span><span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">]]()</span> <span class="o">/:</span> <span class="n">sortedPVs</span><span class="o">)</span> <span class="o">{</span> <span class="o">(</span><span class="n">visits</span><span class="o">,</span> <span class="n">pv</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="k">val</span> <span class="n">isNewVisit</span> <span class="k">=</span> <span class="n">visits</span><span class="o">.</span><span class="n">lastOption</span> <span class="n">flatMap</span> <span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">lastOption</span><span class="o">)</span> <span class="n">map</span> <span class="o">{</span> <span class="n">prevPV</span> <span class="k">=&gt;</span> <span class="n">pv</span> <span class="o">-</span> <span class="n">prevPV</span> <span class="o">&gt;</span> <span class="n">N</span> <span class="o">}</span> <span class="n">getOrElse</span> <span class="kc">true</span> <span class="k">if</span> <span class="o">(</span><span class="n">isNewVisit</span><span class="o">)</span> <span class="o">{</span> <span class="n">visits</span> <span class="o">:+</span> <span class="nc">Seq</span><span class="o">(</span><span class="n">pv</span><span class="o">)</span> <span class="o">}</span> <span class="k">else</span> <span class="o">{</span> <span class="n">visits</span><span class="o">.</span><span class="n">init</span> <span class="o">:+</span> <span class="o">(</span><span class="n">visits</span><span class="o">.</span><span class="n">last</span> <span class="o">:+</span> <span class="n">pv</span><span class="o">)</span> <span class="o">}</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div> <p>Here, we’re starting with an empty list of visits and folding it over the sorted pageviews. At each pageview, we decide if we need to start a new visit. If so, we append a new visit containing the pageview to the accumulated visits. If not, we pop off the last visit, append the pageview, and put the last visit back on the tail of the accumulated visits.</p> <h3>Folding With Intervals</h3> <p>One part that’s still a bit messy is comparing the current timestamp to the previous one. We can improve that by iterating through the intervals between pageviews instead of the actual pageviews.</p> <div class="highlight"><pre><code class="scala"><span class="k">def</span> <span class="n">slidingThroughIt</span><span class="o">(</span><span class="n">pageviews</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">])</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">]]</span> <span class="k">=</span> <span class="o">{</span> <span class="k">val</span> <span class="n">intervals</span> <span class="k">=</span> <span class="o">(</span><span class="mi">0L</span> <span class="o">+:</span> <span class="n">pageviews</span><span class="o">.</span><span class="n">sorted</span><span class="o">).</span><span class="n">sliding</span><span class="o">(</span><span class="mi">2</span><span class="o">)</span> <span class="o">(</span><span class="nc">Seq</span><span class="o">[</span><span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">]]()</span> <span class="o">/:</span> <span class="n">intervals</span><span class="o">)</span> <span class="o">{</span> <span class="o">(</span><span class="n">visits</span><span class="o">,</span> <span class="n">interval</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="k">if</span> <span class="o">(</span><span class="n">interval</span><span class="o">(</span><span class="mi">1</span><span class="o">)</span> <span class="o">-</span> <span class="n">interval</span><span class="o">(</span><span class="mi">0</span><span class="o">)</span> <span class="o">&gt;</span> <span class="n">N</span><span class="o">)</span> <span class="o">{</span> <span class="n">visits</span> <span class="o">:+</span> <span class="nc">Seq</span><span class="o">(</span><span class="n">interval</span><span class="o">(</span><span class="mi">1</span><span class="o">))</span> <span class="o">}</span> <span class="k">else</span> <span class="o">{</span> <span class="n">visits</span><span class="o">.</span><span class="n">init</span> <span class="o">:+</span> <span class="o">(</span><span class="n">visits</span><span class="o">.</span><span class="n">last</span> <span class="o">:+</span> <span class="n">interval</span><span class="o">(</span><span class="mi">1</span><span class="o">))</span> <span class="o">}</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div> <p>Here, we’re prepending a “0L” timestamp (and assuming that none of the pageviews happened in the early 70s) and using the “sliding” method to pair each timestamp with the previous one.</p> <h3>With Case Class</h3> <p>So far, we’ve been using a sequence of pageviews as a visit. What happens if we add an explicit Visit type? This lets us convert all pageviews into Visits at the start, then focus on merging overlapping Visits. One nice benefit is that this is a map-reduce algorithm that can be easily parallelized instead of one that must sequentially iterate over the pageviews (either explicitly or with a fold).</p> <div class="highlight"><pre><code class="scala"><span class="k">case</span> <span class="k">class</span> <span class="nc">Visit</span><span class="o">(</span><span class="n">start</span><span class="k">:</span> <span class="kt">Long</span><span class="o">,</span> <span class="n">end</span><span class="k">:</span> <span class="kt">Long</span><span class="o">,</span> <span class="n">pageviews</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">])</span> <span class="o">{</span> <span class="k">def</span> <span class="o">+(</span><span class="n">other</span><span class="k">:</span> <span class="kt">Visit</span><span class="o">)</span><span class="k">:</span> <span class="kt">Visit</span> <span class="o">=</span> <span class="o">{</span> <span class="nc">Visit</span><span class="o">(</span><span class="n">min</span><span class="o">(</span><span class="n">start</span><span class="o">,</span><span class="n">other</span><span class="o">.</span><span class="n">start</span><span class="o">),</span> <span class="n">max</span><span class="o">(</span><span class="n">end</span><span class="o">,</span> <span class="n">other</span><span class="o">.</span><span class="n">end</span><span class="o">),</span> <span class="o">(</span><span class="n">pageviews</span> <span class="o">++</span> <span class="n">other</span><span class="o">.</span><span class="n">pageviews</span><span class="o">).</span><span class="n">sorted</span><span class="o">)</span> <span class="o">}</span> <span class="o">}</span> <span class="k">def</span> <span class="n">doingItMapReduceStyle</span><span class="o">(</span><span class="n">pageviews</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Long</span><span class="o">])</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">Visit</span><span class="o">]</span> <span class="k">=</span> <span class="o">{</span> <span class="n">pageviews</span> <span class="o">.</span><span class="n">par</span> <span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="n">pv</span> <span class="k">=&gt;</span> <span class="nc">Seq</span><span class="o">(</span><span class="nc">Visit</span><span class="o">(</span><span class="n">pv</span><span class="o">,</span> <span class="n">pv</span><span class="o">+</span><span class="n">N</span><span class="o">,</span> <span class="nc">Seq</span><span class="o">(</span><span class="n">pv</span><span class="o">))</span> <span class="o">}</span> <span class="o">.</span><span class="n">reduce</span> <span class="o">{</span> <span class="o">(</span><span class="n">visit1</span><span class="o">,</span> <span class="n">visit2</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="k">val</span> <span class="n">sortedVisits</span> <span class="k">=</span> <span class="o">(</span><span class="n">v1</span> <span class="o">++</span> <span class="n">v2</span><span class="o">)</span> <span class="n">sortBy</span> <span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">start</span><span class="o">)</span> <span class="o">(</span><span class="nc">Seq</span><span class="o">[</span><span class="kt">Visit</span><span class="o">]()</span> <span class="o">/:</span> <span class="n">sortedVisits</span><span class="o">)</span> <span class="o">{</span> <span class="o">(</span><span class="n">visits</span><span class="o">,</span> <span class="n">next</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="k">if</span> <span class="o">(</span><span class="n">visits</span><span class="o">.</span><span class="n">lastOption</span> <span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">end</span> <span class="o">&amp;</span><span class="n">gt</span><span class="o">;</span><span class="k">=</span> <span class="n">next</span><span class="o">.</span><span class="n">start</span><span class="o">)</span> <span class="n">getOrElse</span> <span class="kc">false</span><span class="o">)</span> <span class="o">{</span> <span class="n">visits</span><span class="o">.</span><span class="n">init</span> <span class="o">:+</span> <span class="o">(</span><span class="n">visits</span><span class="o">.</span><span class="n">last</span> <span class="o">+</span> <span class="n">visit</span><span class="o">)</span> <span class="o">}</span> <span class="k">else</span> <span class="o">{</span> <span class="n">visits</span> <span class="o">:+</span> <span class="n">visit</span> <span class="o">}</span> <span class="o">}</span> <span class="o">}</span> <span class="o">}</span> </code></pre></div> <p>The map-reduce solution is fun, but in a production system, I’d probably stick with the sliding variation and add a bit more flexibility to track actual pageview objects instead of just timestamps.</p> Using GROUP BYs or multiple INSERTs with complex data types in Hive. Darren Lee Wed, 19 Sep 2012 00:00:00 +0000 http://dev.bizo.com/2012/09/using-group-bys-or-multiple-inserts.html http://dev.bizo.com/2012/09/using-group-bys-or-multiple-inserts.html <p>In any sort of ad hoc data analysis, the first step is often to extract a specific subset of log lines from our files. For example, when looking at a single partner’s web traffic, I often use an initial query to copy that partner’s data into a new table. In addition to segregating out only the data relevant to my analysis, I use this to copy the data from S3 into HDFS, which will make later queries more efficient. (Using maps as our log lines is how we support<a href="http://dev.bizo.com/2011/02/quot;dynamicquot;-columns-in-hive.html">dynamic columns</a>.)</p> <div class="highlight"><pre><code class="sql"><span class="k">create</span> <span class="k">external</span> <span class="k">table</span> <span class="n">if</span> <span class="k">not</span> <span class="k">exists</span> <span class="n">original_logs</span><span class="p">(</span><span class="n">fields</span> <span class="k">map</span><span class="o">&amp;</span><span class="n">lt</span><span class="p">;</span><span class="n">string</span><span class="p">,</span><span class="n">string</span><span class="o">&amp;</span><span class="n">gt</span><span class="p">;)</span> <span class="k">location</span> <span class="err">“</span><span class="p">...</span><span class="err">”</span> <span class="p">;</span> <span class="k">create</span> <span class="k">table</span> <span class="n">if</span> <span class="k">not</span> <span class="k">exists</span> <span class="n">extracted_logs</span><span class="p">(</span><span class="n">fields</span> <span class="k">map</span><span class="o">&lt;</span><span class="n">string</span><span class="p">,</span><span class="n">string</span><span class="o">&gt;</span><span class="p">)</span> <span class="p">;</span> <span class="k">insert</span> <span class="n">overwrite</span> <span class="k">table</span> <span class="n">extracted_logs</span> <span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">original_logs</span> <span class="k">where</span> <span class="n">fields</span><span class="p">[</span><span class="err">“</span><span class="n">partnerId</span><span class="err">”</span><span class="p">]</span> <span class="o">=</span> <span class="mi">123</span> <span class="p">;</span> </code></pre></div> <p>If I’m doing this for multiple partners, it’s tempting to use a multiple-insert so Hadoop only needs to make one pass of the original data.</p> <div class="highlight"><pre><code class="sql"><span class="k">create</span> <span class="k">external</span> <span class="k">table</span> <span class="n">if</span> <span class="k">not</span> <span class="k">exists</span> <span class="n">original_logs</span><span class="p">(</span><span class="n">fields</span> <span class="k">map</span><span class="o">&amp;</span><span class="n">lt</span><span class="p">;</span><span class="n">string</span><span class="p">,</span><span class="n">string</span><span class="o">&amp;</span><span class="n">gt</span><span class="p">;)</span> <span class="k">location</span> <span class="err">“</span><span class="p">...</span><span class="err">”</span> <span class="p">;</span> <span class="k">create</span> <span class="k">table</span> <span class="n">if</span> <span class="k">not</span> <span class="k">exists</span> <span class="n">extracted_logs</span><span class="p">(</span><span class="n">fields</span> <span class="k">map</span><span class="o">&lt;</span><span class="n">string</span><span class="p">,</span><span class="n">string</span><span class="o">&gt;</span><span class="p">)</span> <span class="n">partitioned</span> <span class="k">by</span> <span class="p">(</span><span class="n">partnerId</span> <span class="nb">int</span><span class="p">);</span> <span class="k">from</span> <span class="n">original_logs</span> <span class="k">insert</span> <span class="n">overwrite</span> <span class="k">table</span> <span class="n">extracted_logs</span> <span class="n">partition</span> <span class="p">(</span><span class="n">partnerId</span> <span class="o">=</span> <span class="mi">123</span><span class="p">)</span> <span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">original_logs</span> <span class="k">where</span> <span class="n">fields</span><span class="p">[</span><span class="err">“</span><span class="n">partnerId</span><span class="err">”</span><span class="p">]</span> <span class="o">=</span> <span class="mi">123</span> <span class="k">insert</span> <span class="n">overwrite</span> <span class="k">table</span> <span class="n">extracted_logs</span> <span class="n">partition</span> <span class="p">(</span><span class="n">partnerId</span> <span class="o">=</span> <span class="mi">234</span><span class="p">)</span> <span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">original_logs</span> <span class="k">where</span> <span class="n">fields</span><span class="p">[</span><span class="err">“</span><span class="n">partnerId</span><span class="err">”</span><span class="p">]</span> <span class="o">=</span> <span class="mi">234</span> </code></pre></div> <p>Unfortunately, in Hive 0.7.x, this query fails with the error message “Hash code on complex types not supported yet.” A multiple-insert statement uses an implicit group by, and Hive 0.7.x <a href="https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L500">does not support grouping by complex types</a>. This bug was partially addressed in 0.8, which added support for arrays and maps, but structs and unions are <a href="https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L500">still not supported</a>.</p> <p>At an initial glance, it does look like adding this support should be straightforward. This could be a good candidate for our next <a href="http://dev.bizo.com/2012/04/dev-days:-hacking,-open-source-and-docs.html">open source day</a>.</p> mdadm: device or resource busy Alex Boisvert Sat, 07 Jul 2012 00:00:00 +0000 http://dev.bizo.com/2012/07/mdadm-device-or-resource-busy.html http://dev.bizo.com/2012/07/mdadm-device-or-resource-busy.html <p>I just spent a few hours tracking an issue with <a href="http://en.wikipedia.org/wiki/Mdadm">mdadm</a> (Linux utility used to manage software RAID devices) and figured I&#39;d write a quick blog post to share the solution so others don&#39;t have to waste time on the same.</p> <p>As a short background, we use mdadm to create RAID-0 stripped devices for our Sugarcube analytics (OLAP) servers using Amazon EBS volumes.</p> <p>The issue manifested itself as a random failure during device creation:</p> <div class="highlight"><pre><code class="bash"><span class="nv">$ </span>mdadm --create /dev/md0 --level<span class="o">=</span>0 --chunk 256 --raid-devices<span class="o">=</span>4 /dev/xvdh1 /dev/xvdh2 /dev/xvdh3 /dev/xvdh4 mdadm: Defaulting to version 1.2 metadata mdadm: ADD_NEW_DISK <span class="k">for</span> /dev/xvdh3 failed: Device or resource busy </code></pre></div> <p>I searched and searched the interwebs and tried every trick I found to no avail. We don&#39;t have <a href="http://www.linuxmanpages.com/man8/dmraid.8.php">dmraid</a> installed on our Linux images (Ubuntu 12.04 LTS / Alestic cloud image) so there&#39;s no possible conflict there. All devices were clean, as they are freshly created EBS volumes and I knew none of them were in use. </p> <p>Before running <code>mdadm --create, mdstat</code> was clean:</p> <div class="highlight"><pre><code class="bash"><span class="nv">$ </span>cat /proc/mdstat Personalities : <span class="o">[</span>linear<span class="o">]</span> <span class="o">[</span>multipath<span class="o">]</span> <span class="o">[</span>raid0<span class="o">]</span> <span class="o">[</span>raid1<span class="o">]</span> <span class="o">[</span>raid6<span class="o">]</span> <span class="o">[</span>raid5<span class="o">]</span> <span class="o">[</span>raid4<span class="o">]</span> <span class="o">[</span>raid10<span class="o">]</span> unused devices: &lt;none&gt; </code></pre></div> <p>And yet after running it the devices were assigned to two different devices instead of just /dev/md0:</p> <div class="highlight"><pre><code class="bash"><span class="nv">$ </span>cat /proc/mdstatPersonalities : <span class="o">[</span>linear<span class="o">]</span> <span class="o">[</span>multipath<span class="o">]</span> <span class="o">[</span>raid0<span class="o">]</span> <span class="o">[</span>raid1<span class="o">]</span> <span class="o">[</span>raid6<span class="o">]</span> <span class="o">[</span>raid5<span class="o">]</span> <span class="o">[</span>raid4<span class="o">]</span> <span class="o">[</span>raid10<span class="o">]</span> md127 : inactive xvdh4<span class="o">[</span>3<span class="o">](</span>S<span class="o">)</span> xvdh3<span class="o">[</span>2<span class="o">](</span>S<span class="o">)</span> 1048573952 blocks super 1.2 md0 : inactive xvdh2<span class="o">[</span>1<span class="o">](</span>S<span class="o">)</span> xvdh1<span class="o">[</span>0<span class="o">](</span>S<span class="o">)</span> 1048573952 blocks super 1.2 unused devices: &lt;none&gt; </code></pre></div> <p>Looking into <code>dmesg</code> didn&#39;t reveal anything interesting either:</p> <div class="highlight"><pre><code class="bash"><span class="nv">$ </span>dmesg ... <span class="o">[</span>3963010.552493<span class="o">]</span> md: <span class="nb">bind</span>&lt;xvdh1&gt; <span class="o">[</span>3963010.553011<span class="o">]</span> md: <span class="nb">bind</span>&lt;xvdh2&gt; <span class="o">[</span>3963010.553040<span class="o">]</span> md: could not open unknown-block<span class="o">(</span>202,115<span class="o">)</span>. <span class="o">[</span>3963010.553052<span class="o">]</span> md: md_import_device returned -16 <span class="o">[</span>3963010.566543<span class="o">]</span> md: <span class="nb">bind</span>&lt;xvdh3&gt; <span class="o">[</span>3963010.731009<span class="o">]</span> md: <span class="nb">bind</span>&lt;xvdh4&gt; </code></pre></div> <p>And strangely, the creation or assembly would sometime work and sometime not:</p> <div class="highlight"><pre><code class="bash"><span class="nv">$ </span>mdadm --manage /dev/md0 --stop mdadm: stopped /dev/md0 <span class="nv">$ </span>sudo mdadm --assemble --force /dev/md0 /dev/xvdh<span class="o">[</span>1234<span class="o">]</span> mdadm: /dev/md0 has been started with 4 drives. <span class="nv">$ </span>mdadm --manage /dev/md0 --stop mdadm: stopped /dev/md0 <span class="nv">$ </span>sudo mdadm --assemble --force /dev/md0 /dev/xvdh<span class="o">[</span>1234<span class="o">]</span> mdadm: cannot open device /dev/xvdh3: Device or resource busy <span class="nv">$ </span>mdadm --manage /dev/md0 --stop mdadm: stopped /dev/md0 <span class="nv">$ </span>sudo mdadm --assemble --force /dev/md0 /dev/xvdh<span class="o">[</span>1234<span class="o">]</span> mdadm: cannot open device /dev/xvdh1: Device or resource busy mdadm: /dev/xvdh1 has no superblock - assembly aborted <span class="nv">$ </span>mdadm --manage /dev/md0 --stop mdadm: stopped /dev/md0 <span class="nv">$sudo</span> mdadm --assemble --force /dev/md0 /dev/xvdh<span class="o">[</span>1234<span class="o">]</span> mdadm: /dev/md0 has been started with 4 drives. </code></pre></div> <p>I started suspecting I was facing some kind of underlying race condition where the devices would get assigned/locked during the device creation process. So I started googling for &quot;mdadm create race&quot; and I finally found a <a href="http://permalink.gmane.org/gmane.linux.raid/34027">post</a> that tipped me off. While it didn&#39;t provide the solution, the post put me on the right track by mentioning <a href="http://en.wikipedia.org/wiki/Udev">udev</a> and it took only a few more minutes to narrow down on the solution: <strong>disabling udev events during device creation to avoid contention on device handles.</strong></p> <p>So now our script goes something like:</p> <div class="highlight"><pre><code class="bash"><span class="nv">$ </span>udevadm control --stop-exec-queue <span class="nv">$ </span>mdadm --create /dev/md0 --run --level<span class="o">=</span>0 --raid-devices<span class="o">=</span>4 ... <span class="nv">$ </span>udevadm control --start-exec-queue </code></pre></div> <p>And we now have consistent reliable device creation.Hopefully this blog post will help other passers-by with a similar problem. Good luck!</p>