You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The specific technical issue in this case was due to our misconfiguration of
242
240
Redis instances.
243
241
244
242
245
243
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.025.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'Root cause?'">
246
-
247
244
We know the particular technical failure was due to our Redis mishandling,
248
245
but how do we look past the specific bit and get to a broader understanding
249
246
of the processes that caused the issue?
250
247
251
248
252
249
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.026.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Billing incident response from Twilio developer evangelist.">
253
-
254
250
Let's take a look at the resolution of the situation and then learn about
255
251
the concepts and tools that could prevent future problems.
256
252
@@ -262,61 +258,53 @@ own environments.
262
258
263
259
264
260
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.027.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Twilio status page.">
265
-
266
261
Twilio became more transparent with the status of services, especially with
267
262
showing partial failures and outages.
268
263
269
264
270
265
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.028.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Twilio number of production deployments.">
271
-
272
266
Twilio was also deliberate in avoiding the accumulation of manual processes
273
267
and controls that other organizations often put in place after failures. We
274
268
doubled down on resiliency through automation to increase our ability to
275
269
deploy to production.
276
270
277
271
278
272
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.029.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'tools and concepts'.">
279
-
280
273
What are some of the tools and concepts we use at Twilio to prevent future
281
274
failure scenarios?
282
275
283
276
284
277
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.030.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Eventually you ship code into production that breaks your application.">
285
-
286
278
If you do not have the right tools and processes in place, eventually you
287
279
end up with a broken production environment after shipping code. What is
288
280
one tool we can use to be confident that the code going into production is
289
281
not broken?
290
282
291
283
292
284
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.031.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'automated testing' with example code coverage in the background.">
293
-
294
-
Automated testing, in its many forms, such as unit testing, integration
295
-
testing, security testing and performance testing, helps to ensure the
296
-
integrity of the code. You need to automate because manual testing is too
297
-
slow.
285
+
Automated [testing](/testing.html), in its many forms, such as unit testing,
286
+
integration testing, security testing and performance testing, helps to
287
+
ensure the integrity of the code. You need to automate because manual
288
+
testing is too slow.
298
289
299
290
Other important tools that fall into the automated testing bucket but are
300
291
not traditionally thought of as a "test case" include code coverage and
301
-
code metrics (such as Cyclomatic Complexity).
292
+
[code metrics](/code-metrics.html) (such as Cyclomatic Complexity).
302
293
303
294
304
295
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.032.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Automated tests in dev only deploy to production when they are successful.">
305
-
306
296
Awesome, now you only deploy to production when a big batch of automated
307
297
test cases ensure the integrity of your code. All good, right?
308
298
309
299
310
300
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.033.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Bugs can still occur in production.">
311
-
312
301
Err, well no. Stuff can still break in production, espcially in environments
313
302
where for various reasons you do not have the same exact data in test
314
303
that you do in production. Your automated tests and code metrics will
315
304
simply not catch every last scenario that could go wrong in production.
316
305
317
306
318
307
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.034.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'monitoring and alerting' with New Relic dashboard in the background.">
319
-
320
308
When something goes wrong with your application, you need monitoring to
321
309
know what the problem is, and alerting to tell the right folks. Traditionally,
322
310
the "right" people were in operations. But over time many organizations
@@ -325,7 +313,6 @@ developers who wrote the code that had the problem.
325
313
326
314
327
315
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.035.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="When something breaks in prod, your developers know about it and can fix the problem.">
328
-
329
316
A critical piece to DevOps is about ensuring the appropriate developers
330
317
are carrying the pagers. It sucks to carry the pager and get woken up in the
331
318
middle of the night, but it's a heck of a lot easier to debug the code that
@@ -339,14 +326,12 @@ something will blow up on you later on at a less convenient time.
339
326
340
327
341
328
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.036.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="When production is running smoothly with many tests, do that increase the chance of black swan-type events?">
342
-
343
329
Typically you find though that there are still plenty of production errors
344
330
even when you have defensive code in place with a huge swath of the most
345
331
important parts of your codebase being constantly tested.
346
332
347
333
348
334
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.037.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'Chaos engineering' with the chaos engineering monkey logo in the background.">
349
-
350
335
That's where a concept known as "chaos engineering" can come in. Chaos
351
336
engineering breaks parts of your production environment on a schedule and
352
337
even unscheduled basis. This is a very advanced technique- you are not going
@@ -355,121 +340,136 @@ or appropriate controls in place.
355
340
356
341
357
342
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.038.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Chaos engineering introduces intentional failures in your infrastructure both on a scheduled and unschedule basis.">
358
-
359
343
By deliberately introducing failures, especially during the day when your
360
344
well-caffeinated team can address the issues and put further safeguards in
361
345
place, you make your production environment more resilient.
362
346
363
347
364
348
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.039.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads '1. other peoples money' with money in the background.">
365
-
366
349
We talked about the failure in Twilio's payments infrastructure several years
367
350
ago that led us to ultimately become more resilient to failure by putting
368
351
appropriate automation in place.
369
352
370
353
371
354
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.040.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads '2. other peoples lives' with people in the background.">
372
-
373
355
Screwing with other people's money is really bad, and so is messing with
374
356
people's lives.
375
357
376
358
377
359
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.041.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'War on Terror' with an exploded vehicle in the background.">
378
-
379
360
Let's discuss a scenario where human lives were at stake.
380
361
381
362
To be explicit about this next scenario, I'm only going to talk about public
382
363
information, so my cleared folks in the audience can relax.
383
364
384
365
385
366
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.042.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="U.S. military and civilian casualties in Iraq.">
386
-
387
367
During the height of U.S forces' Iraq surge in 2007, more improvised explosive
388
368
devices were killing and maiming soldiers and civilians than ever before. It
389
369
was an incredible tragedy that contributed to the uncertainty of the time in
One major challenge with the project was a terrible manual build process that
381
+
literally involved clicking buttons in an integrated
382
+
[development environment](/development-environments.html) to create the
383
+
application artifacts. The process was too manual and the end result was that
384
+
the latest version of the software took far too long to get into production.
405
385
406
386
407
387
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.045.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="The situation did not have reasonable deployments to dev or to production.">
408
-
409
-
...
388
+
We did not have automated deployments to a development environment, staging
389
+
or production.
410
390
411
391
412
392
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.046.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Start somewhere, automate your deployments to dev environment.">
393
+
Our team had to start somewhere, but with a lack of approved tools, all we
394
+
had available to us was shell scripts. But shell scripts were a start. We were
395
+
able to make a very brittle but repeatable, automated deployment process to
396
+
a development environment?
413
397
414
-
...
398
+
There is still a huge glaring issue though: until the code is actually
399
+
deployed to production it does not provide any value for the users.
415
400
416
401
417
402
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.047.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Some environments have tricky issues with automated prod deployments like disconnected networks.">
403
+
In this case, we could never fully automate the deployment because we had to
404
+
burn to a CD before moving to a physically different computer network. The
405
+
team could automate just about everything else though, and that really mattered
406
+
for iteration and speed to deployment.
418
407
419
-
...
408
+
You do the best you can with the tools at your disposal.
420
409
421
410
422
411
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.048.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'Tools and concepts'.">
423
-
424
-
...
412
+
What are the tools and concepts behind automating deployments?
425
413
426
414
427
415
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.049.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Several development teams commit to a Git repository.">
428
-
429
-
...
416
+
Source code is stored in a
417
+
[source control (or version control)](/source-control.html) repository.
418
+
Source control is the start of the automation process, but what do we need
419
+
to get the code into various environments using a repeatable, automated
420
+
process?
430
421
431
422
432
423
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.050.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'continuous integration' with a screenshot of Jenkins dashboard in the background.">
433
-
434
-
...
424
+
This is where [continuous integration](/continuous-integration.html) comes
425
+
in. Continuous integration takes your code from the version control system,
426
+
builds it, tests it and calculate the appropriate code metrics before the
427
+
code is deployed to an environment.
435
428
436
429
437
430
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.051.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Add a continuous integration server to build the code that is committed to your source control repository.">
438
-
439
-
...
431
+
Now we have a continuous integration server hooked up to source control, but
432
+
this picture still looks odd.
440
433
441
434
442
435
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.052.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="How do we automate the building of these environments and the deployments themselves?">
443
-
444
-
...
436
+
Technically, continuous integration does not handle the details of the build
437
+
and how to configure individual execution environments.
445
438
446
439
447
440
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.053.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Text that reads 'configuration management' with a screenshot of Ansible AWX in the background.">
448
-
449
-
...
441
+
[Configuration management](/configuration-management.html) tools handle the
442
+
setup of application code and environments.
450
443
451
444
452
445
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.054.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Agile sprints deliver code to a development environment and then automate the deployment into production.">
453
-
454
-
...
446
+
Those two scenarios provided some context for why DevOps and Continuous
447
+
Delivery matter to organizations in varying industries. When you have high
448
+
performing teams working via the Agile development methodology, you will
449
+
encounter a set of problems that are not solvable by doing Agile "better". You
450
+
need the tools and concepts we talked about today as well as a slew of other
451
+
engineering practices to get that new code into production.
455
452
456
453
457
454
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.055.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Review list of continuous delivery tools.">
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.056.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="A list of more concepts and tools for continuous delivery.">
463
-
464
-
...
462
+
There are many other practices you will need as you continue your journey.
463
+
You can learn about
464
+
[all of them on Full Stack Python](/table-of-contents.html).
465
465
466
466
467
467
<imgsrc="/img/171101-devops-cd-you/devops-cd-you.057.jpg"width="100%"class="technical-diagram img-rounded"style="border: 1pxsolid#aaa"alt="Thank you slide.">
468
468
469
469
That's all for today. My name is [Matt Makai](/about-author.html)
470
470
and I'm a software developer at [Twilio](/twilio.html) and the
471
-
author of [Full Stack Python](https://www.fullstackpython.com/),
472
-
thank you very much.
471
+
author of [Full Stack Python](https://www.fullstackpython.com/).
0 commit comments