Jekyll2022-07-26T19:29:07+00:00http://hackalog.github.io/feed.xmlHackalogTear it apart. Put it back together again.HackalogReproducible PDFs2022-07-19T00:00:00+00:002022-07-19T00:00:00+00:00http://hackalog.github.io/reproducible-pdf<p>For the last few weeks, I’ve been putting the final touches on a research report, intended to be published both as a print (like, <em>dead-tree</em>) publication, and as a digital artifact (PDF, including the sources needed to generate it.)</p>
<p>It’s been a lot of back-and-forth, but we’re finally at the point of production. As I happily regenerated the PDF to send to the printer “one last time” (<code class="language-plaintext highlighter-rouge">book-final-edited-fix2-AGAIN-reallyfinal-ugh.pdf</code>), I noticed something odd. I kept getting git conflicts with the PDF, even when the source material wasn’t changing. (I don’t usually check generated files into git, but this particular PDF, being the main output of the project in question, seemed like a reasonable exception to my rule).</p>
<h2 id="down-the-rabbit-hole">Down The Rabbit Hole</h2>
<p>In hunting for the source of these git conflicts, I unwittingly fell down a reproducibility rabbit hole with my PDF generation: <strong>Can I reproducibly generate a PDF from an unchanging source document?</strong></p>
<p>To be more precise, I mean: starting from <em>exactly</em> the same input (a set of markdown documents), can I get the same PDF out? That can’t possibly be a hard problem, can it?</p>
<p>Short answer: <em><strong>it’s a lot harder than I thought</strong></em> (and metadata is to blame).</p>
<h2 id="our-pipeline">Our Pipeline</h2>
<p>We generate our book PDF from a set of markdown source files using <a href="https://pandoc.org/">pandoc</a>. It’s academic writing, so there’s a mix of filters in there: LaTeX for equations, <a href="https://github.com/lierdakil/pandoc-crossref">pandoc-crossref</a> for intra-document references, and Citeproc (BibTeX) for citations and references. It’s also multilingual, so we throw <a href="https://tug.org/xetex/">Xetex</a> into the mix to handle Unicode. It may seem like a crazy way to do it, but the result is an easy-to-edit, easy-to-diff document that can be easily maintained (<em>and</em> viewed) in GitHub by a wide variety of people (even those for whom LaTeX isn’t their first language).</p>
<p>The result is shockingly easy to convert to other document formats, so when a client asks for, say, a Word document to send for translation, we can easily do the conversion and expect that everything will render properly.</p>
<h2 id="our-problem">Our Problem</h2>
<p>I couldn’t seem to generate the same PDF twice. Witness here:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> make clean <span class="o">&&</span> make <span class="o">&&</span> md5 document.pdf
MD5 <span class="o">(</span>document.pdf<span class="o">)</span> <span class="o">=</span> fdfeefe8eb0df92162342271ad4cacc2
<span class="o">>>></span> make clean <span class="o">&&</span> make <span class="o">&&</span> md5 document.pdf
MD5 <span class="o">(</span>document.pdf<span class="o">)</span> <span class="o">=</span> 90360b00c4f1ef08e57135e6b866e392
</code></pre></div></div>
<p>Basically, every time the PDF is generated, the hash is different. That’s a little embarrassing for a guy who does reproducibility research. I need to fix this in our generation pipeline.</p>
<p>Pro-tip number 1: <strong>make sure you’re solving the right problem</strong>. There’s no guarantee the PDF is the culprit, so before digging in that grave, I should check the generation upstream. Is the source material actually unchanging?</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> make clean <span class="o">&&</span> make document.tex <span class="o">&&</span> <span class="nb">cp </span>document.tex orig.tex
<span class="o">>>></span> make clean <span class="o">&&</span> make document.tex <span class="o">&&</span> <span class="nb">cp </span>document.tex next.tex
<span class="o">>>></span> diff orig.txt next.txt
<span class="o">(</span>nothing<span class="o">)</span>
</code></pre></div></div>
<p>As I hoped: no output, so the generated TEX is the same. A good start.</p>
<p>Next, let’s figure out how different these files actually are, starting with my favourite hash function: file size.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> make clean <span class="o">&&</span> make <span class="o">&&</span> <span class="nb">mv </span>document.pdf orig.pdf
<span class="o">>>></span> make clean <span class="o">&&</span> make <span class="o">&&</span> <span class="nb">mv </span>document.pdf next.pdf
<span class="o">>>></span> <span class="nb">ls</span> <span class="nt">-la</span> <span class="k">*</span>.pdf
<span class="nt">-rw-r--r--</span> 1 hackalog staff 6709688 19 Jul 15:35 next.pdf
<span class="nt">-rw-r--r--</span> 1 hackalog staff 6709688 19 Jul 15:34 orig.pdf
</code></pre></div></div>
<p>Since the upstream contents are the same, and the resulting PDFs are the same size, I’m going to assume the bulk of the files are identical and look for some kind of metadata difference.</p>
<h2 id="the-fix">The Fix</h2>
<p>Lo and behold, it’s metadata. Google and <a href="https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va">Stackoverflow confirm</a> that these three fields are to blame:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">/CreationDate</code></li>
<li><code class="language-plaintext highlighter-rouge">/ModDate</code></li>
<li><code class="language-plaintext highlighter-rouge">/ID</code></li>
</ul>
<p>By reading the article, it seems that two of these are easy to fix, by hard-coding something reasonable into a <code class="language-plaintext highlighter-rouge">SOURCE_DATE_EPOCH</code> environment variable before running <code class="language-plaintext highlighter-rouge">pandoc</code>. (like the suggested output of <code class="language-plaintext highlighter-rouge">date +%s</code>). I can generate a fixed date, set the variable, and give it a try.</p>
<p>Sure enough, according to <a href="https://exiftool.org/">exiftool</a>, the creation and modification dates now match. Unfortunately, the hashes <em>still</em> don’t match.</p>
<p>What about that third one? ID? Annoyingly, <a href="https://exiftool.org/">exiftool</a> doesn’t let me view the <code class="language-plaintext highlighter-rouge">ID</code> field directly. Time to get dirty. (I’m actually impressed I made it this far without a <a href="https://github.com/vim/vim/blob/master/src/xxd/xxd.c">hex dump</a>).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> diff <(xxd document-1.pdf) <(xxd document.pdf)
418936,418940c418936,418940
< 00664770: 662f 4944 5b3c 3935 3361 3763 3266 6531 f/ID[<953a7c2fe1
< 00664780: 3363 3431 3139 3231 6236 3265 6635 3065 3c411921b62ef50e
< 00664790: 3962 6334 3134 3e3c 3935 3361 3763 3266 9bc414><953a7c2f
< 006647a0: 6531 3363 3431 3139 3231 6236 3265 6635 e13c411921b62ef5
< 006647b0: 3065 3962 6334 3134 3e5d 2f52 6f6f 740a 0e9bc414>]/Root.
---
> 00664770: 662f 4944 5b3c 6130 6665 6131 3762 3361 f/ID[<a0fea17b3a
> 00664780: 3039 3436 3330 6561 3536 6364 3366 6539 094630ea56cd3fe9
> 00664790: 6363 3734 3434 3e3c 6130 6665 6131 3762 cc7444><a0fea17b
> 006647a0: 3361 3039 3436 3330 6561 3536 6364 3366 3a094630ea56cd3f
> 006647b0: 6539 6363 3734 3434 3e5d 2f52 6f6f 740a e9cc7444>]/Root.
</code></pre></div></div>
<p>There it is, and sure enough, the ID changes every time. According to the <a href="https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va">aforelinked stackoverflow article</a>, there is a solution, but it depends on which PDF backend is compiling the actual compiling for pandoc. I suppose I can patch the <a href="https://github.com/Wandmalfarbe/pandoc-latex-template">eisvogel.tex</a> template I’m using to generate the book, and add a blurb the TeX header. Technically, I only use Xetex, but so I don’t have to look it up again, I’ll put it <strong>all</strong> in and cross my fingers if I ever need a different backend:</p>
<div class="language-tex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">\ifnum</span> 0<span class="k">\ifxetex</span> 1<span class="k">\fi\ifluatex</span> 1<span class="k">\fi</span>=0 <span class="c">% if pdftexe</span>
<span class="k">\pdfinfoomitdate</span>=1
<span class="k">\pdftrailerid</span><span class="p">{}</span>
<span class="k">\else</span> <span class="c">% if not pdftex</span>
<span class="k">\ifxetex</span>
<span class="k">\special</span><span class="p">{</span>pdf:trailerid [
<00112233445566778899aabbccddeeff>
<00112233445566778899aabbccddeeff>
]<span class="p">}</span>
<span class="k">\fi</span>
<span class="k">\ifluatex</span>
<span class="k">\pdfvariable</span> suppressoptionalinfo <span class="k">\numexpr</span>32+64+512<span class="k">\relax</span>
<span class="k">\fi</span>
<span class="k">\fi</span>
</code></pre></div></div>
<h2 id="the-result">The Result</h2>
<p>A few fistfuls of hair, a few hours, a hex dump, and much googling later, and…</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> make <span class="sb">`</span>clean<span class="sb">`</span> <span class="o">&&</span> make <span class="o">&&</span> <span class="nb">mv</span> <span class="sb">`</span>document.pdf orig.pdf<span class="sb">`</span>
<span class="o">>>></span> make <span class="sb">`</span>clean<span class="sb">`</span> <span class="o">&&</span> make <span class="o">&&</span> <span class="nb">mv</span> <span class="sb">`</span>document.pdf next.pdf<span class="sb">`</span>
<span class="o">>>></span> <span class="sb">`</span>md5 <span class="k">*</span>.pdf MD5<span class="sb">`</span>
<span class="o">(</span><span class="sb">`</span>next.pdf<span class="sb">`</span><span class="o">)</span> <span class="o">=</span> <span class="sb">`</span>c3bf99530a35eab6f9adafb08c24acbd MD5<span class="sb">`</span>
<span class="o">(</span><span class="sb">`</span>orig.pdf<span class="sb">`</span><span class="o">)</span> <span class="o">=</span> <span class="sb">`</span>c3bf99530a35eab6f9adafb08c24acbd<span class="sb">`</span>
</code></pre></div></div>
<p>At last my PDF generation is reproducible. That wasn’t so hard, was it?</p>HackalogReproducibly generating a PDF from the same source file can't possibly be a hard problem, can it?How I spent my Parenting Sabbatical2021-06-11T00:00:00+00:002021-06-11T00:00:00+00:00http://hackalog.github.io/parenting-sabbatical<p>For 6 weeks now, I’ve been on a parenting sabbatical; that is, I split my 9-month Parental Leave into two parts, separated by a 6-week “return to work”. On Monday, my work is done, and I go back on Parental Leave.</p>
<p>It took some serious logistics (a certain pandemic wiped out my original childcare plans), but I think I got some really great work done. Not only did I finish up old work, I feel like I’ve been able to at least dip my toe into the current research problems, which will make coming back in November all that much easier.</p>
<h2 id="6-weeks-in-review">6 Weeks in Review</h2>
<p>It’s been an intense 6 weeks.</p>
<p>Back in February, Amy and I presented our summary of Reproducibility research in 2020. In that talk, we also set out our roadmap for where we want to take the reproducibility project (Easydata) in 2021. My plan for my 6-week return was to get a good start on this roadmap, readying Easydata for the next set of projects and workshops to be thrown at it (later this summer, and early this fall).</p>
<p><img src="images/easydata2022/edreview-2021-goals.png" alt="Our 2021 Easydata Roadmap" /></p>
<p>In the last 6 weeks, I focused heavily on implementing the “Streamline Workgroup Sharing” improvements outlined in that talk. Particularly:</p>
<ul>
<li>Improving the <a href="https://hackalog.github.io/git-friendly-catalog">Catalog</a> object: Implementing a more git-friendly catalog format</li>
<li>Implementing <strong>notebook-as-transformer</strong>: i.e. the ability to use notebooks as nodes in the <code class="language-plaintext highlighter-rouge">DatasetGraph</code>. This allows an analyst to ceate a <code class="language-plaintext highlighter-rouge">Dataset</code> in a jupyter notebook (complete with all the storytelling that comes along with that format), and have that notebook be used automatically to regenerate the Dataset as part of the usual <code class="language-plaintext highlighter-rouge">Dataset.load()</code> dependency traversal mechanism (i.e. as a <strong>transformer</strong> in the <code class="language-plaintext highlighter-rouge">DatasetGraph</code>).</li>
</ul>
<p>This has set up some good opportunities to use the improved framework both immediately, and in the upcoming months:</p>
<ul>
<li>Amy’s preparing a set of tutorial notebooks for the <a href="https://github.com/acwooding/vectorizers_playground">Vectorizers Playground</a>.</li>
<li>Amy’s presenting an <a href="https://www.youtube.com/watch?v=KrIRTPvzLHM">Easydata tutorial</a>, and I’m giving a talk on the <a href="https://github.com/hackalog/make_better_defaults/blob/main/README.md">Easydata Makefile workflow</a> at this year’s Pydata Global.</li>
<li>Easydata will be driving the git repos for a number of upcoming workshops and research events (details to come).</li>
</ul>
<h3 id="we-released-easydata-20">We released Easydata 2.0</h3>
<p>Easydata 2.0 consists of two new features (<a href="https://hackalog.github.io/git-friendly-catalog">new catalog format</a> and notebook-as-transformer), and a massive API cleanup. Because we removed almost as much code as we added (+1300 lines, -900 lines) , we cranked the major version number to warn the user that they may want to review the documentation (or at least the blog post) before proceeding.</p>
<h3 id="we-reimplemented-catalogs">We reimplemented Catalogs</h3>
<p>A <code class="language-plaintext highlighter-rouge">Catalog</code> object is a serializable, disk-backed git-friendly dict-like object for storing a data catalog.</p>
<ul>
<li><strong>serializable</strong> means anything stored in the catalog must be serializable to/from JSON.</li>
<li><strong>disk-backed</strong> means all changes are reflected immediately in the on-disk serialization.</li>
<li><strong>git-friendly</strong> means this on-disk format can be easily maintained in a git repo (with minimal
issues around merge conflicts), and</li>
<li><strong>dict-like</strong> means programmatically, it acts like a Python <code class="language-plaintext highlighter-rouge">dict</code>.</li>
</ul>
<p>The new Catalog replaces the monlithic “catalog-as-json-files” that were used by Easydata. The main problem with these files is that, when several users were using the same git repo (like say, in a workshop) these catalogs were a rich source of git merge conflicts.</p>
<p>My favourite thing about the new Catalog format is that it’s almost completely transparent to the code. Internally, it just acts like a dict. The serialization almost comes for free. Implementing catalogs in this fashion let us remove a whole pile of special-case code for dealing with the various catalogs.</p>
<p>For details, read my <a href="https://hackalog.github.io/git-friendly-catalog">blog post</a>.</p>
<h3 id="we-implemented-notebook-as-transformer">We implemented Notebook-as-transformer</h3>
<p>Internally, Easydata maintains a dependency hypergraph called the <code class="language-plaintext highlighter-rouge">DatasetGraph</code>. Nodes in this graph are <code class="language-plaintext highlighter-rouge">Dataset</code> objects. Edges are composable “transformer functions” which take in 0 or more Datasets, and emit 1 or more Datsets. For more details, see my blog post on <a href="transformers-and-datasets">transformers and datasets</a>.</p>
<p>The DatasetGraph is the magic that lets <code class="language-plaintext highlighter-rouge">Dataset.load()</code> just magically work. If the Dataset is present on-disk, it’s loaded from there. If not, it’s generated by walking its dependency list and building (or loading) the relevant Datasets before running a transformer function.</p>
<p>Writing transformer functions was never hard, but it was one place where we spent a bunch of time coaching users. So, to make Easydata easier to use, we’ve eliminated the need to put everything in a single function. Now a user can specify a <em>jupyter notebook</em> as a transformer function. So long as the notebook writes the desired dataset to disk, the process will just magically work.</p>
<p>Allowing <em>notebook-as-transformer</em> greatly improves the storytelling possible with Easydata, as <code class="language-plaintext highlighter-rouge">Dataset</code> preparation (and all the narration that goes along with it) can be stored in the main flow of jupyter notebooks, instead of hidden in a transformer function inside the project’s <code class="language-plaintext highlighter-rouge">src</code> module.</p>
<h3 id="we-made-a-whole-bunch-of-other-api-changes">We made a whole bunch of other API changes</h3>
<p>Since we were already breaking a bunch of API with the <code class="language-plaintext highlighter-rouge">Catalog</code> change, we took the opportunity to clean (or remove) a lot of the more troublesome (or confusing) parts of the Easydata API. These were design decisions which we knew had issues (and for which we had usually developed workarounds), but we were keeping for purposes of backwards compatibility. There were a lot of small changes here (see my <a href="https://hackalog.github.io/api-changes">api-changes</a> blog post for details), but a lot of those changes can be described as follows:</p>
<ul>
<li>Names have semantic baggage. <strong>Good</strong> (variable, method, parameter) <strong>names are important</strong>.</li>
<li>Good API design comes from watching users actually <em>use</em> your framework</li>
<li>Any day you can delete a bunch of code by introducing a new API is a good day.</li>
</ul>
<h3 id="and-now-back-to-parenting">And now: Back to Parenting</h3>
<p>So that’s it for a few months. I’m off on Parental leave until November. @acwooding’s still around, however, so feel free to direct your reproducibility questions her way in the meantime.</p>
<p>See you in the fall!</p>HackalogA quick summary of my 6-week return-to-work.API Ch-ch-changes2021-06-02T00:00:00+00:002021-06-02T00:00:00+00:00http://hackalog.github.io/api-changes<p>As mentioned in the <a href="/git-friendly-catalog/">last post</a>, the upcoming Easydata 2.0 release is all about API and UX lessons we learned in the last year of using, and developing the Easydata framework.</p>
<p>Since there are probably a few existing Easydata users out there, here’s a quick guide to migrating to the new API.</p>
<h2 id="on-disk-catalog-format">On-disk Catalog Format</h2>
<p>We completely changed the on-disk <a href="/git-friendly-catalog">catalog format</a>. But you knew that, because we wrote a whole blog post about it :)</p>
<h3 id="loading-a-catalog">Loading a Catalog</h3>
<ul>
<li>Old: <code class="language-plaintext highlighter-rouge">load_catalog(catalog_name)</code></li>
<li>New: <code class="language-plaintext highlighter-rouge">Catalog.load(catalog_name)</code></li>
</ul>
<p>Previously, we had defined some helpful (partial) functions to load these; i.e.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">dataset_catalog = partial(load_catalog, catalog_file='datasets.json') # Old way</code></li>
<li><code class="language-plaintext highlighter-rouge">transformer_catalog = partial(load_catalog, catalog_file='transformers.json') # Old way</code></li>
<li><code class="language-plaintext highlighter-rouge">datasource_catalog = partial(load_catalog, catalog_file='datasources.json') # Old way</code></li>
</ul>
<p>We’ve deprecated them, because the new form is just as clear:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">Catalog.load('datasets')</code></li>
<li><code class="language-plaintext highlighter-rouge">Catalog.load('transformers')</code></li>
<li><code class="language-plaintext highlighter-rouge">Catalog.load('datasources')</code></li>
</ul>
<h3 id="deleting-a-key">Deleting a key</h3>
<ul>
<li>Old: <code class="language-plaintext highlighter-rouge">del_from_catalog(key, catalog_file=foo)</code></li>
<li>New: <code class="language-plaintext highlighter-rouge">c = Catalog.load(foo); del c[key]</code></li>
</ul>
<p>Basically, treat the catalog as a dict, and changes will be serialized to disk automatically.</p>
<h3 id="available-catalog-entries">Available catalog entries</h3>
<p>We used to have functions like <code class="language-plaintext highlighter-rouge">available_datasets()</code>, <code class="language-plaintext highlighter-rouge">available_transformers()</code>, <code class="language-plaintext highlighter-rouge">available_datasources()</code> but again, we now simply treat these as a dict, so</p>
<ul>
<li>Old: <code class="language-plaintext highlighter-rouge">if 'foo' in available_datasets() ...</code></li>
<li>New: <code class="language-plaintext highlighter-rouge">c = Catalog.load('datsets'); if 'foo' in c ...</code></li>
</ul>
<p>Basically, treat the catalog as a dict.</p>
<h2 id="building-transformers">Building Transformers</h2>
<p>One of our favourite new features is the ability to use a Jupyter Notebook in place of a transformer function.
It’s as easy as writing a Dataset to disk inside your notebook and then doing a:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dsdict = notebook_as_transformer(notebook_name='my_notebook.ipynb',
input_datasets=[ds_in],
output_datasets=[ds_out],
overwrite_catalog=True)
</code></pre></div></div>
<h2 id="eliminating-the-workflow-module">Eliminating the workflow module</h2>
<p>The purpose of <code class="language-plaintext highlighter-rouge">src.workflow</code> has changed several times. In the end, we ended up using it as a place to test
out new API ideas without exposing the details to the user. By the time we cut our Easydata 2 beta, this file was effectively empty, so it has returned to its original purpose (handling commands like “make datasets” and “make datasources”).</p>
<p>For the rest of the functions, that used to be there, import the module you want from easydata directly.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from src.data import Catalog, Dataset
from src.helpers import (dataset_from_csv_manual_download,
dataset_from_metadata
dataset_from_single_function)
</code></pre></div></div>
<h3 id="adding-datasetdatasource-to-catalog">Adding Dataset/Datasource to Catalog.</h3>
<p>To add a dataset or datasource to its respective catalog, use the “update_catalog()` method of the
Dataset / Datasource object respectively; e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>c = Catalog.load('datsets')
ds = Dataset('new_dataset_name')
ds.update_catalog()
</code></pre></div></div>
<p>The same works for DataSource objects.</p>
<h3 id="renamed-api-calls">Renamed API Calls</h3>
<ul>
<li><code class="language-plaintext highlighter-rouge">TransformerGraph</code> is now <code class="language-plaintext highlighter-rouge">DatasetGraph</code></li>
<li><code class="language-plaintext highlighter-rouge">create_transformer_pipeline</code> is now <code class="language-plaintext highlighter-rouge">serialize_transformer_pipeline</code></li>
</ul>
<h3 id="new-exceptions">New Exceptions</h3>
<p>We introduced some Easydata-specific exceptions. We had previously been using generic ones.</p>
<ul>
<li>EasydataError: base for all other exceptions</li>
<li>ValidationError: hash check failed</li>
<li>ObjectCollision: object already exists in object store (more general than a FileExistsError)</li>
<li>NotFoundError: object not found in object store (more general than a FileNotFound Error)</li>
</ul>
<h3 id="force-flags-and-other-misnamed-options">“force” flags and other misnamed options</h3>
<p><code class="language-plaintext highlighter-rouge">force</code> was a terrible name for an option flag, as it meant something slightly different
to every function, leading to some odd bugs. It has been replaced in most cases with a clearer name:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">Dataset.dump(force=True)</code> -> <code class="language-plaintext highlighter-rouge">Dataset.dump(exists_ok=True)</code></li>
<li><code class="language-plaintext highlighter-rouge">DatasetGraph.traverse()</code>: <code class="language-plaintext highlighter-rouge">force</code> -> <code class="language-plaintext highlighter-rouge">exhaustive</code></li>
<li><code class="language-plaintext highlighter-rouge">DatasetGraph.generate()</code> <code class="language-plaintext highlighter-rouge">force</code>-><code class="language-plaintext highlighter-rouge">exhaustive</code></li>
<li><code class="language-plaintext highlighter-rouge">DatasetGraph.add_source()</code>: <code class="language-plaintext highlighter-rouge">force</code>->overwrite_catalog</li>
<li><code class="language-plaintext highlighter-rouge">DatasetGraph.add_edge()</code>: <code class="language-plaintext highlighter-rouge">force</code>->overwrite_catalog</li>
</ul>
<p>We also cleaned up some other misnamed options:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">DatasetGraph.generate()</code>: <code class="language-plaintext highlighter-rouge">write_catalog</code>-><code class="language-plaintext highlighter-rouge">write_dataset</code></li>
<li><code class="language-plaintext highlighter-rouge">DatasetGraph.process_edge()</code>: <code class="language-plaintext highlighter-rouge">write_catalog</code>-><code class="language-plaintext highlighter-rouge">write_datsets</code></li>
</ul>
<h3 id="adding-datasets">Adding Datasets</h3>
<p><code class="language-plaintext highlighter-rouge">src.data.add_dataset()</code> is deprecated. It had two forms:</p>
<ul>
<li>From the dataset itself: Now <code class="language-plaintext highlighter-rouge">dataset.update_catalog()</code> (which can be handled by <code class="language-plaintext highlighter-rouge">update_catalog</code>)</li>
<li>Using the <code class="language-plaintext highlighter-rouge">from_datasource</code> option: Now <code class="language-plaintext highlighter-rouge">Dataset.from_datasource()</code></li>
</ul>
<h3 id="changing-the-log-level">Changing the log level</h3>
<p><code class="language-plaintext highlighter-rouge">src.log.debug</code> is now gone (it did not work correctly anyway). Set the <code class="language-plaintext highlighter-rouge">LOGLEVEL</code> environment variable instead.</p>
<p>I’m sure there are many other changes I forgot about, but these should get you going. (and the <a href="https://cookiecutter-easydata.readthedocs.io">Easydata documentation</a> and docstrings should get you the rest of the way!)</p>HackalogA quick guide to the changes in Easydata 2Making a git-friendly Catalog Format2021-05-25T00:00:00+00:002021-05-25T00:00:00+00:00http://hackalog.github.io/git-friendly-catalog<p>TL;DR: API Lessons learned from a year of building (and using) Easydata</p>
<p>After a year of using it, I’d say we got a lot of things right in <a href="https://github.com/hackalog/easydata">Easydata</a>. We made it to our 1.0 release last summer (introducing the <code class="language-plaintext highlighter-rouge">Dataset.load()</code> API), and, over the course of several workshops, hammered out a set of changes for working with large datsets (remote data and the EXTRA API), and private data.</p>
<p>That said, we also got a few things wrong, and now it’s time to go ahead and fix one of those things: making the catalog format more git-friendy. This is a breaking change, so this change will form the start of what will become Easydata 2.0, the rest of which will be documented in my <a href="/api-changes">next post</a>.</p>
<h2 id="on-disk-catalog-format">On-disk Catalog Format</h2>
<p>When we first picked an on-disk catalog format, we hadn’t thought about designing for minimizing potential git conflicts. Since a git workflow is a fairly core piece of <a href="https://github.com/hackalog/easydata">Easydata</a>, we’re going to right that wrong.</p>
<p>Following in the “implement the obvious thing first” philosophy, our initial serialization format for the <code class="language-plaintext highlighter-rouge">DatasetGraph</code> hypergraph was a pair of JSON files: one for the datasets (nodes), and one for the transformers (edges).</p>
<p>While this works in practice, when we use Easydata in a busy workshop, it has a downside: it’s a ripe source of <strong>git conflicts</strong>. How? Well, when two participants both make changes to their respective data catalogs, there’s a strong possibility of a git conflict when someone goes to merge those changes.</p>
<p>For 2.0, we’re going to try a straightforward format change. Instead of a catalog consisting of multiple JSON files with one entry per node/edge, let’s make the catalog consist of <strong>multiple directories</strong>, and have one <em>file</em> per node/edge. Dataset names are necessarily unique (as are transformer names, though it’s <em>much</em> less common to refer to them by name), so it seems natural. In other words, instead of</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">catalog/datasets.json</code></li>
<li><code class="language-plaintext highlighter-rouge">catalog/transformers.json</code></li>
</ul>
<p>we now have</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">catalog/datasets/*.json</code></li>
<li><code class="language-plaintext highlighter-rouge">catalog/transformers/*.json</code></li>
</ul>
<p>As an added bonus, the Catalog class can be used wherever we need a catalog of serializable objects; e.g. for our <code class="language-plaintext highlighter-rouge">DataSource</code> objects as well.</p>
<h2 id="the-catalog-class">The Catalog class</h2>
<p>A <code class="language-plaintext highlighter-rouge">Catalog</code> object is a serializable, disk-backed git-friendly dict-like object for storing a data catalog.</p>
<ul>
<li><strong>serializable</strong> means anything stored in the catalog must be serializable to/from JSON.</li>
<li><strong>disk-backed</strong> means all changes are reflected immediately in the on-disk serialization.</li>
<li><strong>git-friendly</strong> means this on-disk format can be easily maintained in a git repo (with minimal
issues around merge conflicts), and</li>
<li><strong>dict-like</strong> means programmatically, it acts like a Python <code class="language-plaintext highlighter-rouge">dict</code>.</li>
</ul>
<p>On disk, a Catalog is stored as a directory of JSON files, one file per object The stem of the filename (e.g. <code class="language-plaintext highlighter-rouge">stem.json</code>) is the key (name) of the catalog entry in the dictionary, so <code class="language-plaintext highlighter-rouge">catalog/key.json</code> is accessible through the API as <code class="language-plaintext highlighter-rouge">catalog['key']</code>.</p>
<p>Making this change let us deprecate a whole lot of arbitrary methods in our API, which got the ball rolling on a massive <a href="/api-changes">API cleanup</a>. More about that soon.</p>HackalogAPI Lessons learned from a year of building (and using) EasydataCache is Magic2020-05-06T00:00:00+00:002020-05-06T00:00:00+00:00http://hackalog.github.io/cache-magic<p>TL;DR: Caching is finicky, but magical when you get it right.</p>
<h2 id="cache-is-magic">Cache is Magic</h2>
<p>My self-declared milestone for an <a href="https://github.com/hackalog/easydata">Easydata</a> 1.0 release is being able to do this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> ds = Dataset.load('dataset_name')
</code></pre></div></div>
<p>And having it “just work” regardless of whether the <code class="language-plaintext highlighter-rouge">Dataset</code> is already on disk, or if it needs to be regenerated by traversing the <a href="/dataset-graph">DatasetGraph</a> and regenerating some or all of the intermediate <code class="language-plaintext highlighter-rouge">Dataset</code> objects (including raw data fetches, if necessary).</p>
<p>A magical component of this generation the <strong>caching</strong>, which, if a <code class="language-plaintext highlighter-rouge">Dataset</code> is on disk and matches the hashes recorded in the <code class="language-plaintext highlighter-rouge">Dataset</code> catalog, the generation step will be skipped. Seems easy enough, but as with most things in software, “do what I mean” turns out to be much, much harder than I secretly hoped. The good news is, the implementation is starting to just work.</p>
<p>After much usability wrangling, here’s how we cache <code class="language-plaintext highlighter-rouge">Dataset</code> objects in <a href="https://github.com/hackalog/easydata">Easydata</a>.</p>
<h3 id="datasets-and-metadata">Datasets and Metadata</h3>
<p>Recall, a <code class="language-plaintext highlighter-rouge">Dataset</code> is a set of binary blobs with standard names like <code class="language-plaintext highlighter-rouge">.data</code> and <code class="language-plaintext highlighter-rouge">.target</code>, along with its associated metadata.</p>
<p>Metadata is not an afterthought. It’s an essential component of the <code class="language-plaintext highlighter-rouge">Dataset</code>. Metadata can be anything that is JSON–serializable (in fact, under the hood, it’s just a dict), but usually contains:</p>
<ul>
<li>the <code class="language-plaintext highlighter-rouge">.DESCR</code> (readme) text, describing what this dataset is all about.</li>
<li>the <code class="language-plaintext highlighter-rouge">.LICENSE</code>, listing the conditions under which this data can be used.</li>
<li><code class="language-plaintext highlighter-rouge">.HASHES</code>: hash values for each of the binary attributes like data and target (essential for data provenance)</li>
<li>Any other information that you want to keep with the data itself, and preserve through the <code class="language-plaintext highlighter-rouge">Dataset</code> transformation process.</li>
</ul>
<p>Though under the hood it’s implemented as a dict, we steal a great idea from the sklearn <a href="https://github.com/adrinjalali/scikit-learn/blob/bea2e2414f93fdf4558f1288377d2aa0351727b4/sklearn/utils/__init__.py#L60-L80">Bunch</a> object and tweak it a bit to make metadata access easier. In addition to the standard dictionary-style access, metadata is accessible by referring to <strong>uppercase</strong> property names; e.g. <code class="language-plaintext highlighter-rouge">ds.LICENSE</code> returns the metadata stored at <code class="language-plaintext highlighter-rouge">ds.metadata['license']</code></p>
<p>It’s important (as you’ll see in a second) that this metadata is both hashable and JSON-serializable.</p>
<h3 id="how-caching-works-in-easydata">How caching works (in <a href="https://github.com/hackalog/easydata">Easydata</a>)</h3>
<p>The global <code class="language-plaintext highlighter-rouge">Dataset</code> catalog is a dictionary of the form:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{dataset_name: str, dataset_metadata:dict}
</code></pre></div></div>
<p>Caching works by hashing the metadata dictionary (which includes the data hashes) and using this hash as a filename for the cached copy of the dataset. Caches are stored in <code class="language-plaintext highlighter-rouge">paths['cache_path']</code>, and consist of a pair of files: <code class="language-plaintext highlighter-rouge">dataset_name.dataset</code> and <code class="language-plaintext highlighter-rouge">dataset_name.metadata</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-rw-r--r-- 1 hackalog staff 301636179 9 May 14:22 1b1adb100d8088955878a9d7b3d071710c2db3bf.dataset
-rw-r--r-- 1 hackalog staff 478 9 May 14:22 1b1adb100d8088955878a9d7b3d071710c2db3bf.metadata
-rw-r--r-- 1 hackalog staff 301636175 9 May 14:21 756974a0ce41ffb9f53b47c234cd1e8b721dacfd.dataset
-rw-r--r-- 1 hackalog staff 474 9 May 14:21 756974a0ce41ffb9f53b47c234cd1e8b721dacfd.metadata
</code></pre></div></div>
<p>The .dataset file is joblib serialization of the <code class="language-plaintext highlighter-rouge">Dataset</code> object. The .metadata file is a JSON file containing just the metadata dictionary, useful if we don’t want to spend the time to load the whole dataset just to get at its hashes, say.</p>
<p>Once in a while, a <code class="language-plaintext highlighter-rouge">Dataset</code> in in a polished enough form that we dump it directly to a named <code class="language-plaintext highlighter-rouge">Dataset</code> in the <code class="language-plaintext highlighter-rouge">paths['processed_data_path']</code> directory. We often do this at the end of a data cleaning session, or after an analysis. The idea being that we can blow away the <code class="language-plaintext highlighter-rouge">paths['interim_data_path']</code> or <code class="language-plaintext highlighter-rouge">paths['cache_path'] directory to get back disk space, and still have our generated </code>Dataset` objects available.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-rw-r--r-- 1 hackalog staff 301636179 9 May 14:22 beer_review_all.dataset
-rw-r--r-- 1 hackalog staff 478 9 May 14:22 beer_review_all.metadata
</code></pre></div></div>
<p>Note, these are exactly the same as their associated cache files: <code class="language-plaintext highlighter-rouge">1b1adb100d8088955878a9d7b3d071710c2db3bf.{dataset|metadata}</code></p>
<p>The end result is that we can accumulate multiple versions of a <code class="language-plaintext highlighter-rouge">Dataset</code> in the cache directory, and continue to use them so long as we have the disk space.</p>
<p>At some point, we’d love for this cache to be shared within a workgroup, but that’s a feature for another day.</p>HackalogCaching is finicky, but magical when you get it right.Implementing the DatasetGraph2020-05-04T00:00:00+00:002020-05-04T00:00:00+00:00http://hackalog.github.io/dataset-graph<p>TL;DR: How the Dataset DAG became a hypergraph became the DatasetGraph.</p>
<h2 id="datasetgraph-as-a-top-level-object">DatasetGraph as a top-level object.</h2>
<p>Recall from a <a href="/transformers-and-datasets">few weeks ago</a>, I described a bipartite graph (or Hypergraph), now called a <code class="language-plaintext highlighter-rouge">DatasetGraph</code>, which describes how <code class="language-plaintext highlighter-rouge">Dataset</code> objects are generated from other <code class="language-plaintext highlighter-rouge">Dataset</code> objects. I originally named it a <code class="language-plaintext highlighter-rouge">TransformerGraph</code>, because that’s how the directionality of the edges works out in the bipartite representation, but that turns out to be a little more confusing for the user. In the hypergraph, the <code class="language-plaintext highlighter-rouge">Dataset</code> objects are the nodes, so <code class="language-plaintext highlighter-rouge">DatasetGraph</code> it is.</p>
<p>One of the unintended consequences of introducing a <code class="language-plaintext highlighter-rouge">DatasetGraph</code> class in <a href="https://github.com/hackalog/easydata">Easydata</a> is that it turns out to be the right place to do a lot of things. That’s why we ended up exposing it to the user, instead of just using it internally to the <code class="language-plaintext highlighter-rouge">Dataset</code>.</p>
<p>Before we created the <code class="language-plaintext highlighter-rouge">DatasetGraph</code>, we used to have a top-level functions <code class="language-plaintext highlighter-rouge">add_transformer()</code> to add a dataset transformation to the global catalog. but it turns out a much more natural place to put it is in the <code class="language-plaintext highlighter-rouge">DatasetGraph</code> class directly.</p>
<p>Sticking with the “edges are functions, nodes are datasets” hypergraph terminology, the API becomes something like this”</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> dag = DatasetGraph()
>>> xp = create_transformer_pipeline([list, of, transformer, functions, or, partials, ...])
>>> dag.add_source(datasource_name="dsrc_name", datasource_opts={}, output_dataset="dset_name")
>>> dag.add_edge(input_dataset=None, input_datasets=(),
output_dataset=None, output_datasets=(),
transformer_pipeline=xp, **kwargs)
>>> dataset = dag.generate('node_name')
</code></pre></div></div>
<p>This gives us a clean separation between adding a source node to the graph, and adding an edge. Both technically add edges, but the idea of a “source edge” in this hypergraph just feels weird, so the details are hidden by this API. This is perhaps why describing the dependency graph as a bipartite graph is less troublesome. See <a href="/transformers-and-datasets">my last post</a> for more on that.</p>HackalogHow the Dataset DAG became a hypergraph became the DatasetGraph.Building Transformers and Datasets2020-04-13T00:00:00+00:002020-04-13T00:00:00+00:00http://hackalog.github.io/transformers-and-datasets<p>TL;DR: Easydata’s Dataset dependency hypergraph, described.</p>
<h3 id="hypergraph-or-bipartite-graph">Hypergraph or Bipartite Graph?</h3>
<p>For this post, I’m still talking about the hypergraph of data dependencies that I mentioned <a href="/dataset-dag">last time</a>, however for this discussion, I’ll switch from a hypergraph-based description to a bipartite graph-based description of the dependencies.</p>
<p>Why? For starters, there’s not necessarily a commonly accepted notion of a <strong>directed hypergraph</strong>
When I use the term, I mean a hypergraph, where the vertices of an edge are partitioned into two sets: the <strong>head-set</strong> and <strong>tail-set</strong> of the edge.</p>
<p>It’s perhaps interesting (and often surprising) to note the constructs that appear when trying to describe data flow as a directed hypergraph. In our case, we often end up with a hypergraph where data originates from a transformer function (like when we have synthetic, or downloaded data). This leads to a directed hyperedge with <strong>no input nodes</strong>, only output nodes; i.e the head-set is empty, but the tail-set is not. What does one even call that. A <strong>source edge</strong>?</p>
<p>Anyway, to avoid some of these rabbit holes, we can switch to a <strong>bipartite graph</strong> representation of this construct. These representations (hypergraph, bipartite graph) are interchangable. To construct this bipartite graph, list the transformers (the hyper “edges”) down one side of the page, Datasets (the hyper “nodes”) down the other, and join them with directed edges to indicate data dependencies (<strong>inbound edges</strong> to a transformer are <strong>input datasets</strong>, <strong>outbound edges are output datasets</strong>).</p>
<h3 id="more-on-the-dataset-graph">More on the Dataset Graph</h3>
<p>A <code class="language-plaintext highlighter-rouge">Dataset</code> is an on-disk object representing a point-in-time snapshot (a cached copy) of data and its associated metadata. The <code class="language-plaintext highlighter-rouge">Dataset</code> objects themselves are serialized to <code class="language-plaintext highlighter-rouge">data/processed</code>. Metadata about these objects are serialized to <code class="language-plaintext highlighter-rouge">catalog/datasets.json</code>.</p>
<p>A <code class="language-plaintext highlighter-rouge">Transformer</code> is a function that takes in <strong>zero or more</strong> <code class="language-plaintext highlighter-rouge">Dataset</code> objects, and produces <strong>one or more</strong> <code class="language-plaintext highlighter-rouge">Dataset</code> objects. While the functions themselves are stored in the source module (by default in <code class="language-plaintext highlighter-rouge">src/user/transformers.py</code>), metadata describing these functions and their inputs/outputs <code class="language-plaintext highlighter-rouge">Dataset</code> objects are serialized to the catalog file <code class="language-plaintext highlighter-rouge">catalog/transformers.json</code>.</p>
<p>We’ll define the <code class="language-plaintext highlighter-rouge">DatasetGraph</code> as the bipartite graph formed by the two distinct sets of vertices above: <code class="language-plaintext highlighter-rouge">Dataset</code> objects, and <code class="language-plaintext highlighter-rouge">Transformer</code> functions. The edges of this graph are directed, indicating the direction of dependency from the perspective of the <code class="language-plaintext highlighter-rouge">Transformer</code>; i.e. since <code class="language-plaintext highlighter-rouge">output_datasets</code> depend on <code class="language-plaintext highlighter-rouge">input_datasets</code> so arrows are directed from input <code class="language-plaintext highlighter-rouge">Dataset</code> objects to <code class="language-plaintext highlighter-rouge">Transformer</code> functions, and from <code class="language-plaintext highlighter-rouge">Transformer</code> functions to output <code class="language-plaintext highlighter-rouge">Dataset</code> objects.</p>
<p>The whole goal of this exercise is to capture the information about the data transformations from raw data to processed data, <strong>in a way that can be serialized to disk</strong>, and committed as if it was code. These instructions are stored in the data catalog in JSON format. There is some trickiness here, as function objects don’t serialize in a platform-independent way, so we just some make assumptions about namespaces (we set up a standard location in the python module for user-generated functions: <code class="language-plaintext highlighter-rouge">src.user.*</code>), and use Python introspection to map the serialization to function objects when the pipeline is loaded.</p>
<h3 id="transformer-serialization">Transformer Serialization</h3>
<p>Note that <strong>transformers can take zero datasets as input</strong> (but must produce at least one output). This special case occurs in one of two ways:</p>
<ul>
<li><strong>Synthetic Data</strong>: The data is synthetic, and the transformer is actually generates a <code class="language-plaintext highlighter-rouge">Dataset</code> object from scratch. The JSON in this case looks like:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "synthetic_source": {
"output_dataset: "ds_name",
"transformations": [
(synthetic_generator, kwargs_dict),
(optional_function_2, kwargs_dict_2 ),
...
],
}
</code></pre></div> </div>
</li>
<li><strong>Data Conversion</strong>: The data originates from something that isn’t a <code class="language-plaintext highlighter-rouge">Dataset</code> (e.g. a DataSource object), and the transformer converts it to a <code class="language-plaintext highlighter-rouge">Dataset</code>. This is really no different than the synthetic data case, except we supply a <code class="language-plaintext highlighter-rouge">dataset_from_datasource()</code> wrapper so the user doesn’t have to constantly reimplement it:
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "datasource_edge": {
"output_dataset: "ds_name",
"transformations": [
(dataset_from_datasource, (datasource_name), **datasource_opts} ),
(optional_function_2, kwargs_dict_2 ),
...
],
}
</code></pre></div> </div>
<p>In all other cases, a transformer consumes and emits one or more <code class="language-plaintext highlighter-rouge">Datasets</code> as both input and output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "hyperedge": {
"input_datasets":[in_dset_1, in_dset_2],
"output_datasets":[out_dset_1, out_dset_2],
"transformations": [
(function_1, kwargs_dict_1 ),
(function_2, kwargs_dict_2 ),
...
],
"suppress_output": False, # defaults to True
},
</code></pre></div> </div>
<h3 id="dataset-serialization">Dataset Serialization</h3>
<p>A complete <code class="language-plaintext highlighter-rouge">Dataset</code> object contains both the data itself and an associated metadata dictionary. On disk, this is serialized to two files, typically located in <code class="language-plaintext highlighter-rouge">paths['processed_data_path']</code>:</p>
</li>
<li><code class="language-plaintext highlighter-rouge">dataset_name.dataset</code>: The complete <code class="language-plaintext highlighter-rouge">Dataset</code> Object</li>
<li><code class="language-plaintext highlighter-rouge">dataset_name.metadata</code>: A copy of the metadata portion of the <code class="language-plaintext highlighter-rouge">Dataset</code>. As the <code class="language-plaintext highlighter-rouge">Dataset</code> can be quite large, metadata-only operations save time and memory by reading this file instead. If the <code class="language-plaintext highlighter-rouge">Dataset</code> has been reproducibly generated, this metadata should match whatever is serialized into the dataset catalog.</li>
</ul>
<p>One of the design goals of <a href="https://github.com/hackalog/easydata">Easydata</a> is that this processed dataset can be deleted at any time and (reproducibly and deterministically) recreated when needed.</p>
<h3 id="dataset-metadata">Dataset Metadata</h3>
<p>The master copy of the generated metadata is stored in the catalog file: <code class="language-plaintext highlighter-rouge">catalog/datasets.json</code>.</p>
<p><code class="language-plaintext highlighter-rouge">Dataset</code> metadata is fairly freeform. It is based on scikit-learn’s <a href="https://github.com/adrinjalali/scikit-learn/blob/bea2e2414f93fdf4558f1288377d2aa0351727b4/sklearn/utils/__init__.py#L60-L80">Bunch</a> object (basically a dictionary where the keys can be accessed as attributes). This object typically contains 4 attributes: <code class="language-plaintext highlighter-rouge">.data</code>, <code class="language-plaintext highlighter-rouge">.target</code> (which is often None for unsupervised learning problems), <code class="language-plaintext highlighter-rouge">.metadata</code>, and <code class="language-plaintext highlighter-rouge">.hashes</code>. The latter contains a hash of all the non-<code class="language-plaintext highlighter-rouge">metadata</code> attributes of the <code class="language-plaintext highlighter-rouge">Dataset</code>; e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> "hashes": {
"data":"sha1:d1d5ac9a5872e09b3a88618177dccc481df022d1",
"target":"sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b",
},
</code></pre></div></div>
<p>where data and target are whatever data type makes sense for the problem at hand (e.g. matrix, pandas dataframe, nparray, etc.)</p>HackalogThe Easydata Dataset dependency hypergraph, described.Building a Dataset Dependency Graph for Easydata2020-03-30T00:00:00+00:002020-03-30T00:00:00+00:00http://hackalog.github.io/dataset-dag<p>TL;DR: We thought we were building a graph of dependencies. Turns out we had a hypergraph.</p>
<h2 id="building-a-dataset-dependency-graph-for-easydata">Building a Dataset Dependency Graph for Easydata</h2>
<p>One of our design goals for Easydata is to be able to start an analysis like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> ds = Dataset.load(f"covid-19-nlp-{date}", date="2020-04-01")
</code></pre></div></div>
<p>This would do a bunch of magic behind the hood:</p>
<ul>
<li>(<strong>Caching</strong>) it would check if a cached version of the dataset already exists, (returning this cached copy if so). Otherwise</li>
<li>(<strong>Dataset Generation</strong>) it would generate any intermediate files needed to generate this dataset (all the way back to the raw data, if need be), then apply a sequence of <strong>transformer functions</strong> to turn the raw data into a processed dataset, then</li>
<li>(<strong>Check Hashes</strong>) it would hash and check the generated datasets to ensure that, if this command had been previously executed, nothing about my generated dataset has changed</li>
</ul>
<p>Until recently, I referred to this process as the <strong>Dataset Dependency DAG</strong>, assuming that it would be implemented as a directed acyclic graph (DAG); i.e. edges would be transformer functions that look like</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> def transformer(input_dset: Dataset, **kwarg) -> Dataset
</code></pre></div></div>
<p>and nodes would be <code class="language-plaintext highlighter-rouge">Dataset</code> objects. These could be easily pipelined together, as all transformers consumed and generated the same data type, and the remaining kwargs could be serialized to a json blob for the <a href="https://github.com/hackalog/easydata">Easydata</a> catalog, so Bob’s your uncle.</p>
<h3 id="well-duh">Well, DUH</h3>
<p>Unfortunately, when we started looking at our collection of real-world examples of transformer functions (see <a href="https:/github.com/acwooding/reproallthethings">reproallthethings</a>), we came to the conclusion that what we had wasn’t a <strong>directed graph</strong> of data dependencies, it was a <strong>directed hypergraph</strong>, as our real-world collection of data transformations includes such functions as:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">train-test-split</code>: takes as input a single dataset and produces two children: <strong>one parent, two children</strong>.</li>
<li><code class="language-plaintext highlighter-rouge">augment-dataset</code>: takes as input two datasets and joins them along various axes: <strong>two parents, one child</strong></li>
<li><code class="language-plaintext highlighter-rouge">subsample-dataset</code>: Takes as input a dataset and produces a smaller dataset by subsampling the rows: <strong>one parent, one child</strong>.</li>
</ul>
<p>In its most general form, therefore, a dataset transformer function <em>takes in an arbitrary number of datasets, and produces an arbitrary number of datasets</em>; i.e. a transformer function is a <strong>hyperedge</strong>, not an edge, and so my data dependencies are best described by a “Directed Acyclic Hypergraph” (DAH).</p>
<p>Unfortunately, a “DAH” doesn’t have the same ring to it as DAG. I complained about this online, and a colleague fixed this problem for me:</p>
<blockquote>
<p>Obviously acyclic generalizes to a different concept in hypergraphs than what you have. The correct term for the lack of of cycles in your hypergraph is “uncyclic”, so, um … DUH</p>
</blockquote>
<p>With the <a href="https://martinfowler.com/bliki/TwoHardThings.html">hardest computer science problem</a> out already of the way, we came to the next hurdle. I don’t have a handy mental model for how to implement this <code class="language-plaintext highlighter-rouge">Dataset</code> hypergraph (DUH) in python: the actual data structures and algorithms I’ll use to issue the sequence of data transformation calls, or the APIs I’ll need to be able to chain these transformer functions together in a pipeline.</p>
<p>Where as before, I could insist that a transformer be a function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> def transformer(input_dset: Dataset, **kwargs) -> Dataset
</code></pre></div></div>
<p>and just chain these together, now I have to something a little more… hyper.
Here’s my current thinking:</p>
<h3 id="serializing-the-dataset-hypergraph">Serializing The Dataset Hypergraph</h3>
<p>A <em>transformer function</em> takes in <em>input_datasets</em> and produces <em>output_datasets</em>.</p>
<p>Edges can be thought of as directed (parent to child), indicating a dependency. e.g. <em>output_datasets</em> depend on <em>input_datasets</em>, with an edge from one set to the other.</p>
<p>This will be serialized in the dataset catalog as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"hyperedge_1": {
"input_datasets":[],
"output_datasets":[],
transformations: [
(function_1, kwargs_dict_1 ),
(function_2, kwargs_dict_2 ),
...
],
"suppress_output": False, # defaults to True
},
"source_edge_1": {
"datasource_name": "ds_name",
"datasource_opts": {},
"output_dataset: "ds_name",
}
}
</code></pre></div></div>
<p>Notice that source nodes are actual 1-1 edges. This is convenient from an implementation perspective.</p>
<h3 id="the-transformer-api">The Transformer API</h3>
<p>Putting all this together, then, <strong>transformer functions</strong> are functions that takes in <strong>zero or more</strong> <code class="language-plaintext highlighter-rouge">Dataset</code> objects, and produces <strong>one or more</strong> <code class="language-plaintext highlighter-rouge">Dataset</code> objects, with the API:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>>> def transformer(dsdict: Dict(str,Dataset), **kwargs) -> Dict(str,Dataset)
</code></pre></div></div>
<p>Where we use kwargs to map function variables to key names in the dsdict as needed. This strings approach is needed, because the kwargs dict needs to be serializable to disk (as a json dump) to be used in the <a href="https://github.com/hackalog/easydata">Easydata</a> catalog.</p>
<h3 id="this-is-a-work-in-progress">This is a Work In Progress</h3>
<p>Of course, there are a few outstanding items from this litle brainstorm</p>
<ul>
<li>Does the generalization of the transformer API actually work? Can they be chained together in the way I intend? It works in my head, but my head isn’t Turing complete.</li>
<li>What’s the hypergraph traversal algorithm; i.e. I want to give the list of transformer (hyperedges) traversed from sources to any named node in the graph. What’s the directed hypergraph equivalent of the depth-first or breadth-first search here? Just do it on the complete bipartite graph and and stop when my list of nodes has been covered?</li>
</ul>
<p>Let’s <a href="/transformers-and-datasets">implement it and see</a>.</p>HackalogWe thought we were building a graph of dependencies. Turns out we had a hypergraph.The LLF Guide to Remote Work2020-03-08T00:00:00+00:002020-03-08T00:00:00+00:00http://hackalog.github.io/remote-work<p>A colleague of ours recently asked us about the challenges of doing
“remote work”. Obviously, in the current health environment, a lot of
organizations are considering doing remote work for the first
time. For those of us fortunate enough to be knowledge workers,
working from home is pretty accessible. At <a href="https://learnleapfly.com">Learn Leap Fly</a>, we’ve been
doing remote work since the company started in 2015. We built our
model by borrowing heavily from companies with far more experienced
than us (e.g. <a href="https://stephyiu.com/2019/02/17/behind-the-scenes-culture-and-tools-of-remote-work-at-automattic/">Automattic</a>, <a href="https://www.youtube.com/watch?v=83fk3RT8318">Mozilla</a>, <a href="https://basecamp.com/guides/how-we-communicate">Basecamp</a>), and then iterating on
those practices based on our own experiences.</p>
<p>Here are some of the things we’ve learned, both as remote workers, and as remote team managers,
over the last 5 years.</p>
<h2 id="on-culture">On Culture</h2>
<p>Be intentional and deliberate about <strong>fostering your team
culture</strong>. One of the big liabilities of remote work is that it’s easy
to lose the human side of interactions. To keep your culture in a
remote work environment, you have to be intentional and actively
foster it.</p>
<p><strong>Culture is what you do</strong>, not what you say you do. Culture is made up of
the shared practices that you have in common, and what they reflect
about your values as a team. Do you silently tolerate an in-crowd or
do you actively practice and foster open communication channels that
include everyone on your team? Do you favor extroverts, or do you give
a voice to everyone? Remote work is a great opportunity to <strong>reflect on
different ways of working and interacting</strong>. Whatever practices you
adopt will inevitably reflect and determine the culture of your team.</p>
<p>We recommend you use the remote work opportunity to be deliberate
about what you value, to document these values, and develop processes
that nurture and reflect the culture that you wish to have in your
organization.</p>
<h2 id="on-synchronous-vs-asynchronous-communication">On Synchronous vs. Asynchronous Communication</h2>
<p>I’d go out on a limb and say that every successful remote-work
organization has a <strong>asynchronous most-of-the-time</strong> communications
culture. That is, if asynchronous means will get the job done, <strong>do it
asynchronously</strong>. Practically speaking:</p>
<p><strong>Don’t hold a meeting if you can do it another way.</strong> This is a <a href="http://www.paulgraham.com/makersschedule.html">general rule of thumb for
productive work, period</a>. Use remote work as an opportunity to <strong>break your meeting
habit</strong>, and to develop more productive, asynchronous communications
habits.</p>
<p><strong>Recognize the true cost of meetings</strong>. In addition to being hard to schedule (especially across timezones), synchronous meetings far more costly than anyone remembers to factor in. <a href="https://basecamp.com/guides/how-we-communicate">As Basecamp puts it</a>: “five people in a room for an hour isn’t a one hour meeting, it’s a <strong>five hour</strong> meeting”.</p>
<p>That said, <strong>face-time is important</strong>. Learn Leap Fly has one regular
meeting: the weekly standup. It’s easy to start
feeling a lonely and isolated without laying eyes on your coworkers
once in a while.</p>
<h2 id="on-meetings">On Meetings</h2>
<p>If you do want to hold synchronous meetings, here’s our advice:</p>
<ul>
<li><strong>Share the agenda in advance</strong>. For regular meetings with a set
agenda (e.g. Sprint Rollovers) it’s sufficient to share the agenda
once. We keep them on our wiki, and evolve them over time as we
need to.</li>
<li>Use the <strong>highest quality audio and video platform</strong> you can
afford. We use zoom (which has easily the best multi-party video
quality).</li>
<li>Make sure <strong>everyone has good quality headphones and mics</strong>. Don’t
skimp out here. Buy them for your employees if you can.</li>
<li><strong>Connect from someplace quiet</strong>. No coffee shops. In a pinch, use a
car, or even a closet. (But please, don’t use a bathroom. That’s
just… gross.)</li>
<li><strong>Use one screen per person</strong>. Even if more then one person happens
to be working from the same space, require everyone to have an
individual connection to the video chat. This puts everyone on the
same footing. There’s nothing worse than watching an off-camera,
hard-to-hear conversation from across a bad quality
video-feed. One screen per person levels the playing field,
letting everybody <strong>feel like an equal contributor</strong>, regardless
of whether they are remote or local.</li>
<li>If you can’t find a regular meeting time that fits everyone, <strong>alternate meeting times</strong>.
That way, it’s not always the same people who are being left out.</li>
<li><strong>Take and post meeting notes</strong>. Rotate this responsibility to
ensure everyone has a chance to participate at some point.</li>
</ul>
<h2 id="on-email-blogs-and-wikis">On Email, Blogs, and Wikis</h2>
<p><strong>Don’t use email</strong> for internal business communication. Just
don’t. Use it to communicate with those outside your business if you
must, but use an <strong>asynchronous messaging tool</strong> (e.g. Slack, Skype
for Business, Basecamp) for conversations, and a <strong>documentation
platform</strong> (e.g. blogs, wikis) for more permanent team
communications.</p>
<p><strong>Write things down</strong>. <a href="https://basecamp.com/guides/how-we-communicate">Basecamp likes to say</a>:
“Speaking only helps who’s in the room, writing helps everyone.”
Think about the people who couldn’t make it to a meeting, future
employees or contractors, and even <strong>future you</strong>.</p>
<p><strong>Pick the tools that work for you</strong>. Internal blogs are great for
more detailed ongoing posts about your work. A Wiki can be great for
long-form, archived information. Automattic uses a wordpress theme
called P2 to merge their chat, checkins, and blog posts into a
single interface. Atlassian has confluence. Learn Leap Fly uses
MediaWiki and Notion. There are lots of options.</p>
<p><strong>Record daily check-ins</strong>. It’s really easy to lose the serendipitous
advantages that come from running into each other at the office and
chatting about what you’re currently working on. The informal and
unplanned sharing of information is key to productivity and
creativity of teams. Whether this is jotting down a few notes on the
corporate wiki every day, or using the wonderful “automatic
check-in” features of products like Basecamp, set aside a few
minutes at the end of each work day to share what you have been
working on with your colleagues.</p>
<h2 id="digital-watercoolers">Digital Watercoolers.</h2>
<p>Work isn’t always just about working. <strong>Have places where people can
interact informally</strong> and let off steam. At LLF, we have a #feeds channel in slack
so people can share interesting things they’re reading
online. <a href="https://revelry.co/watercooler-channel/">Some companies</a> have a #watercooler or #random
channel for off-topic chat. The point is, people need to interact
about things that aren’t mainline work (and this is a good thing).</p>
<h2 id="the-arc-of-work">The Arc of Work</h2>
<p><strong>Work in sprints</strong>, with a specific goal and specific end date. Ours
are either 2 or 3 weeks long, and we identify specific success
criteria to make sure our sprints aren’t overly
ambitious. Whenever we hit these goals, we have a mini-celebration
at the sprint rollover.</p>
<p><strong>Document your successes, and failures</strong>. We have everyone write up
a sprint report as part of our sprint rollovers. This is a little
post that answers the following questions:</p>
<ul>
<li>What did you set out to do?</li>
<li>What did you actually do?</li>
<li>What’s blocking your progress?</li>
<li>Are there any process changes that would help you?</li>
<li>What’s your morale (1-10)?</li>
</ul>
<p>Finally, organize sprints into <strong>larger arcs</strong>. Ours are roughly 3 months
long, after which we prepare a more detailed summary of what we
accomplished and learned. Basecamp calls these checkins “heartbeats.”
The act of reflecting on a larger arc is really, really useful to keep
you from losing the forest in all those daily trees.</p>
<h2 id="for-the-remote-worker">For the Remote Worker</h2>
<p>Have a <strong>dedicated personal work space</strong>. Home offices are amazing for
productivity. If you don’t have room for an office, create a space somewhere in your house
that you <strong>only use for working</strong>.</p>
<p>Think about <strong>ergonomics</strong>. Invest in a properly set-up desk, a great
chair, monitor stands, and a good keyboard. Companies like Automattic
give stipends for home-office setup costs. This is a great way to help
people build a productive and ergonomic home office.</p>
<p><strong>Get dressed for work</strong>. We don’t mean “dress up.” We mean, “get
changed out of your pajamas”. Having a transition from your “home day”
to your “work day” is important. We’ve heard of people that will go to
a coffee shop first thing, read the paper, and then come back to their house
to start their work day. Whatever works for you, try and establish a
routine around starting, and stopping work for the day.</p>
<p><strong>Have dedicated work hours</strong>. This is as much for other people as it is
for you. Plan when you are going to start, when you are going to stop,
and communicate these times with everyone who needs to know them. At
Learn Leap Fly, we use a shared Google calendar for this.</p>
<p><strong>Speak up!</strong> One of the drawbacks of remote work is that no one can see
you beavering away. Share what you’re doing on the group chat. Post
your daily checkins and weekly updates.</p>
<p><strong>Stay logged in to the group chat</strong> whenever you are working. For
some reason, seeing that little green dot that tells you other
people are online—even if you’re not actively talking to them—is
super comforting when remote working. Stay present, but don’t
constantly check your messages, and get sucked into side
conversations if you’re trying to do deep work. Most messaging tools
let you turn on a “Do not disturb” knob that silences notifications
for a while. Use it.</p>
<h2 id="for-the-remote-team-manager">For the Remote Team Manager</h2>
<p><strong>Trust your team members</strong>. One of the first questions we get
whenever we talk about remote work arrangements is “what do you do if
someone starts slacking off.” They don’t. The whole magic of a
flexible work arrangement is that so long as you are meeting your
objectives, we really don’t need to know how you’re doing
it. Presumably, you already have mechanisms to review work, with
performance reviews and the like. Trust them. If the performance
reviews are broken, fix them. In the meantime, <strong>trust your team members</strong>.</p>
<p><strong>Don’t let your people work too much</strong>. Ironically, with all the
questions around remote workers slacking off, it’s working too much
that often ends up being the real danger. It’s really easy to get
sucked in to working too much when you live in your workspace. Keep
an eye on your workers. Make it a cultural badge of honour to <em>not</em>
work more than 40 hours in a week.</p>
<p><strong>Don’t require a fixed work schedule</strong>. Let people define the hours
that best work for them. Trust them to do the work the way that suits
them best, and you’ll be amazed at the results.</p>
<p><strong>Don’t try and replicate the in-person office experience</strong> remotely. In
fact, you should use the <strong>remote</strong> work experience to <strong>improve your
in-person office work environment</strong>. There are a lot of unique
advantages to remote work. Take advantage of these advantages. Get
your team used to them, and use the change of setting to apply them
back to the office setting. One of our favorites is opening up the
decision-making process and letting more people in to see how
decisions are made in real time. Distributed tools allow everyone to
be in the room, not just “management.”</p>
<h2 id="tools-we-use">Tools We Use</h2>
<p>No talk of remote work would be complete without mentioning the tools
we use. Likely, every remote work scenario will use tools to implement
at least the following functions.</p>
<ul>
<li>Real-time team chat (e.g. Slack, Skype for Business, Mattermost)</li>
<li>Shared calendar (Basecamp, Google Calendar)</li>
<li>Videoconferencing (Zoom, Google Hangouts, Skype)</li>
<li>Information Repository (MediaWiki, Notion, Confluence)</li>
<li>Team blogging platform (Confluence, Wordpress)</li>
</ul>
<p>If you’re a technology shop, you’ll likely also need these:</p>
<ul>
<li>Shared Kanban / Sprint boards (Trello)</li>
<li>Brainstorming Tools (Miro, Mural)</li>
<li>Code Repo (GitHub, Gitlab, BitBucket) and CI</li>
</ul>
<p>Here are some of our favourite tool combinations:</p>
<ul>
<li>The <strong>Free Tier</strong>: slack + mediawiki + zoom + trello + miro + google cal + notion + github/bitbucket</li>
<li>The <strong>All-in-one(ish)</strong>: Basecamp + zoom + slack + github</li>
<li>The <strong>Atlassian</strong>: Confluence + Bitbucket + Trello + slack + zoom.</li>
<li>The <strong>Self-Hoster</strong>: Mattermost + wordpress/P2 + MediaWiki + GitLab + zoom</li>
</ul>
<h2 id="tools-we-use-but-dont-want-to-talk-about-here">Tools we use but don’t want to talk about here</h2>
<ul>
<li><strong>Shared Todo Managers</strong>. Actually, we use them all the time, but
this level of personal productivity tends to be very
personal. We’d recommend you leave this part of the stack up to
the individual. (We currently use Nozbe, though we’ve tried Asana
and OmniFocus as well.)</li>
<li><strong>Time tracking and Reporting</strong>. We use toggl, if it matters.</li>
<li><strong>Customer Relations Management (CRM) tools</strong>. We use Mailchimp.</li>
</ul>
<h2 id="tools-we-dont-use">Tools we don’t use</h2>
<ul>
<li><strong>Ticketing</strong> (e.g. Jira, Zendesk). We’re simply not in that
business. Besides, that’s more of a business function than a
remote-work enabler.</li>
<li><strong>Single Sign-on</strong>. We use a password manager (1Password) and generate unique random, strong passwords on every platform or website we use.</li>
<li><strong>Corporate email</strong>. Hopefully you’re sold on the virtues of <strong>not</strong> using email for team communications.</li>
</ul>
<h2 id="good-remote-work-references">Good Remote Work References</h2>
<p>Don’t take our word for it. Here is some good reading on the various topics covered in this post.</p>
<ul>
<li><a href="https://basecamp.com/books/remote">Remote: Office Not Required</a></li>
<li>Basecamp’s <a href="https://basecamp.com/guides/how-we-communicate">How We Communicate</a></li>
<li><a href="https://www.youtube.com/watch?v=83fk3RT8318">Mozilla Best Practices</a></li>
<li><a href="https://ma.tt/2020/03/coronavirus-remote-work/">Remote Work and the Coronavirus</a></li>
<li><a href="podcasts.apple.com/us/podcast/distributed-with-matt-mullenweg/id1463243282">Distributed with Matt Mullenweg</a> - Great podcast with Automattic’s Founder</li>
<li><a href="https://stephyiu.com/2019/02/17/behind-the-scenes-culture-and-tools-of-remote-work-at-automattic/">Behind the scenes: culture and tools of remote work at Automattic</a></li>
<li><a href="https://revelry.co/watercooler-channel/">Building Remote Office Culture with a Watercooler Channel</a></li>
<li>Extreme Remote Work: <a href="https://www.wired.com/story/what-do-i-do-all-day-livestreamed-technology-ceoing/">Steven Wolfram’s CEO Livestream</a></li>
<li>Paul Graham on <a href="http://www.paulgraham.com/makersschedule.html">The Maker’s Schedule</a></li>
<li><a href="https://basecamp.com/features/checkins">Auto-checkins in Basecamp</a></li>
</ul>HackalogSome of the things we've learned, both as remote workers, and as remote team managers, in 5 years of remote working.Reproducible Data Science2020-02-20T00:00:00+00:002020-02-20T00:00:00+00:00http://hackalog.github.io/reproducibility<h2 id="missing-pieces">Missing Pieces</h2>
<p>About 2 years ago, <a href="https://github.com/acwooding/">acwooding</a> and I attended a workshop on text analysis,
where a lot of people did some really nice work embedding text into
vector spaces under a variety of algorithms. What we were
working on was trying to establish some stability results;
i.e. whether repeated embeddings the various algorithms were stable,
or whether the results were all over the place because, for example,
we got results all over the place the algorithm was randomized and we
had just gotten lucky.</p>
<p>When we sat down to write up the analysis, we discovered really
quickly that we had a problem. Though we still had a collection of
<code class="language-plaintext highlighter-rouge">jupyter</code> notebooks and the associated data blobs, we had <em>no idea</em> how
our collaborators had pre-processed their data to insert into the
process in the first place. We had lost the information about the
preparation of the data, and hence, we’d lost the ability to generate
a consistent set of analyses across all of our data. Our workshop results
weren’t reproducible, and we were going to have to do a bunch of work
over from scratch if we wanted to publish anything.</p>
<p>If you were to survey you average data scientist on how much time they
spend in a given phase of the operation, you’d probably get something
that looks like this:</p>
<p><img src="https://raw.githubusercontent.com/hackalog/bus_number/master/notebooks/references/charts/munge-supervised.png" alt="Supervised Learning" /></p>
<p>In supervised learning, around 2/3 time is spent munging the data in
the first place, before you finally get around to doing your analysis.</p>
<p><img src="https://raw.githubusercontent.com/hackalog/bus_number/master/notebooks/references/charts/munge-unsupervised.png" alt="Unsupervised Learning" /></p>
<p>In unsupervised learning problems it’s was more like 90%.</p>
<p>Admittedly, like all statistics, these actual numbers are made up, but
they illustrate a real phenomenon. A vast amount of effort we are
performing as data scientists is happening before we ever get around
to the analysis part.</p>
<h2 id="but-what-about-the-environment">But What About the Environment?</h2>
<p>What we wanted to be able to do was capture that data munging history,
and turn that process into something that is sharable and
reproducible. We started looking at our own past analyses and set out
to create adopt a more standard workflow that would make it easy to
preserve (and share) the whole process of data science, including the
data munging.</p>
<p>At PyData NYC 2018, we ran a tutorial called “<a href="https://pydata.org/nyc2018/schedule/presentation/46/">Up your Bus
Number</a>: A Reproducible Data Science Workflow.” At that workshop we
were intending to talk a great deal about the munging of data, and the
wonderful and clever APIs that we had settled on to help simplify that
process. When we actually ran the tutorial, it turned out that about
80% of our time was spent before we even got to data munging. it was
spent trying to set up consistent, reproducible environments on a wide
variety of systems. The hard parts of getting to a reproducible data
science pipeline (installing and maintaining your environment), for
most of the people we were encountering, didn’t even show up in the
survey that we did about where your time is spent. Most people knew of
(or even used) tools like anaconda or virtualenv, but not in a way
that let them easily maintain and share these environments, or
reproduce the environments of others.</p>
<p>Even if we could reproduce the data munging, we couldn’t reproduce the
<em>development environment</em>. We have all these fancy tools: anaconda,
virtualenv, the now-deceased pipenv, and any number of wrappers and
scripts that are designed around making it easier to build a custom
python environment that’s tuned for your problem at hand. But actually
using those in a consistent manner is not trivial.</p>
<p>When we talk to people about reproducible data science, everyone wants
it, but almost nobody wants to <em>do</em> it. Most people think we’re
talking about reproducing an analysis, because that’s the easy
part. If we dig in a little further, some will grudgingly speak about
reproducing their data munging. Almost nobody talks about solving the
challenges of reproducible environments, assuming that tools like
conda have already solved that. Yet when we sit down to do the work,
environment and data munging issues dominate the effort.</p>
<h2 id="recognizing-the-hard-parts">Recognizing the Hard Parts</h2>
<p>One of our stated goals is to help make data scientists more
productive. How can we do this? Give them the ability to do their job
with less futzing around with their environments, and make it easy for
them to share their work—including the data munging. The primary means
by which data scientists exchange data science lore is by passing
around <code class="language-plaintext highlighter-rouge">jupyter</code> notebooks. But there’s so much that goes in under the
hood before that <code class="language-plaintext highlighter-rouge">jupyter</code> notebook ever even gets run, that if we don’t
take steps to that additional information—including information about
the environment, the data munging, the metadata associated with the
data sources—then data scientist productivity is lost, and
reproducibility goes right out the window.</p>
<p><img src="https://github.com/alan-turing-institute/the-turing-way/raw/master/book/content/figures/reproducibility/ReproducibleMatrix.jpg" alt="The Reproducibility Matrix" />
<em>Source: <a href="https://the-turing-way.netlify.com/reproducibility/03/definitions.html">The Turing Way</a>. (<a href="https://creativecommons.org/licenses/by/4.0/">CC-BY-4.0</a>)</em></p>
<p>Our challenge is this: if we want reproducible data science—and
that covers the entire spectrum of reproducibility, replicability,
generalizability, and robustness—then the hardest thing we have to
do is <strong>identify what the hard parts are</strong>. The only way to do that is
to repeatedly sit down with people and walk through their
pipelines. As many people as we can. Take their work and attempt to
reproduce it, and in doing so, learn where those barriers to
reproduction actually live: the technical barriers, the UX barriers,
an the psychology barriers. Then, and only then, put in the hard work
of building a toolkit that also solves the psychology and the user
interface problems, the workflow, and the APIs associated with
preventing reproducibility in the first place.</p>HackalogWhat's the hardest part about reproducible data science? Recognizing the hard parts.