Hackalog

Reproducible PDFs

2022-07-19T00:00:00+00:00

For the last few weeks, I’ve been putting the final touches on a research report, intended to be published both as a print (like, dead-tree) publication, and as a digital artifact (PDF, including the sources needed to generate it.)

It’s been a lot of back-and-forth, but we’re finally at the point of production. As I happily regenerated the PDF to send to the printer “one last time” (book-final-edited-fix2-AGAIN-reallyfinal-ugh.pdf), I noticed something odd. I kept getting git conflicts with the PDF, even when the source material wasn’t changing. (I don’t usually check generated files into git, but this particular PDF, being the main output of the project in question, seemed like a reasonable exception to my rule).

Down The Rabbit Hole

In hunting for the source of these git conflicts, I unwittingly fell down a reproducibility rabbit hole with my PDF generation: Can I reproducibly generate a PDF from an unchanging source document?

To be more precise, I mean: starting from exactly the same input (a set of markdown documents), can I get the same PDF out? That can’t possibly be a hard problem, can it?

Short answer: it’s a lot harder than I thought (and metadata is to blame).

Our Pipeline

We generate our book PDF from a set of markdown source files using pandoc. It’s academic writing, so there’s a mix of filters in there: LaTeX for equations, pandoc-crossref for intra-document references, and Citeproc (BibTeX) for citations and references. It’s also multilingual, so we throw Xetex into the mix to handle Unicode. It may seem like a crazy way to do it, but the result is an easy-to-edit, easy-to-diff document that can be easily maintained (and viewed) in GitHub by a wide variety of people (even those for whom LaTeX isn’t their first language).

The result is shockingly easy to convert to other document formats, so when a client asks for, say, a Word document to send for translation, we can easily do the conversion and expect that everything will render properly.

Our Problem

I couldn’t seem to generate the same PDF twice. Witness here:

>>> make clean && make && md5 document.pdf
MD5 (document.pdf) = fdfeefe8eb0df92162342271ad4cacc2
>>> make clean && make && md5 document.pdf
MD5 (document.pdf) = 90360b00c4f1ef08e57135e6b866e392

Basically, every time the PDF is generated, the hash is different. That’s a little embarrassing for a guy who does reproducibility research. I need to fix this in our generation pipeline.

Pro-tip number 1: make sure you’re solving the right problem. There’s no guarantee the PDF is the culprit, so before digging in that grave, I should check the generation upstream. Is the source material actually unchanging?

>>> make clean && make document.tex && cp document.tex orig.tex
>>> make clean && make document.tex && cp document.tex next.tex
>>> diff orig.txt next.txt
(nothing)

As I hoped: no output, so the generated TEX is the same. A good start.

Next, let’s figure out how different these files actually are, starting with my favourite hash function: file size.

>>> make clean && make && mv document.pdf orig.pdf
>>> make clean && make && mv document.pdf next.pdf
>>> ls -la *.pdf
-rw-r--r--  1 hackalog  staff  6709688 19 Jul 15:35 next.pdf
-rw-r--r--  1 hackalog  staff  6709688 19 Jul 15:34 orig.pdf

Since the upstream contents are the same, and the resulting PDFs are the same size, I’m going to assume the bulk of the files are identical and look for some kind of metadata difference.

The Fix

Lo and behold, it’s metadata. Google and Stackoverflow confirm that these three fields are to blame:

/CreationDate
/ModDate
/ID

By reading the article, it seems that two of these are easy to fix, by hard-coding something reasonable into a SOURCE_DATE_EPOCH environment variable before running pandoc. (like the suggested output of date +%s). I can generate a fixed date, set the variable, and give it a try.

Sure enough, according to exiftool, the creation and modification dates now match. Unfortunately, the hashes still don’t match.

What about that third one? ID? Annoyingly, exiftool doesn’t let me view the ID field directly. Time to get dirty. (I’m actually impressed I made it this far without a hex dump).

>>> diff <(xxd document-1.pdf) <(xxd document.pdf)
418936,418940c418936,418940
< 00664770: 662f 4944 5b3c 3935 3361 3763 3266 6531  f/ID[<953a7c2fe1
< 00664780: 3363 3431 3139 3231 6236 3265 6635 3065  3c411921b62ef50e
< 00664790: 3962 6334 3134 3e3c 3935 3361 3763 3266  9bc414><953a7c2f
< 006647a0: 6531 3363 3431 3139 3231 6236 3265 6635  e13c411921b62ef5
< 006647b0: 3065 3962 6334 3134 3e5d 2f52 6f6f 740a  0e9bc414>]/Root.
---
> 00664770: 662f 4944 5b3c 6130 6665 6131 3762 3361  f/ID[<a0fea17b3a
> 00664780: 3039 3436 3330 6561 3536 6364 3366 6539  094630ea56cd3fe9
> 00664790: 6363 3734 3434 3e3c 6130 6665 6131 3762  cc7444><a0fea17b
> 006647a0: 3361 3039 3436 3330 6561 3536 6364 3366  3a094630ea56cd3f
> 006647b0: 6539 6363 3734 3434 3e5d 2f52 6f6f 740a  e9cc7444>]/Root.

There it is, and sure enough, the ID changes every time. According to the aforelinked stackoverflow article, there is a solution, but it depends on which PDF backend is compiling the actual compiling for pandoc. I suppose I can patch the eisvogel.tex template I’m using to generate the book, and add a blurb the TeX header. Technically, I only use Xetex, but so I don’t have to look it up again, I’ll put it all in and cross my fingers if I ever need a different backend:

\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftexe
  \pdfinfoomitdate=1
  \pdftrailerid{}
\else % if not pdftex
  \ifxetex
    \special{pdf:trailerid [
      <00112233445566778899aabbccddeeff>
      <00112233445566778899aabbccddeeff>
    ]}
  \fi
  \ifluatex
    \pdfvariable suppressoptionalinfo \numexpr32+64+512\relax
  \fi
\fi

The Result

A few fistfuls of hair, a few hours, a hex dump, and much googling later, and…

>>> make `clean` && make && mv `document.pdf orig.pdf`
>>> make `clean` && make && mv `document.pdf next.pdf`
>>> `md5 *.pdf MD5`
(`next.pdf`) = `c3bf99530a35eab6f9adafb08c24acbd MD5`
(`orig.pdf`) = `c3bf99530a35eab6f9adafb08c24acbd`

At last my PDF generation is reproducible. That wasn’t so hard, was it?

How I spent my Parenting Sabbatical

2021-06-11T00:00:00+00:00

For 6 weeks now, I’ve been on a parenting sabbatical; that is, I split my 9-month Parental Leave into two parts, separated by a 6-week “return to work”. On Monday, my work is done, and I go back on Parental Leave.

It took some serious logistics (a certain pandemic wiped out my original childcare plans), but I think I got some really great work done. Not only did I finish up old work, I feel like I’ve been able to at least dip my toe into the current research problems, which will make coming back in November all that much easier.

6 Weeks in Review

It’s been an intense 6 weeks.

Back in February, Amy and I presented our summary of Reproducibility research in 2020. In that talk, we also set out our roadmap for where we want to take the reproducibility project (Easydata) in 2021. My plan for my 6-week return was to get a good start on this roadmap, readying Easydata for the next set of projects and workshops to be thrown at it (later this summer, and early this fall).

In the last 6 weeks, I focused heavily on implementing the “Streamline Workgroup Sharing” improvements outlined in that talk. Particularly:

Improving the Catalog object: Implementing a more git-friendly catalog format
Implementing notebook-as-transformer: i.e. the ability to use notebooks as nodes in the DatasetGraph. This allows an analyst to ceate a Dataset in a jupyter notebook (complete with all the storytelling that comes along with that format), and have that notebook be used automatically to regenerate the Dataset as part of the usual Dataset.load() dependency traversal mechanism (i.e. as a transformer in the DatasetGraph).

This has set up some good opportunities to use the improved framework both immediately, and in the upcoming months:

Amy’s preparing a set of tutorial notebooks for the Vectorizers Playground.
Amy’s presenting an Easydata tutorial, and I’m giving a talk on the Easydata Makefile workflow at this year’s Pydata Global.
Easydata will be driving the git repos for a number of upcoming workshops and research events (details to come).

We released Easydata 2.0

Easydata 2.0 consists of two new features (new catalog format and notebook-as-transformer), and a massive API cleanup. Because we removed almost as much code as we added (+1300 lines, -900 lines) , we cranked the major version number to warn the user that they may want to review the documentation (or at least the blog post) before proceeding.

We reimplemented Catalogs

A Catalog object is a serializable, disk-backed git-friendly dict-like object for storing a data catalog.

serializable means anything stored in the catalog must be serializable to/from JSON.
disk-backed means all changes are reflected immediately in the on-disk serialization.
git-friendly means this on-disk format can be easily maintained in a git repo (with minimal issues around merge conflicts), and
dict-like means programmatically, it acts like a Python dict.

The new Catalog replaces the monlithic “catalog-as-json-files” that were used by Easydata. The main problem with these files is that, when several users were using the same git repo (like say, in a workshop) these catalogs were a rich source of git merge conflicts.

My favourite thing about the new Catalog format is that it’s almost completely transparent to the code. Internally, it just acts like a dict. The serialization almost comes for free. Implementing catalogs in this fashion let us remove a whole pile of special-case code for dealing with the various catalogs.

For details, read my blog post.

We implemented Notebook-as-transformer

Internally, Easydata maintains a dependency hypergraph called the DatasetGraph. Nodes in this graph are Dataset objects. Edges are composable “transformer functions” which take in 0 or more Datasets, and emit 1 or more Datsets. For more details, see my blog post on transformers and datasets.

The DatasetGraph is the magic that lets Dataset.load() just magically work. If the Dataset is present on-disk, it’s loaded from there. If not, it’s generated by walking its dependency list and building (or loading) the relevant Datasets before running a transformer function.

Writing transformer functions was never hard, but it was one place where we spent a bunch of time coaching users. So, to make Easydata easier to use, we’ve eliminated the need to put everything in a single function. Now a user can specify a jupyter notebook as a transformer function. So long as the notebook writes the desired dataset to disk, the process will just magically work.

Allowing notebook-as-transformer greatly improves the storytelling possible with Easydata, as Dataset preparation (and all the narration that goes along with it) can be stored in the main flow of jupyter notebooks, instead of hidden in a transformer function inside the project’s src module.

We made a whole bunch of other API changes

Since we were already breaking a bunch of API with the Catalog change, we took the opportunity to clean (or remove) a lot of the more troublesome (or confusing) parts of the Easydata API. These were design decisions which we knew had issues (and for which we had usually developed workarounds), but we were keeping for purposes of backwards compatibility. There were a lot of small changes here (see my api-changes blog post for details), but a lot of those changes can be described as follows:

Names have semantic baggage. Good (variable, method, parameter) names are important.
Good API design comes from watching users actually use your framework
Any day you can delete a bunch of code by introducing a new API is a good day.

And now: Back to Parenting

So that’s it for a few months. I’m off on Parental leave until November. @acwooding’s still around, however, so feel free to direct your reproducibility questions her way in the meantime.

See you in the fall!

API Ch-ch-changes

2021-06-02T00:00:00+00:00

As mentioned in the last post, the upcoming Easydata 2.0 release is all about API and UX lessons we learned in the last year of using, and developing the Easydata framework.

Since there are probably a few existing Easydata users out there, here’s a quick guide to migrating to the new API.

On-disk Catalog Format

We completely changed the on-disk catalog format. But you knew that, because we wrote a whole blog post about it :)

Loading a Catalog

Old: load_catalog(catalog_name)
New: Catalog.load(catalog_name)

Previously, we had defined some helpful (partial) functions to load these; i.e.

dataset_catalog = partial(load_catalog, catalog_file='datasets.json') # Old way
transformer_catalog = partial(load_catalog, catalog_file='transformers.json') # Old way
datasource_catalog = partial(load_catalog, catalog_file='datasources.json') # Old way

We’ve deprecated them, because the new form is just as clear:

Catalog.load('datasets')
Catalog.load('transformers')
Catalog.load('datasources')

Deleting a key

Old: del_from_catalog(key, catalog_file=foo)
New: c = Catalog.load(foo); del c[key]

Basically, treat the catalog as a dict, and changes will be serialized to disk automatically.

Available catalog entries

We used to have functions like available_datasets(), available_transformers(), available_datasources() but again, we now simply treat these as a dict, so

Old: if 'foo' in available_datasets() ...
New: c = Catalog.load('datsets'); if 'foo' in c ...

Basically, treat the catalog as a dict.

Building Transformers

One of our favourite new features is the ability to use a Jupyter Notebook in place of a transformer function. It’s as easy as writing a Dataset to disk inside your notebook and then doing a:

dsdict = notebook_as_transformer(notebook_name='my_notebook.ipynb',
                                 input_datasets=[ds_in],
                                 output_datasets=[ds_out],
                                 overwrite_catalog=True)

Eliminating the workflow module

The purpose of src.workflow has changed several times. In the end, we ended up using it as a place to test out new API ideas without exposing the details to the user. By the time we cut our Easydata 2 beta, this file was effectively empty, so it has returned to its original purpose (handling commands like “make datasets” and “make datasources”).

For the rest of the functions, that used to be there, import the module you want from easydata directly.

from src.data import Catalog, Dataset
from src.helpers import (dataset_from_csv_manual_download,
                         dataset_from_metadata
                         dataset_from_single_function)

Adding Dataset/Datasource to Catalog.

To add a dataset or datasource to its respective catalog, use the “update_catalog()` method of the Dataset / Datasource object respectively; e.g.

c = Catalog.load('datsets')
ds = Dataset('new_dataset_name')
ds.update_catalog()

The same works for DataSource objects.

Renamed API Calls

TransformerGraph is now DatasetGraph
create_transformer_pipeline is now serialize_transformer_pipeline

New Exceptions

We introduced some Easydata-specific exceptions. We had previously been using generic ones.

EasydataError: base for all other exceptions
ValidationError: hash check failed
ObjectCollision: object already exists in object store (more general than a FileExistsError)
NotFoundError: object not found in object store (more general than a FileNotFound Error)

“force” flags and other misnamed options

force was a terrible name for an option flag, as it meant something slightly different to every function, leading to some odd bugs. It has been replaced in most cases with a clearer name:

Dataset.dump(force=True) -> Dataset.dump(exists_ok=True)
DatasetGraph.traverse(): force -> exhaustive
DatasetGraph.generate() force->exhaustive
DatasetGraph.add_source(): force->overwrite_catalog
DatasetGraph.add_edge(): force->overwrite_catalog

We also cleaned up some other misnamed options:

DatasetGraph.generate(): write_catalog->write_dataset
DatasetGraph.process_edge(): write_catalog->write_datsets

Adding Datasets

src.data.add_dataset() is deprecated. It had two forms:

From the dataset itself: Now dataset.update_catalog() (which can be handled by update_catalog)
Using the from_datasource option: Now Dataset.from_datasource()

Changing the log level

src.log.debug is now gone (it did not work correctly anyway). Set the LOGLEVEL environment variable instead.

I’m sure there are many other changes I forgot about, but these should get you going. (and the Easydata documentation and docstrings should get you the rest of the way!)

Making a git-friendly Catalog Format

2021-05-25T00:00:00+00:00

TL;DR: API Lessons learned from a year of building (and using) Easydata

After a year of using it, I’d say we got a lot of things right in Easydata. We made it to our 1.0 release last summer (introducing the Dataset.load() API), and, over the course of several workshops, hammered out a set of changes for working with large datsets (remote data and the EXTRA API), and private data.

That said, we also got a few things wrong, and now it’s time to go ahead and fix one of those things: making the catalog format more git-friendy. This is a breaking change, so this change will form the start of what will become Easydata 2.0, the rest of which will be documented in my next post.

On-disk Catalog Format

When we first picked an on-disk catalog format, we hadn’t thought about designing for minimizing potential git conflicts. Since a git workflow is a fairly core piece of Easydata, we’re going to right that wrong.

Following in the “implement the obvious thing first” philosophy, our initial serialization format for the DatasetGraph hypergraph was a pair of JSON files: one for the datasets (nodes), and one for the transformers (edges).

While this works in practice, when we use Easydata in a busy workshop, it has a downside: it’s a ripe source of git conflicts. How? Well, when two participants both make changes to their respective data catalogs, there’s a strong possibility of a git conflict when someone goes to merge those changes.

For 2.0, we’re going to try a straightforward format change. Instead of a catalog consisting of multiple JSON files with one entry per node/edge, let’s make the catalog consist of multiple directories, and have one file per node/edge. Dataset names are necessarily unique (as are transformer names, though it’s much less common to refer to them by name), so it seems natural. In other words, instead of

catalog/datasets.json
catalog/transformers.json

we now have

catalog/datasets/*.json
catalog/transformers/*.json

As an added bonus, the Catalog class can be used wherever we need a catalog of serializable objects; e.g. for our DataSource objects as well.

The Catalog class

A Catalog object is a serializable, disk-backed git-friendly dict-like object for storing a data catalog.

serializable means anything stored in the catalog must be serializable to/from JSON.
disk-backed means all changes are reflected immediately in the on-disk serialization.
git-friendly means this on-disk format can be easily maintained in a git repo (with minimal issues around merge conflicts), and
dict-like means programmatically, it acts like a Python dict.

On disk, a Catalog is stored as a directory of JSON files, one file per object The stem of the filename (e.g. stem.json) is the key (name) of the catalog entry in the dictionary, so catalog/key.json is accessible through the API as catalog['key'].

Making this change let us deprecate a whole lot of arbitrary methods in our API, which got the ball rolling on a massive API cleanup. More about that soon.

Cache is Magic

2020-05-06T00:00:00+00:00

TL;DR: Caching is finicky, but magical when you get it right.

Cache is Magic

My self-declared milestone for an Easydata 1.0 release is being able to do this:

>>> ds = Dataset.load('dataset_name')

And having it “just work” regardless of whether the Dataset is already on disk, or if it needs to be regenerated by traversing the DatasetGraph and regenerating some or all of the intermediate Dataset objects (including raw data fetches, if necessary).

A magical component of this generation the caching, which, if a Dataset is on disk and matches the hashes recorded in the Dataset catalog, the generation step will be skipped. Seems easy enough, but as with most things in software, “do what I mean” turns out to be much, much harder than I secretly hoped. The good news is, the implementation is starting to just work.

After much usability wrangling, here’s how we cache Dataset objects in Easydata.

Datasets and Metadata

Recall, a Dataset is a set of binary blobs with standard names like .data and .target, along with its associated metadata.

Metadata is not an afterthought. It’s an essential component of the Dataset. Metadata can be anything that is JSON–serializable (in fact, under the hood, it’s just a dict), but usually contains:

the .DESCR (readme) text, describing what this dataset is all about.
the .LICENSE, listing the conditions under which this data can be used.
.HASHES: hash values for each of the binary attributes like data and target (essential for data provenance)
Any other information that you want to keep with the data itself, and preserve through the Dataset transformation process.

Though under the hood it’s implemented as a dict, we steal a great idea from the sklearn Bunch object and tweak it a bit to make metadata access easier. In addition to the standard dictionary-style access, metadata is accessible by referring to uppercase property names; e.g. ds.LICENSE returns the metadata stored at ds.metadata['license']

It’s important (as you’ll see in a second) that this metadata is both hashable and JSON-serializable.

How caching works (in Easydata)

The global Dataset catalog is a dictionary of the form:

{dataset_name: str, dataset_metadata:dict}

Caching works by hashing the metadata dictionary (which includes the data hashes) and using this hash as a filename for the cached copy of the dataset. Caches are stored in paths['cache_path'], and consist of a pair of files: dataset_name.dataset and dataset_name.metadata.

-rw-r--r--  1 hackalog  staff  301636179  9 May 14:22 1b1adb100d8088955878a9d7b3d071710c2db3bf.dataset
-rw-r--r--  1 hackalog  staff        478  9 May 14:22 1b1adb100d8088955878a9d7b3d071710c2db3bf.metadata
-rw-r--r--  1 hackalog  staff  301636175  9 May 14:21 756974a0ce41ffb9f53b47c234cd1e8b721dacfd.dataset
-rw-r--r--  1 hackalog  staff        474  9 May 14:21 756974a0ce41ffb9f53b47c234cd1e8b721dacfd.metadata

The .dataset file is joblib serialization of the Dataset object. The .metadata file is a JSON file containing just the metadata dictionary, useful if we don’t want to spend the time to load the whole dataset just to get at its hashes, say.

Once in a while, a Dataset in in a polished enough form that we dump it directly to a named Dataset in the paths['processed_data_path'] directory. We often do this at the end of a data cleaning session, or after an analysis. The idea being that we can blow away the paths['interim_data_path'] or paths['cache_path'] directory to get back disk space, and still have our generated Dataset` objects available.

-rw-r--r--  1 hackalog  staff  301636179  9 May 14:22 beer_review_all.dataset
-rw-r--r--  1 hackalog  staff        478  9 May 14:22 beer_review_all.metadata

Note, these are exactly the same as their associated cache files: 1b1adb100d8088955878a9d7b3d071710c2db3bf.{dataset|metadata}

The end result is that we can accumulate multiple versions of a Dataset in the cache directory, and continue to use them so long as we have the disk space.

At some point, we’d love for this cache to be shared within a workgroup, but that’s a feature for another day.

Implementing the DatasetGraph

2020-05-04T00:00:00+00:00

TL;DR: How the Dataset DAG became a hypergraph became the DatasetGraph.

DatasetGraph as a top-level object.

Recall from a few weeks ago, I described a bipartite graph (or Hypergraph), now called a DatasetGraph, which describes how Dataset objects are generated from other Dataset objects. I originally named it a TransformerGraph, because that’s how the directionality of the edges works out in the bipartite representation, but that turns out to be a little more confusing for the user. In the hypergraph, the Dataset objects are the nodes, so DatasetGraph it is.

One of the unintended consequences of introducing a DatasetGraph class in Easydata is that it turns out to be the right place to do a lot of things. That’s why we ended up exposing it to the user, instead of just using it internally to the Dataset.

Before we created the DatasetGraph, we used to have a top-level functions add_transformer() to add a dataset transformation to the global catalog. but it turns out a much more natural place to put it is in the DatasetGraph class directly.

Sticking with the “edges are functions, nodes are datasets” hypergraph terminology, the API becomes something like this”

>>> dag = DatasetGraph()

>>> xp = create_transformer_pipeline([list, of, transformer, functions, or, partials, ...])

>>> dag.add_source(datasource_name="dsrc_name", datasource_opts={}, output_dataset="dset_name")
>>> dag.add_edge(input_dataset=None, input_datasets=(),
                output_dataset=None, output_datasets=(),
                transformer_pipeline=xp, **kwargs)

>>> dataset = dag.generate('node_name')

This gives us a clean separation between adding a source node to the graph, and adding an edge. Both technically add edges, but the idea of a “source edge” in this hypergraph just feels weird, so the details are hidden by this API. This is perhaps why describing the dependency graph as a bipartite graph is less troublesome. See my last post for more on that.

Building Transformers and Datasets

2020-04-13T00:00:00+00:00

TL;DR: Easydata’s Dataset dependency hypergraph, described.

Hypergraph or Bipartite Graph?

For this post, I’m still talking about the hypergraph of data dependencies that I mentioned last time, however for this discussion, I’ll switch from a hypergraph-based description to a bipartite graph-based description of the dependencies.

Why? For starters, there’s not necessarily a commonly accepted notion of a directed hypergraph When I use the term, I mean a hypergraph, where the vertices of an edge are partitioned into two sets: the head-set and tail-set of the edge.

It’s perhaps interesting (and often surprising) to note the constructs that appear when trying to describe data flow as a directed hypergraph. In our case, we often end up with a hypergraph where data originates from a transformer function (like when we have synthetic, or downloaded data). This leads to a directed hyperedge with no input nodes, only output nodes; i.e the head-set is empty, but the tail-set is not. What does one even call that. A source edge?

Anyway, to avoid some of these rabbit holes, we can switch to a bipartite graph representation of this construct. These representations (hypergraph, bipartite graph) are interchangable. To construct this bipartite graph, list the transformers (the hyper “edges”) down one side of the page, Datasets (the hyper “nodes”) down the other, and join them with directed edges to indicate data dependencies (inbound edges to a transformer are input datasets, outbound edges are output datasets).

More on the Dataset Graph

A Dataset is an on-disk object representing a point-in-time snapshot (a cached copy) of data and its associated metadata. The Dataset objects themselves are serialized to data/processed. Metadata about these objects are serialized to catalog/datasets.json.

A Transformer is a function that takes in zero or more Dataset objects, and produces one or more Dataset objects. While the functions themselves are stored in the source module (by default in src/user/transformers.py), metadata describing these functions and their inputs/outputs Dataset objects are serialized to the catalog file catalog/transformers.json.

We’ll define the DatasetGraph as the bipartite graph formed by the two distinct sets of vertices above: Dataset objects, and Transformer functions. The edges of this graph are directed, indicating the direction of dependency from the perspective of the Transformer; i.e. since output_datasets depend on input_datasets so arrows are directed from input Dataset objects to Transformer functions, and from Transformer functions to output Dataset objects.

The whole goal of this exercise is to capture the information about the data transformations from raw data to processed data, in a way that can be serialized to disk, and committed as if it was code. These instructions are stored in the data catalog in JSON format. There is some trickiness here, as function objects don’t serialize in a platform-independent way, so we just some make assumptions about namespaces (we set up a standard location in the python module for user-generated functions: src.user.*), and use Python introspection to map the serialization to function objects when the pipeline is loaded.

Transformer Serialization

Note that transformers can take zero datasets as input (but must produce at least one output). This special case occurs in one of two ways:

Synthetic Data: The data is synthetic, and the transformer is actually generates a Dataset object from scratch. The JSON in this case looks like:

  "synthetic_source": {
      "output_dataset: "ds_name",
      "transformations": [
          (synthetic_generator, kwargs_dict),
          (optional_function_2, kwargs_dict_2 ),
          ...
      ],
}

Data Conversion: The data originates from something that isn’t a Dataset (e.g. a DataSource object), and the transformer converts it to a Dataset. This is really no different than the synthetic data case, except we supply a dataset_from_datasource() wrapper so the user doesn’t have to constantly reimplement it:
```
  "datasource_edge": {
      "output_dataset: "ds_name",
      "transformations": [
          (dataset_from_datasource, (datasource_name), **datasource_opts} ),
          (optional_function_2, kwargs_dict_2 ),
          ...
      ],
}
```
In all other cases, a transformer consumes and emits one or more Datasets as both input and output:
```
  "hyperedge": {
      "input_datasets":[in_dset_1, in_dset_2],
      "output_datasets":[out_dset_1, out_dset_2],
      "transformations": [
          (function_1, kwargs_dict_1 ),
          (function_2, kwargs_dict_2 ),
          ...
      ],
      "suppress_output": False,  # defaults to True
},
```
Dataset Serialization

A complete Dataset object contains both the data itself and an associated metadata dictionary. On disk, this is serialized to two files, typically located in paths['processed_data_path']:
dataset_name.dataset: The complete Dataset Object
dataset_name.metadata: A copy of the metadata portion of the Dataset. As the Dataset can be quite large, metadata-only operations save time and memory by reading this file instead. If the Dataset has been reproducibly generated, this metadata should match whatever is serialized into the dataset catalog.

One of the design goals of Easydata is that this processed dataset can be deleted at any time and (reproducibly and deterministically) recreated when needed.

Dataset Metadata

The master copy of the generated metadata is stored in the catalog file: catalog/datasets.json.

Dataset metadata is fairly freeform. It is based on scikit-learn’s Bunch object (basically a dictionary where the keys can be accessed as attributes). This object typically contains 4 attributes: .data, .target (which is often None for unsupervised learning problems), .metadata, and .hashes. The latter contains a hash of all the non-metadata attributes of the Dataset; e.g.

    "hashes": {
        "data":"sha1:d1d5ac9a5872e09b3a88618177dccc481df022d1",
        "target":"sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b",
},

where data and target are whatever data type makes sense for the problem at hand (e.g. matrix, pandas dataframe, nparray, etc.)

Building a Dataset Dependency Graph for Easydata

2020-03-30T00:00:00+00:00

TL;DR: We thought we were building a graph of dependencies. Turns out we had a hypergraph.

Building a Dataset Dependency Graph for Easydata

One of our design goals for Easydata is to be able to start an analysis like this:

>>>  ds = Dataset.load(f"covid-19-nlp-{date}", date="2020-04-01")

This would do a bunch of magic behind the hood:

(Caching) it would check if a cached version of the dataset already exists, (returning this cached copy if so). Otherwise
(Dataset Generation) it would generate any intermediate files needed to generate this dataset (all the way back to the raw data, if need be), then apply a sequence of transformer functions to turn the raw data into a processed dataset, then
(Check Hashes) it would hash and check the generated datasets to ensure that, if this command had been previously executed, nothing about my generated dataset has changed

Until recently, I referred to this process as the Dataset Dependency DAG, assuming that it would be implemented as a directed acyclic graph (DAG); i.e. edges would be transformer functions that look like

>>> def transformer(input_dset: Dataset, **kwarg) -> Dataset

and nodes would be Dataset objects. These could be easily pipelined together, as all transformers consumed and generated the same data type, and the remaining kwargs could be serialized to a json blob for the Easydata catalog, so Bob’s your uncle.

Well, DUH

Unfortunately, when we started looking at our collection of real-world examples of transformer functions (see reproallthethings), we came to the conclusion that what we had wasn’t a directed graph of data dependencies, it was a directed hypergraph, as our real-world collection of data transformations includes such functions as:

train-test-split: takes as input a single dataset and produces two children: one parent, two children.
augment-dataset: takes as input two datasets and joins them along various axes: two parents, one child
subsample-dataset: Takes as input a dataset and produces a smaller dataset by subsampling the rows: one parent, one child.

In its most general form, therefore, a dataset transformer function takes in an arbitrary number of datasets, and produces an arbitrary number of datasets; i.e. a transformer function is a hyperedge, not an edge, and so my data dependencies are best described by a “Directed Acyclic Hypergraph” (DAH).

Unfortunately, a “DAH” doesn’t have the same ring to it as DAG. I complained about this online, and a colleague fixed this problem for me:

Obviously acyclic generalizes to a different concept in hypergraphs than what you have. The correct term for the lack of of cycles in your hypergraph is “uncyclic”, so, um … DUH

With the hardest computer science problem out already of the way, we came to the next hurdle. I don’t have a handy mental model for how to implement this Dataset hypergraph (DUH) in python: the actual data structures and algorithms I’ll use to issue the sequence of data transformation calls, or the APIs I’ll need to be able to chain these transformer functions together in a pipeline.

Where as before, I could insist that a transformer be a function:

>>> def transformer(input_dset: Dataset, **kwargs) -> Dataset

and just chain these together, now I have to something a little more… hyper. Here’s my current thinking:

Serializing The Dataset Hypergraph

A transformer function takes in input_datasets and produces output_datasets.

Edges can be thought of as directed (parent to child), indicating a dependency. e.g. output_datasets depend on input_datasets, with an edge from one set to the other.

This will be serialized in the dataset catalog as:

{
    "hyperedge_1": {
        "input_datasets":[],
        "output_datasets":[],
        transformations: [
            (function_1, kwargs_dict_1 ),
            (function_2, kwargs_dict_2 ),
            ...
        ],
        "suppress_output": False,  # defaults to True
    },
    "source_edge_1": {
        "datasource_name": "ds_name",
        "datasource_opts": {},
        "output_dataset: "ds_name",
    }
}

Notice that source nodes are actual 1-1 edges. This is convenient from an implementation perspective.

The Transformer API

Putting all this together, then, transformer functions are functions that takes in zero or more Dataset objects, and produces one or more Dataset objects, with the API:

>>> def transformer(dsdict: Dict(str,Dataset), **kwargs) -> Dict(str,Dataset)

Where we use kwargs to map function variables to key names in the dsdict as needed. This strings approach is needed, because the kwargs dict needs to be serializable to disk (as a json dump) to be used in the Easydata catalog.

This is a Work In Progress

Of course, there are a few outstanding items from this litle brainstorm

Does the generalization of the transformer API actually work? Can they be chained together in the way I intend? It works in my head, but my head isn’t Turing complete.
What’s the hypergraph traversal algorithm; i.e. I want to give the list of transformer (hyperedges) traversed from sources to any named node in the graph. What’s the directed hypergraph equivalent of the depth-first or breadth-first search here? Just do it on the complete bipartite graph and and stop when my list of nodes has been covered?

Let’s implement it and see.

The LLF Guide to Remote Work

2020-03-08T00:00:00+00:00

A colleague of ours recently asked us about the challenges of doing “remote work”. Obviously, in the current health environment, a lot of organizations are considering doing remote work for the first time. For those of us fortunate enough to be knowledge workers, working from home is pretty accessible. At Learn Leap Fly, we’ve been doing remote work since the company started in 2015. We built our model by borrowing heavily from companies with far more experienced than us (e.g. Automattic, Mozilla, Basecamp), and then iterating on those practices based on our own experiences.

Here are some of the things we’ve learned, both as remote workers, and as remote team managers, over the last 5 years.

On Culture

Be intentional and deliberate about fostering your team culture. One of the big liabilities of remote work is that it’s easy to lose the human side of interactions. To keep your culture in a remote work environment, you have to be intentional and actively foster it.

Culture is what you do, not what you say you do. Culture is made up of the shared practices that you have in common, and what they reflect about your values as a team. Do you silently tolerate an in-crowd or do you actively practice and foster open communication channels that include everyone on your team? Do you favor extroverts, or do you give a voice to everyone? Remote work is a great opportunity to reflect on different ways of working and interacting. Whatever practices you adopt will inevitably reflect and determine the culture of your team.

We recommend you use the remote work opportunity to be deliberate about what you value, to document these values, and develop processes that nurture and reflect the culture that you wish to have in your organization.

On Synchronous vs. Asynchronous Communication

I’d go out on a limb and say that every successful remote-work organization has a asynchronous most-of-the-time communications culture. That is, if asynchronous means will get the job done, do it asynchronously. Practically speaking:

Don’t hold a meeting if you can do it another way. This is a general rule of thumb for productive work, period. Use remote work as an opportunity to break your meeting habit, and to develop more productive, asynchronous communications habits.

Recognize the true cost of meetings. In addition to being hard to schedule (especially across timezones), synchronous meetings far more costly than anyone remembers to factor in. As Basecamp puts it: “five people in a room for an hour isn’t a one hour meeting, it’s a five hour meeting”.

That said, face-time is important. Learn Leap Fly has one regular meeting: the weekly standup. It’s easy to start feeling a lonely and isolated without laying eyes on your coworkers once in a while.

On Meetings

If you do want to hold synchronous meetings, here’s our advice:

Share the agenda in advance. For regular meetings with a set agenda (e.g. Sprint Rollovers) it’s sufficient to share the agenda once. We keep them on our wiki, and evolve them over time as we need to.
Use the highest quality audio and video platform you can afford. We use zoom (which has easily the best multi-party video quality).
Make sure everyone has good quality headphones and mics. Don’t skimp out here. Buy them for your employees if you can.
Connect from someplace quiet. No coffee shops. In a pinch, use a car, or even a closet. (But please, don’t use a bathroom. That’s just… gross.)
Use one screen per person. Even if more then one person happens to be working from the same space, require everyone to have an individual connection to the video chat. This puts everyone on the same footing. There’s nothing worse than watching an off-camera, hard-to-hear conversation from across a bad quality video-feed. One screen per person levels the playing field, letting everybody feel like an equal contributor, regardless of whether they are remote or local.
If you can’t find a regular meeting time that fits everyone, alternate meeting times. That way, it’s not always the same people who are being left out.
Take and post meeting notes. Rotate this responsibility to ensure everyone has a chance to participate at some point.

On Email, Blogs, and Wikis

Don’t use email for internal business communication. Just don’t. Use it to communicate with those outside your business if you must, but use an asynchronous messaging tool (e.g. Slack, Skype for Business, Basecamp) for conversations, and a documentation platform (e.g. blogs, wikis) for more permanent team communications.

Write things down. Basecamp likes to say: “Speaking only helps who’s in the room, writing helps everyone.” Think about the people who couldn’t make it to a meeting, future employees or contractors, and even future you.

Pick the tools that work for you. Internal blogs are great for more detailed ongoing posts about your work. A Wiki can be great for long-form, archived information. Automattic uses a wordpress theme called P2 to merge their chat, checkins, and blog posts into a single interface. Atlassian has confluence. Learn Leap Fly uses MediaWiki and Notion. There are lots of options.

Record daily check-ins. It’s really easy to lose the serendipitous advantages that come from running into each other at the office and chatting about what you’re currently working on. The informal and unplanned sharing of information is key to productivity and creativity of teams. Whether this is jotting down a few notes on the corporate wiki every day, or using the wonderful “automatic check-in” features of products like Basecamp, set aside a few minutes at the end of each work day to share what you have been working on with your colleagues.

Digital Watercoolers.

Work isn’t always just about working. Have places where people can interact informally and let off steam. At LLF, we have a #feeds channel in slack so people can share interesting things they’re reading online. Some companies have a #watercooler or #random channel for off-topic chat. The point is, people need to interact about things that aren’t mainline work (and this is a good thing).

The Arc of Work

Work in sprints, with a specific goal and specific end date. Ours are either 2 or 3 weeks long, and we identify specific success criteria to make sure our sprints aren’t overly ambitious. Whenever we hit these goals, we have a mini-celebration at the sprint rollover.

Document your successes, and failures. We have everyone write up a sprint report as part of our sprint rollovers. This is a little post that answers the following questions:

What did you set out to do?
What did you actually do?
What’s blocking your progress?
Are there any process changes that would help you?
What’s your morale (1-10)?

Finally, organize sprints into larger arcs. Ours are roughly 3 months long, after which we prepare a more detailed summary of what we accomplished and learned. Basecamp calls these checkins “heartbeats.” The act of reflecting on a larger arc is really, really useful to keep you from losing the forest in all those daily trees.

For the Remote Worker

Have a dedicated personal work space. Home offices are amazing for productivity. If you don’t have room for an office, create a space somewhere in your house that you only use for working.

Think about ergonomics. Invest in a properly set-up desk, a great chair, monitor stands, and a good keyboard. Companies like Automattic give stipends for home-office setup costs. This is a great way to help people build a productive and ergonomic home office.

Get dressed for work. We don’t mean “dress up.” We mean, “get changed out of your pajamas”. Having a transition from your “home day” to your “work day” is important. We’ve heard of people that will go to a coffee shop first thing, read the paper, and then come back to their house to start their work day. Whatever works for you, try and establish a routine around starting, and stopping work for the day.

Have dedicated work hours. This is as much for other people as it is for you. Plan when you are going to start, when you are going to stop, and communicate these times with everyone who needs to know them. At Learn Leap Fly, we use a shared Google calendar for this.

Speak up! One of the drawbacks of remote work is that no one can see you beavering away. Share what you’re doing on the group chat. Post your daily checkins and weekly updates.

Stay logged in to the group chat whenever you are working. For some reason, seeing that little green dot that tells you other people are online—even if you’re not actively talking to them—is super comforting when remote working. Stay present, but don’t constantly check your messages, and get sucked into side conversations if you’re trying to do deep work. Most messaging tools let you turn on a “Do not disturb” knob that silences notifications for a while. Use it.

For the Remote Team Manager

Trust your team members. One of the first questions we get whenever we talk about remote work arrangements is “what do you do if someone starts slacking off.” They don’t. The whole magic of a flexible work arrangement is that so long as you are meeting your objectives, we really don’t need to know how you’re doing it. Presumably, you already have mechanisms to review work, with performance reviews and the like. Trust them. If the performance reviews are broken, fix them. In the meantime, trust your team members.

Don’t let your people work too much. Ironically, with all the questions around remote workers slacking off, it’s working too much that often ends up being the real danger. It’s really easy to get sucked in to working too much when you live in your workspace. Keep an eye on your workers. Make it a cultural badge of honour to not work more than 40 hours in a week.

Don’t require a fixed work schedule. Let people define the hours that best work for them. Trust them to do the work the way that suits them best, and you’ll be amazed at the results.

Don’t try and replicate the in-person office experience remotely. In fact, you should use the remote work experience to improve your in-person office work environment. There are a lot of unique advantages to remote work. Take advantage of these advantages. Get your team used to them, and use the change of setting to apply them back to the office setting. One of our favorites is opening up the decision-making process and letting more people in to see how decisions are made in real time. Distributed tools allow everyone to be in the room, not just “management.”

Tools We Use

No talk of remote work would be complete without mentioning the tools we use. Likely, every remote work scenario will use tools to implement at least the following functions.

Real-time team chat (e.g. Slack, Skype for Business, Mattermost)
Shared calendar (Basecamp, Google Calendar)
Videoconferencing (Zoom, Google Hangouts, Skype)
Information Repository (MediaWiki, Notion, Confluence)
Team blogging platform (Confluence, Wordpress)

If you’re a technology shop, you’ll likely also need these:

Shared Kanban / Sprint boards (Trello)
Brainstorming Tools (Miro, Mural)
Code Repo (GitHub, Gitlab, BitBucket) and CI

Here are some of our favourite tool combinations:

The Free Tier: slack + mediawiki + zoom + trello + miro + google cal + notion + github/bitbucket
The All-in-one(ish): Basecamp + zoom + slack + github
The Atlassian: Confluence + Bitbucket + Trello + slack + zoom.
The Self-Hoster: Mattermost + wordpress/P2 + MediaWiki + GitLab + zoom

Tools we use but don’t want to talk about here

Shared Todo Managers. Actually, we use them all the time, but this level of personal productivity tends to be very personal. We’d recommend you leave this part of the stack up to the individual. (We currently use Nozbe, though we’ve tried Asana and OmniFocus as well.)
Time tracking and Reporting. We use toggl, if it matters.
Customer Relations Management (CRM) tools. We use Mailchimp.

Tools we don’t use

Ticketing (e.g. Jira, Zendesk). We’re simply not in that business. Besides, that’s more of a business function than a remote-work enabler.
Single Sign-on. We use a password manager (1Password) and generate unique random, strong passwords on every platform or website we use.
Corporate email. Hopefully you’re sold on the virtues of not using email for team communications.

Good Remote Work References

Don’t take our word for it. Here is some good reading on the various topics covered in this post.

Remote: Office Not Required
Basecamp’s How We Communicate
Mozilla Best Practices
Remote Work and the Coronavirus
Distributed with Matt Mullenweg - Great podcast with Automattic’s Founder
Behind the scenes: culture and tools of remote work at Automattic
Building Remote Office Culture with a Watercooler Channel
Extreme Remote Work: Steven Wolfram’s CEO Livestream
Paul Graham on The Maker’s Schedule
Auto-checkins in Basecamp

Reproducible Data Science

2020-02-20T00:00:00+00:00

Missing Pieces

About 2 years ago, acwooding and I attended a workshop on text analysis, where a lot of people did some really nice work embedding text into vector spaces under a variety of algorithms. What we were working on was trying to establish some stability results; i.e. whether repeated embeddings the various algorithms were stable, or whether the results were all over the place because, for example, we got results all over the place the algorithm was randomized and we had just gotten lucky.

When we sat down to write up the analysis, we discovered really quickly that we had a problem. Though we still had a collection of jupyter notebooks and the associated data blobs, we had no idea how our collaborators had pre-processed their data to insert into the process in the first place. We had lost the information about the preparation of the data, and hence, we’d lost the ability to generate a consistent set of analyses across all of our data. Our workshop results weren’t reproducible, and we were going to have to do a bunch of work over from scratch if we wanted to publish anything.

If you were to survey you average data scientist on how much time they spend in a given phase of the operation, you’d probably get something that looks like this:

In supervised learning, around 2/3 time is spent munging the data in the first place, before you finally get around to doing your analysis.

In unsupervised learning problems it’s was more like 90%.

Admittedly, like all statistics, these actual numbers are made up, but they illustrate a real phenomenon. A vast amount of effort we are performing as data scientists is happening before we ever get around to the analysis part.

But What About the Environment?

What we wanted to be able to do was capture that data munging history, and turn that process into something that is sharable and reproducible. We started looking at our own past analyses and set out to create adopt a more standard workflow that would make it easy to preserve (and share) the whole process of data science, including the data munging.

At PyData NYC 2018, we ran a tutorial called “Up your Bus Number: A Reproducible Data Science Workflow.” At that workshop we were intending to talk a great deal about the munging of data, and the wonderful and clever APIs that we had settled on to help simplify that process. When we actually ran the tutorial, it turned out that about 80% of our time was spent before we even got to data munging. it was spent trying to set up consistent, reproducible environments on a wide variety of systems. The hard parts of getting to a reproducible data science pipeline (installing and maintaining your environment), for most of the people we were encountering, didn’t even show up in the survey that we did about where your time is spent. Most people knew of (or even used) tools like anaconda or virtualenv, but not in a way that let them easily maintain and share these environments, or reproduce the environments of others.

Even if we could reproduce the data munging, we couldn’t reproduce the development environment. We have all these fancy tools: anaconda, virtualenv, the now-deceased pipenv, and any number of wrappers and scripts that are designed around making it easier to build a custom python environment that’s tuned for your problem at hand. But actually using those in a consistent manner is not trivial.

When we talk to people about reproducible data science, everyone wants it, but almost nobody wants to do it. Most people think we’re talking about reproducing an analysis, because that’s the easy part. If we dig in a little further, some will grudgingly speak about reproducing their data munging. Almost nobody talks about solving the challenges of reproducible environments, assuming that tools like conda have already solved that. Yet when we sit down to do the work, environment and data munging issues dominate the effort.

Recognizing the Hard Parts

One of our stated goals is to help make data scientists more productive. How can we do this? Give them the ability to do their job with less futzing around with their environments, and make it easy for them to share their work—including the data munging. The primary means by which data scientists exchange data science lore is by passing around jupyter notebooks. But there’s so much that goes in under the hood before that jupyter notebook ever even gets run, that if we don’t take steps to that additional information—including information about the environment, the data munging, the metadata associated with the data sources—then data scientist productivity is lost, and reproducibility goes right out the window.

Source: The Turing Way. (CC-BY-4.0)

Our challenge is this: if we want reproducible data science—and that covers the entire spectrum of reproducibility, replicability, generalizability, and robustness—then the hardest thing we have to do is identify what the hard parts are. The only way to do that is to repeatedly sit down with people and walk through their pipelines. As many people as we can. Take their work and attempt to reproduce it, and in doing so, learn where those barriers to reproduction actually live: the technical barriers, the UX barriers, an the psychology barriers. Then, and only then, put in the hard work of building a toolkit that also solves the psychology and the user interface problems, the workflow, and the APIs associated with preventing reproducibility in the first place.

Hackalog

Reproducible PDFs

Down The Rabbit Hole

Our Pipeline

Our Problem

The Fix

The Result

How I spent my Parenting Sabbatical

6 Weeks in Review

We released Easydata 2.0

We reimplemented Catalogs

We implemented Notebook-as-transformer

We made a whole bunch of other API changes

And now: Back to Parenting

API Ch-ch-changes

On-disk Catalog Format

Loading a Catalog

Deleting a key

Available catalog entries

Building Transformers

Eliminating the workflow module

Adding Dataset/Datasource to Catalog.

Renamed API Calls

New Exceptions

“force” flags and other misnamed options

Adding Datasets

Changing the log level

Making a git-friendly Catalog Format

On-disk Catalog Format

The Catalog class

Cache is Magic

Cache is Magic

Datasets and Metadata

How caching works (in Easydata)

Implementing the DatasetGraph

DatasetGraph as a top-level object.

Building Transformers and Datasets

Hypergraph or Bipartite Graph?

More on the Dataset Graph

Transformer Serialization

Dataset Serialization

Dataset Metadata

Building a Dataset Dependency Graph for Easydata

Building a Dataset Dependency Graph for Easydata

Well, DUH

Serializing The Dataset Hypergraph

The Transformer API

This is a Work In Progress

The LLF Guide to Remote Work

On Culture

On Synchronous vs. Asynchronous Communication

On Meetings

On Email, Blogs, and Wikis

Digital Watercoolers.

The Arc of Work

For the Remote Worker

For the Remote Team Manager

Tools We Use

Tools we use but don’t want to talk about here

Tools we don’t use

Good Remote Work References

Reproducible Data Science

Missing Pieces

But What About the Environment?

Recognizing the Hard Parts