Brilliantly wrong

State of Wall in Protein Language Models in 2026

Sun, 01 Feb 2026 12:00:00 +0000

Last year Pascal Notin wrote a great post summarizing important observation about AI + proteins: Have we hit the scaling wall for protein language models?. (Spoiler: the answer is ‘yes’)

Briefest summary if you didn’t read it:

PLMs’ performance on fitness prediction (‘transferability’ of skills) plateaus after 1B and declines after 5B parameters. This holds for multiple PLM families
leading approaches combine MSAs and 3d structure. Even very simple methods that combine these sources of information outperform billion-parameter models
training on genetic sequences (that’s quite a lot of additional signal!) doesn’t help — Evo and Evo-2 are near the bottom of the leaderboard

Remark: I’ll focus on sequence-based models, and declare folding and inverse folding as out-of-scope for this post.

New models appeared on ProteinGym leaderboard since Pascal’s post, but conclusions hold. And later analysis from another group corroborates this: Medium-sized PLMs perform well at transfer learning on realistic datasets. Folding models keep using embeddings from (very old) ESM-2.

We’re in a weird position when we have a lot of sequencing data (and computing power), but we can’t put it to work. Let’s take a tour across recent literature and see if there are any signs of going beyond this scaling wall.

Remark: for comparison, widely used structure models (AlphaFold2 / AlphaFold3 / proteinMPNN) are even less than 1B parameters. This could be explained by a smaller size of PDB compared to UniProt, or maybe it’s just a common trait of molecular biology.

AMPLIFY: is scaling necessary?

preprint

Interestingly, the authors explicitly start from noting that the premise that “scale leads to performance” is likely false in PLMs, and then use recent LLM pretraining techniques to achieve better perplexity than ESM-2 using a cheaper and smaller model.

They explore removal of UniProt clustering (used in most models) to increase size/diversity of training data. Their main argument: clustering adds too much weight to non-realistic sequences.

Validation, interestingly, is a subset of human proteome — choice here is important because final ranking in perplexity is highly affected by similarity of distribution to training data.

Turns out, quality of sequencing data matters a lot — significant improvements correlate with largest “clean-ups” in UniProt.

Other interesting bits:

AF2 can’t distinguish between non-proteins and disordered proteins (PLMs of course can)
sequence recovery is very good (a lot of analysis in supplements)
analysis of performance on downstream tasks (like protein properties) is lacking, but this was covered in other papers.

Overall: yes, we can significantly improve perplexity/recovery, and model size isn’t crucial.

Structure-alignment of ESM2 and AMPLIFY

preprint

Multiple works in this list sprinkle structure tokens in training (and sometimes inference). This work instead utilizes a CLIP-like contrastive alignment step between PLM token and protein GNN (GearNet) structure token. Second loss is a direct prediction of structure tokens.

This delivers a good improvements on contact predictions, fold and secondary structure, but interestingly not so much for downstream tasks (specially in Table 8/ Fugire 10 SaAMPLIFY isn’t better than plain AMPLIFY).

SaESM-2 (aligned ESM-2) transfers to downstream tasks better than SaAMPLIFY — again confirming very poor correlation between perplexity and transferability.

ProSST: quantized structure tokens

preprint

ProSST heads the leaderboard in proteinGYM, let’s see the recipe:

introduced structure tokens by encoding 40 neighbors
attention separately encodes sequence, structure tokens and relative position (ablation against the plain attention shows unrealistic improvement, could they have forgotten relpos?)
pre-trained on AFDB (18.8M structures selected) using ESM-style MLM objective

Result is SOTA generalization to downstream tasks. Peak performance is reached at ~110M parameters, and then goes down.

Model requires knowing the protein structure during prediction, which is somewhat limiting. Huge structural database was used, and perplexity still improves with size, but not downstream performance.

VespaG

paper

VespaG is a tiny projection on top of ESM-2 embeddings, and achieves SOTA performance among sequence-only models. Trick is to “align” token embedding produced by ESM-2 (or other PLM) to MSA-based statistics computed by GEMME.

From their analysis, again, highest performance is reached on 650M ESM2, and then goes down — mirroring results of plain ESM family with some additional boost in quality.

Scaling and Data Saturation in Protein Language Models

paper

Paper starts with a nice reference: in LLM world relation of scaling law and downstream performance is not direct (likely even less so with RL finetuning strategies)

And show how this observation translates to the world of proteins by training a number of AMPLIFY models:

let’s chunk every sequence. Training on more chunks from same sequences consistently improves performance, while adding newer sequences can hurt it
When stratifying by MSA depth, proteins with larger MSAs (as measured by Neff/L) tended to show improved prediction performance with later model training years, unlike those with smaller MSAs
“when partitioning by functional assay type, proteins evaluated using Organismal Fitness as the readout exhibited the most consistent improvement over time, whereas other categories showed more variable or flat trajectories” — this is reasonable, after all nature crafts sequences only by fitness

Finally, an experiment with one specific family shows that supervised dataset can replace a decade of collecting protein data in the wild, so … just collecting sequences-in-the-wild is still useful but inefficient.

Training Compute-Optimal Protein Language Models

neurips proceedings

Metagenomic sequences are diverse and abundant, likely a good complement to UniProt — so authors add ColabFoldDB in training.

Paper builds a good contrast between MLMs and causal LMs (CLMs). MLMs are efficient and easy to overfit, opposite to CLMs.

They claim that optimal training recipe is starting from CLMs, then switchin loss to MLM; Surprisingly, training on two losses at the same time isn’t better. Authors argue that flops-optimal scaling favors larger models (and they train up to 10B parameters). Results are mixed:

transfer to downstream tasks isn’t impressive
contact prediction: minor fine-tuning of ~1B model achieves higher quality than larger model

Insteresting observation: BERT’s 15% masking ratio (used in ESMs) is still a good choice in protein MLMs.

Ankh3: combining sequence denoising and completion

preprint

This paper stands out because 1. they show good improvement in contact prediction 2. 6B model is overall better than 2B model.

A model jointly optimized on two objectives: encoder-decoder protein completion and MLM denoising (with 15%, 20% or 50% masking probability, and apparently short spans were masked, not individual tokens). Both points contradict previous paper in this list — could be results of encoder-decoder architecture.

Preprint leaves many questions unanswered:

model is deep (72 layers), so it could be just ineffecient
evaluation is limited to datasets without easy ‘leaderboard’ to estimate downstream performance.
I’m a bit concerned that ESM-2 and Ankh results were “sourced from ankh paper” instead of being reproduced.

ProGen3: Scaling Unlocks Broader Generation and Deeper Functional Understanding of Proteins

preprint

Employ huge curated dataset (PPA-1) that combines genomic and metagenomic sources and excludes fragments.
Model is trained on left-to-right, right-to-left and span infilling objectives (finally!). Then aligned on downstream tasks using IRPO — modification of DPO.

Results: non-aligned perfomance frequently peaks at ~3B, aligned performance usually still improves. Larger models can generate proteins from more clusters, with tiny implevements in expression.

Exact numbers on proteinGYM aren’t impressive, but overall dynamics after alignment looks encouraging.

DPLM-1 / DPLM-2 / ESM-3

These models were trained with a sufficient amount of structural information in a form of structure tokens.

DPLM-1 achieves better downstream performance on multiple tasks on 3B model (no larger model was analyzed), but DPLM-2 (with primary focus on structure tokens based on LFQ) reports only 650M model — I treat this as implicit signal of scaling boundary. Interestingly, DLPM-2 shows worse downstream performance, and authors link this to missing PLM pretraining in DPLM-2.

Combination of scaling + PLM pretraining + better structure tokens would be very interesting, but this didn’t happen yet with DPLMs (or happened and result wasn’t good enough for publication).

ESM-3 is somewhat close, but they don’t report any actual translatable properties of the model; performance reported by proteinGYM isn’t impressive and ESM-C 300M has similar performance to ESM-C 600M.

MSA as a context for PLMs

MSA-based models (like MsaPairformer) show better transferability compared to PLMs (and they are smaller than PLMs).

PoET model started a direction in PLMs where homologous sequences are passed as a context while architecture is still a classical transformer.

This direction inherits weak sides of both PLMs and MSA-based models: 1. one still has to retrieve MSAs 2. alignment should be done by model implicitly 3. more weights compared to MSA-based models and 4. long+deep MSAs are expensive because of quadratic attention.

One paper from this family (Profluent E1, also trained on PPA-1) claims good perfomance on gym and contact prediction (better than MsaPairformer and other PLMs) and shows positive scaling … up to 600M. From plots I’d expect further improvement on contact prediction, but not on downstream tasks. Given cost of training, it isn’t surprising that largest model is only 600M.

Final thoughts / directions

Multiple years of research in PLMs did not bring a recognized recipe to utilize vast sequencing data. Recent literature contains some interesting hints, but not strong hypotheses how to do this. PLMs more and more incorporate structural or MSA features, which pushes performance; model sizes still mostly don’t matter.

PLMs started from assumption that better perplexity means overall better understanding of protein sequences, as it worked in NLP. This assumption is wrong, and likely in NLP it isn’t true either: longer training on natural language worked because pretty much any reasonable problem was already discussed with examples in the training data. Later progress in NLP was guaranteed by numerous problem-oriented curated datasets, scaling only helped in storing knowledge/patterns in the model.

If, in addition to protein sequences, training data contained various tokens related to expression, function, interaction, biophysical properties, etc., then all those metrics would go up. Protein sequences alone don’t provide enough training signal. Can correlation with other genes from the same organism provide a more useful context? Can functional descrption form a better prompt? Some teams work on this, so we’ll see soon.

Is there a double descent in biology? Given the size of ESM-3 I’ll put this hypothesis off the table.

Are we memorizing phylogenetic noise? Almost surely yes. Larger models can generate proteins from more families (as shown by E1), while the best property prediction is still provided by analysis of MSAs (within the same family).

Maybe nature does not care much about our downstream tasks. Maybe much memory isn’t necessary to memorize everything useful in biology (we’re far from optimal performance, so probably not).

Simple but likely more fruitful direction at this point would be to curate a large dataset with diverse downstream properties.

Confounding factors? We don’t accept assay results at face value, but we generally assume that protein sequences are free of confounding effects (except for phylogenetic noise). In “Clever Hans in Chemistry” authors show that models can guess the author of molecule; knowing author, they can guess the activity without looking at the molecule itself. Could similar cues appear in non-frequent sequences? Like the sequencing technology, or assembly method? This is yet another hypothesis why we don’t see generalization.

Fastest Autograd in the West

Thu, 28 Dec 2023 12:00:00 +0000

Who needs fast autograd? Seemingly everyone these days!

And once upon a time I needed an autograd that is actually fast. Leaving project details aside, here are the requirements:

we test many computation graphs (graph is changing constantly)
many-many scalar operations with roughly 10k—100k nodes in each graph
every graph should be compiled and ran around 10k times both forward and backward
this should be done wicked fast, and with a convenient pythonic interface

Path that awaits us ahead:

autograd in torch
autograd in jax
autograd in python
autograd in rust
autograd in C
autograd in assembly

Plus a significant amount of sloppy code and timings on M1 macbook.

Let’s autograd in pytorch

We start our journey with pytorch — the default autograd engine in research. We’ll create a graph with many nodes, and to keep things simple our benchmark has only several kinds of operations: unary (softplus), binary (multiplication), n-ary (sum) and n-to-n (softmax).

This allows using just a few operations, but resembles a realistic load. All benchmarks in this post will reimplement the same logic as below.

def run_graph(initial_variables, n_operations: int):
    nodes = [*initial_variables]

    for op in range(n_operations):
        match op % 4:
            case 0:
                # softplus
                nodes.append(F.softplus(nodes[-10]))
            case 1:
                # sum
                nodes.append(sum(nodes[-30:-10:5]))
            case 2:
                # prod
                nodes.append(nodes[-20] * nodes[-10])
            case 3:
                # softmax
                softmaxes = F.softmax(torch.stack(nodes[-4:], dim=0), dim=0)
                nodes.extend(softmaxes)

    return nodes


def run_benchmark_pytorch(n_iterations, n_operations):
    init_vars = torch.arange(100, dtype=torch.float32, requires_grad=True)
    for _ in range(n_iterations):
        nodes = run_graph(
            initial_variables=init_vars,
            n_operations=n_operations,
        )
        nodes[-1].backward()

Run-time for 10k ops x 100 iterations: 11.3 seconds
Run-time for 10k ops x 10k iterations: 1130 seconds (estimate)

Given we created 100M python objects, it’s actually quite fast. And yes, that’s not going to deliver an interactive experience.

Let’s also discuss torch.compile, a major innovation in pytorch 2.0.

At 100 operations torch.compile takes 4.5 seconds. Execution gets faster: for 100 operations and 10k iterations it takes 4.52 seconds with torch.compile and 10.4 seconds without. Compilation + execution are still in the same ballpark. For bigger graphs (1k operations) torch.compile crashes.

Let’s autograd in jax

Jax is the new cool kid… well, not that new anymore. But in some aspects it is very interesting. Jax’s focus on JIT-compiling static graphs is very suitable for the problem at hand.

Implementation for benchmark is similar to pytorch:

import jax
import numpy as np

def run_graph_jax(initial_variables):
    nodes = [*initial_variables]
    for op in range(n_operations):
        match op % 4:
            case 0:
                # softplus
                nodes.append(jax.nn.softplus(nodes[-10]))
            case 1: 
                # sum
                nodes.append(sum(nodes[-30:-10:5]))
            case 2: 
                # prod 
                nodes.append(nodes[-20] * nodes[-10])
            case 3: 
                # softmax
                softmaxes = jax.nn.softmax(jax.numpy.stack(nodes[-4:]), axis=0)
                nodes.extend(softmaxes)
                
    return nodes[-1]

run_graph_and_grad = jax.value_and_grad(run_graph_jax)
# or 
run_graph_and_grad = jax.jit(jax.value_and_grad(run_graph_jax))

Without jit computations are extremely slow:
1k ops x 10 iterations => 15.9 seconds
10k ops x 10k iterations => 159,000 seconds (estimate)

That’s a bit longer than forever! But whole point of jax is to JIT-compile stuff. So let’s do it.

jit: compilation of 1k ops = 47 seconds
jit: run-time for 1k ops x 10k iterations = 0.66 seconds
jit: 10k ops x 10k iterations (compilation + run-time) => 470 seconds (estimate)

Speed up in execution time is more than impressive, but we spend >99% of time compiling.

Tensorflow

Someone will mention TF anyway. I’ll leave this as an exercise for you, TF fans.

Let’s autograd in python

Done with baselines, time to see if we can speed things up.

Let’s create a simplistic pseudo-framework and see how it competes with previous candidates. We’ll implement a tape-like autograd where operations order is explicitly tracked in a tape.

show autograd engine in plain python

class NaiveVar:
    def __init__(self, val):
        self.val = val
        self.grad = 0.
    
class NaiveTape:
    def __init__(self, input_values):
        self.ops = []
        
    def sum(self, *vars):
        res = NaiveVar(sum(v.val for v in vars))
        self.ops.append(('sum', vars, res))
        return res

    def prod(self, var1, var2):
        res = NaiveVar(var1.val * var2.val)
        self.ops.append(('prod', [var1, var2], res))
        return res

    def softmax(self, *vars):
        vals = [v.val for v in vars]
        maxval = max(vals)
        vals = [v - maxval for v in vals]
        denom = sum(math.exp(v) for v in vals)
        res = [NaiveVar(math.exp(v) / denom) for v in vals]
        self.ops.append(('softmax', vars, denom))
        return res

    def softplus(self, var):
        res = NaiveVar(math.log1p(math.exp(var.val)))
        self.ops.append(('splus', var, res))
        return res

    def backward(self, var):
        assert var.grad == 0
        var.grad += 1
        for op, inputs, outputs in self.ops[::-1]:
            match op:
                case 'sum':
                    out = outputs
                    for v in inputs:
                        v.grad += out.grad
                case 'prod':
                    out = outputs
                    in1, in2 = inputs
                    in1.grad += in2.val * out.grad
                    in2.grad += in1.val * out.grad
                case 'splus':
                    inputs.grad += out.grad / (1 + math.exp(-inputs.val))
                case 'softmax':
                    pass # skip for now
                case _:
                    raise NotImplementedError()

and reimplement reference task using our new pseudo-framework:

show benchmarking code

def run_graph_python_and_backward(initial_variables, n_operations):
    nodes = [NaiveVar(x) for x in initial_variables]
    tape = NaiveTape(nodes)
    for op in range(n_operations):
        match op % 4:
            case 0: 
                # softplus
                nodes.append(tape.softplus(nodes[-10]))
            case 1: 
                # sum
                nodes.append(tape.sum(*nodes[-30:-10:5]))
            case 2: 
                # prod 
                nodes.append(tape.prod(nodes[-20], nodes[-10]))
            case 3: 
                # softmax
                nodes.extend(tape.softmax(*nodes[-4:]))

    tape.backward(nodes[-1])
    return tape

Run-time for 10k ops and 10k iterations: 312 seconds.

Expectably not fast. But compared to previous candidates, that’s actually quite competitive!

Let’s autograd in python, again

This time we move all values into tape instead of keeping in variables. Additionally tape will keep a ‘static graph’ of computations by recording indices of variables participating in every operation.

show code for autograd in plain python

import numba
import math

class VarInd:
    def __init__(self, index):
        self.index = index # variable is just a unique index in tape
    
class TapeInd:
    def __init__(self):
        self.ops = []
        self.vals = []  # flat memory with values
        self.grads = [] # flat memory with gradients

    def make_var(self, value):
        self.vals.append(value)
        self.grads.append(0.)
        return VarInd(len(self.vals) - 1)

    def val(self, v: VarInd):
        return self.vals[v.index]

    def add_op(self, kls, input_vars, output_vars):
	    # translate variable to indices. self.ops keeps only indices
        self.ops.append((kls, [x.index for x in input_vars], [x.index for x in output_vars]))        
        
    def sum(self, *vars):
        res = self.make_var(sum(self.val(v) for v in vars))
        self.add_op('sum', vars, [res])
        return res

    def prod(self, var1, var2):
        res = self.make_var(self.val(var1) * self.val(var2))
        self.add_op('prod', [var1, var2], [res])
        return res

    def softmax(self, *vars):
        vals = [self.val(v) for v in vars]
        maxval = max(vals)
        vals = [v - maxval for v in vals]
        denom = sum(math.exp(v) for v in vals)
        res = [self.make_var(math.exp(v) / denom ) for v in vals]
        self.add_op('softmax', vars, res)
        return res

    def softplus(self, var):
        res = self.make_var(math.log1p( math.exp(self.val(var)) ))
        self.add_op('splus', [var], [res])
        return res

    def forward_backward_external(self, grad_var: VarInd):
        return forward_backward_optimal(self.vals, self.grads, self.ops, grad_var_index=grad_var.index)

def forward_backward_external(
	vals: list[float], 
	grads: list[float], 
	ops: list[tuple[str, list[int], list[int]]],
	grad_var_index: int
):
    v: list[float] = vals
    g: list[float] = grads
    # forward pass
    for op, ins, outs in ops:
        match op:
            case 'sum':
                v[outs[0]] = sum(v[i] for i in ins)
            case 'prod':
                v[outs[0]] = v[ins[0]] * v[ins[1]]
            case 'splus':
                v[outs[0]] = math.log1p(math.exp( v[ins[0]] ))
            case 'softmax':
                maximal = max(v[i] for i in ins)
                exps = [math.exp(v[i] - maximal) for i in ins]
                denom = sum(outs)
                for i, exp in zip(outs, exps):
                    v[i] = exp / denom

    g[grad_var_index] += 1

	# backward pass
    for op, ins, outs in ops[::-1]:
        match op:
            case 'sum':
                for i in ins:
                    g[i] += g[outs[0]]
            case 'prod':
                out: int = outs[0]
                in1, in2 = ins
                g[in1] += v[in2] * g[out]
                g[in2] += v[in1] * g[out]
            case 'splus':
                g[ins[0]] += g[outs[0]] / (1 + math.exp(-v[ins[0]]))
            case 'softmax':
				avg_grad = sum(v[j] * g[j] for j in outs)
				for i, j in zip(ins, outs):
					g[i] += v[j] * (g[j] - avg_grad)

and corresponding launching code

def run_graph_python_and_backward(n_operations, n_iterations):
    tape = TapeInd()
    nodes = [tape.make_var(float(x)) for x in range(100)]
    
    for op in range(n_operations):
        match op % 4:
            case 0: 
                # softplus
                nodes.append(tape.softplus(nodes[-10]))
            case 1: 
                # sum
                nodes.append(tape.sum(*nodes[-30:-10:5]))
            case 2: 
                # prod 
                nodes.append(tape.prod(nodes[-20], nodes[-10]))
            case 3: 
                # softmax
                softmaxes = tape.softmax(*nodes[-4:])
                nodes.extend(softmaxes)

    for _ in range(n_iterations):
        tape.forward_backward(nodes[-1])

Run-time for 10k ops x 10k iterations: 94 seconds

As we see, moving all values into tape and switching to operating on indices is quite an efficient strategy. We still use python, but are now ~5-10 fold faster than pytorch or jax.

At this point, I want to mention one more experiment: code above is organized to be numba-friendly. Numba is famous for speeding up number crunching in python with minimal changes by providing just-in-time compilation. Recent addition of numba.typed.List makes it possible to efficiently handle list of lists.

Run-time with numba, 10k ops x 10k iterations: 41 second.
At this point we’re >10-fold faster than jax/pytorch (and still writing code in python).

Let’s autograd in rust

Once we moved graph tracking to tape, we can now use something fast to run computations for us. For instance, rust. For rust↔python interop I’ve used a small wrapper around rustimport. Rustimport allows to conveniently “import” a single rust file without creating a full-fledged rust project.

Some optimization remarks:

softmax was a bottleneck, so I switched to creating temporary arrays on stack instead of Vecs, which required specializing on input sizes
I followed rust-y approach with iterators to reduce number of boundary checks
I wondered if match with multiple options checked one-by-one is slow. In synthetic tests it seemed to be relatively fast, but I wish jump table optimization was implemented here (e.g. it is supported for enums in rust, and clang uses this optimization in C for switch-case)

show rust code for minimal autograd

// rustimport:pyo3
use pyo3::prelude::*;


// slower softmax version for larger number of inputs
fn softmax_varlength(vals: &mut Vec<f32>, ins: &[usize], outs: &[usize]) {
    let mut max = -1e20_f32;
    let loc_vals: Vec<f32> = ins.into_iter().map(|i| { let x = vals[*i]; max = max.max(x); x} ).collect();
    let mut sum: f32 = 0.0_f32;
    let exps: Vec<f32> = loc_vals.iter().map(|v| {let _exp = f32::exp(*v - max); sum += _exp; _exp}).collect();
    outs.iter().zip(exps.iter()).for_each(|(j, exp)| vals[*j] = exp / sum );
}


// vecs are slow! so allocate slices on stack, and explicit grouping of computations also helps
fn softmax<const N: usize>(vals: &mut Vec<f32>, ins: &[usize], outs: &[usize]) {
    let mut loc_vals: [f32; N] = [0_f32; N];
    let mut exps: [f32; N] = [0_f32; N];
    let mut max = -1e20_f32;
    let mut sum: f32 = 0.;
    for (n, i) in ins.into_iter().enumerate() {
        let v = vals[*i];
        loc_vals[n] = v;
        max = max.max(v);
    }
    for (n, _i) in ins.into_iter().enumerate() {
        let exp = f32::exp(loc_vals[n] - max);
        exps[n] = exp;
        sum += exp;
    }
    let invsum = 1.0_f32 / sum;
    for (n, j) in outs.into_iter().enumerate() {
        vals[*j] = exps[n] * invsum;
    }
}

fn sigmoid(x: f32) -> f32 {
    1.0 / (1.0 + (-x).exp())
}


#[pyfunction]
unsafe fn autograd(
    vals_input: Vec<f32>,
    ops: Vec<i32>,
    input_ids: Vec<Vec<usize>>, 
    output_ids: Vec<Vec<usize>>,
    backward_node_id: usize,
    n_iteration: i32,
) -> (Vec<f32>, Vec<f32>) {
    let mut vals: Vec<f32> = vals_input.iter().map(|x| *x).collect();
    let mut grad: Vec<f32> = vals_input.into_iter().map(|_| 0.0_f32).collect();

    for _ in 0..n_iteration {
        for (i_op, op) in ops.iter().enumerate(){
            let ins: &Vec<usize> = &input_ids[i_op];
            let outs: &Vec<usize> = &output_ids[i_op];
            
            match op {
                0 => {
                    // softplus
                    let x = vals[ins[0]];
                    let max = f32::max(0., x);
                    let min = f32::min(0., x);
                    vals[outs[0]] = max + f32::ln_1p(f32::exp(min - max));
                }
                1 => {
                    // sum
                    vals[outs[0]] = ins.iter().map(|i| vals.get_unchecked(*i)).sum();
                }
                2 => {
                    // prod
                    vals[outs[0]] = vals[ins[0]] * vals[ins[1]];
                }
                3 => {
                    // softmax. we will need switch-case resolution here for most common cases
                    match ins.len() {
                        1 => {softmax::<1>(&mut vals, &ins, &outs)}
                        2 => {softmax::<2>(&mut vals, &ins, &outs)}
                        3 => {softmax::<3>(&mut vals, &ins, &outs)}
                        4 => {softmax::<4>(&mut vals, &ins, &outs)}
                        5 => {softmax::<5>(&mut vals, &ins, &outs)}
                        _ => {softmax_varlength(&mut vals, &ins, &outs)}
                    }
                }
                _ => { panic!(""); }
           }
        }
        grad[backward_node_id] = 1.;
        
        for (i_op, op) in ops.iter().enumerate(){
            let ins: &Vec<usize> = &input_ids[i_op];
            let outs: &Vec<usize> = &output_ids[i_op];
            
            match op {
                0 => {
                    // softplus
                    grad[ins[0]] += grad[outs[0]] * sigmoid(vals[ins[0]]);
                }
                1 => {
                    // sum
                    ins.iter().for_each(|i| grad[*i] += grad[outs[0]]);
                }
                2 => {
                    // prod
                    grad[ins[0]] += grad[outs[0]] * vals[ins[1]];
                    grad[ins[1]] += grad[outs[0]] * vals[ins[0]];
                }
                3 => {
	                // softmax
                    let avg_grad: f32 = outs.iter().map(|j| grad[*j] * vals[*j] ).sum();
                    for (i, j) in ins.iter().zip(outs.iter()) {
                        grad[*i] += vals[*j] * (grad[*j] - avg_grad);
                    }
                }
                _ => { panic!(""); }
           }
        }        
    }
    (vals, grad)
}

Run-time for 10k ops x 10k iterations: 1.4 seconds

Success: we are in the realm of interactive experiences.
Recall we started from >1000 seconds. But should we stop here?

Let’s autograd in C

Time to implement autograd logic in C. For interop with python I use python-cffi.

I went bananas on optimization:

I used the fact that output nodes are placed consequentially in memory, so we pass only index of the first output
number of inputs is limited to 8, and those are baked into struct as int[8], not int * to avoid jumps in memory
dynamic stack allocations of variable size (compared to rust, those are straightforward in C)
-O3, and unsafe math: -ffast-math. Even experimented memory alignment and restrict-ing pointers, but no luck

show me some code in C

#include <math.h>

typedef struct { 
    int opcode;
    size_t n_arguments; // used for softmax and sum
    int ins[8];         // at most 8 inputs
    int out;            // points to the first output variable
} MyOperation;


MyOperation * allocate_memory(int n_elements) {
    return (MyOperation *) malloc(sizeof(MyOperation) * n_elements);
}

// stable implementation
double logaddexp(double x, double y) {
    if (x > y) { return x + log1p(exp(y - x)); }
    else       { return y + log1p(exp(x - y)); }
}

double sigmoid(double x) { return 1.0 / (1.0 + exp(-x)); }

void run_multiple_passes(
    int n_operations,
    MyOperation *ops,
    double *values,
    double *grads,
    int n_iterations
) {
    for(int iteration = 0; iteration < n_iterations; iteration++) {
        for(int operation = 0; operation < n_operations; operation++) {
            MyOperation op = ops[operation];
            switch(op.opcode) {
                case 1: 
                    values[op.out] = logaddexp(0., values[op.ins[0]]);
                    break;
                case 2: 
                    {
                        double out = 0.;
                        for(size_t i=0; i < op.n_arguments; i++) {
                            out += values[op.ins[i]];
                        }
                        values[op.out] = out;
                    }
                    break;
                case 3:
                    values[op.out] = values[op.ins[0]] * values[op.ins[1]];
                    break;
                case 4:
                    {
                        double maximal = -1e20;
                        size_t n_arg = (size_t) op.n_arguments;
                        for(size_t i = 0; i < n_arg; i++) {
                            maximal = fmax(maximal, values[op.ins[i]]);
                        }
                        double exps[n_arg];
                        double sum = 0;
                        for(size_t i = 0; i < n_arg; i++) {
                            exps[i] = exp(op.ins[i] - maximal);
                            sum += exps[i];
                        }
                        for(size_t i = 0; i < n_arg; i++) {
                            values[op.out + i] = exps[i] / sum;
                        }
                    }
                    break;
            }
        }  // end forward

        // TODO set grad for target variable.

        for(int operation = 0; operation < n_operations; operation++) {
            MyOperation op = ops[n_operations - 1 - operation];
            switch(op.opcode) {
                case 1: 
                    grads[op.ins[0]] += grads[op.out] * sigmoid(values[op.ins[0]]);
                    break;
                case 2: 
                    {
                        for(size_t i=0; i < op.n_arguments; i++) { grads[op.ins[i]] += grads[op.out]; }
                    }
                    break;
                case 3:
                    grads[op.ins[0]] += grads[op.out] * values[op.ins[1]];
                    grads[op.ins[1]] += grads[op.out] * values[op.ins[0]];
                    break;
                case 4:
                    {
                        size_t n_arg = (size_t) op.n_arguments;
                        double avg_grad = 0.0;
                        for(size_t i = 0; i < n_arg; i++) {
                            avg_grad += values[op.out + i] * grads[op.out + i];
                        }
                        for(size_t i = 0; i < n_arg; i++) {
                            grads[op.ins[i]] += values[op.out + i] * (grads[op.out + i] - avg_grad);
                        }
                    }
                    break;
            }
        }  // end backward
    }
}

Run-time for 10k ops x 10k iterations: 0.99 second

I liked ergonomics of rust better, but achieving high speed in C is way easier. Rust’s interop with python is also way more convenient.

Let’s autograd in C (again)

Another approach I’ve taken is to ‘compile’ traced graph to C. So python produces a long C file where operations are called one-by-one with explicit indices, something like

...
vals[215] = vals[195] * vals[205];
vals[216] = vals[196] + vals[201] + vals[204];
... // etcetc, and then backward steps are also written the same way

Source code is lengthy, outputs are enormous, and to speed up compilation we can set -O0 in clang. Using -O0 produces slower binaries, but interestingly did not speed up compilation. Best results I got are around 1 minute for compilation and 1 second for a full run. Surprisingly, eliminating switch/case and memory lookups for arguments did not result in faster execution.

Given that recompilation is needed any time the graph is changed, real time experienced by user is 1 minute. That’s a no go.

Assembly

In this endeavor to get maximal speed, I decided to go down to assembly. Otherwise it feels like an incomplete journey. We can map a computational graph to just a set of low-level instruction, and avoid “costly” compilation. These days x86/64 is not a king anymore, but neither armv7/armv8 is — and writing assembly for several architectures is totally unreasonable.

So … how about using webassembly? It is low-level, fast to compile, and still cross-platform. Projects like wasmer/wasmtime allow interacting with wasm code from other languages. That’s my first encounter with WASM, and I’ve got quite positive impression: WASM mixes lisp-style syntax (for efficient streaming parsing) and execution model of stack machine. Unlike canonical stack machines, and unlike canonical assembly, WASM allows grouping expressions, e.g.

;; canonical stack-machine way to compute a * b + c
(local.get $a)
(local.get $b)
f32.mul
(local.get $c)
f32.add

;; another way to say write the same, also perfectly legal in wasm
(f32.add 
    (f32.mul (local.get $a) (local.get $b))  
    (local.get $c) 
)

This convenience allows writing significantly more readable code in WASM compared to ye-olde-assembly. Level of abstraction looks just right to me — low-level instructions, but no need to manage register allocations.

Webassembly is still very close to assembly in terms of instructions, i.e. there is no exp, log, let alone log1p and alike. Fortunately, there is a WASM implementation of exp2/log2 by Peter Knight.

My major question was if speed of exponentiation is going to be sufficient, as exp consumes significant time in C implementation. Alas, in a simple benchmark computing just exponents in wasm takes ~1.9 seconds, leaving it behind rust/C. For reference, javascript computes the same number of exponents in 0.7 seconds. Hence, I take WASM branding of ‘near-native speed’ with a grain of salt, at least in the context of number crunching. Hopefully this will improve, but for now WASM is out of competition.

Summary

So, we achieved a 1000X speed up compared to leading libraries.

I don’t find this surprising — major usecase for autograd system is manipulating large ndarrays. Memory management, copy elimination, device synchronization, parallelization of computations — these things are the main focus, and throughput of 1 million ops per second is totally reasonable for the vast majority of scenarios and users.

Not for me though. My scenario is totally different in terms of numbers and setup, and tensor-focused autograds are too slow. For the problem at hand departing from the common autograd systems was the right and the only possible choice. Exploring different options was quite fun, and my expectations were challenged several times along this exploration.

👋

Optical pooled screens of cells (overview of emerging biotechnology)

Sun, 20 Aug 2023 12:00:00 +0000

This month brought two preprints describing optical pooled CRISPR screens. What’s this new technology, what it can be used for, and why I’ve been waiting for it? I’ll make a small comparison of approaches and critically review the papers.

Best of all — I am not affiliated with either team, and this is likely the most unbiased review you’ll find 😅

Papers discussed:

PERISCOPE
aka Perturbation Effect Readout In situ with Single Cell Optical Phenotyping from A genome-wide atlas of human cell morphology (Broad Institute)
CP-POSH
aka Cell Painting Pooled Optical Screening in Human cells from A Pooled Cell Painting CRISPR Screening Platform Enables de novo Inference of Gene Function by Self-supervised Deep Learning (Insitro Inc.)

In the next parts I discuss some details from these preprints.

Preface

To drive experiments in biological systems you need two components:

intervention: change something in cell (or organoid, or organism).

For a broad understanding of biological system you want to have detailed control of all of its parts. CRISPR solves this by individually acting on any selected gene. This makes CRISPR-driven experiment more interpretable and ensures high coverage of biological processes.
readout: detect change in some characteristic. Better characterization of system would involve high-dimensional description. E.g. just measuring cell size, cell death and pH provides little insight into what’s happening.

Several sequencing-based assays provide rich description, and many of them provide single-cell readouts. Cell painting stands out: it is much cheaper, microscopy-based, and still captures a lot of biologically-relevant information.

Effectiveness of the system for unbiased discovery, roughly, is a product of these two dimensions: how well you control the biology and how well you can describe results of intervention.

Pooled CRISPR screens with scRNAseq/scATAC stand out in both dimensions.
They combine 1. complete control via CRISPR with 2. very high-dimensional interpretable readout. Sounds awesome (and it is!), but we need to introduce one more factor to the equation:

price per experiment. The more observations you have the merrier. We already found there are a ton of things happening in our biology, and to find at least a majority of them in an unbiased manner, a number of attempts is required.

Pooled screens are very efficient in experiment material: every cell is turned into a tiny individual experiment. Still, with all multiplexing/overloading tricks, a cost-per-cell in scRNAseq is comparable to cost-per-well in cell painting. Quite a difference!

Optical pooled CRISPR screening, a focus of this post, replaces expensive sequencing with cheap microscopy, and drops price-per-cell >200 fold (PERISCOPE reports price-per-cell ~$0.001). Compared to arrayed optical screens, lower requirements for automation can be expected as all conditions share the well.

Overall, technology opens an opportunity for massive experimentation.

Why do we need an even more scalable assay? 🤔

Great question! A number of whole-genome pooled screens have been conducted, arrayed whole-genome screens were run with cell painting. Recursion, who pioneered adoption of Cell Painting, scaled it to 2 million wells a week. Why would you wish for even more?

Gene perturbation can be more nuanced than just knockout. CRISPR tiling, an approach to scan for important positions in genome, requires a lot of experiments.

Space of interventions also goes beyond single-gene at a time. If e.g. two proteins can perform similar function (“alternative pathways”), downregulating just one of them won’t have as much effect (periscope paper accidentally needs double KO of M6PR and IGF2R). These cases, when the effect in combination is different from combination of effects, are of high interest and give a more direct hint at underlying biology than just similarity of images. At the same time such cases are (likely) sparse, and should be found across 20k x 20k = 400m combinations…

Sometimes you need to interact with more than two genes at a time, for instance to create iPSCs. Recall that iPSC creation relies on simultaneous expression of 4 Yamanaka factors. For reference, the original Yamanaka paper screened 24 candidate genes. To improve upon this “recipe”, a large number of combinations should be tried. Scanning just combinations of 4 factors out of 100 TFs already takes around 4 million attempts.

Combinatorial space stays almost unexplored. Dropping price even more still won’t make it possible to check all possible combinations, and this exploration should be driven by ML. ML-friendliness thus becomes a requirement.

There are non-genetic perturbations that are of high interest: cell environment, additions of chemicals or biologics. Unfortunately, usually there is no way to ‘massively multiplex’ these conditions, and microwell stays the minimal possible unit of experiment. Notable exception are peptides, as those similarly can be barcoded and participate in a pooled screen. Peptides can be used both as discovery tool (e.g. to block some interaction or activate receptor) and as a therapeutic.

Challenges needed to be solved

Cell Painting (left, 5 channels + composite) and base calling in ISS (right) have significant overlap in channels.
Image from CP-POSH preprint.

Interventions are encoded with sgRNA barcodes. In situ sequencing (ISS) is used to read the barcode back.

Main issue is merging ISS with cell painting. There is a spectral overlap between channels used for cell painting and ISS, and thus ISS becomes non-reliable.
Cell painting degrades RNA and destroys barcode. Both teams addressed this by running reverse transcription and RCA (rolling cycle amplification) of DNA before cell painting. ISS imaging is quite destructive (multiple cycles) and happens after cell painting step.

How PERISCOPE solves spectral overlap

Periscope team replaced two dyes in cell painting with fluorescent labels attached to probes with disulfide linker (see image). Linker is cleaved right after “phenotypic” (cell painting) imaging, and these two channels could be used for ISS. Floating fluorescent labels are partially washed and remaining (uniform) signal is cancelled out by image processing pipeline.

More specifically, membrane label Concanavalin-A was SS-conjugated to fluorophore directly, while mitochondria stain mitotracker was replaced with anti-TOMM20 Ab + secondary Ab SS-linked to fluorophore. Original cell painting avoided antibodies to make the process cheaper and more reproducible.

As expected, perturbation of TOMM20 distorts the signal from this channel — something to keep in mind.

How CP-POSH solves spectral overlap

Correlation of mitoprobe with TOMM20 and Hoechst

Mitotracker was replaced with Mitoprobe — a novel RNA-based label for mitochondria, linked to Cy5 fluorophore. Interestingly, they optimized a sequence to have high correlation with TOMM20 and low correlation with Hoechst (nuclei).

Resulting image (on the right) shows optimization was successful.

RNA sequences were taken from the ribosome after search for fragments that would bind to 12S rRNA and 16S rRNA (two different locations), then tested 8 of them and left two: one for 12s and one for 16s in proportion 1:1. This is an interesting solution and seems to overcome the issues seen in PERISCOPE approach, and likely to work in other species too.

This replacement of mitotracker with mitoprobe does not remove spectral overlap (there is overlap with base A), but makes it non-essential because RNA is degraded during cell-painting. Two additional spectral overlaps (WGA <> base G) and (phalloidin <> base T) are also solved by degrading, and additional steps in the protocol were necessary. These overlaps still seem to play negative role in ISS step (see later).

CP-POSH has an additional channel that can be utilized for one study-specific marker, which is later featured in one of experiments. (They use deep red — good choice, as shorter wavelengths can be used by phenotyping!)

In total both protocols are not straightforward.

In situ sequencing (ISS)

Source: Feldman et al., 2019

ISS reads the barcode to determine perturbed gene. This part is very similar, as both groups:

use Illumina’s miseq kit for ISS (sequence-by-synthesis), and both groups used lower resolution (10X) for imaging.
use padlock with gap to amplify barcode to get reliable signal during sequencing
finally, barcodes used in both cases are not an additional genetic sequences, but sgRNAs themselves.
No barcodes — no problems!

CP-POSH additionally uses tiny image-to-image convnet to improve calling to get +18% correct calls. Such a model can be trained on the screen data itself: almost-correctly called barcodes (with simpler pipeline) are used for training the model.

sgRNAs

Quality of ISS quickly drops with sequence length, so instead of sequencing all ~20 bases of sgRNA, the guides are selected so that reading only first 12-13 bases is enough to guess which sgRNA is expressed in the cell. Groups start from existing pools of sgRNAs to guide Cas9, with minor differences in selection procedure:

Periscope uses 12 cycles and minimal Levenshtein distance ≥ 2, which means they detect if barcode contains one error (and discard the barcode).
CP-POSH uses 13 cycles and Levenshtein distance ≥ 3, and allows up to 1 error correction. Most cells have more than one amplicon, which makes barcode calling even more reliable. Error correction adds +80% of barcoded cells in their largest screen.

I hypothesize high error rate (despite CNN filtering) is connected to spectral overlaps.

Scope of experiments is different: Periscope covers 20k genes with 4 guides per gene, while the largest experiment in CP-POSH targets druggable genome — 1.6k genes with 10 guides per gene.

Phenotypic pipeline and analysis

Both teams avoid training the system on known labels. I’ve also been avoiding training with supervision for a while, for a couple of reasons:

no need to drop any data from analysis (no labels → no cross-validation)
by providing labels you already bias model into what you believe is important. Correspondingly model works to ignore all “irrelevant” information, and the same model can’t be used (reliably) for studying orthogonal questions (e.g. well-to-well variations)
should there be any confounder, it is less likely to be picked

It’s actually impressive how little prior knowledge is required to get a decent grasp of biology just from looking at static cells. We only need to know all genes of the organism to run CRISPR, neural networks don’t need even this piece of information.

PERISCOPE relies on Cell Profiler, and does not train any specific pipeline. After averaging morphological profiles across the cells for the same gene, a matrix of gene similarities is computed.

CP-POSH relies on CellPose for segmentation, and either uses CellProfiler-like pipeline (dubbed CellStats) or self-supervised DINO-ViT from FAIR. Unsurprisingly, DINO-ViT demonstrates better quality, which improves with higher diversity of interventions provided during training. Pre-training on cells not ImageNet works much better, as you’d expect (Insitro-ers for some reason like Imagenet-pretrained models as baseline). DINO-ViT also uses patches 8x8, more relevant to the scale of cell.

A nice detail: they use a well-level compensation. That’s possible thanks to pooling!

Both papers delve into ‘differential expression’ of hand-crafted morphological features to provide arguments that readout is valid. For instance, periscope shows that most important features to detect interventions connected to common pathways point to the right cell compartment.

On the picture from PERISCOPE you see that disturbing a pathway results in some enrichment of important features (‘differentially expressed‘ features) from the corresponding cell compartment.

Verification & Discovery

“Method papers” are a special genre of literature: 1) focus of author is technology 2) focus of editor is novel biology 3) authors must provide convincing validation which no one wants to dive in.

This rarely converts into a consistent story for screens, and this time is no exception.

PERISCOPE compares two different medias, running whole-genome screens in each of them — an interesting experiment with unclear interpretation: there are genes that “land in different clusters” depending on the media — but unclear what to do with this information. As I understand, the goal was to demonstrate that running screen in a more physiologically relevant media would yield better insights, but it is unclear if differences (Ext Fig.8) indeed show superiority of either media.

Another interesting shot is the TMEM251 investigation with significant additional research beyond PERISCOPE. If the TMEM251 story really matters, I’d prefer to see it published separately and better verified (using available info from other pooled screens as well), Periscope in this story was needed only for initial guess based on GSEA — but this guess could come from other public screens as well.

Speaking of GSEA… — usage of GSEA in paper (e.g. fig. 6a) makes no sense 😞. GSEA’s power is combining signal from multiple genes with low expression. This problem does not exist in optical screens — as no expression is measured. Preranked GSEA (erroneously) relies on zero correlation between genes, but correlation in optical screens is very high. In fact, this high correlation is a subject of several plots in the paper. To compare pathways, just define another direction in embedding space for each pathway, as you do for single genes. Direction is a (weighted) average of directions for individual genes + measure separation of distributions along direction (e.g. ROC AUC).

Example UMAP from CP-POSH for one of screens

CP-POSH focuses on druggable genome (1640 genes) with a couple of smaller screens. Each version of pipeline (data + phenotyping model) is compared against StringDB, providing a quantifiable comparison, so they can e.g. demonstrate that targeting more genes is slightly better. They also confirm that trained models generalize to new experiments.

Different versions of screen are presented in a uniform way with UMAP+Leiden clustering applied to genes with a clear morphological signature (see example above).

I was confused by notable divergence between models trained on 300 and 1640 genes, figure 5a. In particular their lists of significant genes (AUC > 0.55) should markedly diverge across models. Also, 0.55 may sound small — however, bear in mind this is a cell-level classification, and combining multiple cells will result in strong discrimination.

Both ViT and CellStats “nominate the potential role of TUT1 in cell cycle regulation”. (No research made to confirm). Interestingly, sgRNA consistency failed for several genes, and half of genes have at least one ‘outlier’ sgRNA (out of 10).

In my opinion, CP-POSH has a consistent storyline and more ‘standardized’ analysis. It looks more like a validation of approach/platform, and less like a bunch of interesting observations (though CP-POSH has these too). PERISCOPE presentation is more aligned to “get published in AAA journal”.

Neither paper discusses cell cycle, a well-known confounder in single-cell studies, how so? 🤷 Optical screens previously characterized full images, not individual cells, and thus did not have to deal with this issue (as there are other cells to get signal from). Since neither team used supervision, pipelines likely cluster dividing cells together, preferring this characteristic over perturbation. Cancelling this in optical screen is an interesting challenge.

So which one to choose?

Great question, fortunately we have papers to help us! So here is my insight: I don’t know. I can’t meaningfully compare performance of two systems after reading preprints. Performance, I guess, is similar — but that’s only a guess. If some lab wants to select which one to go with, this becomes a matter of trust — not how science is supposed to work. (ok-ok, one additional channel can actually make this choice).

Main selling points of optical pooled screens are simple scalability and fewer confounders, which ultimately means hypothesis-free or hypothesis-light research. I doubt that interpretable morphological features are important for practitioners.

Papers lack “power analysis” on how many cells are needed to reconstruct perturbation profile. Very little said about cost ($0.001 per cell — estimate from PERISCOPE, no cost estimates from CP-POSH). These two factors determine if pooled strategy pays out.

Speaking of potential, it is unclear if two sgRNAs per cell can be confidently called with either approach.

Can we do better?

Screen validation should become a benchmark. It’s about time we had a benchmark of reproduction of gene networks/gene ontology with some predefined procedure. Community would benefit from comparing across the screens rather than “rediscovering” mTOR in every screen paper.

Number one question is — can screen discover culture-specific biology? When comparing several cell lines, are gene similarities in optical screen and scRNAseq similar for the same cell line?

It would be of high interest to highlight which pathways are detectable in scRNAseq but hardly noticeable in optical pooled screening (and vice versa). It is of value to know if there are pathways that can be seen in an optical screen or in scrnaseq — and can help in choosing the right instrument for the problem.

Compare screen to screen, not screen to “common knowledge”. Common pathways are a very rough sanity check. Single UMAP with gene grouped by their similarity is descriptive enough. GSEA is a poor argument: it is embarrassingly easy to find something pleasing with GSEA and throw a bunch of impressively small (incorrect) p-values at readers.

Comparison screen-to-screen can detect more subtle biology, specific to the biology of culture, and can actually bring interesting insight.

Discoveries are usually irrelevant for the story and should not be demanded by journals. Method papers are demanded to “show novel biology”, and most of “byproduct discoveries” have no value for readers or authors — otherwise those would be a separate paper.

Faster, cheaper, easier to scale, more reliable, easier to implement are great arguments for technology. If whole smartphone industry can’t deliver “a killer feature” every year, how that can be a requirement for every method? 🤷

Where would this go?

Back to point. Pooled optical screening is an exciting technology, and it has a number of immediate applications. And it is super valuable to understand its current limits.

For instance, I have the following questions on my mind:

does it transfer? When two labs experiment with same cell line, would they get similar results? In theory, yes, but how about practice?
similarity and difference with arrayed screens: shared media means studied processed are limited to a single cell, because cell interactions are not restricted to cells with the same perturbation. This has both pros (clearer signal) and cons (if cell interactions/collective behavior are of interest).
is it suitable to automatically find ‘interesting’ combinations of genes? Can we train RL to discover those for us?
can it handle tissue slices? Can we pool-screen whole mouse?
can vision pipeline handle neurons? Is DINO a good choice for that?

Hopefully more research will come and we’ll get answers to these and other questions soon.

👋

Acknowledgments

Thanks to Kevan Shah and Tatiana Dvorkina for proofreading and comments. Thanks to CP-POSH team (Ci Chu, Max Salick) and PERISCOPE team (Meraj Ramezani, Paul C. Blainey) for answering questions.

Comments

Paul C. Blainey provided some pointers to prior works of his lab, relevant to the questions I discuss in the post:

… a couple of comments that you may find interesting:

In Figure S2 of Feldman et al., 2019 we showed efficient detection of 2 guides per cell (in ~80% of cells)

In Carlson et al, 2023 we use a different and simple strategy to overlap IHC and SBS in the same channels which is to titrate down the IHC reagents

Both of these works demonstrate a potentially standardizable validation approach to do a follow-up (“secondary”) screen in an independent experiment with higher replication (more cells and/or guides per gene). The hit ranks or feature scores can be compared gene-wise or guide-wise across the primary and secondary to check reproducibility of the results. This can be for technical validation (same assay and guides) or biological validation (new assay and/or new biological model system).
So far we’re seeing impressive reproducibility which supports some of the more challenging and informative use cases you suggest.

Funk et al, 2022 demostrated that cell cycle can be treated more explicitly, we added 24-hour live imaging of cells prior to fixation

Einops, retrospective of 5 years

Thu, 13 Jul 2023 12:00:00 +0000

Einops is soon-to-turn 5 years. Right time to have a look back.

Some intro: einops is widely used — around 4 million downloads a month (for calibration - pytorch is 10 million) on pypi and is used in thousands of projects on github.

In a number of ways einops is unique:

bends tensors for a number of very different frameworks. AFAIK all other efforts to make something truly multi-framework either died too soon or avoided touching internals of models
never pulled back released features. At the same time einops lived much longer than any major version of tensorflow or pytorch. Some backends it originally supported (mxnet, chainer) are dead by now
bug tracker was empty for years, compared to usual hundreds in projects of similar scope. Now it reports several hardly fixable inconsistencies that appeared as frameworks introduced more features
einops adoption happens mostly through the code sharing between teams/projects, and not by hype-waving. Several mentions in twitter brought waves of likes but almost none were converted to users at that point. Paper appeared only after einops circulated for three years in the wild nature of github, when it was pristine clear that idea “clicks”.
“magical” universal dispatching, so users could write rearrange(x, 'b c h w -> b h w c') and not care about x’s framework/device/dtype/C-ordering. While this is more of a ‘fancy’ functionality, it was important during initial adoption.
no dependencies (except Python). Everything else is optional, even numpy
there is no corporation/university behind einops, it is mostly a single-person effort

Tough place?

A while ago Stephan H. asked what is challenging about einops as a project.

I don’t think I’ve made a great answer back then. And probably couldn’t anyway, because question assumes there is a specific “tough place”, but the assumption is wrong.

Also “tough place” is very subjective and after working for some time over any project, if you’re successful, there will be no “tough” place, because you focus on those parts that are “tough” and get them better either by decomposing their complexity or by just learning to manage with it.

Unique technical challenges

I decided to dedicate some time to write a better answer for this question. First prototype was built in a couple of hours, but project itself took months, so clearly there were non-trivial parts. Einops as a project has a number of (conflicting) technical restrictions that create a significant pressure:

frameworks. Einops supports a dozen of them, and that’s unique. Worse, each framework has its specifics, and this creates significant internal tension within a project, which I’ll discuss a lot in the next points
even worse, frameworks have multiple regimes of work within the same framework (i.e. torch alone has torch.compile, tracing, scripting, ‘plain run’, torch.fx, cuda graph capturing, and maybe more). They all have different behaviors
landscape is not steady and frameworks appear and gone, even worse, sometimes change their API, and sometimes by breaking existing API (looking at you, keras and TF). Their dependencies may contradict each other (stares at protobuf)
support for eager computations.

That’s how code usually runs these pytorchy days. In this case, the hot path should be really fast, and have absolutely minimal overhead. Einops deals with this with a number of caches that make usual loopy computations super-efficient. Shape checks (usually skipped by lazy everyone) are conducted only once per shape.
support for symbolic computations and traceability.

Two little-known facts first: 1. einops can deal with symbolic tensors (i.e. can operate tensors with unknown size of one or several axes, which may sound slightly impossible at first) and 2. einops “disappears” during tracing and provides models that contain an equivalent set of framework-native operations, and moreover traced operation correctly work for inputs of different shape.

As a result, execution flow has to rely only on traceable operations over shape’s elements, and e.g. one can’t just compute correct result shape in cpp/rust
shape checks for symbolic tensors.

For example rearrange(x, '(h h2) (w w2) -> (h w) h2 w2', h2=4, w2=w2) demands that first axis is divisible by 4, and the second axis is divisible by w2, while dimensions of tensors are unknown. An additional restriction: einops can’t use built-in graph asserts like tf.Asserts because of their framework-specificity. Clever organization of computations in ops ensures that code fails for wrong inputs without introducing additional elements of static graph
support for scripting: this requirement dramatically narrows a subset of Python that can be used, and in some cases demands specifying wrong type hints for internal functions because correct types like tuple[str, ...] are not supported by torchscript
support for tensor-rank polymorphism, that is, the same operation with ellipsis can handle inputs of different dimensions. Initially this was done by a clever trick that pre-packed ‘ellipsis axes’ into one, but recent changes in frameworks (see next point) required developing some new approach
special axes. Frameworks try to extend a concept of tensor = ndarray which worked so well. Examples are sharding axes in distributed tensors and jagged arrays. This clearly was out of initial design and, as I mentioned, required significant redesign of einops.
frameworks divergences: differences in the names/interfaces of operations, missing operations like logsumexp, inconsistencies in support of einsum.
layers definition is quite different across frameworks, and specially flax required some personal approach.
view semantics. Einops tries to provide a view to the input if possible, making operation itself very cheap, as no real computation happens.
additional pressure is my perfectionism, and trying to keep the bar very high. These days I don’t think extreme reliability should be assumed from side/personal projects.

Appearing problems with new features like torch.fx may be interpreted as einops gives cracks, reality is - einops as a notation and approach are just fine. It is enjoyed by many, and community wants to use notation with new framework features. And notation fits that. But the terrible basement that tensor manipulation is built upon (i.e. reshape/view/transpose and similar) gives cracks - more and more visible, and building a layer of cement upon is … not wise. As I discussed several times, einops’ core operation should be available at the lowest level of graph representation — but I don’t expect this advice to be heard.

Support for a large zoo of frameworks is (retrospectively) a questionable investment. Examples: cupy and chainer were almost never used but also were trivial to maintain and develop. While mxnet/gluon required very special attitude. Supporting multiple frameworks to me was an insurance that frameworks did not try to create “their very own version of einops”, and did not create non-compatible extensions (as they did for numpy).

These days projects that don’t use einops still use its core ideas by writing parts of einops patterns: (b h) t c, b*h t c and similar. Because that’s the best way to communicate internal structure of tensor (… when you agree on C-ordering of course, construct relies on it significantly).

Unique conceptual challenges

Einops is more of approach to writing code than a package, but package is a necessary tool to bring those ideas into practice. On approach level there are a number of hurdles too.

Turns out design of operations is very challenging: einops received a long list of suggestions and ideas, and very few were accepted. Folks just introduced to einops think “einops are helpful, so let’s invent something similar”, but similar does not imply helpful.

Let’s take a story of einops.pack and einops.unpack for a demonstration of this point: concatenation of different-shape tensors was of interest (for me) even before the first public release. My design at that time was universal enough, similar to the rest of einops, but too verbose and inconvenient:

[r, g, b] = rechunk([rgb], 'b h w [r+g+b] -> b h w [r, g, b]', r=1, g=1, b=1)

… thus it was not included. Later it was minimized by restricting transpositions:

# this one poorly works with type hinting
[r, g, b] = rechunk(rgb, 'b h w *', 'r+g+b -> [r, g, b]', r=1, g=1, b=1) 

until I finally realized that this operation better to be totally different from rearrange and should not have any names for the concatenated/split axes:

[r, g, b] = unpack(rgb, 'b h w *', [1, 1, 1])

which was soon generalized into unpacking with arbitrary shapes.

[r, g, b] = unpack(rgb, 'b h w *', [[1], [1], [1]])

Original design of operation could not support arbitrary shapes. Ok, technically it could, but that would be ugly and miserable. New design solved another issue — memorizing axes that were composed, another common request for einops.

I’ve come up with a final design (which I still find optimal) only two years later. A number of suggestions popped around that were similar to the original version.

To see that operation ‘clicks’, a whole research is needed:

collect use-cases (and this requires a broad view of SOTA and how it may change over the next years)
convert use-cases to code examples, and prepare baseline implementations without new operation
implement with your suggestion, and in most cases, conclude that doesn’t look good enough

There are more complicated parts, like “is it easy to read?”, “is this code confusing?” and finally “how to make this all efficient given all restrictions above?”.

Allocating time for these (mostly unsuccessful) attempts is tough.

Additional challenge: “fewer, but more universal operations”.

There is a gap between “I find this helpful” and “this will be actively used”. It is easy to come up with a long list of operations that will be helpful in some cases, but how users would figure this out? I don’t think anyone checks einops’ docs regularly, so operation will never pop up in mind. See, usefulness of operation strongly depends on its universality, i.e. ability to cover many cases, and einops are good at this because it was one of requirements.

Adoption challenges, management challenges

Einops adoption was very slow. If it was a commercial project, it is likely to run out of money before getting sufficient traction.

But the project was designed to be resilient. Somewhat an internal requirement: should be usable for at least a couple of years even in the worst scenario: no maintenance at all, and deep learning landscape changes even faster than before.

From the very beginning maintenance debt was minimized — that means, very restricted design, fewer features. I assessed very carefully which things can be broken. Once I was asked during an interview: why may it stop working? I said — only if API of core operations will change. Time shown this was the correct answer.

Another issue is extremely low adoption of layers. I have no good explanation to it, they are very useful.

Reasons for slow adoption?

No hyping. In part, because I am bad at it, and in part, because I am not that interested in answering basic questions from folks attracted by new shiny things. As a byproduct, early adopters of einops are mostly very advanced folks who knew what to expect from the tool and cared more about quality of their code than the rest of ML community.

Consequently, einops has no dedicated community (discord server or so). In the long run I think no community is better than abandoned community (which happens in many projects). There are a number of ein-tools around the github addressing specific cases, maybe somewhat centralized community could help with initial adoption.

Another important factor is a significant prejudice against string-templated operations, which is for three reasons: 1. einsum was historically slow 2. einsum is the only operation of this kind in the frameworks. 3. everyone knows parsing is slow, and idea of ‘parse once’ rarely crosses the mind.

Einops caches results of pattern parsing. But even repeating this many times in paper/documentation will not overcome prejudice — because if you’re already convinced it is slow, why would you read paper?

A couple of speed issues were reported to einops repo, while those were not even related to einops — a vivid demonstration of this bias.

No critical case. Tool becomes an immediate hit only if it addresses an existing case that is very poorly covered by previous tools. Or rarely because of hype.

Not that you can’t bend tensors without einops. And not that adding single rearrange magically makes your code better. Einops is an approach — and approach still requires investment to get a habit of writing and reading new kind of code. Real conversion happens only after one needs to read someone’s else code and finds out that reading einopsy code is significantly easier.

Concluding thought

Einops, as said, is one of a kind, and its development trajectory deviates significantly from the ‘normal’ development.

How would you call a system that is shaped by hard constraints? I’d call this “engineering art”.

Schema migration should be a responsibility of DB

Sun, 29 Jan 2023 01:00:00 +0000

A great achievement of the past decade in programming is a shift in paradigm from transition-focused to state-focused.

This shift is clearly seen in front-end (user interfaces): In react/preact/vue and other frontend frameworks a component has a state and defines how state should be represented (rendered) in html. The aim of a framework is to ‘migrate DOM’ to desired html representation with minimal overhead.

This shift is clearly seen in management of cloud resources. In AWS CDK, pulumi, terraform and other IaC tools user defines desired state of infrastructure, and it is responsibility of a tool to produce a correct ‘migration of infrastructure’.

This shift is visible in dependency management: Dependency management relies on expected state (which packages/libraries are required) and less on imperative instructions that dictate order of installation. Imperative glue here is still very common — e.g. dockerfiles, but tools like nix/nixos eliminate the glue as well.

In databases, in particular in ORMs, this shift had (only partially) happened around two decades ago. User changes ORM classes, and the framework produces migrations.

Generally speaking, in all these cases we define desired state of the system, not necessary changes. Movement to state-focused programming dramatically simplified management of complex systems. It’s like you laying out a plan of street while the question of moving all belongings/walls is solved for you.

What’s wrong with migrations in RDBMS?

Switching to auto-migration tools helps to focus on important - e.g. current relations in RDBMS - and not how we ended up with this set of relations. Plus, coherence between DB and code (ORMs or schema-definition tools) is now granted.

Adoption of auto-migration tools is still very low (even compared to ORMs), and in my opinion, because of how this process is organized.

We have dozens of relational DBMS, and yes, they look similar, but there are tons of nuances that make them all different.

And we have a number of tools to produce migrations: sqlalchemy+alembic in python, entity framework in .net, a dozen of tools for Hibernate in Java, and every community/ecosystem tries to develop a solution that can migrate a large number of deviating databases in a uniform way.

No big surprise all of them have very limited success given that scope of project is unlimited.

Auto-migration tools like alembic are also tough to develop and maintain:

they need to understand schema definition in a language (in python, in this case)
they need to introspect current schema of the database
they need to compute ’diff’ based on matching these two schema definitions, neither of which were created with automated schema migration in mind
deal with all peculiarities of dialects in schema definition and schema migration
for all operations alembic creates counterparts in python code, which is like introducing +1 language

The same problems doesn’t hurt frontend frameworks as much, because there are currently ~2.5 browser engines, and a ton of work done by standardization committees around js, and … after ditching react/vue you still have to deal with discrepancies, this time yourself. The same problems are faced by IaC tools, and this eventually will become one more (significant) barrier for migration between clouds.

Comparison of existing solutions (python’s alembic is taken as example), and comparison to this proposal. Note that on the left there are multiple steps that cross the boundary of ORM/migrator or migrator/DB.

Solution

schema migration is generated by database
tool only declares desired state

This will move responsibility for db-specificity migration to db developers, and that’s for good.

Where to start?

In a minimal implementation, DB provides a function. Function is given two db schemas (think of postgresql/oracle/sql server schemas, or individual databases in mysql) and compares them to produce a migration from an observed difference. Migration tool would create a temporary schema with a desired state and call a procedure to produce migration.

That’s not something unseen: pgAdmin has ‘Schema Diff’, SQL Server Data Tools has ‘Schema Compare’. So tools do exist, but they are not part of the database, and they don’t have a uniform interface.

Consequences

When we push migrations to database developers…

migrations would be almost immediately available in any programming language
on a longer range, we should expect improvements in SDL (schema definition languages) to account for common migration scenarios.

Example of these changes

For example, if you start from something like

Relation Person:
  name: string

and migrate it to

Relation Person:
   full_name:  string

From the point of a migration tool it is not clear that you just renamed a field, not deleted ‘name’ and created ‘full_name’. Thus an additional technical identifier is necessary, for instance:

Relation Person:
  name: string, oid=‘7dsd8’

Relation Person:
  full_name: string, oid=‘7dsd8’

now it is clear that renaming happened. There are a number of other ways to have smoother support of migrations.

However, this will be just an idea until DB developers don’t have to think about migration.

there are cases when db just does not provide tools to produce migrations. Like postgresql enum that just can’t be migrated safely by alembic, so this issue is unresolved for years, and that’s not on alembic side.

Well… we can just implement improvements as a stand-alone solution, e.g. within ORM, right?.

No, we can’t. As I described, to make it somewhat useful, you need to support numerous dialects, and creating such migration tools is a big job (comparable to creating a new database). Creating such tools for multiple languages is probably more job than just creating db from scratch.

That’s the main feature I expect from my next db: declarative SDL with schema migrations taken by DB. I know that EdgeDB already provides such functionality, but if you know other tools that have this implemented - drop me a letter.

Delimiter-first code

Tue, 29 Nov 2022 01:00:00 +0000

Summary

I argue for wider usage of delimiter-first in the code

three friends [tic, tac, toe] becomes three friends ・tic ・tac ・toe.

A new top-level syntax for programming languages is proposed to show advantages of this method. New syntax is arguably as simple, but more consistent, better preserves visual structure and solves some issues in code formatting.

A well-known proposal is to write commas first in languages like javascript, JSON or SQL, which don’t have trailing commas (JS has these days, but not the other two):

    -- trailing commas              
    SELECT employee_name,
      company_name,
      salary,
      state_code,
      city
    FROM `employees`

    -- leading commas               
    SELECT employee_name
         , company_name
         , salary
         , state_code
         , city
    FROM `employees`

While it is not what I am discussing here, there is a large overlap. This style wasn’t widely adopted, and it is interesting why.

All criticism essentially comes down to: 1) tools can solve common issues solved by this notation 2) it is not natural / you don’t write text like this.

Argument 1) is irrelevant since tools can handle any notation, even completely non-readable for human. Argument 2) is weak, however similarity to known things drastically simplifies adoption.

Over time, however, code culture diverged in multiple ways from ‘usual writing’: we enumerate from zero, write identifiers with underscores, don’t follow usual rules for quotes, and indent code instead of writing in paragraphs. When some tools have shown that the alternative way works, further adoption happens more easily.

More importantly, argument 2) is really broken:

    ・this version          
    ・is far more 
    ・natural

    than this version・          
    with a delimiter・
    after

so when it came to enumerating in a visually distinctive way, ‘usual writing’ uses delimiter-first.

I want to point the source of this controversy with one more example:

You need eggs, cheese, bread.               # ok  
You need ,eggs ,cheese ,bread.              # sucks
You need a) eggs b) cheese c) bread.        # ok
You need 1. eggs 2. cheese 3. bread.        # ok
You need ・eggs ・cheese ・bread.            # ok

So complains are not because delimiter-first looks wrong - in fact, it is common. It is about commas being used as leading elements, not trailing - a lesson to remember.

Both argument 1) and 2) pinpoint reasons why things the way they are: habit and tools. But different code examples (SQL examples by Felipe Hopfa and JS examples by Isaac Z. Schlueter) show benefits of delimiter-first.

I expected to find in discussions some code examples where delimiter-last is better, but I didn’t.

Later addition: haskell community adopted leading commas in many projects, because trailing commas were not supported at first. Later haskell got support for trailing, but now majority votes for advantages of leading commas.

Is ‘delimiter’ a right word?

Delimiter (just as separator) separates items. Though there is no consensus about it.

E.g. in [ 1, 2, 3 ] we have a sequence of tokens:

start  item delimiter  item  delimiter   item   end
  [     1       ,        2        ,       3      ]

So what I’m arguing for is having a start-of-item token. Like this: ・1 ・2 ・3. Do we need to point an end of last token? As we’ll see next, that’s usually not the case.

We have a special word for end-of-item token: terminator, but no startinator or any similar word. I see some irony in this.
(update: find some interesting thoughts I received about this in the comments section)

Meanwhile, I keep using the word ‘delimiter’ (albeit it’s maybe incorrect)

Collections in HTML

Different markup languages give some food for thought, as they commonly deal with collections.

E.g. html allows using start-of-item (<li>) and skipping end-of-item (</li>)

<ul>
    <li> first item
    <li> second item
</ul>

Collections in YAML

Yaml, which focuses on a hierarchy of collections, also uses a delimiter-first approach.

- point 1
  - point 1.1
  - point 1.2
    - point 1.2.1
    - point 1.2.2
  - point 1.3
- point 2 

Let me reinterpret this example. This reinterpretation is important in further discussion.

There are 3 delimiters: \n-, \n__- and \n____- (underscore = whitespace). All three delimiters are distinct, and the whole structure now reads as

1point 1
2point 1.1
2point 1.2
3point 1.2.1
3point 1.2.2
2point 1.3
1point 2

No end token needed in yaml: the last item ends when a collection ends, i.e. at a delimiter of higher level. There is no need to know or parse anything about an internal structure between two tokens.

Correspondingly, the only expectation we have from contents enclosed between <lvl2> is that there are no tokens <lvl1> or <lvl2> and that’s it.

Intermediate conclusion: delimiter-first is very common, and in markup languages it is even standard (but not in programing languages!)

Line should start from `\n`, not end with it

This sounds mad (after many years of programming it just should), but see for yourself:

Let's assume I've had some very long text ending here.

Chapter 2.
Let's learn about belonging of indentation elements to logical elements.

Pay attention to the blank line between last line of previous chapter and a header of new line. Undoubtedly, blank line is a part of ‘Chapter 2’ logical element, because empty line focuses our attention on ‘Chapter 2’ label. It is not because we need to end the paragraph.

For the same reason, in html additional margins ‘belong’ to headers, not preceding elements.

Same for lines: we highlight a beginning of a new line, not an end of previous one. Ironically, that’s in the name: it is newline, not endline.

When we turn to code, the same thought is seen with this small snippet, where I compare normal print with a hypothetical print that outputs newline before the output:

print('step1. downloading', end='')
for chunk in download(...):
    print(end='.')
print() # to keep steps on separate lines

print('step2. processing', end='')
for chunk in process(...):
    print(end='.')
print() # to keep steps on separate lines

Code with \n auto-printed after the arguments

print('step1. downloading')              
for chunk in download(...):
    print(start='.')

print('step2. processing')
for chunk in process(...):
    print(start='.')
    
    

Code with \n auto-printed before the arguments

result:

step1. downloading.........
step2. processing.........

Version of code with leading \n is more straightforward.

If things were the opposite way:

.......step1. downloaded
.......step2. processed

then \n in the end would be more optimal, but this order is not natural. Normally we first describe the collection, then enumerate items, not vice versa.

Unix’s newline in the end of line

Unix does not use \n as a delimiter of lines. Instead, it is more of line-terminator, because file with text should end with \n. Not doing so would break simplicity of unix tools and simplicity of definitions, see this SO thread.

For layman, why newline is required in unix:

$ echo -n 'good file with newline in the end\n' && echo -n 'another good file with newline in the end\n'
good file with newline in the end
another good file with newline in the end

Missed newline in the first file:

$ echo -n 'bad file without newline in the end' && echo -n 'another good file with newline in the end\n'
bad file without  newline in the endanother good file with newline in the end

problem is in the first file, but it is the second one to get printed the wrong way. No such misattrbution issue with newline-first.

If it is ok to end each file with \n, then it is ok to start it with \n.

Having lines start with \n maintains the simplicity of unix utilities, but is a bit simpler to visualize in editor.

Imagine that in parallel universe text and binary files are different in the very first character. What a science finction we could live in!

Do I really want to change all files to newline-first? Of course not. But I have to point that if in the course of history files were newline-first from the start, that would be a better system.

I hypothesize, that newline-last comes from unix mainframes: when line in shell is entered, it can be passed to a mainframe for processing. I can’t confirm this, but it sounds plausible. If so, time has shown that to be a wrong choice: all the messengers these days make distinction between new line (enter) and sending messages (shift+enter). Jupyter knows that, IDEs know that, messengers know that. Terminals still don’t know that.

Using indentation to structure code

Code indentation is available in all major languages, but python (and scala 3, F#, nim, haskell, …) relies on indentation to define logical structure.

And that works very well. Let’s see how we can re-interpret the python code the way we did with yaml

class MyClass:
    def __init__(self):
        pass
    
    def some_method(self):
        pass

now we reinterpret the structure with <lvl1>=\n, <lvl2>=\n____, <lvl3>=\n________.

1class MyClass
2def __init__(self)
3pass
2
2def some_method(self):
3pass

so, we see very basic organization of code is available just by looking at sequence of start tokens (which simply mirrors indentation).

Some problems with multiline strings

There are places where python allows code to ‘escape’ indentation: continuation of previous line (explicit with \ or implicit with different brackets) and multiline strings.

Continuations are ‘solvable’ with code formatting tools, but not multiline literals:

if True:
    print("""
    This is python's
    multiline string
    """)

Output (###### just shows where the line ends):

######
    This is python's######
    multiline string######
    ######

To get proper output we need to break visual alignment:

if True:
    print("""This is python's
multiline string
""")
    # takes effort to realize that the same block of code continues here
    return False

There are problems with multiline: first line, last line and indentation. Multilines in javascript/go face all the same issues, so it is a generic problem.

I think there is a way to solve this issue too, and it will be discussed.

Delimiter-first pseudo-python

To better demostrate how all these ideas come together, I’ll imagine a new language (pseudo-python). To focus only on syntax changes, I’ll keep all other aspects of the language the same.

I will consider an artificially complicated example. It includes different arguments, list, empty list, string, multiline string, method chaining, multiline logical arithmetics, few or no arguments

Goal is to demonstrate that any wild mix is representable and does not produce mess.

prepare_message(
    title="Hey {}, ready for Christmas?".format(user_name),
    email=email,
    body=f"""Reminder: please clean your chimneys!

Oh, and prepare "Santa Landing Spot" on your roof

Thank you {user_name} for cooperation,\nSanta Corp.
""",
    additional_sections=[
        get_current_promotions(n_promotions=4),
        get_recent_news(),
    ],
    unsubscribe_link=generate_unsubscribe_link(
        email, 
        message=message,
        **unsubscribe_settings,
    ),
    attachments = [],
).schedule_for_submission(
    holidays_queue,
    important=user_is_santa |  user_is_deer \
     | user_previously_had_issues_with_christmas_delivery,
)

prepare_message(
    , title="Hey {}, ready for Christmas?".format(user_name)
    , email=email
    , body=f"""
        "Reminder: please clean your chimneys!              
        "                                                   
        "Oh, and prepare "Santa Landing Spot" on your roof  
        "                                                   
        "Thank you {user_name} for cooperation,\nSanta Corp.
    , additional_sections=[
        , get_current_promotions(n_promotions=4)
        , get_recent_news()
    ]
    , unsubscribe_link=generate_unsubscribe_link(
        , email
        , message=message
        , **unsubscribe_settings
    )
    , attachments = []
)
\.schedule_for_submission(
    , holidays_queue
    , important=user_is_santa | user_is_deer 
      \| user_previously_had_issues_with_christmas_delivery
)

I welcome you to study this example for a minute. Structure overall did not change much. Note differences in line breaks \ and multiline strings.

An important distinction: leading commas get the same role as hyphens in yaml: they define structure, their position is not arbitrary.

# normal python
# this is legal code            
print(
    1, 
        2,
)

# proposed
# this is incorrect code        
print(
    , 1
        , 2
)

In new code there is no need in closing brackets (see that yourself by staring at the code more!).
So let’s remove closing elements:

prepare_message(
    title="Hey {}, ready for Christmas?".format(user_name),
    email=email,
    body=f"""Reminder: please clean your chimneys!

Oh, and prepare "Santa Landing Spot" on your roof

Thank you {user_name} for cooperation,\nSanta Corp.
""",
    additional_sections=[
        get_current_promotions(n_promotions=4),
        get_recent_news(),
    ],
    unsubscribe_link=generate_unsubscribe_link(
        email, 
        message=message,
        **unsubscribe_settings,
    ),
    attachments = [],
).schedule_for_submission(
    holidays_queue,
    important=user_is_santa |  user_is_deer \
     | user_previously_had_issues_with_christmas_delivery,
)

prepare_message(
    , title="Hey {}, ready for Christmas?".format(user_name)
    , email=email
    , body=f"""
        "Reminder: please clean your chimneys!                
        "                                                     
        "Oh, and prepare "Santa Landing Spot" on your roof    
        "                                                     
        "Thank you {user_name} for cooperation,\nSanta Corp.  
    , additional_sections=[
        , get_current_promotions(n_promotions=4)
        , get_recent_news()
    , unsubscribe_link=generate_unsubscribe_link(
        , email
        , message=message
        , **unsubscribe_settings
    , attachments = []
\.schedule_for_submission(
    , holidays_queue
    , important=user_is_santa | user_is_deer 
      \| user_previously_had_issues_with_christmas_delivery

Don’t pay much attention to number of lines - denser code is a byproduct, not a goal.

Further I’ll discuss several advantages of this syntax.

New multiline strings

print(f"""
    "This is new
    "multiline string

output:

This is new
multiline string

Everything looks perfect, multiple issues are solved in one shot. But … with a minor catch: that’s how output looks like in raw form: \nThis is new\nmultiline string (i.e. it is newline-first). Technically, one can produce newline-last outputs, but that’s artificial. See the elegance of match between delimiter-first and newline-first approach: delimiter just gets replaced with newline. That’s an operation that one can visually imagine by shifting all lines to the left.

One more example:

print(f"""
    "you can place anything here: ' '' ''' " "" """ f""" etc etc.
    # and you can put comments in the middle of multiline
    "multiline string can't be broken or terminated by any sequence within a line

Now, python literals do not work like that.

'''
""" and ''' should be escaped (otherwise interpreted as literal terminator)
'''


'''''
'''  # this trick (available in markdown) does not work in python
'''''

New parsing

In contrast to normal python, line alone does not inform if the instruction is complete, or it should be continued on the next line. Parsing one more line is required to confirm that current code section is complete (only prefix of next line should be parsed, to be more precise).

In this approach top-level parsing is quite ignorant to language details, and it relies on the same visual cues as we humans do: parser does not need to analyze line in detail to figure out if the instruction continues or not.

Let me ‘parse’ this example:

Delimiter   Token class    Rest of line
            <lvl1-instr   >prepare_message(
    ,       <lvl2-item    >title="Hey {}, ready for Christmas?".format(user_name)
    ,       <lvl2-item    >email=email
    ,       <lvl2-item    >body= f"""
        "   <lvl3-literal >Reminder: please clean your chimneys!
        "   <lvl3-literal >
        "   <lvl3-literal >Oh, and prepare "Santa Landing Spot" on your roof
        "   <lvl3-literal >
        "   <lvl3-literal >Thank you {user_name} for cooperation,\nSanta Corp.
    ,       <lvl2-item    >additional_sections=[
        ,   <lvl3-item    >get_current_promotions(n_promotions=4)
        ,   <lvl3-item    >get_recent_news()
    ,       <lvl2-item    >unsubscribe_link=generate_unsubscribe_link(
        ,   <lvl3-item    >email
        ,   <lvl3-item    >message=message
        ,   <lvl3-item    >**unsubscribe_settings
    ,       <lvl2-item    >attachments = []
\           <lvl1-continue>.schedule_for_submission(
    ,       <lvl2-item    >holidays_queue
    ,       <lvl2-item    >important=user_is_santa | user_is_deer 
      \|    <lvl2-continue>| user_previously_had_issues_with_christmas_delivery

By looking only at the sequence of delimiters (there are several subtypes of them), one can deduct limits of every code block / call / literal, i.e. derive top-level structure of the program. Parser now deals with a simpler task of checking that elements fit this pre-defined structure, and can point places where ‘structure’ does not match ‘content’.

Good bye old times when one deleted bracket caused complete rebuild of AST and numerous errors.

New code suggestions

This paragraph was added later, to unwrap the point that was missed by many readers.

Parsing of correct code is not a problem since 1960s or so. Real challenge is on-the-fly parsing of partially incorrect and quickly-changing code in the process of editing.

Say I’m a complete novice and typed something wrong:

def myfunction(
    var1 = 'some default value',
    var2 = (1, (2, 3),
)
    var3 = "variable number 3"

    var4 = """
Simple unfinished multiline string
""" + \
var

    var5 = ())

what should be autosuggested? var1/2/3/4? or nothing? Which would be more helpful?

How to inform user which places should be fixed? VS Code blames bracket on first line saying it is not closed (while it is closed!) and last line for missing colon (no, I don’t want colon there). Pycharm’s diagnostic messages are slightly better, but it blames line with var3 (which is completely ok).

Now, in pseudo-python there is no way to ‘escape’ indentation and thus code analysis can rely on indentation. And it is immediately deducible that lines with var2 and var5 have problem, and indent of var3 is incorrect (since colon is missing on previous line).

Autosuggestion even in code with multiple unfinished places would be still useful (in similar scenario in pseudo-python it still can suggest var3/var4, and depending on tolerance additionally var1/var2). Currently tools don’t suggest anything.

As I mentioned, AST undergoes small changes during editing, thus providing highly effecient autosuggestion, code analysis, and highlighting for such language would be simpler, much simpler.

New editing

Normal python.
suppose you want to start a list of arguments

print()

after you hit enter in IDE:

print(
    
)

then you type argument and comma.
Ready to proceed

print(
    42,
    
)

Done? Arrow down + enter

print(
    42,
    43,
)

Forgot something?
Double arrow up,
move cursor to end of line,
enter

print(
    42,
    43,
    
)

Delimiter-first pseudo-python.
suppose you want to start a list of arguments

print()

after you hit enter in IDE comma is auto-added:

print(
    ,

you type only argument.
Ready to preceed

print(
    , 42
    ,

Done? Enter + shift-tab

print(
    , 42
    , 43

Forgot something? Tab

print(
    , 42
    , 43 
    ,

The process of editing such structures was polished with hierarchical lists in word and other text processors.

Below is an animated example from workflowy (taken from post by B. Brandall):

Even minimalist note-taking apps these days recognize the importance of hierarchical organization. Their interface focuses on effectively traversing and modifying this structure.

But with code - this extremely structured and standardized pieces of linked information - we continue the game of imitation: ‘hey, that’s just text files, you can use notepad here!’.

New versioning

Missing trailing commas make diffs a bit annoying because of including an additional line.

New syntax has this solved. In other aspects versioning should work the same.

New formatting

The goal of formatting is to produce a visual code structure that is easy to read, as if you already see all main components without reading anything.

New syntax enforces this, and leaves fewer degrees of freedom. Writing something non-readable would be challenging… I suppose.

Role of formatters thus would be minor, or they can be skipped.

Limitations

First, I did not try to solve following perceptual problems:

commas are leading, and I’ve mentioned that this was a problem for comma-first formatting
open brackets without a matching pair create visual discomfort. Also my eyes already trained to focus on closing brackets, but proper color scheme seems to solve this

This post is already long, and leaving things closer to python simplifies example. I think both points can be improved, and feel free to post your ideas on this.

Second, I intentionally focused only on improving multi-line constructs, but single-line collections were left untouched. That does not mean delimiter-first does not work there, but scale of necessary changes is just too high to justify gains. At least for now.

If you made it this far

Wow, thank you!

I hope an adventure was interesting and slightly mind blowing.

Don’t be too surprised if this proposal evokes “hey this looks wrong, just plain wrong” reaction.
After all, ideas we enjoy these days: enumeration from zero, using registers in names, structural programming, mandatory formatting, and even python’s approach to defining code blocks with indentation — every single one of them were met with a storm of criticism.

👋

Comments 💬

I received and collected a number of links for using delimiter-first in different contexts (lisp/scheme, formulas, translatable languages), will organize that material when I get time.
Isaac Z. Schlueter advised there is a term ‘initiator’, used in “… specification discussion threads, where it’s common to dig deep into the particulars of parsing semantics. Very much a ‘deep in the weeds’ kind of technical term.”

In the context of parsing I found the word ‘initiator’ in several papers, and only one mention on stackoverflow, so I’ll stick to using word ‘delimiter’.
Other options mentioned in discussions: introducer, starter
Peter Hilton noticed that “… startinators in prose usually called bullets. Some English-language style guides even treat the following punctuation as equivalent.

Brilliantly Wrong — Alex Rogozhnikov’s blog about math, machine learning, programming, physics and biology.*

Brilliantly Wrong — Alex Rogozhnikov’s blog about:
- math
- machine learning
- programming
- physics
- biology.
Note the bullet list’s trailing full stop (period). It’s still one punctuated sentence.”

Indeed, name ‘bullet’ sounds very appropriate when discussing code written in delimiter-first style. From parsing side, I don’t feel it’s a good partner to word ‘terminator’.
Thanks to Alexander Molchanov for proofreading, improving text, and leaving comments.
Question: “Who did you write this for?”

I believe that’s a better way to structure code (for readability, editing, and better language tools). Based on what I’ve learnt so far, I am sceptical about integration of additional syntax to existing languages: two notations side-by-side are worse for users than one. From the perspetive of language maintainers, all tooling would need to deal with two dialects, which is also a downgrade.

So main audience are authors of new programming languages. However, it is not only authors - to get adopted, any new feature should get at least minimal support from community. That’s where this page can help. So more generally, I target people interested in experimenting around new programming languages, and interested in challenging status-quo.
Question: “But how will you represent a couple of multiline lists next to each other?”

This case is handled normally:
```
  f([                
      a,
      b,
  ], [
      c,
      d,
  ])
```
```
  f([                
      , a
      , b
  \,[
      , c
      , d
```
For the record, I’d prefer to introduce variables in any case.
Question “Don’t you think that current tools have already solved the issues solved by delimiter-first?”

I developed a simple 4-line code with missed comma that is compeletely fine for flake8 and ruff. And black formatter considers it well-formatter. It took me less than a minute to develop this example, and if you start thinking, I’m sure you’ll find a handful of similar cases. Authors of one utitity that is supposed to mark these cases claim that ‘5% of 666 Python repos had comma typos (including Tensorflow, and PyTorch, Sentry, and V8)’.

We can continue patching problems with even more tools and more special cases, but I’d better have it solved by design. Core point is - delimiter-last is flawed. Main visual cues (indentation) is on the left, while there are still control sequences that can override indentation, and they are on the right. For this reason \ in the end of line is a bad choice.

Things I wish someone told me about microscopy

Sun, 01 Nov 2020 12:00:00 +0000

If you want to learn some culprits of microscopy

… you’d better watch this video by microbehunter, because rest of the post is view of ML person on things you should (not) expect from lab microscopy during experiment design.

Warning:
This post contains reflections and is not meant to be an easy reading.
This post assumes that you understand wave mechanics.

I have a nice general background in physics, however just that was clearly insufficient — a lot of specific knowledge that is hard to deduce from first principles.

General remarks

there are myriads of different microscopes from trivial ones for mid-schools to EM (electron microscopes) and light-sheets
- Ranges of prices from hundreds of dollars to millions. In some applications 100x cheaper microscope can still be more useful
- Manual and automated. Terribly expensive still may be non-automated
microscopes are typically designed to be modular, many parts are interchangeable; there is still vendor- and format- specificity
when a microscope is automated, that typically means that it can at least move its specimen (yes, specimen is moved, microscope’s camera and light path are usually steady)
- it may or may not be able to switch excitation / emission filters automatically, so ‘automated’ is not a descriptive word. Ask about what is automated
while typically microscopes are just ‘make a photo with light’ devices, software for microscopes is a tough topic.
- manufacturers desire to provide a visual interface with windows and buttons, and mapping all countless scenarios to a sequence of buttons is … challenging
- as a result both API and interface are far from satisfactory
light source is not moved with specimen, but instead aligned and fixed relative to camera.
- You can’t image with different shifts but ‘same light position’
immersion is quite critical when going to higher resolutions (above 20x)
objective on a microscope has everything aligned and focusing depth can be adjusted or changed. (objectives are also pretty expensive). That’s not your smartphone’s refocusing camera. So 40x on your microscope means that object of size nm in focusing plane (which is fixed) literally projects in 40n40m on detector plane. To complete arithmetics you only need physical size of pixel in a camera - and voila - you have ‘size of specimen pixel’.
for a long time I was surprised that biologists are so limited by the number of fluorescent channels they can image simultaneously (emission spectra overlap, so you want them to be separable).
- At the same time they don’t switch to quantum dots (which have much narrower emission spectra). Permeability may be an issue here
- And they don’t try to go significantly outside of visible spectrum.
  - probably this is due to objectives - correcting aberrations for wide spectrum range is tough
- Another factor is penetration depths variability (even within water) for different wavelengths
- You can take images in IR, but going to deep IR is ultra-rare
there is an uncountable amount of imaging techniques.
Dozens of them with all their variations, with all covering only some part of information.
- Very hard to combine many in the same system (while some useful combinations exist)
- Dream of machine learner - having different imaging systems for the same specimen - can be implemented only in specific cases
more powerful microscope requires identical efforts on sample/environment side
- Higher magnification requires better compensation of motion
- More sensitive to optical properties means you’ll see more artifacts from anything in your system. Or maybe plates or slides.
  - E.g. if method can detect birefringence, any plastic labware is likely to add some birefringence patterns
well edges introduce significant effects, plate edges also introduce some effects for imaging (both also affect biological processes)
ibiology provides an amazing combination of theory and practice of imaging. It was incredibly helpful
imaging protocols are hardly readable. Too many things and parameters, no deduplication.
- They remind completely unwrapped low-level code for execution by machine, not ‘settings’.
- I’ve told about software being tough here, right? There are issues with interfaces on all levels
imaging time is a real issue
- “oh, we can just increase stack size” is correct solution to many questions in theory, but not in practice
reproducible focusing may be an issue
richest sources of information are available only for ex-vivo cells and tissues
anything that produces nice high-resolution images will be called by biologist “confocal” no matter if confocality is actually used there :)
believe data, always believe data. If you think something is misaligned - it almost surely is.

Contrasting methods

The main way to achieve contrast is by using monochromatic (i.e. laser) light, and achieve shift in phase between “rays” started from the same source. Shift in phase affected by specimen provides a contrast visible by a simple detector.

Simplest example is DIC (differential interference contrast) - light is split in two parts, which come through neighboring positions in slide
Another example is polarization contrast, where light comes though the same specimen but due to birefringence of some materials different polarizations come with different speed, which produces retardation of one polarization
Phase contrast organizes interference between scattered and passed through waves. Phase delay adds phase to scattered light. Simplest to setup of these three.

An important property of contrasting optical paths is that optical path lengths for light arriving to the same location should be identical (unless sample perturbations prevent this). Optical path is not distance, but time taken by light to travel along a trajectory.

That’s a simple thought and sounds like a natural, but when you look at optical system with all its lenses, you should realize it’s non-trivial behavior.

Amazing variability of imaging techniques

Microscopy world is very limited within one lab (even optical lab) but whole large world of microscopy is so rich and interesting out there.

Multi-photon imaging
- deliver energy required for excitation with several photon simultaneously
- requires an expensive laser, but imaging is simple
- can go quite deep into tissue
- can’t guarantee narrow emission spectra because different number of ph
Electron microscopy
- super precise (it’s completely different part of spectra)
- ex-vivo samples only
- requires isolated rooms and strong movement compensation
- not something you will simply hold in a lab, but provides extremely detailed image
LSM: light-sheet microscopy is a demonstration that light source does not have to be on the same axis, while it sounds like an axiom after lab scopes
- LLSM is times cooler
TIRF (total internal reflection) microscopy when combined with photo-activable fluorescent proteins (PALM/STORM) can get to tracking trajectories of individual proteins (while still using visible range spectrum).
Another interesting idea is FRET - allows detecting interaction between single molecules if those have appropriate fluorescent tags.
Photons emitted by one antibody are absorbed by the second one if molecules are in proximity of each other.
optical coherence tomography OCT
- has nothing to do with tomography and even works based on reflected light
- widely used for retina scanning
Ghost imaging. Not-yet-there, but idea is mind-blowing
- entangle two photons
- the first one hits the target, while the second goes to detector
- entanglement allows partially reconstructing properties of a photon that hit the target
- there are classical variations as well
Structured illumination (SIM)
- Moir patterns + a bit of computational magic allows you going slightly above optical resolution limit

You may want to check this video to orient yourself a bit and get a sense of what sounds appropriate for your case.

Don't write command-line interfaces (generate them)

Thu, 01 Oct 2020 12:00:00 +0000

(a friendly reminder that reading post before commenting is a great idea. Some people see this as an argument for GUI, but it is completely misleading)

A favourite activity of fresh github-bers is writing CLI (command-line interfaces) for anything.

Every programmer uses CLI (true), so writing CLI makes you more professional (false).

CLIs are required in everyday maintenance, env/pipeline/db management, and checking this and that. It is a glue to keep different subsystems together, but hardly CLI is a reliable programming interface. Progress in software engineering left bash calls far behind in terms of reliability and flexibility.

What’s wrong with writing CLI as an ‘interface’?

CLI support is an additional logic in your program that makes no real work
While typically being dumb, CLI logic is frequently filled with mistakes; thus it requires constant maintenance and an additional testing
Error (exception) handling with CLI is very poor. Another layer of (bad faulty) code is required to make it possible
Scaling/extending is not as easy compared to programming language APIs (see example in the end)
CLIs are detached from essential code, which in most cases is a disadvantage.

more on this

Forcing users to use CLI means: stay away from my code, you’d better not work with it. Maybe that’s ok — but if users can code a bit (otherwise why do they use CLI?), that’s not an optimal way — if something went wrong, do you want to directly see the code+calls that failed or do you want to add several minutes/hours walking thru command args parsing machinery someone else wrote?
While being questionable in small projects, a virtual fence becomes more and more obvious when parsing logic (validation, transformation, routing) grows.

Writing command-line interfaces the right way

write functions
leave CLI-fication to a special package

Which tool to use for writing command-line interfaces in Python?

Here are the options that you should consider …

argparse (or ancient optparse)
click
docopt
python-fire

… deprecated. Yes, consider them deprecated.

Prefer hug and typer. Example for the latter:

import typer
from pathlib import Path

app = typer.Typer()

@app.command()
def find_dragon(name: str, path: Path, min_age_years: int = 200):
    <implementation goes here>

@app.command()
def feed_dragon(dragon_name: str, n_humans: int = 3):
    <implementation goes here>

if __name__ == "__main__":
    app()

Now it’s ready to be invoked from shell

python example.py find_dragon 'Drake' --path /on/my/planet

That’s it! Types are parsed, checked and converted. Defaults and description are picked from function itself. Even provides bash completions you can install. Best part is you wrote no code for that!

— I need to invoke my code from bash with complex parameterization

Exact wording of this question may also include job schedulers, calls on remote machines and docker run/exec — common reasons that force people to write CLI.

Previous recipe may not work in this case, you have two options:

Option A.

Read documentation for deprecated packages, write a ton of code for conversion, validation, testing and mocking. Add documentation, make presentations about CLI logic and neat places of using bash, get promoted to Senior CLI architect, give talks and interviews. Some junior in your company discovers option B and ruins your career.

Option B.

When there is much to configure, don’t try to build a large parsing machinery to handle all cases, just use code to parameterize calls:

python -c "
from mymodule import set_dragon_feeding_schedule, Creatures, Date
set_dragon_feeding_schedule(
    feeding_times=['10:00', '14:00', '18:00'],
    dishes={Creatures.Tiger: 2, Creatures.Human: 1},
    start_day=Date('1020-03-01'),
)
"

Instead of

python -m mymodule \
    set_dragon_feeding_schedule \
    --feeding-times ['10:00','14:00','18:00'] # hopefully this way it gets recognized \
    # how will you define parsing a dict with enum to integer mapping? 
    --dishes=Creatures.Tiger:2 \
    --dishes=Creatures.Human:1 \
    --start-day=1020-03-21 # BTW bash allows no comments in multiline calls

How many lines of code you need to cover parsing logic in previous example?
- Try to be reasonable, not optimistic. Don’t forget documentation.
- Add testing, mocking, … have you ever seen that part done properly for CLIs?
Is there anything that you win after writing an explicit CLI parsing? Double quote maybe?
Exception handling — simple to add in one case, very tough in the other

— Never realized that CLI command can be replaced by python command

You’re welcome! This can save you weeks of time and sleepless nights.

Here is definitive guide:

Don’t write yet-another-parser — python can parse all you need
Don’t reinvent representing lists, dicts, enums, objects, etc in text — every programming language has it already solved
Don’t create new types of interfaces — functions are interfaces
Don’t write parsing logic/validation — check parameters instead

Focus on writing useful and friendly functional interface, not CLI.

— How about an example for dealing with more complex parameterization?

Sure! Here is an example from machine learning.

Common headache is supporting multiple optimization algorithms (each having own set of parameters) and allowing a number of architectures (each also having different parameters).

python -c "
from yourpackage import ResidualNetwork, AdamOptimizer, train, activations
train(
    optimizer=AdamOptimizer(lr=0.0001, some_param=42, converge=True),
    model=ResidualNetwork(n_layers_in_each_group=[3,4,5,6], act=activations.ReLU, n_classes=1234),
    save_path='/research/my_experiment_number9999',
)
"

Compare this piece of clarity and versatility to a parsing nightmare happening in some popular packages.

Why it becomes such a nightmare? That’s a great question!

parameters depend on each other in a non-trivial way. Different model → different parameters. Added a model — update CLI.
there should be a way to associate parameters with an entity they come from
- is this parameter for an architecture? for an optimizer? for a dataset?
- entities that appear naturally in programming interfaces are not in the style of bash calls
at some point second model appears (hi GANs!), and possibly a second optimizer, several types of datasets… now you need to support all of that in CLI and avoid flag collisions
- unlikely you want to frequently drop previous interface, so backward-compatibility will multiply your problems
validation logic that is capable of handling all these scenarios would be huge, buggy and not helpful at all

CLIs don’t scale up well.
They work well only when you can decompose things into simpler components ‘each doing one job’. Before writing CLI, it is thus important to know what is the functionality your project provides and how it may change in a year or two. It is very easy to add CLI when the project is in its initial stage — but as functionality grows, you’ll find it exponentially harder to fit all knobs into CLI.

Other programming interfaces survive growth quite easily.

Looking forward

In the bright future of programming there will be more natural bridges between different languages. With growing capabilities for reflection, it will be easier to invoke particular functions from other languages without intermediate bash calls. Python<>rust is a good example of going in this direction.

By not writing CLI logic and focusing on programming interface you make code future-proof. Different utilities already can convert functions to REST API (we may later use some other network APIs like gRCP, and you’ll be able to add it with a couple of lines). More to come, maybe we should expect utilities to auto-wrap your functions for calling from other languages/hosts/universes.

Code should be designed to be used by other code first. Convenience ‘temporary’ command-line utilities sooner or later become part of bigger automated pipelines if no other API proposed.

TL;DR

simple CLIs should be auto-generated today, don’t write it yourself
- other types of APIs can be auto-generated as well
complex CLIs are a problem and think twice (better, 5 times) before trying to replace programming API with CLI
- convenient command-line calls are available without writing a single line of CLI code

Additional comments

I use python as an example because 1) need to show some code 2) it is popular 3) I know it well enough.
However, the points made should be valid for all modern languages (C++ is not a modern language just in case).
Itamar Turner-Trauring has an article on a relevant topic in his called please stop writing shell scripts. Itamar provides numerous helpful recommendations and tips in his blog, and this is no exception.

Possible objections

CLI allows abstracting out from implementation
- Exposed functions can be detached from an actual implementation
User may not know programming language I use
- Unlikely import and a function call can be misleading. By hiding details you leave user clueless in case something doesn’t work
- Actual choice is whether user should learn a bit of your language or yet-another-CLI system. Hard to find argument for the latter
- If your tool requires detailed configuration, you shouldn’t be afraid to say: you need to write several lines of code, here is an example
My application heavily uses bash/shell features: pipes, process substitutions and filename expansions
- In this case when you want to keep using and supporting CLI

Comments on packages

What’s wrong with python-fire?

While it builds CLI on the top of exposing functions/methods, fire ignores annotations and tries to guess types based on input.

An example from an official documentation to confirm:

$ python example.py 10
int
$ python example.py "10"
int
$ python example.py '"10"'
str

So 1) no types guaranteed 2) convolved logic 3) to make sure argument is not converted to int, wrap in both single and double quotes. Now wrap it in a bash call (e.g. during building docker). Have fun with escaping quotes for every string argument.

Hug has a poor support for CLIs (as of now)

Be warned, it ignores flag names. Though it has right direction of thought and directly supports marshmallow types. But in the meantime (Oct 2020) typer is a safer choice.

Interface package of a dream is not released yet — it should support both CLI and web APIs and include some elements from python-fire. However, this should not stop you, as switches between these packages is almost painless as long as you write no custom logic.

Acknowledgements

Thanks to Tatiana for proof-reading an initial version of this post.

Twin training: trick for better model comparisons

Tue, 01 Jan 2019 12:00:00 +0000

Abstract: Frequently comparing deep learning models?
A simple way to improve comparison is discussed here, this trick becomes specially handy when comparing segmentation models.

Reliable comparison of models is a question important for DL “theorists” (to evaluate new approaches) as well as for practitioners/engineers (to select an approach for a particular task in hand). Comparison is time-consuming process, frequently with noisy results.

Usual setting incorporates fixed dataset split into train/val/test and fixed metric of choice. Next, independent runs are conducted for all models under comparison and achieved quality is registered.

As a result,

There is a significant noise in comparison (it is rare to rerun each model several times, specially in applications),
Validation can be done only using whole dataset
need to remember which version of code was used to generate a particular number, as you can accidentally compare things that are not ‘comparable’ because of e.g. changed augmentation or updates in the dataset
- yes, practitioners have to deal with frequent updates in the dataset
can’t use augmentations while testing, since it is hard to guarantee that exactly same augmentations were applied. Sometimes it is handy to evaluate using several batches as a fast intermediate check. Augmentations in test allow ‘broader’ check.

What is suggested: twin training

Models can be trained side-by-side within the same process, with as high similarity in the training process as possible. Same batches, same augmentations, and of course the same datasets.

If models, say, have identical architecture, their initial weights should be identical (easy to achieve in any DL framework).
- As we know, initial state influences optimization, in some cases drastically (that’s not desirable, but happens).
During training, same exact batches with the same exact augmentation should be used to optimize models.
- That’s right, you need to augment only once, thus CPU is not a bottleneck.
- Similarly, one should always compare on the same batches. To achieve smooth monitoring rather than ‘validate once on a while’, take one batch at a time and compute metrics on that batch.

Pseudo-code may look like (fragment):

for batch in train_data:
    batch = augment(batch)
    for model in models:
        # make an optimization step for each model using the same batch

Things usually tuned (architecture, loss, augmentations, parameters, optimizers, learning schedules, etc.) - all of them can be compared more efficiently this way.

Example:

There are three models trained in parallel in this screenshot from tensorboard. One can tell when one of models has lower loss and estimate level of ‘noise’. It is also clear that most jumps and falls in learning curves are due to batches sampled, and are not model-specific behavior. In other words, you can better see the difference between models not difference between runs.

This demonstrates a typical comparison — things compared are extremely similar and there is little practical difference. Models’ response to the same training input is close to identical. It’s not easy to get the same conclusion by looking at just final scores. That’s a good argument towards including learning curves in the paper.

Bonus: simpler comparison of segmentation models

When training models for image segmentation (such as instance segmentation or class-segmentation), lack of memory becomes a critical factor. Batch sizes become very small, and it is almost impossible to train several segmentation models at once on a single GPU.

During segmentation training each sample contributes a lot, since it provides a lot of labels (one per pixel!).
It is also unlikely that you have thousands of well-labelled high-resolution segmentation images.

However when you train several models inside a single script/notebook, there are no such problems, because you never keep intermediate activations for more than one model at a time. Weights of all models should still be kept in (GPU) memory, but that’s a small fraction of space taken by activations.

Bonus: simple organization of experiments in tensorboard

Tensorboard recursively scans subfolders for logs, so you can keep each ‘comparison’ in a separate folder, and each compared option saves its logs to a corresponding subfolder.

Alternative: fix random seed?

I don’t think that fixed random seed is reliable enough to be considered as an alternative way to achieve similarity in training.

THere are many different RNGs provided by different modules, and RNGs are used in too many places. And you need to precisely control RNG flow in your program. Because if some of your functions use global RNGs like random or np.random directly, this implies that any side call to those from anywhere in your program completely changes all following sampled numbers. Any ‘interruption’ in the sequence breaks it. Random numbers on GPU is whole another story.

So, you should look through all the augmentations, samplers, dropouts (basically, everything) to verify they don’t use global RNG’s (and find that some of them actually do).

Long story short, if you have to rely on random seeds in DL, at least log some control sums to verify that sequence was not broken by an unexpected call from somewhere else.

You can still use random seed to achieve reproducible training of the same model.

Einops — a new style of deep learning code

Thu, 06 Dec 2018 12:00:00 +0000

Recently I’ve open-sourced einops — a new (and better) way to write deep learning code.

Einops introduces a new notation and new operations.

It perfectly complements existing frameworks (pytorch, tensorflow, gluon, chainer, numpy and others) allowing you to write better deep learning code (see examples for pytorch).

Einops at Github

Tutorials: part 1 and part 2.

Brilliantly wrong

State of Wall in Protein Language Models in 2026

AMPLIFY: is scaling necessary?

Structure-alignment of ESM2 and AMPLIFY

ProSST: quantized structure tokens

VespaG

Scaling and Data Saturation in Protein Language Models

Training Compute-Optimal Protein Language Models

Ankh3: combining sequence denoising and completion

ProGen3: Scaling Unlocks Broader Generation and Deeper Functional Understanding of Proteins

DPLM-1 / DPLM-2 / ESM-3

MSA as a context for PLMs

Final thoughts / directions

Fastest Autograd in the West

Let’s autograd in pytorch

Let’s autograd in jax

Tensorflow

Let’s autograd in python

Let’s autograd in python, again

Let’s autograd in rust

Let’s autograd in C

Let’s autograd in C (again)

Assembly

Summary

Optical pooled screens of cells (overview of emerging biotechnology)

Papers discussed:

Preface

Why do we need an even more scalable assay? 🤔

Challenges needed to be solved

How PERISCOPE solves spectral overlap

How CP-POSH solves spectral overlap

In situ sequencing (ISS)

sgRNAs

Phenotypic pipeline and analysis

Verification & Discovery

So which one to choose?

Can we do better?

Where would this go?

Acknowledgments

Comments

Einops, retrospective of 5 years

Tough place?

Unique technical challenges

Unique conceptual challenges

Adoption challenges, management challenges

Reasons for slow adoption?

Concluding thought

Schema migration should be a responsibility of DB

What’s wrong with migrations in RDBMS?

Solution

Where to start?

Consequences

Delimiter-first code

Summary

Related: comma-first formatting

Is ‘delimiter’ a right word?

Collections in HTML

Collections in YAML

Line should start from \n, not end with it

Unix’s newline in the end of line

Using indentation to structure code

Some problems with multiline strings

Delimiter-first pseudo-python

New multiline strings

New parsing

New code suggestions

New editing

New versioning

New formatting

Limitations

If you made it this far

Comments 💬

Things I wish someone told me about microscopy

If you want to learn some culprits of microscopy

General remarks

Contrasting methods

Amazing variability of imaging techniques

Don't write command-line interfaces (generate them)

What’s wrong with writing CLI as an ‘interface’?

Writing command-line interfaces the right way

Line should start from `\n`, not end with it