Brilliantly wrong Brilliantly Wrong — Alex Rogozhnikov's blog about math, machine learning, programming, physics and biology. https://arogozhnikov.github.io/ Thu, 12 Feb 2026 06:12:48 +0000 Thu, 12 Feb 2026 06:12:48 +0000 Jekyll v3.10.0 State of Wall in Protein Language Models in 2026 <p>Last year Pascal Notin wrote a great post summarizing important observation about AI + proteins: <a href="https://pascalnotin.substack.com/p/have-we-hit-the-scaling-wall-for">Have we hit the scaling wall for protein language models?</a>. (Spoiler: the answer is ‘yes’)</p> <p>Briefest summary if you didn’t read it:</p> <ul> <li>PLMs’ performance on fitness prediction (‘transferability’ of skills) plateaus after 1B and declines after 5B parameters. This holds for multiple PLM families</li> <li>leading approaches combine MSAs and 3d structure. Even very simple methods that combine these sources of information outperform billion-parameter models</li> <li>training on genetic sequences (that’s quite a lot of additional signal!) doesn’t help — Evo and Evo-2 are near the bottom of the leaderboard</li> </ul> <blockquote> <p><strong>Remark:</strong> I’ll focus on sequence-based models, and declare folding and inverse folding as out-of-scope for this post.</p> </blockquote> <p>New models appeared on ProteinGym leaderboard since Pascal’s post, but conclusions hold. And later analysis from another group corroborates this: <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC11601519/">Medium-sized PLMs perform well at transfer learning on realistic datasets</a>. Folding models keep using embeddings from (very old) ESM-2.</p> <p>We’re in a weird position when we have a lot of sequencing data (and computing power), but we can’t put it to work. Let’s take a tour across recent literature and see if there are any signs of going beyond this scaling wall.</p> <blockquote> <p><strong>Remark:</strong> for comparison, widely used structure models (AlphaFold2 / AlphaFold3 / proteinMPNN) are even less than 1B parameters. This could be explained by a smaller size of PDB compared to UniProt, or maybe it’s just a common trait of molecular biology.</p> </blockquote> <h2 id="amplify-is-scaling-necessary">AMPLIFY: is scaling necessary?</h2> <p><a href="https://www.biorxiv.org/content/10.1101/2024.09.23.614603v2">preprint</a></p> <p>Interestingly, the authors explicitly start from noting that the premise that “scale leads to performance” is likely false in PLMs, and then use recent LLM pretraining techniques to achieve better perplexity than ESM-2 using a cheaper and smaller model.</p> <p><img src="/images/protein_lms/amplify_perplexity.png" alt="perplexity of AMPLIFY" /></p> <p>They explore removal of UniProt clustering (used in most models) to increase size/diversity of training data. Their main argument: clustering adds too much weight to non-realistic sequences.</p> <p>Validation, interestingly, is a subset of human proteome — choice here is important because final ranking in perplexity is highly affected by similarity of distribution to training data.</p> <p>Turns out, quality of sequencing data matters a lot — significant improvements correlate with largest “clean-ups” in UniProt.</p> <p>Other interesting bits:</p> <ul> <li>AF2 can’t distinguish between non-proteins and disordered proteins (PLMs of course can)</li> <li>sequence recovery is very good (a lot of analysis in supplements)</li> <li>analysis of performance on downstream tasks (like protein properties) is lacking, but this was covered in other papers.</li> </ul> <p>Overall: yes, we can significantly improve perplexity/recovery, and model size isn’t crucial.</p> <h2 id="structure-alignment-of-esm2-and-amplify">Structure-alignment of ESM2 and AMPLIFY</h2> <p><a href="https://arxiv.org/pdf/2505.16896v2">preprint</a></p> <p>Multiple works in this list sprinkle structure tokens in training (and sometimes inference). This work instead utilizes a CLIP-like contrastive alignment step between PLM token and protein GNN (GearNet) structure token. Second loss is a direct prediction of structure tokens.</p> <p>This delivers a good improvements on contact predictions, fold and secondary structure, but interestingly not so much for downstream tasks (specially in Table 8/ Fugire 10 SaAMPLIFY isn’t better than plain AMPLIFY).</p> <p>SaESM-2 (aligned ESM-2) transfers to downstream tasks better than SaAMPLIFY — again confirming very poor correlation between perplexity and transferability.</p> <p><img src="/images/protein_lms/sa_proteins_transfer.png" alt="SaESM / SaAMPLIFY transferability" /></p> <h2 id="prosst-quantized-structure-tokens">ProSST: quantized structure tokens</h2> <p><a href="https://www.biorxiv.org/content/10.1101/2024.04.15.589672v3.full.pdf">preprint</a></p> <p>ProSST heads the leaderboard in proteinGYM, let’s see the recipe:</p> <ol> <li>introduced structure tokens by encoding 40 neighbors</li> <li>attention separately encodes sequence, structure tokens and relative position (ablation against the plain attention shows unrealistic improvement, could they have forgotten relpos?)</li> <li>pre-trained on AFDB (18.8M structures selected) using ESM-style MLM objective</li> </ol> <p>Result is SOTA generalization to downstream tasks. Peak performance is reached at ~110M parameters, and then goes down.</p> <p>Model requires knowing the protein structure during prediction, which is somewhat limiting. Huge structural database was used, and perplexity still improves with size, but not downstream performance.</p> <p><img src="/images/protein_lms/proSST_trasnfer.png" alt="proSST transfer" /></p> <h2 id="vespag">VespaG</h2> <p><a href="https://academic.oup.com/bioinformatics/article/40/11/btae621/7907184">paper</a></p> <p>VespaG is a tiny projection on top of ESM-2 embeddings, and achieves SOTA performance among sequence-only models. Trick is to “align” token embedding produced by ESM-2 (or other PLM) to MSA-based statistics computed by GEMME.</p> <p>From their analysis, again, highest performance is reached on 650M ESM2, and then goes down — mirroring results of plain ESM family with some additional boost in quality.</p> <h2 id="scaling-and-data-saturation-in-protein-language-models">Scaling and Data Saturation in Protein Language Models</h2> <p><a href="https://arxiv.org/pdf/2507.22210">paper</a></p> <p>Paper starts with a nice <a href="https://arxiv.org/abs/2507.00885">reference</a>: in LLM world relation of scaling law and downstream performance is not direct (likely even less so with RL finetuning strategies)</p> <p>And show how this observation translates to the world of proteins by training a number of AMPLIFY models:</p> <ul> <li>let’s chunk every sequence. Training on more chunks from <em>same</em> sequences consistently improves performance, while adding newer sequences can hurt it</li> <li>When stratifying by MSA depth, proteins with larger MSAs (as measured by Neff/L) tended to show improved prediction performance with later model training years, unlike those with smaller MSAs</li> <li>“when partitioning by functional assay type, proteins evaluated using Organismal Fitness as the readout exhibited the most consistent improvement over time, whereas other categories showed more variable or flat trajectories” — this is reasonable, after all nature crafts sequences only by fitness</li> </ul> <p>Finally, an experiment with one specific family shows that supervised dataset can replace a decade of collecting protein data in the wild, so … just collecting sequences-in-the-wild is still useful but inefficient.</p> <h2 id="training-compute-optimal-protein-language-models">Training Compute-Optimal Protein Language Models</h2> <p><a href="https://proceedings.neurips.cc/paper_files/paper/2024/file/8066ae1446b2bbccb5159587cc3b3bcc-Paper-Conference.pdf">neurips proceedings</a></p> <p>Metagenomic sequences are diverse and abundant, likely a good complement to UniProt — so authors add ColabFoldDB in training.</p> <p>Paper builds a good contrast between MLMs and causal LMs (CLMs). MLMs are efficient and easy to overfit, opposite to CLMs.</p> <p>They claim that optimal training recipe is starting from CLMs, then switchin loss to MLM; Surprisingly, training on two losses at the same time isn’t better. Authors argue that flops-optimal scaling favors larger models (and they train up to 10B parameters). Results are mixed:</p> <ul> <li>transfer to downstream tasks isn’t impressive</li> <li>contact prediction: minor fine-tuning of ~1B model achieves higher quality than larger model</li> </ul> <p>Insteresting observation: BERT’s 15% masking ratio (used in ESMs) is still a good choice in protein MLMs.</p> <h2 id="ankh3-combining-sequence-denoising-and-completion">Ankh3: combining sequence denoising and completion</h2> <p><a href="https://arxiv.org/pdf/2505.20052">preprint</a></p> <p>This paper stands out because 1. they show good improvement in contact prediction 2. 6B model is overall better than 2B model.</p> <p>A model jointly optimized on two objectives: encoder-decoder protein completion and MLM denoising (with 15%, 20% or 50% masking probability, and apparently short spans were masked, not individual tokens). Both points contradict previous paper in this list — could be results of encoder-decoder architecture.</p> <p>Preprint leaves many questions unanswered:</p> <ul> <li>model is deep (72 layers), so it could be just ineffecient</li> <li>evaluation is limited to datasets without easy ‘leaderboard’ to estimate downstream performance.</li> <li>I’m a bit concerned that ESM-2 and Ankh results were “sourced from ankh paper” instead of being reproduced.</li> </ul> <h2 id="progen3-scaling-unlocks-broader-generation-and-deeper-functional-understanding-of-proteins">ProGen3: Scaling Unlocks Broader Generation and Deeper Functional Understanding of Proteins</h2> <p><a href="https://www.biorxiv.org/content/10.1101/2025.04.15.649055v1">preprint</a></p> <ol> <li>Employ huge curated dataset (PPA-1) that combines genomic and metagenomic sources and excludes fragments.</li> <li>Model is trained on left-to-right, right-to-left and span infilling objectives (finally!). Then aligned on downstream tasks using IRPO — modification of DPO.</li> </ol> <p>Results: non-aligned perfomance frequently peaks at ~3B, aligned performance usually still improves. Larger models can generate proteins from more clusters, with tiny implevements in expression.</p> <p>Exact numbers on proteinGYM aren’t impressive, but overall dynamics after alignment looks encouraging.</p> <h2 id="dplm-1--dplm-2--esm-3"><a href="https://arxiv.org/abs/2402.18567">DPLM-1</a> / <a href="https://arxiv.org/abs/2410.13782">DPLM-2</a> / <a href="https://www.science.org/doi/10.1126/science.ads0018">ESM-3</a></h2> <p>These models were trained with a sufficient amount of structural information in a form of structure tokens.</p> <p>DPLM-1 achieves better downstream performance on multiple tasks on 3B model (no larger model was analyzed), but DPLM-2 (with primary focus on structure tokens based on LFQ) reports only 650M model — I treat this as implicit signal of scaling boundary. Interestingly, DLPM-2 shows worse downstream performance, and authors link this to missing PLM pretraining in DPLM-2.</p> <p>Combination of scaling + PLM pretraining + better structure tokens would be very interesting, but this didn’t happen yet with DPLMs (or happened and result wasn’t good enough for publication).</p> <p>ESM-3 is somewhat close, but they don’t report any actual translatable properties of the model; performance reported by proteinGYM isn’t impressive and ESM-C 300M has similar performance to ESM-C 600M.</p> <h2 id="msa-as-a-context-for-plms">MSA as a context for PLMs</h2> <p>MSA-based models (like MsaPairformer) show better transferability compared to PLMs (and they are smaller than PLMs).</p> <p>PoET model started a direction in PLMs where homologous sequences are passed as a context while architecture is still a classical transformer.</p> <p>This direction inherits weak sides of both PLMs and MSA-based models: 1. one still has to retrieve MSAs 2. alignment should be done by model implicitly 3. more weights compared to MSA-based models and 4. long+deep MSAs are expensive because of quadratic attention.</p> <p>One paper from this family (<a href="https://www.biorxiv.org/content/10.1101/2025.11.12.688125v1.abstract">Profluent E1</a>, also trained on PPA-1) claims good perfomance on gym and contact prediction (better than MsaPairformer and other PLMs) and shows positive scaling … up to 600M. From plots I’d expect further improvement on contact prediction, but not on downstream tasks. Given cost of training, it isn’t surprising that largest model is only 600M.</p> <h1 id="final-thoughts--directions">Final thoughts / directions</h1> <p>Multiple years of research in PLMs did not bring a recognized recipe to utilize vast sequencing data. Recent literature contains some interesting hints, but not strong hypotheses how to do this. PLMs more and more incorporate structural or MSA features, which pushes performance; model sizes still mostly don’t matter.</p> <p>PLMs started from assumption that better perplexity means overall better understanding of protein sequences, as it worked in NLP. This assumption is wrong, and likely in NLP it isn’t true either: longer training on natural language worked because pretty much any reasonable problem was already discussed with examples in the training data. Later progress in NLP was guaranteed by numerous problem-oriented curated datasets, scaling only helped in storing knowledge/patterns in the model.</p> <p>If, in addition to protein sequences, training data contained various tokens related to expression, function, interaction, biophysical properties, etc., then all those metrics would go up. Protein sequences alone don’t provide enough training signal. Can correlation with other genes from the same organism provide a more useful context? Can functional descrption form a better prompt? Some teams work on this, so we’ll see soon.</p> <p>Is there a double descent in biology? Given the size of ESM-3 I’ll put this hypothesis off the table.</p> <p>Are we memorizing phylogenetic noise? Almost surely yes. Larger models can generate proteins from more families (as shown by E1), while the best property prediction is still provided by analysis of MSAs (within the same family).</p> <p>Maybe nature does not care much about <em>our</em> downstream tasks. Maybe much memory isn’t necessary to memorize everything useful in biology (we’re far from optimal performance, so probably not).</p> <p>Simple but likely more fruitful direction at this point would be to curate a large dataset with diverse downstream properties.</p> <p><strong>Confounding factors?</strong> We don’t accept assay results at face value, but we generally assume that protein sequences are free of confounding effects (except for phylogenetic noise). In <a href="https://arxiv.org/pdf/2512.20924">“Clever Hans in Chemistry”</a> authors show that models can guess the author of molecule; knowing author, they can guess the activity without looking at the molecule itself. <em>Could similar cues appear in non-frequent sequences?</em> Like the sequencing technology, or assembly method? This is yet another hypothesis why we don’t see generalization.</p> Sun, 01 Feb 2026 12:00:00 +0000 https://arogozhnikov.github.io/2026/02/01/protein-lms.html https://arogozhnikov.github.io/2026/02/01/protein-lms.html protein language models deep learning Fastest Autograd in the West <p>Who needs fast autograd? Seemingly everyone these days!</p> <p>And once upon a time I needed an autograd that is <strong>actually fast</strong>. Leaving project details aside, here are the requirements:</p> <ul> <li>we test many computation graphs (graph is changing constantly)</li> <li>many-many scalar operations with roughly <strong>10k—100k nodes</strong> in each graph</li> <li>every graph should be compiled and ran around <strong>10k times</strong> both forward and backward</li> <li>this should be done <strong>wicked fast</strong>, and with a convenient pythonic interface</li> </ul> <p>Path that awaits us ahead:</p> <ol> <li>autograd in torch</li> <li>autograd in jax</li> <li>autograd in python</li> <li>autograd in rust</li> <li>autograd in C</li> <li>autograd in assembly</li> </ol> <p>Plus a significant amount of sloppy code and timings on M1 macbook.</p> <h3 id="lets-autograd-in-pytorch">Let’s autograd in pytorch</h3> <p>We start our journey with pytorch — the default autograd engine in research. We’ll create a graph with many nodes, and to keep things simple our benchmark has only several kinds of operations: unary (softplus), binary (multiplication), n-ary (sum) and n-to-n (softmax).</p> <p>This allows using just a few operations, but resembles a realistic load. All benchmarks in this post will reimplement the same logic as below.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">run_graph</span><span class="p">(</span><span class="n">initial_variables</span><span class="p">,</span> <span class="n">n_operations</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span> <span class="n">nodes</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">initial_variables</span><span class="p">]</span> <span class="k">for</span> <span class="n">op</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_operations</span><span class="p">):</span> <span class="n">match</span> <span class="n">op</span> <span class="o">%</span> <span class="mi">4</span><span class="p">:</span> <span class="n">case</span> <span class="mi">0</span><span class="p">:</span> <span class="c1"># softplus </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">F</span><span class="p">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">1</span><span class="p">:</span> <span class="c1"># sum </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">30</span><span class="p">:</span><span class="o">-</span><span class="mi">10</span><span class="p">:</span><span class="mi">5</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">2</span><span class="p">:</span> <span class="c1"># prod </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> <span class="o">*</span> <span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">])</span> <span class="n">case</span> <span class="mi">3</span><span class="p">:</span> <span class="c1"># softmax </span> <span class="n">softmaxes</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">stack</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">4</span><span class="p">:],</span> <span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="n">nodes</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">softmaxes</span><span class="p">)</span> <span class="k">return</span> <span class="n">nodes</span> <span class="k">def</span> <span class="nf">run_benchmark_pytorch</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">,</span> <span class="n">n_operations</span><span class="p">):</span> <span class="n">init_vars</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">):</span> <span class="n">nodes</span> <span class="o">=</span> <span class="n">run_graph</span><span class="p">(</span> <span class="n">initial_variables</span><span class="o">=</span><span class="n">init_vars</span><span class="p">,</span> <span class="n">n_operations</span><span class="o">=</span><span class="n">n_operations</span><span class="p">,</span> <span class="p">)</span> <span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">backward</span><span class="p">()</span> </code></pre></div></div> <p>Run-time for 10k ops x 100 iterations: 11.3 seconds <br />Run-time for 10k ops x 10k iterations: <strong>1130 seconds</strong> (estimate)</p> <p>Given we created 100M python objects, it’s actually quite fast. And yes, that’s not going to deliver an interactive experience.</p> <p>Let’s also discuss <code class="language-plaintext highlighter-rouge">torch.compile</code>, a major innovation in pytorch 2.0.</p> <p>At 100 operations torch.compile takes 4.5 seconds. Execution gets faster: for 100 operations and 10k iterations it takes 4.52 seconds with torch.compile and 10.4 seconds without. Compilation + execution are still in the same ballpark. For bigger graphs (1k operations) <code class="language-plaintext highlighter-rouge">torch.compile</code> crashes.</p> <h3 id="lets-autograd-in-jax">Let’s autograd in jax</h3> <p>Jax is the new cool kid… well, not that new anymore. But in some aspects it is very interesting. Jax’s focus on JIT-compiling static graphs is very suitable for the problem at hand.</p> <p>Implementation for benchmark is similar to pytorch:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">jax</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="k">def</span> <span class="nf">run_graph_jax</span><span class="p">(</span><span class="n">initial_variables</span><span class="p">):</span> <span class="n">nodes</span> <span class="o">=</span> <span class="p">[</span><span class="o">*</span><span class="n">initial_variables</span><span class="p">]</span> <span class="k">for</span> <span class="n">op</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_operations</span><span class="p">):</span> <span class="n">match</span> <span class="n">op</span> <span class="o">%</span> <span class="mi">4</span><span class="p">:</span> <span class="n">case</span> <span class="mi">0</span><span class="p">:</span> <span class="c1"># softplus </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">jax</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">1</span><span class="p">:</span> <span class="c1"># sum </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">30</span><span class="p">:</span><span class="o">-</span><span class="mi">10</span><span class="p">:</span><span class="mi">5</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">2</span><span class="p">:</span> <span class="c1"># prod </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">]</span> <span class="o">*</span> <span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">])</span> <span class="n">case</span> <span class="mi">3</span><span class="p">:</span> <span class="c1"># softmax </span> <span class="n">softmaxes</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">jax</span><span class="p">.</span><span class="n">numpy</span><span class="p">.</span><span class="n">stack</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">4</span><span class="p">:]),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="n">nodes</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">softmaxes</span><span class="p">)</span> <span class="k">return</span> <span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="n">run_graph_and_grad</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">value_and_grad</span><span class="p">(</span><span class="n">run_graph_jax</span><span class="p">)</span> <span class="c1"># or </span><span class="n">run_graph_and_grad</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">jit</span><span class="p">(</span><span class="n">jax</span><span class="p">.</span><span class="n">value_and_grad</span><span class="p">(</span><span class="n">run_graph_jax</span><span class="p">))</span> </code></pre></div></div> <p>Without jit computations are extremely slow: <br /> 1k ops x 10 iterations =&gt; 15.9 seconds <br /> 10k ops x 10k iterations =&gt; 159,000 seconds (estimate)</p> <p>That’s a bit longer than forever! But whole point of jax is to JIT-compile stuff. So let’s do it.</p> <p>jit: compilation of 1k ops = 47 seconds <br /> jit: run-time for 1k ops x 10k iterations = 0.66 seconds <br /> jit: 10k ops x 10k iterations (compilation + run-time) =&gt; <strong>470 seconds</strong> (estimate)</p> <p>Speed up in execution time is more than impressive, but we spend &gt;99% of time compiling.</p> <h4 id="tensorflow">Tensorflow</h4> <p>Someone will mention TF anyway. I’ll leave this as an exercise for you, TF fans.</p> <h3 id="lets-autograd-in-python">Let’s autograd in python</h3> <p>Done with baselines, time to see if we can speed things up.</p> <p>Let’s create a simplistic pseudo-framework and see how it competes with previous candidates. We’ll implement a tape-like autograd where operations order is explicitly tracked in a tape.</p> <details> <summary class="code-summary">show autograd engine in plain python </summary> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">NaiveVar</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">val</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">val</span> <span class="o">=</span> <span class="n">val</span> <span class="bp">self</span><span class="p">.</span><span class="n">grad</span> <span class="o">=</span> <span class="mf">0.</span> <span class="k">class</span> <span class="nc">NaiveTape</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_values</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">def</span> <span class="nf">sum</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="nb">vars</span><span class="p">):</span> <span class="n">res</span> <span class="o">=</span> <span class="n">NaiveVar</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">val</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">vars</span><span class="p">))</span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="s">'sum'</span><span class="p">,</span> <span class="nb">vars</span><span class="p">,</span> <span class="n">res</span><span class="p">))</span> <span class="k">return</span> <span class="n">res</span> <span class="k">def</span> <span class="nf">prod</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">var1</span><span class="p">,</span> <span class="n">var2</span><span class="p">):</span> <span class="n">res</span> <span class="o">=</span> <span class="n">NaiveVar</span><span class="p">(</span><span class="n">var1</span><span class="p">.</span><span class="n">val</span> <span class="o">*</span> <span class="n">var2</span><span class="p">.</span><span class="n">val</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="s">'prod'</span><span class="p">,</span> <span class="p">[</span><span class="n">var1</span><span class="p">,</span> <span class="n">var2</span><span class="p">],</span> <span class="n">res</span><span class="p">))</span> <span class="k">return</span> <span class="n">res</span> <span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="nb">vars</span><span class="p">):</span> <span class="n">vals</span> <span class="o">=</span> <span class="p">[</span><span class="n">v</span><span class="p">.</span><span class="n">val</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">vars</span><span class="p">]</span> <span class="n">maxval</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">vals</span><span class="p">)</span> <span class="n">vals</span> <span class="o">=</span> <span class="p">[</span><span class="n">v</span> <span class="o">-</span> <span class="n">maxval</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vals</span><span class="p">]</span> <span class="n">denom</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vals</span><span class="p">)</span> <span class="n">res</span> <span class="o">=</span> <span class="p">[</span><span class="n">NaiveVar</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="o">/</span> <span class="n">denom</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vals</span><span class="p">]</span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="s">'softmax'</span><span class="p">,</span> <span class="nb">vars</span><span class="p">,</span> <span class="n">denom</span><span class="p">))</span> <span class="k">return</span> <span class="n">res</span> <span class="k">def</span> <span class="nf">softplus</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">var</span><span class="p">):</span> <span class="n">res</span> <span class="o">=</span> <span class="n">NaiveVar</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">log1p</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">var</span><span class="p">.</span><span class="n">val</span><span class="p">)))</span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="s">'splus'</span><span class="p">,</span> <span class="n">var</span><span class="p">,</span> <span class="n">res</span><span class="p">))</span> <span class="k">return</span> <span class="n">res</span> <span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">var</span><span class="p">):</span> <span class="k">assert</span> <span class="n">var</span><span class="p">.</span><span class="n">grad</span> <span class="o">==</span> <span class="mi">0</span> <span class="n">var</span><span class="p">.</span><span class="n">grad</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">for</span> <span class="n">op</span><span class="p">,</span> <span class="n">inputs</span><span class="p">,</span> <span class="n">outputs</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span> <span class="n">match</span> <span class="n">op</span><span class="p">:</span> <span class="n">case</span> <span class="s">'sum'</span><span class="p">:</span> <span class="n">out</span> <span class="o">=</span> <span class="n">outputs</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">inputs</span><span class="p">:</span> <span class="n">v</span><span class="p">.</span><span class="n">grad</span> <span class="o">+=</span> <span class="n">out</span><span class="p">.</span><span class="n">grad</span> <span class="n">case</span> <span class="s">'prod'</span><span class="p">:</span> <span class="n">out</span> <span class="o">=</span> <span class="n">outputs</span> <span class="n">in1</span><span class="p">,</span> <span class="n">in2</span> <span class="o">=</span> <span class="n">inputs</span> <span class="n">in1</span><span class="p">.</span><span class="n">grad</span> <span class="o">+=</span> <span class="n">in2</span><span class="p">.</span><span class="n">val</span> <span class="o">*</span> <span class="n">out</span><span class="p">.</span><span class="n">grad</span> <span class="n">in2</span><span class="p">.</span><span class="n">grad</span> <span class="o">+=</span> <span class="n">in1</span><span class="p">.</span><span class="n">val</span> <span class="o">*</span> <span class="n">out</span><span class="p">.</span><span class="n">grad</span> <span class="n">case</span> <span class="s">'splus'</span><span class="p">:</span> <span class="n">inputs</span><span class="p">.</span><span class="n">grad</span> <span class="o">+=</span> <span class="n">out</span><span class="p">.</span><span class="n">grad</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">inputs</span><span class="p">.</span><span class="n">val</span><span class="p">))</span> <span class="n">case</span> <span class="s">'softmax'</span><span class="p">:</span> <span class="k">pass</span> <span class="c1"># skip for now </span> <span class="n">case</span> <span class="n">_</span><span class="p">:</span> <span class="k">raise</span> <span class="nb">NotImplementedError</span><span class="p">()</span> </code></pre></div> </div> </details> <p>and reimplement reference task using our new pseudo-framework:</p> <details> <summary class="code-summary">show benchmarking code </summary> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">run_graph_python_and_backward</span><span class="p">(</span><span class="n">initial_variables</span><span class="p">,</span> <span class="n">n_operations</span><span class="p">):</span> <span class="n">nodes</span> <span class="o">=</span> <span class="p">[</span><span class="n">NaiveVar</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">initial_variables</span><span class="p">]</span> <span class="n">tape</span> <span class="o">=</span> <span class="n">NaiveTape</span><span class="p">(</span><span class="n">nodes</span><span class="p">)</span> <span class="k">for</span> <span class="n">op</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_operations</span><span class="p">):</span> <span class="n">match</span> <span class="n">op</span> <span class="o">%</span> <span class="mi">4</span><span class="p">:</span> <span class="n">case</span> <span class="mi">0</span><span class="p">:</span> <span class="c1"># softplus </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tape</span><span class="p">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">1</span><span class="p">:</span> <span class="c1"># sum </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tape</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="o">*</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">30</span><span class="p">:</span><span class="o">-</span><span class="mi">10</span><span class="p">:</span><span class="mi">5</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">2</span><span class="p">:</span> <span class="c1"># prod </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tape</span><span class="p">.</span><span class="n">prod</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">],</span> <span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">3</span><span class="p">:</span> <span class="c1"># softmax </span> <span class="n">nodes</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">tape</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="o">*</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">4</span><span class="p">:]))</span> <span class="n">tape</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="k">return</span> <span class="n">tape</span> </code></pre></div> </div> </details> <p>Run-time for 10k ops and 10k iterations: <strong>312 seconds</strong>.</p> <p>Expectably not fast. But compared to previous candidates, that’s actually quite competitive!</p> <h3 id="lets-autograd-in-python-again">Let’s autograd in python, again</h3> <p>This time we move all values into tape instead of keeping in variables. Additionally tape will keep a ‘static graph’ of computations by recording indices of variables participating in every operation.</p> <details> <summary class="code-summary">show code for autograd in plain python </summary> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numba</span> <span class="kn">import</span> <span class="nn">math</span> <span class="k">class</span> <span class="nc">VarInd</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">index</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="n">index</span> <span class="c1"># variable is just a unique index in tape </span> <span class="k">class</span> <span class="nc">TapeInd</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span> <span class="o">=</span> <span class="p">[]</span> <span class="bp">self</span><span class="p">.</span><span class="n">vals</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># flat memory with values </span> <span class="bp">self</span><span class="p">.</span><span class="n">grads</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># flat memory with gradients </span> <span class="k">def</span> <span class="nf">make_var</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">value</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">vals</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">value</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">grads</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="mf">0.</span><span class="p">)</span> <span class="k">return</span> <span class="n">VarInd</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">vals</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="k">def</span> <span class="nf">val</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">v</span><span class="p">:</span> <span class="n">VarInd</span><span class="p">):</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">vals</span><span class="p">[</span><span class="n">v</span><span class="p">.</span><span class="n">index</span><span class="p">]</span> <span class="k">def</span> <span class="nf">add_op</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">kls</span><span class="p">,</span> <span class="n">input_vars</span><span class="p">,</span> <span class="n">output_vars</span><span class="p">):</span> <span class="c1"># translate variable to indices. self.ops keeps only indices </span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span><span class="p">.</span><span class="n">append</span><span class="p">((</span><span class="n">kls</span><span class="p">,</span> <span class="p">[</span><span class="n">x</span><span class="p">.</span><span class="n">index</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">input_vars</span><span class="p">],</span> <span class="p">[</span><span class="n">x</span><span class="p">.</span><span class="n">index</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">output_vars</span><span class="p">]))</span> <span class="k">def</span> <span class="nf">sum</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="nb">vars</span><span class="p">):</span> <span class="n">res</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">make_var</span><span class="p">(</span><span class="nb">sum</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">val</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">vars</span><span class="p">))</span> <span class="bp">self</span><span class="p">.</span><span class="n">add_op</span><span class="p">(</span><span class="s">'sum'</span><span class="p">,</span> <span class="nb">vars</span><span class="p">,</span> <span class="p">[</span><span class="n">res</span><span class="p">])</span> <span class="k">return</span> <span class="n">res</span> <span class="k">def</span> <span class="nf">prod</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">var1</span><span class="p">,</span> <span class="n">var2</span><span class="p">):</span> <span class="n">res</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">make_var</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">val</span><span class="p">(</span><span class="n">var1</span><span class="p">)</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">val</span><span class="p">(</span><span class="n">var2</span><span class="p">))</span> <span class="bp">self</span><span class="p">.</span><span class="n">add_op</span><span class="p">(</span><span class="s">'prod'</span><span class="p">,</span> <span class="p">[</span><span class="n">var1</span><span class="p">,</span> <span class="n">var2</span><span class="p">],</span> <span class="p">[</span><span class="n">res</span><span class="p">])</span> <span class="k">return</span> <span class="n">res</span> <span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="nb">vars</span><span class="p">):</span> <span class="n">vals</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">val</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">vars</span><span class="p">]</span> <span class="n">maxval</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">vals</span><span class="p">)</span> <span class="n">vals</span> <span class="o">=</span> <span class="p">[</span><span class="n">v</span> <span class="o">-</span> <span class="n">maxval</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vals</span><span class="p">]</span> <span class="n">denom</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vals</span><span class="p">)</span> <span class="n">res</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">make_var</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="o">/</span> <span class="n">denom</span> <span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vals</span><span class="p">]</span> <span class="bp">self</span><span class="p">.</span><span class="n">add_op</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">,</span> <span class="nb">vars</span><span class="p">,</span> <span class="n">res</span><span class="p">)</span> <span class="k">return</span> <span class="n">res</span> <span class="k">def</span> <span class="nf">softplus</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">var</span><span class="p">):</span> <span class="n">res</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">make_var</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">log1p</span><span class="p">(</span> <span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">val</span><span class="p">(</span><span class="n">var</span><span class="p">))</span> <span class="p">))</span> <span class="bp">self</span><span class="p">.</span><span class="n">add_op</span><span class="p">(</span><span class="s">'splus'</span><span class="p">,</span> <span class="p">[</span><span class="n">var</span><span class="p">],</span> <span class="p">[</span><span class="n">res</span><span class="p">])</span> <span class="k">return</span> <span class="n">res</span> <span class="k">def</span> <span class="nf">forward_backward_external</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">grad_var</span><span class="p">:</span> <span class="n">VarInd</span><span class="p">):</span> <span class="k">return</span> <span class="n">forward_backward_optimal</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">vals</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">grads</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">ops</span><span class="p">,</span> <span class="n">grad_var_index</span><span class="o">=</span><span class="n">grad_var</span><span class="p">.</span><span class="n">index</span><span class="p">)</span> <span class="k">def</span> <span class="nf">forward_backward_external</span><span class="p">(</span> <span class="n">vals</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">float</span><span class="p">],</span> <span class="n">grads</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">float</span><span class="p">],</span> <span class="n">ops</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">]]],</span> <span class="n">grad_var_index</span><span class="p">:</span> <span class="nb">int</span> <span class="p">):</span> <span class="n">v</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">float</span><span class="p">]</span> <span class="o">=</span> <span class="n">vals</span> <span class="n">g</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">float</span><span class="p">]</span> <span class="o">=</span> <span class="n">grads</span> <span class="c1"># forward pass </span> <span class="k">for</span> <span class="n">op</span><span class="p">,</span> <span class="n">ins</span><span class="p">,</span> <span class="n">outs</span> <span class="ow">in</span> <span class="n">ops</span><span class="p">:</span> <span class="n">match</span> <span class="n">op</span><span class="p">:</span> <span class="n">case</span> <span class="s">'sum'</span><span class="p">:</span> <span class="n">v</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ins</span><span class="p">)</span> <span class="n">case</span> <span class="s">'prod'</span><span class="p">:</span> <span class="n">v</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">=</span> <span class="n">v</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">*</span> <span class="n">v</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span> <span class="n">case</span> <span class="s">'splus'</span><span class="p">:</span> <span class="n">v</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">log1p</span><span class="p">(</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span> <span class="n">v</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="p">))</span> <span class="n">case</span> <span class="s">'softmax'</span><span class="p">:</span> <span class="n">maximal</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ins</span><span class="p">)</span> <span class="n">exps</span> <span class="o">=</span> <span class="p">[</span><span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">maximal</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ins</span><span class="p">]</span> <span class="n">denom</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">outs</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">exp</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">outs</span><span class="p">,</span> <span class="n">exps</span><span class="p">):</span> <span class="n">v</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">exp</span> <span class="o">/</span> <span class="n">denom</span> <span class="n">g</span><span class="p">[</span><span class="n">grad_var_index</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> <span class="c1"># backward pass </span> <span class="k">for</span> <span class="n">op</span><span class="p">,</span> <span class="n">ins</span><span class="p">,</span> <span class="n">outs</span> <span class="ow">in</span> <span class="n">ops</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span> <span class="n">match</span> <span class="n">op</span><span class="p">:</span> <span class="n">case</span> <span class="s">'sum'</span><span class="p">:</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ins</span><span class="p">:</span> <span class="n">g</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">g</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="n">case</span> <span class="s">'prod'</span><span class="p">:</span> <span class="n">out</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="n">in1</span><span class="p">,</span> <span class="n">in2</span> <span class="o">=</span> <span class="n">ins</span> <span class="n">g</span><span class="p">[</span><span class="n">in1</span><span class="p">]</span> <span class="o">+=</span> <span class="n">v</span><span class="p">[</span><span class="n">in2</span><span class="p">]</span> <span class="o">*</span> <span class="n">g</span><span class="p">[</span><span class="n">out</span><span class="p">]</span> <span class="n">g</span><span class="p">[</span><span class="n">in2</span><span class="p">]</span> <span class="o">+=</span> <span class="n">v</span><span class="p">[</span><span class="n">in1</span><span class="p">]</span> <span class="o">*</span> <span class="n">g</span><span class="p">[</span><span class="n">out</span><span class="p">]</span> <span class="n">case</span> <span class="s">'splus'</span><span class="p">:</span> <span class="n">g</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">g</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">v</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]))</span> <span class="n">case</span> <span class="s">'softmax'</span><span class="p">:</span> <span class="n">avg_grad</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">g</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">outs</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">ins</span><span class="p">,</span> <span class="n">outs</span><span class="p">):</span> <span class="n">g</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">v</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="n">g</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">-</span> <span class="n">avg_grad</span><span class="p">)</span> </code></pre></div> </div> <p>and corresponding launching code</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">run_graph_python_and_backward</span><span class="p">(</span><span class="n">n_operations</span><span class="p">,</span> <span class="n">n_iterations</span><span class="p">):</span> <span class="n">tape</span> <span class="o">=</span> <span class="n">TapeInd</span><span class="p">()</span> <span class="n">nodes</span> <span class="o">=</span> <span class="p">[</span><span class="n">tape</span><span class="p">.</span><span class="n">make_var</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)]</span> <span class="k">for</span> <span class="n">op</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_operations</span><span class="p">):</span> <span class="n">match</span> <span class="n">op</span> <span class="o">%</span> <span class="mi">4</span><span class="p">:</span> <span class="n">case</span> <span class="mi">0</span><span class="p">:</span> <span class="c1"># softplus </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tape</span><span class="p">.</span><span class="n">softplus</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">1</span><span class="p">:</span> <span class="c1"># sum </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tape</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="o">*</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">30</span><span class="p">:</span><span class="o">-</span><span class="mi">10</span><span class="p">:</span><span class="mi">5</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">2</span><span class="p">:</span> <span class="c1"># prod </span> <span class="n">nodes</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tape</span><span class="p">.</span><span class="n">prod</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">20</span><span class="p">],</span> <span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">10</span><span class="p">]))</span> <span class="n">case</span> <span class="mi">3</span><span class="p">:</span> <span class="c1"># softmax </span> <span class="n">softmaxes</span> <span class="o">=</span> <span class="n">tape</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="o">*</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">4</span><span class="p">:])</span> <span class="n">nodes</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">softmaxes</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_iterations</span><span class="p">):</span> <span class="n">tape</span><span class="p">.</span><span class="n">forward_backward</span><span class="p">(</span><span class="n">nodes</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> </code></pre></div> </div> </details> <p>Run-time for 10k ops x 10k iterations: <strong>94 seconds</strong></p> <p>As we see, moving all values into tape and switching to operating on indices is quite an efficient strategy. We still use python, but are now ~5-10 fold faster than <code class="language-plaintext highlighter-rouge">pytorch</code> or <code class="language-plaintext highlighter-rouge">jax</code>.</p> <p>At this point, I want to mention one more experiment: code above is organized to be <code class="language-plaintext highlighter-rouge">numba</code>-friendly. <a href="https://numba.readthedocs.io/en/stable/">Numba</a> is famous for speeding up number crunching in python with minimal changes by providing just-in-time compilation. Recent addition of <code class="language-plaintext highlighter-rouge">numba.typed.List</code> makes it possible to efficiently handle list of lists.</p> <p>Run-time with numba, 10k ops x 10k iterations: <strong>41 second</strong>. <br /> At this point we’re &gt;10-fold faster than jax/pytorch (and still writing code in python).</p> <h3 id="lets-autograd-in-rust">Let’s autograd in rust</h3> <p>Once we moved graph tracking to tape, we can now use something fast to run computations for us. For instance, rust. For rust↔python interop I’ve used a small wrapper around <a href="https://github.com/mityax/rustimport">rustimport</a>. <code class="language-plaintext highlighter-rouge">Rustimport</code> allows to conveniently “import” a single rust file without creating a full-fledged rust project.</p> <p>Some optimization remarks:</p> <ul> <li><code class="language-plaintext highlighter-rouge">softmax</code> was a bottleneck, so I switched to creating temporary arrays on stack instead of Vecs, which required specializing on input sizes</li> <li>I followed rust-y approach with iterators to reduce number of boundary checks</li> <li>I wondered if match with multiple options checked one-by-one is slow. In synthetic tests it seemed to be relatively fast, but I wish jump table optimization was implemented here (e.g. it is supported for <a href="https://users.rust-lang.org/t/match-statement-efficiency/4488">enums</a> in rust, and clang <a href="https://stackoverflow.com/questions/60109992/why-is-a-switch-not-optimized-the-same-way-as-chained-if-else-in-c-c">uses</a> this optimization in C for switch-case)</li> </ul> <details> <summary class="code-summary">show rust code for minimal autograd </summary> <div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// rustimport:pyo3</span> <span class="k">use</span> <span class="nn">pyo3</span><span class="p">::</span><span class="nn">prelude</span><span class="p">::</span><span class="o">*</span><span class="p">;</span> <span class="c1">// slower softmax version for larger number of inputs</span> <span class="k">fn</span> <span class="nf">softmax_varlength</span><span class="p">(</span><span class="n">vals</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">ins</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">usize</span><span class="p">],</span> <span class="n">outs</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">usize</span><span class="p">])</span> <span class="p">{</span> <span class="k">let</span> <span class="k">mut</span> <span class="n">max</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1e20_f32</span><span class="p">;</span> <span class="k">let</span> <span class="n">loc_vals</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span> <span class="o">=</span> <span class="n">ins</span><span class="nf">.into_iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">i</span><span class="p">|</span> <span class="p">{</span> <span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">vals</span><span class="p">[</span><span class="o">*</span><span class="n">i</span><span class="p">];</span> <span class="n">max</span> <span class="o">=</span> <span class="n">max</span><span class="nf">.max</span><span class="p">(</span><span class="n">x</span><span class="p">);</span> <span class="n">x</span><span class="p">}</span> <span class="p">)</span><span class="nf">.collect</span><span class="p">();</span> <span class="k">let</span> <span class="k">mut</span> <span class="n">sum</span><span class="p">:</span> <span class="nb">f32</span> <span class="o">=</span> <span class="mf">0.0_f32</span><span class="p">;</span> <span class="k">let</span> <span class="n">exps</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span> <span class="o">=</span> <span class="n">loc_vals</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">v</span><span class="p">|</span> <span class="p">{</span><span class="k">let</span> <span class="n">_exp</span> <span class="o">=</span> <span class="nn">f32</span><span class="p">::</span><span class="nf">exp</span><span class="p">(</span><span class="o">*</span><span class="n">v</span> <span class="o">-</span> <span class="n">max</span><span class="p">);</span> <span class="n">sum</span> <span class="o">+=</span> <span class="n">_exp</span><span class="p">;</span> <span class="n">_exp</span><span class="p">})</span><span class="nf">.collect</span><span class="p">();</span> <span class="n">outs</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.zip</span><span class="p">(</span><span class="n">exps</span><span class="nf">.iter</span><span class="p">())</span><span class="nf">.for_each</span><span class="p">(|(</span><span class="n">j</span><span class="p">,</span> <span class="n">exp</span><span class="p">)|</span> <span class="n">vals</span><span class="p">[</span><span class="o">*</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">exp</span> <span class="o">/</span> <span class="n">sum</span> <span class="p">);</span> <span class="p">}</span> <span class="c1">// vecs are slow! so allocate slices on stack, and explicit grouping of computations also helps</span> <span class="k">fn</span> <span class="n">softmax</span><span class="o">&lt;</span><span class="k">const</span> <span class="n">N</span><span class="p">:</span> <span class="nb">usize</span><span class="o">&gt;</span><span class="p">(</span><span class="n">vals</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">ins</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">usize</span><span class="p">],</span> <span class="n">outs</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="nb">usize</span><span class="p">])</span> <span class="p">{</span> <span class="k">let</span> <span class="k">mut</span> <span class="n">loc_vals</span><span class="p">:</span> <span class="p">[</span><span class="nb">f32</span><span class="p">;</span> <span class="n">N</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0_f32</span><span class="p">;</span> <span class="n">N</span><span class="p">];</span> <span class="k">let</span> <span class="k">mut</span> <span class="n">exps</span><span class="p">:</span> <span class="p">[</span><span class="nb">f32</span><span class="p">;</span> <span class="n">N</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0_f32</span><span class="p">;</span> <span class="n">N</span><span class="p">];</span> <span class="k">let</span> <span class="k">mut</span> <span class="n">max</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1e20_f32</span><span class="p">;</span> <span class="k">let</span> <span class="k">mut</span> <span class="n">sum</span><span class="p">:</span> <span class="nb">f32</span> <span class="o">=</span> <span class="mf">0.</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">i</span><span class="p">)</span> <span class="k">in</span> <span class="n">ins</span><span class="nf">.into_iter</span><span class="p">()</span><span class="nf">.enumerate</span><span class="p">()</span> <span class="p">{</span> <span class="k">let</span> <span class="n">v</span> <span class="o">=</span> <span class="n">vals</span><span class="p">[</span><span class="o">*</span><span class="n">i</span><span class="p">];</span> <span class="n">loc_vals</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span><span class="p">;</span> <span class="n">max</span> <span class="o">=</span> <span class="n">max</span><span class="nf">.max</span><span class="p">(</span><span class="n">v</span><span class="p">);</span> <span class="p">}</span> <span class="k">for</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">_i</span><span class="p">)</span> <span class="k">in</span> <span class="n">ins</span><span class="nf">.into_iter</span><span class="p">()</span><span class="nf">.enumerate</span><span class="p">()</span> <span class="p">{</span> <span class="k">let</span> <span class="n">exp</span> <span class="o">=</span> <span class="nn">f32</span><span class="p">::</span><span class="nf">exp</span><span class="p">(</span><span class="n">loc_vals</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">-</span> <span class="n">max</span><span class="p">);</span> <span class="n">exps</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="n">exp</span><span class="p">;</span> <span class="n">sum</span> <span class="o">+=</span> <span class="n">exp</span><span class="p">;</span> <span class="p">}</span> <span class="k">let</span> <span class="n">invsum</span> <span class="o">=</span> <span class="mf">1.0_f32</span> <span class="o">/</span> <span class="n">sum</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="k">in</span> <span class="n">outs</span><span class="nf">.into_iter</span><span class="p">()</span><span class="nf">.enumerate</span><span class="p">()</span> <span class="p">{</span> <span class="n">vals</span><span class="p">[</span><span class="o">*</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">exps</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="o">*</span> <span class="n">invsum</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="k">fn</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="nb">f32</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">f32</span> <span class="p">{</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="p">(</span><span class="mf">1.0</span> <span class="o">+</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span><span class="nf">.exp</span><span class="p">())</span> <span class="p">}</span> <span class="nd">#[pyfunction]</span> <span class="k">unsafe</span> <span class="k">fn</span> <span class="nf">autograd</span><span class="p">(</span> <span class="n">vals_input</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">ops</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">i32</span><span class="o">&gt;</span><span class="p">,</span> <span class="n">input_ids</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">usize</span><span class="o">&gt;&gt;</span><span class="p">,</span> <span class="n">output_ids</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">usize</span><span class="o">&gt;&gt;</span><span class="p">,</span> <span class="n">backward_node_id</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="n">n_iteration</span><span class="p">:</span> <span class="nb">i32</span><span class="p">,</span> <span class="p">)</span> <span class="k">-&gt;</span> <span class="p">(</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span><span class="p">,</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span><span class="p">)</span> <span class="p">{</span> <span class="k">let</span> <span class="k">mut</span> <span class="n">vals</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span> <span class="o">=</span> <span class="n">vals_input</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">x</span><span class="p">|</span> <span class="o">*</span><span class="n">x</span><span class="p">)</span><span class="nf">.collect</span><span class="p">();</span> <span class="k">let</span> <span class="k">mut</span> <span class="n">grad</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">f32</span><span class="o">&gt;</span> <span class="o">=</span> <span class="n">vals_input</span><span class="nf">.into_iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">_</span><span class="p">|</span> <span class="mf">0.0_f32</span><span class="p">)</span><span class="nf">.collect</span><span class="p">();</span> <span class="k">for</span> <span class="n">_</span> <span class="k">in</span> <span class="mi">0</span><span class="o">..</span><span class="n">n_iteration</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="n">i_op</span><span class="p">,</span> <span class="n">op</span><span class="p">)</span> <span class="k">in</span> <span class="n">ops</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.enumerate</span><span class="p">(){</span> <span class="k">let</span> <span class="n">ins</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">usize</span><span class="o">&gt;</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">input_ids</span><span class="p">[</span><span class="n">i_op</span><span class="p">];</span> <span class="k">let</span> <span class="n">outs</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">usize</span><span class="o">&gt;</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">output_ids</span><span class="p">[</span><span class="n">i_op</span><span class="p">];</span> <span class="k">match</span> <span class="n">op</span> <span class="p">{</span> <span class="mi">0</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// softplus</span> <span class="k">let</span> <span class="n">x</span> <span class="o">=</span> <span class="n">vals</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]];</span> <span class="k">let</span> <span class="n">max</span> <span class="o">=</span> <span class="nn">f32</span><span class="p">::</span><span class="nf">max</span><span class="p">(</span><span class="mf">0.</span><span class="p">,</span> <span class="n">x</span><span class="p">);</span> <span class="k">let</span> <span class="n">min</span> <span class="o">=</span> <span class="nn">f32</span><span class="p">::</span><span class="nf">min</span><span class="p">(</span><span class="mf">0.</span><span class="p">,</span> <span class="n">x</span><span class="p">);</span> <span class="n">vals</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">=</span> <span class="n">max</span> <span class="o">+</span> <span class="nn">f32</span><span class="p">::</span><span class="nf">ln_1p</span><span class="p">(</span><span class="nn">f32</span><span class="p">::</span><span class="nf">exp</span><span class="p">(</span><span class="n">min</span> <span class="o">-</span> <span class="n">max</span><span class="p">));</span> <span class="p">}</span> <span class="mi">1</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// sum</span> <span class="n">vals</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">=</span> <span class="n">ins</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">i</span><span class="p">|</span> <span class="n">vals</span><span class="nf">.get_unchecked</span><span class="p">(</span><span class="o">*</span><span class="n">i</span><span class="p">))</span><span class="nf">.sum</span><span class="p">();</span> <span class="p">}</span> <span class="mi">2</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// prod</span> <span class="n">vals</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">=</span> <span class="n">vals</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">*</span> <span class="n">vals</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">1</span><span class="p">]];</span> <span class="p">}</span> <span class="mi">3</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// softmax. we will need switch-case resolution here for most common cases</span> <span class="k">match</span> <span class="n">ins</span><span class="nf">.len</span><span class="p">()</span> <span class="p">{</span> <span class="mi">1</span> <span class="k">=&gt;</span> <span class="p">{</span><span class="nn">softmax</span><span class="p">::</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ins</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">outs</span><span class="p">)}</span> <span class="mi">2</span> <span class="k">=&gt;</span> <span class="p">{</span><span class="nn">softmax</span><span class="p">::</span><span class="o">&lt;</span><span class="mi">2</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ins</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">outs</span><span class="p">)}</span> <span class="mi">3</span> <span class="k">=&gt;</span> <span class="p">{</span><span class="nn">softmax</span><span class="p">::</span><span class="o">&lt;</span><span class="mi">3</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ins</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">outs</span><span class="p">)}</span> <span class="mi">4</span> <span class="k">=&gt;</span> <span class="p">{</span><span class="nn">softmax</span><span class="p">::</span><span class="o">&lt;</span><span class="mi">4</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ins</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">outs</span><span class="p">)}</span> <span class="mi">5</span> <span class="k">=&gt;</span> <span class="p">{</span><span class="nn">softmax</span><span class="p">::</span><span class="o">&lt;</span><span class="mi">5</span><span class="o">&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ins</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">outs</span><span class="p">)}</span> <span class="n">_</span> <span class="k">=&gt;</span> <span class="p">{</span><span class="nf">softmax_varlength</span><span class="p">(</span><span class="o">&amp;</span><span class="k">mut</span> <span class="n">vals</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ins</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">outs</span><span class="p">)}</span> <span class="p">}</span> <span class="p">}</span> <span class="n">_</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="nd">panic!</span><span class="p">(</span><span class="s">""</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span> <span class="n">grad</span><span class="p">[</span><span class="n">backward_node_id</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="n">i_op</span><span class="p">,</span> <span class="n">op</span><span class="p">)</span> <span class="k">in</span> <span class="n">ops</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.enumerate</span><span class="p">(){</span> <span class="k">let</span> <span class="n">ins</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">usize</span><span class="o">&gt;</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">input_ids</span><span class="p">[</span><span class="n">i_op</span><span class="p">];</span> <span class="k">let</span> <span class="n">outs</span><span class="p">:</span> <span class="o">&amp;</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">usize</span><span class="o">&gt;</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">output_ids</span><span class="p">[</span><span class="n">i_op</span><span class="p">];</span> <span class="k">match</span> <span class="n">op</span> <span class="p">{</span> <span class="mi">0</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// softplus</span> <span class="n">grad</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">grad</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">*</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="n">vals</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]);</span> <span class="p">}</span> <span class="mi">1</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// sum</span> <span class="n">ins</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.for_each</span><span class="p">(|</span><span class="n">i</span><span class="p">|</span> <span class="n">grad</span><span class="p">[</span><span class="o">*</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">grad</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]);</span> <span class="p">}</span> <span class="mi">2</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// prod</span> <span class="n">grad</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">grad</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">*</span> <span class="n">vals</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">1</span><span class="p">]];</span> <span class="n">grad</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">grad</span><span class="p">[</span><span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">*</span> <span class="n">vals</span><span class="p">[</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]];</span> <span class="p">}</span> <span class="mi">3</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// softmax</span> <span class="k">let</span> <span class="n">avg_grad</span><span class="p">:</span> <span class="nb">f32</span> <span class="o">=</span> <span class="n">outs</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.map</span><span class="p">(|</span><span class="n">j</span><span class="p">|</span> <span class="n">grad</span><span class="p">[</span><span class="o">*</span><span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="n">vals</span><span class="p">[</span><span class="o">*</span><span class="n">j</span><span class="p">]</span> <span class="p">)</span><span class="nf">.sum</span><span class="p">();</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">)</span> <span class="k">in</span> <span class="n">ins</span><span class="nf">.iter</span><span class="p">()</span><span class="nf">.zip</span><span class="p">(</span><span class="n">outs</span><span class="nf">.iter</span><span class="p">())</span> <span class="p">{</span> <span class="n">grad</span><span class="p">[</span><span class="o">*</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">vals</span><span class="p">[</span><span class="o">*</span><span class="n">j</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="n">grad</span><span class="p">[</span><span class="o">*</span><span class="n">j</span><span class="p">]</span> <span class="o">-</span> <span class="n">avg_grad</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="n">_</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="nd">panic!</span><span class="p">(</span><span class="s">""</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span> <span class="p">}</span> <span class="p">(</span><span class="n">vals</span><span class="p">,</span> <span class="n">grad</span><span class="p">)</span> <span class="p">}</span> </code></pre></div> </div> </details> <p>Run-time for 10k ops x 10k iterations: <strong>1.4 seconds</strong></p> <p>Success: we are in the realm of interactive experiences. <br /> Recall we started from &gt;1000 seconds. But should we stop here?</p> <h3 id="lets-autograd-in-c">Let’s autograd in C</h3> <p>Time to implement autograd logic in C. For interop with python I use <a href="https://cffi.readthedocs.io/en/stable/index.html">python-cffi</a>.</p> <p>I went bananas on optimization:</p> <ul> <li>I used the fact that output nodes are placed consequentially in memory, so we pass only index of the first output</li> <li>number of inputs is limited to 8, and those are baked into struct as <code class="language-plaintext highlighter-rouge">int[8]</code>, not <code class="language-plaintext highlighter-rouge">int *</code> to avoid jumps in memory</li> <li>dynamic stack allocations of variable size (compared to rust, those are straightforward in C)</li> <li><code class="language-plaintext highlighter-rouge">-O3</code>, and unsafe math: <code class="language-plaintext highlighter-rouge">-ffast-math</code>. Even experimented memory alignment and restrict-ing pointers, but no luck</li> </ul> <details> <summary class="code-summary">show me some code in C </summary> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;math.h&gt;</span><span class="cp"> </span> <span class="k">typedef</span> <span class="k">struct</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">opcode</span><span class="p">;</span> <span class="kt">size_t</span> <span class="n">n_arguments</span><span class="p">;</span> <span class="c1">// used for softmax and sum</span> <span class="kt">int</span> <span class="n">ins</span><span class="p">[</span><span class="mi">8</span><span class="p">];</span> <span class="c1">// at most 8 inputs</span> <span class="kt">int</span> <span class="n">out</span><span class="p">;</span> <span class="c1">// points to the first output variable</span> <span class="p">}</span> <span class="n">MyOperation</span><span class="p">;</span> <span class="n">MyOperation</span> <span class="o">*</span> <span class="n">allocate_memory</span><span class="p">(</span><span class="kt">int</span> <span class="n">n_elements</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="p">(</span><span class="n">MyOperation</span> <span class="o">*</span><span class="p">)</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">MyOperation</span><span class="p">)</span> <span class="o">*</span> <span class="n">n_elements</span><span class="p">);</span> <span class="p">}</span> <span class="c1">// stable implementation</span> <span class="kt">double</span> <span class="n">logaddexp</span><span class="p">(</span><span class="kt">double</span> <span class="n">x</span><span class="p">,</span> <span class="kt">double</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span> <span class="k">if</span> <span class="p">(</span><span class="n">x</span> <span class="o">&gt;</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="n">log1p</span><span class="p">(</span><span class="n">exp</span><span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">x</span><span class="p">));</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="k">return</span> <span class="n">y</span> <span class="o">+</span> <span class="n">log1p</span><span class="p">(</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">y</span><span class="p">));</span> <span class="p">}</span> <span class="p">}</span> <span class="kt">double</span> <span class="n">sigmoid</span><span class="p">(</span><span class="kt">double</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="p">(</span><span class="mf">1.0</span> <span class="o">+</span> <span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">));</span> <span class="p">}</span> <span class="kt">void</span> <span class="n">run_multiple_passes</span><span class="p">(</span> <span class="kt">int</span> <span class="n">n_operations</span><span class="p">,</span> <span class="n">MyOperation</span> <span class="o">*</span><span class="n">ops</span><span class="p">,</span> <span class="kt">double</span> <span class="o">*</span><span class="n">values</span><span class="p">,</span> <span class="kt">double</span> <span class="o">*</span><span class="n">grads</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n_iterations</span> <span class="p">)</span> <span class="p">{</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">iteration</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">iteration</span> <span class="o">&lt;</span> <span class="n">n_iterations</span><span class="p">;</span> <span class="n">iteration</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">operation</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">operation</span> <span class="o">&lt;</span> <span class="n">n_operations</span><span class="p">;</span> <span class="n">operation</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">MyOperation</span> <span class="n">op</span> <span class="o">=</span> <span class="n">ops</span><span class="p">[</span><span class="n">operation</span><span class="p">];</span> <span class="k">switch</span><span class="p">(</span><span class="n">op</span><span class="p">.</span><span class="n">opcode</span><span class="p">)</span> <span class="p">{</span> <span class="k">case</span> <span class="mi">1</span><span class="p">:</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span><span class="p">]</span> <span class="o">=</span> <span class="n">logaddexp</span><span class="p">(</span><span class="mf">0.</span><span class="p">,</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]);</span> <span class="k">break</span><span class="p">;</span> <span class="k">case</span> <span class="mi">2</span><span class="p">:</span> <span class="p">{</span> <span class="kt">double</span> <span class="n">out</span> <span class="o">=</span> <span class="mf">0.</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">op</span><span class="p">.</span><span class="n">n_arguments</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">out</span> <span class="o">+=</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span> <span class="p">}</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span><span class="p">]</span> <span class="o">=</span> <span class="n">out</span><span class="p">;</span> <span class="p">}</span> <span class="k">break</span><span class="p">;</span> <span class="k">case</span> <span class="mi">3</span><span class="p">:</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span><span class="p">]</span> <span class="o">=</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">*</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">1</span><span class="p">]];</span> <span class="k">break</span><span class="p">;</span> <span class="k">case</span> <span class="mi">4</span><span class="p">:</span> <span class="p">{</span> <span class="kt">double</span> <span class="n">maximal</span> <span class="o">=</span> <span class="o">-</span><span class="mf">1e20</span><span class="p">;</span> <span class="kt">size_t</span> <span class="n">n_arg</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span> <span class="n">op</span><span class="p">.</span><span class="n">n_arguments</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n_arg</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">maximal</span> <span class="o">=</span> <span class="n">fmax</span><span class="p">(</span><span class="n">maximal</span><span class="p">,</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="n">i</span><span class="p">]]);</span> <span class="p">}</span> <span class="kt">double</span> <span class="n">exps</span><span class="p">[</span><span class="n">n_arg</span><span class="p">];</span> <span class="kt">double</span> <span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n_arg</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">exps</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">exp</span><span class="p">(</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">maximal</span><span class="p">);</span> <span class="n">sum</span> <span class="o">+=</span> <span class="n">exps</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="p">}</span> <span class="k">for</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n_arg</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">exps</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">/</span> <span class="n">sum</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="c1">// end forward</span> <span class="c1">// TODO set grad for target variable.</span> <span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">operation</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">operation</span> <span class="o">&lt;</span> <span class="n">n_operations</span><span class="p">;</span> <span class="n">operation</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">MyOperation</span> <span class="n">op</span> <span class="o">=</span> <span class="n">ops</span><span class="p">[</span><span class="n">n_operations</span> <span class="o">-</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">operation</span><span class="p">];</span> <span class="k">switch</span><span class="p">(</span><span class="n">op</span><span class="p">.</span><span class="n">opcode</span><span class="p">)</span> <span class="p">{</span> <span class="k">case</span> <span class="mi">1</span><span class="p">:</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span><span class="p">]</span> <span class="o">*</span> <span class="n">sigmoid</span><span class="p">(</span><span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]);</span> <span class="k">break</span><span class="p">;</span> <span class="k">case</span> <span class="mi">2</span><span class="p">:</span> <span class="p">{</span> <span class="k">for</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">op</span><span class="p">.</span><span class="n">n_arguments</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span><span class="p">];</span> <span class="p">}</span> <span class="p">}</span> <span class="k">break</span><span class="p">;</span> <span class="k">case</span> <span class="mi">3</span><span class="p">:</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span><span class="p">]</span> <span class="o">*</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">1</span><span class="p">]];</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span><span class="p">]</span> <span class="o">*</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="mi">0</span><span class="p">]];</span> <span class="k">break</span><span class="p">;</span> <span class="k">case</span> <span class="mi">4</span><span class="p">:</span> <span class="p">{</span> <span class="kt">size_t</span> <span class="n">n_arg</span> <span class="o">=</span> <span class="p">(</span><span class="kt">size_t</span><span class="p">)</span> <span class="n">op</span><span class="p">.</span><span class="n">n_arguments</span><span class="p">;</span> <span class="kt">double</span> <span class="n">avg_grad</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">;</span> <span class="k">for</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n_arg</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">avg_grad</span> <span class="o">+=</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span> <span class="o">+</span> <span class="n">i</span><span class="p">];</span> <span class="p">}</span> <span class="k">for</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n_arg</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">ins</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span> <span class="o">+=</span> <span class="n">values</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="n">grads</span><span class="p">[</span><span class="n">op</span><span class="p">.</span><span class="n">out</span> <span class="o">+</span> <span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="n">avg_grad</span><span class="p">);</span> <span class="p">}</span> <span class="p">}</span> <span class="k">break</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="c1">// end backward</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div> </div> </details> <p>Run-time for 10k ops x 10k iterations: <strong>0.99 second</strong></p> <p>I liked ergonomics of rust better, but achieving high speed in C is way easier. Rust’s interop with python is also way more convenient.</p> <h3 id="lets-autograd-in-c-again">Let’s autograd in C (again)</h3> <p>Another approach I’ve taken is to ‘compile’ traced graph to C. So python produces a long C file where operations are called one-by-one with explicit indices, something like</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">...</span> <span class="n">vals</span><span class="p">[</span><span class="mi">215</span><span class="p">]</span> <span class="o">=</span> <span class="n">vals</span><span class="p">[</span><span class="mi">195</span><span class="p">]</span> <span class="o">*</span> <span class="n">vals</span><span class="p">[</span><span class="mi">205</span><span class="p">];</span> <span class="n">vals</span><span class="p">[</span><span class="mi">216</span><span class="p">]</span> <span class="o">=</span> <span class="n">vals</span><span class="p">[</span><span class="mi">196</span><span class="p">]</span> <span class="o">+</span> <span class="n">vals</span><span class="p">[</span><span class="mi">201</span><span class="p">]</span> <span class="o">+</span> <span class="n">vals</span><span class="p">[</span><span class="mi">204</span><span class="p">];</span> <span class="p">...</span> <span class="c1">// etcetc, and then backward steps are also written the same way</span> </code></pre></div></div> <p>Source code is lengthy, outputs are enormous, and to speed up compilation we can set <code class="language-plaintext highlighter-rouge">-O0</code> in clang. Using <code class="language-plaintext highlighter-rouge">-O0</code> produces slower binaries, but interestingly <em>did not</em> speed up compilation. Best results I got are around 1 minute for compilation and 1 second for a full run. Surprisingly, eliminating switch/case and memory lookups for arguments did not result in faster execution.</p> <p>Given that recompilation is needed any time the graph is changed, real time experienced by user is 1 minute. That’s a no go.</p> <h3 id="assembly">Assembly</h3> <p>In this endeavor to get maximal speed, I decided to go down to assembly. Otherwise it feels like an incomplete journey. We can map a computational graph to just a set of low-level instruction, and avoid “costly” compilation. These days x86/64 is not a king anymore, but neither armv7/armv8 is — and writing assembly for several architectures is totally unreasonable.</p> <p>So … how about using webassembly? It is low-level, fast to compile, and still cross-platform. Projects like <code class="language-plaintext highlighter-rouge">wasmer</code>/<code class="language-plaintext highlighter-rouge">wasmtime</code> allow interacting with wasm code from other languages. That’s my first encounter with WASM, and I’ve got quite positive impression: WASM mixes lisp-style syntax (for efficient streaming parsing) and execution model of stack machine. Unlike canonical stack machines, and unlike canonical assembly, WASM allows grouping expressions, e.g.</p> <div class="language-lisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">;; canonical stack-machine way to compute a * b + c</span> <span class="p">(</span><span class="nv">local.get</span> <span class="nv">$a</span><span class="p">)</span> <span class="p">(</span><span class="nv">local.get</span> <span class="nv">$b</span><span class="p">)</span> <span class="nv">f32.mul</span> <span class="p">(</span><span class="nv">local.get</span> <span class="nv">$c</span><span class="p">)</span> <span class="nv">f32.add</span> <span class="c1">;; another way to say write the same, also perfectly legal in wasm</span> <span class="p">(</span><span class="nv">f32.add</span> <span class="p">(</span><span class="nv">f32.mul</span> <span class="p">(</span><span class="nv">local.get</span> <span class="nv">$a</span><span class="p">)</span> <span class="p">(</span><span class="nv">local.get</span> <span class="nv">$b</span><span class="p">))</span> <span class="p">(</span><span class="nv">local.get</span> <span class="nv">$c</span><span class="p">)</span> <span class="p">)</span> </code></pre></div></div> <p>This convenience allows writing significantly more readable code in WASM compared to ye-olde-assembly. Level of abstraction looks just right to me — low-level instructions, but no need to manage register allocations.</p> <p>Webassembly is still very close to assembly in terms of instructions, i.e. there is no <code class="language-plaintext highlighter-rouge">exp</code>, <code class="language-plaintext highlighter-rouge">log</code>, let alone <code class="language-plaintext highlighter-rouge">log1p</code> and alike. Fortunately, there is a WASM <a href="https://gist.github.com/going-digital/02e46c44d89237c07bc99cd440ebfa43">implementation</a> of <code class="language-plaintext highlighter-rouge">exp2</code>/<code class="language-plaintext highlighter-rouge">log2</code> by Peter Knight.</p> <p>My major question was if speed of exponentiation is going to be sufficient, as <code class="language-plaintext highlighter-rouge">exp</code> consumes significant time in C implementation. Alas, in a simple benchmark computing just exponents in wasm takes ~1.9 seconds, leaving it behind rust/C. For reference, javascript computes the same number of exponents in 0.7 seconds. Hence, I take WASM branding of ‘near-native speed’ with a grain of salt, at least in the context of number crunching. Hopefully this will improve, but for now WASM is out of competition.</p> <h2 id="summary">Summary</h2> <p>So, we achieved a <strong>1000X speed up</strong> compared to leading libraries.</p> <p>I don’t find this surprising — major usecase for autograd system is manipulating large ndarrays. Memory management, copy elimination, device synchronization, parallelization of computations — these things are the main focus, and throughput of 1 million ops per second is totally reasonable for the vast majority of scenarios and users.</p> <p>Not for me though. My scenario is totally different in terms of numbers and setup, and tensor-focused autograds are too slow. For the problem at hand departing from the common autograd systems was the right and the only possible choice. Exploring different options was quite fun, and my expectations were challenged several times along this exploration.</p> <div style="text-align: center; font-size: 40px; padding: 110px">👋</div> Thu, 28 Dec 2023 12:00:00 +0000 https://arogozhnikov.github.io/2023/12/28/fastest-autograd.html https://arogozhnikov.github.io/2023/12/28/fastest-autograd.html autograd optimization Optical pooled screens of cells (overview of emerging biotechnology) <p><em>This month brought two preprints describing optical pooled CRISPR screens. What’s this new technology, what it can be used for, and why I’ve been waiting for it? I’ll make a small comparison of approaches and critically review the papers.</em></p> <p><em>Best of all — I am not affiliated with either team, and this is likely the most unbiased review you’ll find</em> 😅</p> <h2 id="papers-discussed">Papers discussed:</h2> <ul> <li><strong>PERISCOPE</strong> <br /> aka <em>Perturbation Effect Readout In situ with Single Cell Optical Phenotyping</em> from <a href="https://www.biorxiv.org/content/10.1101/2023.08.06.552164v1.full">A genome-wide atlas of human cell morphology</a> (Broad Institute)</li> <li><strong>CP-POSH</strong> <br /> aka <em>Cell Painting Pooled Optical Screening in Human cells</em> from <a href="https://www.biorxiv.org/content/10.1101/2023.08.13.553051v2.full.pdf">A Pooled Cell Painting CRISPR Screening Platform Enables de novo Inference of Gene Function by Self-supervised Deep Learning</a> (Insitro Inc.)</li> </ul> <p>In the next parts I discuss some details from these preprints.</p> <h2 id="preface">Preface</h2> <p>To drive experiments in biological systems you need two components:</p> <ol> <li> <p><strong>intervention:</strong> change something in cell (or organoid, or organism). <!--- Fine-grained interventions allow precise verification of hypotheses. ---></p> <p>For a broad understanding of biological system you want to have detailed control of all of its parts. CRISPR solves this by individually acting on any selected gene. This makes CRISPR-driven experiment more interpretable and ensures high coverage of biological processes.</p> </li> <li> <p><strong>readout:</strong> detect change in some characteristic. Better characterization of system would involve high-dimensional description. E.g. just measuring cell size, cell death and pH provides little insight into what’s happening.</p> <p>Several sequencing-based assays provide rich description, and many of them provide single-cell readouts. <a href="https://www.nature.com/articles/nprot.2016.105">Cell painting</a> stands out: it is much cheaper, microscopy-based, and still captures a lot of biologically-relevant information.</p> </li> </ol> <p>Effectiveness of the system for unbiased discovery, roughly, <em>is a product of these two dimensions</em>: how well you control the biology and how well you can describe results of intervention.</p> <p>Pooled CRISPR screens with scRNAseq/scATAC stand out in both dimensions. <br /> They combine 1. complete control via CRISPR with 2. very high-dimensional interpretable readout. Sounds awesome (and it is!), but we need to introduce one more factor to the equation:</p> <ol start="3"> <li> <p><strong>price per experiment.</strong> The more observations you have the merrier. We already found there are a ton of things happening in our biology, and to find at least a majority of them in an unbiased manner, a number of attempts is required.</p> <p>Pooled screens are very efficient in experiment material: every cell is turned into a tiny individual experiment. Still, with all multiplexing/overloading tricks, a <em>cost-per-cell</em> in scRNAseq is comparable to <em>cost-per-well</em> in cell painting. Quite a difference!</p> </li> </ol> <p>Optical pooled CRISPR screening, a focus of this post, replaces expensive sequencing with cheap microscopy, and drops price-per-cell &gt;200 fold (PERISCOPE reports price-per-cell ~$0.001). Compared to <em>arrayed</em> optical screens, lower requirements for automation can be expected as all conditions share the well.</p> <p>Overall, technology opens an opportunity for massive experimentation.</p> <h2 id="why-do-we-need-an-even-more-scalable-assay-">Why do we need an even more scalable assay? 🤔</h2> <p>Great question! A number of whole-genome pooled screens have been conducted, arrayed whole-genome screens were run with cell painting. Recursion, who pioneered adoption of Cell Painting, <a href="https://www.recursion.com/operating-system">scaled it</a> to 2 million wells a week. Why would you wish for <em>even more</em>?</p> <p><em>Gene perturbation can be more nuanced</em> than just knockout. CRISPR tiling, an approach to scan for important positions in genome, requires a lot of experiments.</p> <p>Space of interventions also goes <em>beyond single-gene</em> at a time. If e.g. two proteins can perform similar function (“alternative pathways”), downregulating just one of them won’t have as much effect (periscope paper accidentally needs double KO of M6PR and IGF2R). These cases, when the effect in combination is different from combination of effects, are of high interest and give a more direct hint at underlying biology than just similarity of images. At the same time such cases are (likely) sparse, and should be found across 20k x 20k = 400m combinations…</p> <p>Sometimes you need to interact with more than two genes at a time, for instance to create iPSCs. Recall that iPSC creation relies on simultaneous expression of 4 <a href="https://en.wikipedia.org/wiki/Induced_pluripotent_stem_cell#Production">Yamanaka factors</a>. For reference, the original <a href="https://www.cell.com/cell/fulltext/S0092-8674(06)00976-7?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867406009767%3Fshowall%3Dtrue">Yamanaka paper</a> screened 24 candidate genes. To improve upon this “recipe”, a large number of combinations should be tried. Scanning just combinations of 4 factors out of 100 <a href="https://en.wikipedia.org/wiki/Transcription_factor">TFs</a> already takes around 4 million attempts.</p> <p>Combinatorial space stays almost unexplored. Dropping price even more still won’t make it possible to check all possible combinations, and this exploration should be driven by ML. ML-friendliness thus becomes a requirement.</p> <!--- <div style="float: right; width: 200px; margin: 20px;" > <img src="/images/opticalscreen/peptides.png" height="200" /><br /> <small markdown="True"><a href="https://pubmed.ncbi.nlm.nih.gov/23316341/">J. Thundimadathil, 2012</a> </small> </div> --> <p>There are non-genetic perturbations that are of high interest: cell environment, additions of chemicals or biologics. Unfortunately, usually there is no way to ‘massively multiplex’ these conditions, and microwell stays the minimal possible unit of experiment. Notable exception are <strong>peptides</strong>, as those similarly can be barcoded and participate in a pooled screen. Peptides can be used both as discovery tool (e.g. to block some interaction or activate receptor) and <a href="https://en.wikipedia.org/wiki/Peptide_therapeutics">as a therapeutic</a>.</p> <h2 id="challenges-needed-to-be-solved">Challenges needed to be solved</h2> <p><img src="/images/opticalscreen/cp_posh_imaging_pipeline.png" width="700" /> <small> Cell Painting (left, 5 channels + composite) and base calling in ISS (right) have significant overlap in channels. <br /> Image from CP-POSH preprint. </small></p> <p>Interventions are encoded with <a href="https://en.wikipedia.org/wiki/Guide_RNA">sgRNA</a> barcodes. In situ sequencing (ISS) is used to read the barcode back.</p> <ul> <li> <p><strong>Main issue is merging ISS with cell painting</strong>. There is a spectral overlap between channels used for cell painting and ISS, and thus ISS becomes non-reliable.</p> </li> <li> <p>Cell painting degrades RNA and <strong>destroys barcode</strong>. Both teams addressed this by running reverse transcription and RCA (rolling cycle amplification) of DNA before cell painting. ISS imaging is quite destructive (multiple cycles) and happens after cell painting step.</p> </li> </ul> <h3 id="how-periscope-solves-spectral-overlap">How PERISCOPE solves spectral overlap</h3> <p><img src="/images/opticalscreen/periscope_linker.png" style="float: right; width: 400px;" /> Periscope team replaced two dyes in cell painting with fluorescent labels attached to probes with disulfide linker (see image). Linker is cleaved right after “phenotypic” (cell painting) imaging, and these two channels could be used for ISS. Floating fluorescent labels are partially washed and remaining (uniform) signal is cancelled out by image processing pipeline.</p> <p>More specifically, membrane label Concanavalin-A was SS-conjugated to fluorophore directly, while mitochondria stain mitotracker was replaced with anti-TOMM20 Ab + secondary Ab SS-linked to fluorophore. <!-- TODO (can this place be optimized to remove secondary?). --> Original cell painting avoided antibodies to make the process cheaper and more reproducible.</p> <p>As expected, perturbation of TOMM20 distorts the signal from this channel — something to keep in mind.</p> <h3 id="how-cp-posh-solves-spectral-overlap">How CP-POSH solves spectral overlap</h3> <div style="float: right; width: 400px; padding-left: 20px;"> <img src="/images/opticalscreen/mitotracker_correlation.png" style="width: 400px;" /> <small>Correlation of mitoprobe with TOMM20 and Hoechst</small> </div> <p>Mitotracker was replaced with Mitoprobe — a novel RNA-based label for mitochondria, linked to Cy5 fluorophore. Interestingly, they optimized a sequence to have high correlation with TOMM20 <strong>and</strong> low correlation with Hoechst (nuclei).</p> <p>Resulting image (on the right) shows optimization was successful.</p> <p>RNA sequences were taken from the ribosome after search for fragments that would bind to 12S rRNA and 16S rRNA (two different locations), then tested 8 of them and left two: one for 12s and one for 16s in proportion 1:1. This is an interesting solution and seems to overcome the issues seen in PERISCOPE approach, and likely to work in other species too.</p> <p>This replacement of mitotracker with mitoprobe <em>does not</em> remove spectral overlap (there is overlap with base A), but makes it non-essential because RNA is degraded during cell-painting. Two additional spectral overlaps (WGA &lt;&gt; base G) and (phalloidin &lt;&gt; base T) are also solved by degrading, and additional steps in the protocol were necessary. These overlaps still seem to play negative role in ISS step (see later).</p> <p>CP-POSH has an additional channel that can be utilized for one study-specific marker, which is later featured in one of experiments. (They use deep red — good choice, as shorter wavelengths can be used by phenotyping!)</p> <!-- I am curious if something similar to mitoprobe can be developed for F-actin (i.e. RNA-based label). This could make ethanol unnecessary. --> <p>In total both protocols are not straightforward.</p> <h3 id="in-situ-sequencing-iss"><em>In situ</em> sequencing (ISS)</h3> <p><img src="/images/opticalscreen/in_situ_sequencing.png" /> <small>Source: <a href="https://www.cell.com/cell/pdf/S0092-8674(19)31067-0.pdf">Feldman</a> et al., 2019</small></p> <p>ISS reads the barcode to determine perturbed gene. This part is very similar, as both groups:</p> <ul> <li>use Illumina’s miseq kit for ISS (sequence-by-synthesis), and both groups used lower resolution (10X) for imaging.</li> <li>use padlock with gap to amplify barcode to get reliable signal during sequencing</li> <li>finally, barcodes used in both cases are not an additional genetic sequences, but sgRNAs themselves. <br /> No barcodes — no problems!</li> </ul> <p>CP-POSH additionally uses tiny <em>image-to-image convnet to improve calling</em> to get +18% correct calls. Such a model can be trained on the screen data itself: almost-correctly called barcodes (with simpler pipeline) are used for training the model.</p> <!--- Absence of separate barcodes, while very reliable, has its demerits too: cells that replicate from the same transfected cells, are not ‘true independent observations’, as e.g. they can carry the same mutation introduced during transfection. Additional barcodes could tell apart independent transfections and help in lineage tracking. Optical pooling has partial remedy to this problem: cells coming from the same origin usually colocalize within a well. It could be an interesting analysis if ‘families’ of cells carry any additional visual signature that is not shared by other cells with the same sgRNA. ---> <h3 id="sgrnas">sgRNAs</h3> <p>Quality of ISS quickly drops with sequence length, so instead of sequencing all ~20 bases of sgRNA, the guides are selected so that reading only first 12-13 bases is enough to guess which sgRNA is expressed in the cell. Groups start from existing pools of sgRNAs to guide Cas9, with minor differences in selection procedure:</p> <ul> <li>Periscope uses 12 cycles and minimal Levenshtein distance ≥ 2, which means they detect if barcode contains one error (and discard the barcode).</li> <li> <p>CP-POSH uses 13 cycles and Levenshtein distance ≥ 3, and allows up to 1 error correction. Most cells have more than one amplicon, which makes barcode calling even more reliable. Error correction adds +80% of barcoded cells in their largest screen.</p> <p>I hypothesize high error rate (despite CNN filtering) is connected to spectral overlaps.</p> </li> </ul> <p>Scope of experiments is different: Periscope covers 20k genes with 4 guides per gene, while the largest experiment in CP-POSH targets druggable genome — 1.6k genes with 10 guides per gene.</p> <h2 id="phenotypic-pipeline-and-analysis">Phenotypic pipeline and analysis</h2> <p>Both teams avoid training the system on known labels. I’ve also been avoiding training with supervision for a while, for a couple of reasons:</p> <ol> <li>no need to drop any data from analysis (no labels → no cross-validation)</li> <li>by providing labels you already bias model into what <em>you believe</em> is important. Correspondingly model works to ignore all “irrelevant” information, and the same model can’t be used (reliably) for studying orthogonal questions (e.g. well-to-well variations)</li> <li>should there be any confounder, it is less likely to be picked</li> </ol> <p>It’s actually <strong>impressive how little prior knowledge is required to get a decent grasp of biology just from looking at static cells</strong>. We only need to know all genes of the organism to run CRISPR, neural networks don’t need even this piece of information.</p> <p>PERISCOPE relies on <a href="https://cellprofiler.org/">Cell Profiler</a>, and does not train any specific pipeline. After averaging morphological profiles across the cells for the same gene, a matrix of gene similarities is computed.</p> <p>CP-POSH relies on <a href="https://github.com/mouseland/cellpose">CellPose</a> for segmentation, and either uses CellProfiler-like pipeline (dubbed CellStats) or self-supervised <a href="https://arxiv.org/abs/2104.14294">DINO-ViT</a> from FAIR. Unsurprisingly, DINO-ViT demonstrates better quality, which improves with higher diversity of interventions provided during training. Pre-training on cells not ImageNet works much better, as you’d expect (Insitro-ers for some reason like Imagenet-pretrained models as baseline). DINO-ViT also uses patches 8x8, more relevant to the scale of cell.</p> <p>A nice detail: they use a well-level compensation. That’s possible thanks to pooling!</p> <p><img src="/images/opticalscreen/diffexp_visual_features.png" style="width: 400px; float: right;" /> Both papers delve into ‘differential expression’ of hand-crafted morphological features to provide arguments that readout is valid. For instance, periscope shows that most important features to detect interventions connected to common pathways point to the right cell compartment.</p> <p>On the picture from PERISCOPE you see that disturbing a pathway results in some enrichment of important features (‘differentially expressed‘ features) from the corresponding cell compartment.</p> <div style="clear: both;"></div> <h2 id="verification--discovery">Verification &amp; Discovery</h2> <p>“Method papers” are a special genre of literature: 1) focus of author is technology 2) focus of editor is novel biology 3) authors must provide convincing validation which no one wants to dive in.</p> <p>This rarely converts into a consistent story for screens, and this time is no exception.</p> <p>PERISCOPE compares two different medias, running whole-genome screens in each of them — an interesting experiment with unclear interpretation: there are genes that “land in different clusters” depending on the media — but unclear what to do with this information. As I understand, the goal was to demonstrate that running screen in a more physiologically relevant media would yield better insights, but it is unclear if differences (Ext Fig.8) indeed show superiority of either media.</p> <p>Another interesting shot is the TMEM251 investigation with significant additional research beyond PERISCOPE. If the TMEM251 story really matters, I’d prefer to see it published separately and better verified (using available info from other pooled screens as well), Periscope in this story was needed only for initial guess based on GSEA — but this guess could come from other public screens as well.</p> <p>Speaking of GSEA… — usage of GSEA in paper (e.g. fig. 6a) makes no sense 😞. GSEA’s power is combining signal from multiple genes with low expression. This problem <em>does not exist</em> in optical screens — as no expression is measured. Preranked GSEA (erroneously) relies on zero correlation between genes, but correlation in optical screens is very high. In fact, this high correlation is a subject of several plots in the paper. To compare pathways, just define another direction in embedding space for each pathway, as you do for single genes. Direction is a (weighted) average of directions for individual genes + measure separation of distributions along direction (e.g. ROC AUC).</p> <p><img src="/images/opticalscreen/umap_leiden_from_cellposh.png" width="700" /> <small>Example UMAP from CP-POSH for one of screens</small></p> <p>CP-POSH focuses on druggable genome (1640 genes) with a couple of smaller screens. Each version of pipeline (data + phenotyping model) is compared against <a href="https://string-db.org/">StringDB</a>, providing a quantifiable comparison, so they can e.g. demonstrate that targeting more genes is slightly better. They also confirm that trained models generalize to new experiments.</p> <p>Different versions of screen are presented in a uniform way with UMAP+Leiden clustering applied to genes with a clear morphological signature (see example above).</p> <p>I was confused by notable divergence between models trained on 300 and 1640 genes, figure 5a. In particular their lists of significant genes (AUC &gt; 0.55) should markedly diverge across models. Also, 0.55 may sound small — however, bear in mind this is a cell-level classification, and combining multiple cells will result in strong discrimination.</p> <p>Both ViT and CellStats “nominate the potential role of TUT1 in cell cycle regulation”. (No research made to confirm). Interestingly, sgRNA consistency failed for several genes, and half of genes have at least one ‘outlier’ sgRNA (out of 10).</p> <p>In my opinion, CP-POSH has a consistent storyline and more ‘standardized’ analysis. It looks more like a validation of approach/platform, and less like a bunch of interesting observations (though CP-POSH has these too). PERISCOPE presentation is more aligned to “get published in AAA journal”.</p> <p>Neither paper discusses cell cycle, a well-known confounder in single-cell studies, how so? 🤷 Optical screens previously characterized full images, not individual cells, and thus did not have to deal with this issue (as there are other cells to get signal from). Since neither team used supervision, pipelines likely cluster dividing cells together, preferring this characteristic over perturbation. Cancelling this in optical screen is an interesting challenge.</p> <h2 id="so-which-one-to-choose">So which one to choose?</h2> <p>Great question, fortunately we have papers to help us! So here is my insight: I don’t know. <strong>I can’t meaningfully compare performance of two systems after reading preprints.</strong> Performance, I guess, is similar — but that’s only a guess. If some lab wants to select which one to go with, this becomes a matter of trust — not how science is supposed to work. (ok-ok, one additional channel can actually make this choice).</p> <p>Main selling points of optical pooled screens are simple scalability and fewer confounders, which ultimately means hypothesis-free or hypothesis-light research. I doubt that interpretable morphological features are important for practitioners.</p> <p>Papers lack “power analysis” on how many cells are needed to reconstruct perturbation profile. Very little said about cost ($0.001 per cell — estimate from PERISCOPE, no cost estimates from CP-POSH). These two factors determine if pooled strategy pays out.</p> <p>Speaking of potential, it is unclear if two sgRNAs per cell can be confidently called with either approach.</p> <h2 id="can-we-do-better">Can we do better?</h2> <p><strong>Screen validation should become a benchmark.</strong> It’s about time we had a benchmark of reproduction of gene networks/gene ontology with some predefined procedure. Community would benefit from comparing across the screens rather than “rediscovering” mTOR in every screen paper.</p> <p>Number one question is — can screen discover culture-specific biology? When comparing several cell lines, are gene similarities in optical screen and scRNAseq similar for the same cell line?</p> <p>It would be of high interest to highlight which pathways are detectable in scRNAseq but hardly noticeable in optical pooled screening (and vice versa). It is of value to know if there are pathways that can be seen in an optical screen or in scrnaseq — and can help in choosing the right instrument for the problem.</p> <p><strong>Compare screen to screen, not screen to “common knowledge”.</strong> Common pathways are a very rough sanity check. Single UMAP with gene grouped by their similarity is descriptive enough. GSEA is a poor argument: it is embarrassingly easy to find something pleasing with GSEA and throw a bunch of impressively small (incorrect) p-values at readers.</p> <p>Comparison screen-to-screen can detect more subtle biology, specific to the biology of culture, and can actually bring interesting insight.</p> <p><strong>Discoveries are usually irrelevant for the story and should not be demanded by journals.</strong> Method papers are demanded to “show novel biology”, and most of “byproduct discoveries” have no value for readers or authors — otherwise those would be a separate paper.</p> <p><em>Faster, cheaper, easier to scale, more reliable, easier to implement</em> are <strong>great</strong> arguments for technology. If whole smartphone industry can’t deliver “a killer feature” every year, how that can be a requirement for every method? 🤷</p> <h2 id="where-would-this-go">Where would this go?</h2> <p>Back to point. Pooled optical screening is an exciting technology, and it has a number of immediate applications. And it is super valuable to understand its current limits.</p> <p>For instance, I have the following questions on my mind:</p> <ul> <li>does it transfer? When two labs experiment with same cell line, would they get similar results? In theory, yes, but how about practice?</li> <li>similarity and difference with arrayed screens: shared media means studied processed are limited to a single cell, because cell interactions are not restricted to cells with the same perturbation. This has both pros (clearer signal) and cons (if cell interactions/collective behavior are of interest).</li> <li>is it suitable to automatically find ‘interesting’ combinations of genes? Can we train RL to discover those for us?</li> <li>can it handle tissue slices? Can we pool-screen <a href="https://www.frontiersin.org/articles/10.3389/fragi.2021.714926/full">whole mouse</a>?</li> <li>can vision pipeline handle neurons? Is DINO a good choice for that?</li> </ul> <p>Hopefully more research will come and we’ll get answers to these and other questions soon.</p> <div style="text-align: center; font-size: 40px; padding: 110px">👋</div> <h4 id="acknowledgments">Acknowledgments</h4> <p>Thanks to Kevan Shah and Tatiana Dvorkina for proofreading and comments. Thanks to CP-POSH team (Ci Chu, Max Salick) and PERISCOPE team (Meraj Ramezani, Paul C. Blainey) for answering questions.</p> <h4 id="comments">Comments</h4> <p>Paul C. Blainey provided some pointers to prior works of his lab, relevant to the questions I discuss in the post:</p> <blockquote> <p>… a couple of comments that you may find interesting:</p> <ul> <li>In Figure S2 of <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6886477/">Feldman et al., 2019</a> we showed efficient detection of 2 guides per cell (in ~80% of cells)</li> <li>In <a href="https://www.pnas.org/doi/10.1073/pnas.2210623120">Carlson et al, 2023</a> we use a different and simple strategy to overlap IHC and SBS in the same channels which is to titrate down the IHC reagents</li> <li>Both of these works demonstrate a potentially standardizable validation approach to do a follow-up (“secondary”) screen in an independent experiment with higher replication (more cells and/or guides per gene). The hit ranks or feature scores can be compared gene-wise or guide-wise across the primary and secondary to check reproducibility of the results. This can be for technical validation (same assay and guides) or biological validation (new assay and/or new biological model system).<br /> So far we’re seeing impressive reproducibility which supports some of the more challenging and informative use cases you suggest.</li> <li><a href="https://www.biorxiv.org/content/10.1101/2021.11.28.470116v1.full">Funk et al, 2022</a> demostrated that cell cycle can be treated more explicitly, we added 24-hour live imaging of cells prior to fixation</li> </ul> </blockquote> <!-- My comment: for some processes like mitosis / cell movement, live imaging can be done together with pooled screen and used as a functional validation to provide "arbitrage" between different screens. This still requires compared approaches to be implemented in the same lab, or, at least, with the same culture. --> <!-- # Cell painting channels: Original cell paingting from the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5223290/ Phenotypic images were acquired using a 20X 0.75 NA CFI Plan Apo Lambda objective (Nikon MRD00205) and the following Semrock filters for each phenotypic probe: Nucleus (DAPI) dual-band emission 408/473, dichroic. Actin (phalloidin) emission ET530/30 nm, dichroic 495 nm. Mitochondria (TOMM20) emission 615/24 nm, dichroic 565 nm. Endoplasmic reticulum (Concanavalin A) emission 680/42 nm, dichroic 660 nm. Golgi and plasma membrane (WGA) emission 820/110 nm, dichroic 765 nm. ISS cycles were imaged using a 10X 0.45 NA CFl Plan Apo Lambda objective (Nikon) with the following Semrock filters for each base: Miseq G emission 575/30 nm, dichroic 555 nm. excitation 543/4 nm, Miseq T emission 615/24 nm, dichroic 565 nm. Miseq A emission 680/42 nm, dichroic 660 nm. Miseq C emission 732/68 nm, dichroic 660 nm. 575 (-30) - 732 (+ 68) TOMM20 intersects with T ConA intersects with miseq A Same for cell painting -POSH Stain Target Imaging Type Stain Laser Source Laser (nm) Emission Filter (nm) Objective Exposure time (ms) Nucleus Phenotyping Hoechst Celesta Light Source, Lumencor, 90-10525 405 Pentacube , 441x30 20x 0.75 NA, OFN25 DIC N2 Cellular Membranes/ endoplasmic reticulum Phenotyping ConA Celesta Light Source, Lumencor, 90-10525 488 Pentacube, 511x26 20x 0.75 NA, OFN25 DIC N2 Cellular membrane/ Golgi/ER Phenotyping Wheat Germ Agglutinin Celesta Light Source, Lumencor, 90-10525 545 567/15nm Filter, Semrock, FF01-567/ 15-25 20x 0.75 NA, OFN25 DIC N2 Cytoskeleton/ F-actin Phenotyping Phalloidin Celesta Light Source, Lumencor, 90-10525 545 624/40nm Filter, Semrock, FF01-624/ 40-25 20x 0.75 NA, OFN25 DIC N2 Mitochondria Phenotyping Mitoprobe Celesta Light Source, Lumencor, 90-10525 637 Pentacube 684x34 20x 0.75 NA, OFN25 DIC N2 ribosomal protein Phenotyping pS6 primary and secondary antibody Celesta Light Source, Lumencor, 90-10525 748 Pentacube 817x66 20x 0.75 NA, OFN25 DIC N2 G 545 -> 567/15nm <> WGA T 545 -> 624/40nm <> Phalloidin one-to-one - degraded by ethanol A 637 -> 676/29nm <> Mitoprobe C 637 -> 732/68nm --> Sun, 20 Aug 2023 12:00:00 +0000 https://arogozhnikov.github.io/2023/08/20/optical-pooled-screens.html https://arogozhnikov.github.io/2023/08/20/optical-pooled-screens.html biology Einops, retrospective of 5 years <p>Einops is soon-to-turn 5 years. Right time to have a look back.</p> <p>Some intro: einops is widely used — around 4 million downloads a month (for calibration - pytorch is 10 million) on pypi and is used in thousands of projects on github.</p> <p>In a number of ways einops is unique:</p> <ul> <li>bends tensors for a number of very different frameworks. AFAIK all other efforts to make something truly multi-framework either died too soon or avoided touching internals of models</li> <li>never pulled back released features. At the same time einops lived much longer than any major version of tensorflow or pytorch. Some backends it originally supported (mxnet, chainer) are dead by now</li> <li>bug tracker was empty for years, compared to usual hundreds in projects of similar scope. Now it reports several hardly fixable inconsistencies that appeared as frameworks introduced more features</li> <li>einops adoption happens mostly through the code sharing between teams/projects, and not by hype-waving. Several mentions in twitter brought waves of likes but almost none were converted to users at that point. Paper appeared only after einops circulated for three years in the wild nature of github, when it was pristine clear that idea “clicks”.</li> <li>“magical” universal dispatching, so users could write <code class="language-plaintext highlighter-rouge">rearrange(x, 'b c h w -&gt; b h w c')</code> and not care about <code class="language-plaintext highlighter-rouge">x</code>’s framework/device/dtype/C-ordering. While this is more of a ‘fancy’ functionality, it was important during initial adoption. <!-- Magical is not a great description for technology, but einops was many times described as "magic" with a positive vibe in this word. --></li> <li>no dependencies (except Python). Everything else is optional, even numpy</li> <li>there is no corporation/university behind einops, it is mostly a single-person effort</li> </ul> <h2 id="tough-place">Tough place?</h2> <p>A while ago Stephan H. asked <em>what is challenging about einops</em> as a project.</p> <p>I don’t think I’ve made a great answer back then. And probably couldn’t anyway, because question assumes there is a specific “tough place”, but the assumption is wrong.</p> <p>Also “tough place” is very subjective and after working for some time over any project, if you’re successful, there will be no “tough” place, because you focus on those parts that are “tough” and get them better either by decomposing their complexity or by just learning to manage with it.</p> <h2 id="unique-technical-challenges">Unique technical challenges</h2> <p>I decided to dedicate some time to write a better answer for this question. First prototype was built in a couple of hours, but project itself took months, so clearly there were non-trivial parts. Einops as a project has a number of (conflicting) technical restrictions that create a significant pressure:</p> <ul> <li> <p>frameworks. Einops supports a dozen of them, and that’s unique. Worse, each framework has its specifics, and this creates significant internal tension within a project, which I’ll discuss a lot in the next points</p> </li> <li> <p>even worse, frameworks have multiple regimes of work within the same framework (i.e. torch alone has torch.compile, tracing, scripting, ‘plain run’, torch.fx, cuda graph capturing, and maybe more). They all have different behaviors</p> </li> <li> <p>landscape is not steady and frameworks appear and gone, even worse, sometimes change their API, and sometimes by breaking existing API (looking at you, keras and TF). Their dependencies may contradict each other (stares at protobuf)</p> </li> <li> <p>support for eager computations.</p> <p>That’s how code usually runs these pytorchy days. In this case, the hot path should be <em>really</em> fast, and have absolutely minimal overhead. Einops deals with this with a number of caches that make usual loopy computations super-efficient. Shape checks (usually skipped by lazy everyone) are conducted only once per shape.</p> </li> <li> <p>support for symbolic computations and traceability.</p> <p>Two little-known facts first: 1. einops can deal with symbolic tensors (i.e. can operate tensors with unknown size of one or several axes, which may sound slightly impossible at first) and 2. einops “disappears” during tracing and provides models that contain an equivalent set of framework-native operations, and moreover traced operation correctly work for inputs of different shape.</p> <p>As a result, execution flow has to rely only on traceable operations over shape’s elements, and e.g. one can’t just compute correct result shape in cpp/rust</p> </li> <li> <p>shape checks for symbolic tensors.</p> <p>For example <code class="language-plaintext highlighter-rouge">rearrange(x, '(h h2) (w w2) -&gt; (h w) h2 w2', h2=4, w2=w2)</code> demands that first axis is divisible by 4, and the second axis is divisible by <code class="language-plaintext highlighter-rouge">w2</code>, while dimensions of tensors are unknown. An additional restriction: einops can’t use built-in graph asserts like tf.Asserts because of their framework-specificity. Clever organization of computations in ops ensures that code fails for wrong inputs without introducing additional elements of static graph</p> </li> <li> <p>support for scripting: this requirement dramatically narrows a subset of Python that can be used, and in some cases demands specifying wrong type hints for internal functions because correct types like <code class="language-plaintext highlighter-rouge">tuple[str, ...]</code> are not supported by <code class="language-plaintext highlighter-rouge">torchscript</code></p> </li> <li> <p>support for tensor-rank polymorphism, that is, the same operation with ellipsis can handle inputs of different dimensions. Initially this was done by a clever trick that pre-packed ‘ellipsis axes’ into one, but recent changes in frameworks (see next point) required developing some new approach</p> </li> <li> <p>special axes. Frameworks try to extend a concept of tensor = ndarray which worked so well. Examples are sharding axes in distributed tensors and jagged arrays. This clearly was out of initial design and, as I mentioned, required significant redesign of einops.</p> </li> <li> <p>frameworks divergences: differences in the names/interfaces of operations, missing operations like logsumexp, inconsistencies in support of einsum.</p> </li> <li> <p>layers definition is quite different across frameworks, and specially <code class="language-plaintext highlighter-rouge">flax</code> required some personal approach.</p> </li> <li> <p>view semantics. Einops tries to provide a view to the input if possible, making operation itself very cheap, as no real computation happens.</p> </li> <li> <p>additional pressure is my perfectionism, and trying to keep the bar very high. These days I don’t think extreme reliability should be assumed from side/personal projects.</p> </li> </ul> <!-- - python's typing does not know how to exclude lists --> <p>Appearing problems with new features like <code class="language-plaintext highlighter-rouge">torch.fx</code> may be interpreted as <em>einops gives cracks</em>, reality is - einops as a notation and approach are just fine. It is enjoyed by many, and community wants to use notation with new framework features. And notation fits that. But the terrible basement that tensor manipulation is built upon (i.e. reshape/view/transpose and similar) gives cracks - more and more visible, and building a layer of cement upon is … not wise. As I discussed several times, einops’ core operation should be available at the lowest level of graph representation — but I don’t expect this advice to be heard.</p> <p>Support for a large zoo of frameworks is (retrospectively) a questionable investment. Examples: cupy and chainer were almost never used but also were trivial to maintain and develop. While mxnet/gluon required very special attitude. Supporting multiple frameworks to me was an insurance that frameworks did not try to create “their very own version of einops”, and did not create non-compatible extensions (as they did for numpy).</p> <p>These days projects that don’t use einops still use its core ideas by writing parts of einops patterns: <code class="language-plaintext highlighter-rouge">(b h) t c</code>, <code class="language-plaintext highlighter-rouge">b*h t c</code> and similar. Because that’s the best way to communicate internal structure of tensor (… when you agree on C-ordering of course, construct relies on it significantly).</p> <h2 id="unique-conceptual-challenges">Unique conceptual challenges</h2> <!-- It is easy to think about einops as a python package, but it is more of **approach** to write a readable, reliable and efficient code, that was conveniently provided to python users. --> <p>Einops is more of approach to writing code than a package, but package is a necessary tool to bring those ideas into practice. On approach level there are a number of hurdles too.</p> <p>Turns out design of operations is very challenging: einops received a long list of suggestions and ideas, and very few were accepted. Folks just introduced to einops think “einops are helpful, so let’s invent something similar”, but <em>similar</em> does not imply <em>helpful</em>.</p> <p>Let’s take a story of <code class="language-plaintext highlighter-rouge">einops.pack</code> and <code class="language-plaintext highlighter-rouge">einops.unpack</code> for a demonstration of this point: concatenation of different-shape tensors was of interest (for me) even before the first public release. My design at that time was universal enough, similar to the rest of einops, but too verbose and inconvenient:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">r</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">b</span><span class="p">]</span> <span class="o">=</span> <span class="n">rechunk</span><span class="p">([</span><span class="n">rgb</span><span class="p">],</span> <span class="s">'b h w [r+g+b] -&gt; b h w [r, g, b]'</span><span class="p">,</span> <span class="n">r</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">g</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> </code></pre></div></div> <p>… thus it was not included. Later it was minimized by restricting transpositions:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># this one poorly works with type hinting </span><span class="p">[</span><span class="n">r</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">b</span><span class="p">]</span> <span class="o">=</span> <span class="n">rechunk</span><span class="p">(</span><span class="n">rgb</span><span class="p">,</span> <span class="s">'b h w *'</span><span class="p">,</span> <span class="s">'r+g+b -&gt; [r, g, b]'</span><span class="p">,</span> <span class="n">r</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">g</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> </code></pre></div></div> <p>until I finally realized that this operation better to be totally different from <code class="language-plaintext highlighter-rouge">rearrange</code> and should not have any names for the concatenated/split axes:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">r</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">b</span><span class="p">]</span> <span class="o">=</span> <span class="n">unpack</span><span class="p">(</span><span class="n">rgb</span><span class="p">,</span> <span class="s">'b h w *'</span><span class="p">,</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span> </code></pre></div></div> <p>which was soon generalized into unpacking with arbitrary shapes.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">r</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">b</span><span class="p">]</span> <span class="o">=</span> <span class="n">unpack</span><span class="p">(</span><span class="n">rgb</span><span class="p">,</span> <span class="s">'b h w *'</span><span class="p">,</span> <span class="p">[[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">]])</span> </code></pre></div></div> <p>Original design of operation could not support arbitrary shapes. Ok, technically it could, but that would be ugly and miserable. New design solved another issue — memorizing axes that were composed, another common request for einops.</p> <p>I’ve come up with a final design (which I still find optimal) only <em>two years later</em>. A number of suggestions popped around that were similar to the original version.</p> <p>To see that operation ‘clicks’, <strong>a whole research is needed</strong>:</p> <ul> <li>collect use-cases (and this requires a broad view of SOTA and how it may change over the next years)</li> <li>convert use-cases to code examples, and prepare baseline implementations without new operation</li> <li>implement with your suggestion, and in most cases, conclude that doesn’t look good enough</li> </ul> <p>There are more complicated parts, like “is it easy to read?”, “is this code confusing?” and finally “how to make this all efficient given all restrictions above?”.</p> <p>Allocating time for these (mostly unsuccessful) attempts is tough.</p> <!-- Python. Python stands in a way sometimes. Julia's line-level macros maybe would be a more convenient syntax, and e.g. writing something like ```python x_out['b h w c'] = x['b c h w'] ``` --> <p>Additional challenge: “fewer, but more universal operations”.</p> <p>There is a gap between “I find this helpful” and “this will be actively used”. It is easy to come up with a long list of operations that will be helpful in <em>some</em> cases, but how users would figure this out? I don’t think anyone checks einops’ docs regularly, so operation will never pop up in mind. See, <em>usefulness of operation strongly depends on its universality</em>, i.e. ability to cover many cases, and einops are good at this because it was one of requirements.</p> <h2 id="adoption-challenges-management-challenges">Adoption challenges, management challenges</h2> <p>Einops adoption was very slow. If it was a commercial project, it is likely to run out of money before getting sufficient traction.</p> <p>But the project was designed to be resilient. Somewhat an internal requirement: should be usable for at least a couple of years even in the worst scenario: no maintenance at all, and deep learning landscape changes even faster than before.</p> <p>From the very beginning maintenance debt was minimized — that means, very restricted design, fewer features. I assessed very carefully which things can be broken. Once I was asked during an interview: why may it stop working? I said — only if API of core operations will change. Time shown this was the correct answer.</p> <p>Another issue is <em>extremely low adoption of layers</em>. I have no good explanation to it, they are very useful.</p> <h2 id="reasons-for-slow-adoption">Reasons for slow adoption?</h2> <p><strong>No hyping</strong>. In part, because I am bad at it, and in part, because I am not that interested in answering basic questions from folks attracted by new shiny things. As a byproduct, early adopters of einops are mostly very advanced folks who knew what to expect from the tool and cared more about quality of their code than the rest of ML community.</p> <p>Consequently, einops has <em>no dedicated community</em> (discord server or so). In the long run I think no community is better than abandoned community (which happens in many projects). There are a number of ein-tools around the github addressing specific cases, maybe somewhat centralized community could help with initial adoption.</p> <p>Another important factor is <strong>a significant prejudice against string-templated operations</strong>, which is for three reasons: 1. einsum was historically slow 2. einsum is the only operation of this kind in the frameworks. 3. everyone knows parsing is slow, and idea of ‘parse once’ rarely crosses the mind.</p> <p>Einops <em>caches results of pattern parsing</em>. But even repeating this many times in paper/documentation will not overcome prejudice — because if you’re already convinced it is slow, why would you read paper?</p> <p>A couple of speed issues were reported to einops repo, while those were not even related to einops — a vivid demonstration of this bias.</p> <p><strong>No critical case</strong>. Tool becomes an immediate hit only if it addresses an existing case that is very poorly covered by previous tools. Or rarely because of hype.</p> <p>Not that you can’t bend tensors without einops. And not that adding single <code class="language-plaintext highlighter-rouge">rearrange</code> magically makes your code better. Einops is an approach — and approach still requires investment to get a habit of writing and reading new kind of code. Real conversion happens only after one needs to read someone’s else code and finds out that reading einopsy code is significantly easier.</p> <h1 id="concluding-thought">Concluding thought</h1> <p>Einops, as said, is one of a kind, and its development trajectory deviates significantly from the ‘normal’ development.</p> <p>How would you call a system that is shaped by hard constraints? I’d call this “engineering art”.</p> Thu, 13 Jul 2023 12:00:00 +0000 https://arogozhnikov.github.io/2023/07/13/retrospective-thoughts-on-einops.html https://arogozhnikov.github.io/2023/07/13/retrospective-thoughts-on-einops.html einops tensor manipulations Schema migration should be a responsibility of DB <p>A great achievement of the past decade in programming is a shift in paradigm from <em>transition</em>-focused to <em>state</em>-focused.</p> <p>This shift is clearly seen in front-end (user interfaces): In react/preact/vue and other frontend frameworks a component has a state and defines how state should be represented (rendered) in html. The aim of a framework is to ‘migrate DOM’ to desired html representation with minimal overhead.</p> <p>This shift is clearly seen in management of cloud resources. In AWS CDK, pulumi, terraform and other <a href="https://en.wikipedia.org/wiki/Infrastructure_as_code">IaC</a> tools user defines desired state of infrastructure, and it is responsibility of a tool to produce a correct ‘migration of infrastructure’.</p> <p>This shift is visible in dependency management: Dependency management relies on expected state (which packages/libraries are required) and less on imperative instructions that dictate order of installation. Imperative glue here is still very common — e.g. dockerfiles, but tools like nix/nixos eliminate the glue as well.</p> <!-- Streamlit (tool used by data/ml folks) uses state (kept on client-side) to define the contents of the page. Every user action changes the state, and triggers computation of a new content with (mostly) preserved state. --> <p>In databases, in particular in ORMs, this shift had (only partially) happened around two decades ago. User changes ORM classes, and the framework produces migrations.</p> <p>Generally speaking, in all these cases we define desired state of the system, <em>not</em> necessary changes. Movement to state-focused programming dramatically simplified management of complex systems. It’s like you laying out a plan of street while the question of moving all belongings/walls is solved for you.</p> <h2 id="whats-wrong-with-migrations-in-rdbms">What’s wrong with migrations in RDBMS?</h2> <p>Switching to auto-migration tools helps to focus on important - e.g. current relations in RDBMS - and not how we ended up with this set of relations. Plus, coherence between DB and code (ORMs or schema-definition tools) is now granted.</p> <p>Adoption of auto-migration tools is still very low (even compared to ORMs), and in my opinion, because of <strong>how this process is organized</strong>.</p> <p>We have dozens of relational DBMS, and yes, they look similar, but there are tons of nuances that make them all different.</p> <p>And we have a number of tools to produce migrations: sqlalchemy+alembic in python, entity framework in .net, a dozen of tools for Hibernate in Java, and every community/ecosystem tries to develop a solution that can migrate a large number of deviating databases in a uniform way.</p> <p>No big surprise all of them have very limited success given that scope of project is unlimited.</p> <p>Auto-migration tools like alembic are also tough to develop and maintain:</p> <ul> <li>they need to understand schema definition in a language (in python, in this case)</li> <li>they need to introspect current schema of the database</li> <li>they need to compute ’diff’ based on matching these two schema definitions, neither of which were created with automated schema migration in mind</li> <li>deal with all peculiarities of dialects in schema definition and schema migration</li> <li>for all operations alembic creates counterparts in python code, which is like introducing +1 language</li> </ul> <p>The same problems doesn’t hurt frontend frameworks as much, because there are currently ~2.5 browser engines, and a ton of work done by standardization committees around js, and … after ditching react/vue you still have to deal with discrepancies, this time yourself. The same problems are faced by IaC tools, and this eventually will become one more (significant) barrier for migration between clouds.</p> <p><img src="/images/migrations/migration-db.png" width="800" /> <small> Comparison of existing solutions (python’s alembic is taken as example), and comparison to this proposal. Note that on the left there are multiple steps that cross the boundary of ORM/migrator or migrator/DB. </small></p> <h2 id="solution">Solution</h2> <ul> <li>schema migration is generated by database</li> <li>tool only declares desired state</li> </ul> <p>This will move responsibility for db-specificity migration to db developers, and that’s for good.</p> <h3 id="where-to-start">Where to start?</h3> <p>In a minimal implementation, DB provides a function. Function is given two db <code class="language-plaintext highlighter-rouge">schemas</code> (think of postgresql/oracle/sql server schemas, or individual databases in mysql) and compares them to produce a migration from an observed difference. Migration tool would create a temporary schema with a desired state and call a procedure to produce migration.</p> <p>That’s not something unseen: pgAdmin has ‘Schema Diff’, SQL Server Data Tools has ‘Schema Compare’. So tools do exist, but they are not part of the database, and they don’t have a uniform interface.</p> <h3 id="consequences">Consequences</h3> <p>When we push migrations to database developers…</p> <ul> <li>migrations would be almost immediately available in any programming language</li> <li>on a longer range, we should expect improvements in SDL (schema definition languages) to account for common migration scenarios.</li> </ul> <details> <summary> Example of these changes </summary> <div> <p>For example, if you start from something like</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Relation</span> <span class="n">Person</span><span class="p">:</span> <span class="n">name</span><span class="p">:</span> <span class="n">string</span> </code></pre></div> </div> <p>and migrate it to</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Relation</span> <span class="n">Person</span><span class="p">:</span> <span class="n">full_name</span><span class="p">:</span> <span class="n">string</span> </code></pre></div> </div> <p>From the point of a migration tool it is not clear that you just renamed a field, not deleted ‘name’ and created ‘full_name’. Thus an additional technical identifier is necessary, for instance:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Relation</span> <span class="n">Person</span><span class="p">:</span> <span class="n">name</span><span class="p">:</span> <span class="n">string</span><span class="p">,</span> <span class="n">oid</span><span class="o">=</span><span class="err">‘</span><span class="mi">7</span><span class="n">dsd8</span><span class="err">’</span> </code></pre></div> </div> <p>to</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Relation</span> <span class="n">Person</span><span class="p">:</span> <span class="n">full_name</span><span class="p">:</span> <span class="n">string</span><span class="p">,</span> <span class="n">oid</span><span class="o">=</span><span class="err">‘</span><span class="mi">7</span><span class="n">dsd8</span><span class="err">’</span> </code></pre></div> </div> <p>now it is clear that renaming happened. There are a number of other ways to have smoother support of migrations.</p> <p>However, this will be just an idea until DB developers don’t have to think about migration.</p> </div> </details> <ul> <li>there are cases when db just does not provide tools to produce migrations. Like postgresql enum that just can’t be migrated safely by alembic, so <a href="https://github.com/sqlalchemy/alembic/issues/278">this issue</a> is unresolved for years, and that’s not on alembic side.</li> </ul> <p><br /></p> <p>Well… we can just implement improvements as a stand-alone solution, e.g. within ORM, right?.</p> <p>No, we can’t. As I described, to make it somewhat useful, you need to support numerous dialects, and creating such migration tools is a big job (comparable to creating a new database). Creating such tools for multiple languages is probably more job than just creating db from scratch.</p> <p><br /></p> <p><br /></p> <p>That’s the main feature I expect from my next db: declarative SDL with schema migrations taken by DB. I know that EdgeDB already provides such functionality, but if you know other tools that have this implemented - drop me a letter.</p> Sun, 29 Jan 2023 01:00:00 +0000 https://arogozhnikov.github.io/2023/01/29/migrations.html https://arogozhnikov.github.io/2023/01/29/migrations.html schema migrations databases Delimiter-first code <style> .alex-boxes { display: flex; justify-content: space-around; } .lvl1 { color: darkred; } .lvl2 { color: darkgreen; } .lvl3 { color: darkblue; } .lvl1, .lvl2, .lvl3 { padding-right: 2px; } .lvl1:before, .lvl2:before, .lvl3:before { content: "<lvl"; } .lvl1:after, .lvl2:after, .lvl3:after { content: ">"; } cmnt { /* comments */ display: inline; color: #7f9f7f; } strn { /* string literals */ display: inline; color: #cc9393; } pnct { /* punctuation */ display: inline; color: #41706f; } kwrg { /* kwarg */ display: inline; color: #eee; } hngr { /* hanging elements - bracket / parenthesis / start of multiline */ display: inline; color: #d8f; } caret { display: inline; } caret:after { content: "Ꮖ"; color: #AAA; } .precode { background-color: #2b2b2b; color: #dcdccc; overflow-x: visible; } caret:after { animation: blink-animation 1.5s infinite; } @keyframes blink-animation { 0% { opacity: 0.8; } 10% { opacity: 0.4; } 40% { opacity: 0.4; } 50% { opacity: 0.8; } } </style> <h2 id="summary">Summary</h2> <p>I argue for wider usage of delimiter-first in the code</p> <ul> <li><code class="language-plaintext highlighter-rouge">three friends [tic, tac, toe]</code> becomes <code class="language-plaintext highlighter-rouge">three friends ・tic ・tac ・toe</code>.</li> </ul> <p>A new top-level syntax for programming languages is proposed to show advantages of this method. New syntax is arguably as simple, but more consistent, better preserves visual structure and solves some issues in code formatting.</p> <h2 id="related-comma-first-formatting">Related: comma-first formatting</h2> <p>A well-known proposal is to write commas first in languages like javascript, JSON or SQL, which don’t have trailing commas (JS has these days, but not the other two):</p> <div class="alex-boxes"> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1">-- trailing commas </span> <span class="k">SELECT</span> <span class="n">employee_name</span><span class="p">,</span> <span class="n">company_name</span><span class="p">,</span> <span class="n">salary</span><span class="p">,</span> <span class="n">state_code</span><span class="p">,</span> <span class="n">city</span> <span class="k">FROM</span> <span class="nv">`employees`</span> </code></pre></div> </div> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1">-- leading commas </span> <span class="k">SELECT</span> <span class="n">employee_name</span> <span class="p">,</span> <span class="n">company_name</span> <span class="p">,</span> <span class="n">salary</span> <span class="p">,</span> <span class="n">state_code</span> <span class="p">,</span> <span class="n">city</span> <span class="k">FROM</span> <span class="nv">`employees`</span> </code></pre></div> </div> </div> <p>While it is <strong>not what I am discussing here</strong>, there is a large overlap. This style wasn’t widely adopted, and it is interesting why.</p> <p>All criticism essentially comes down to: 1) tools can solve common issues solved by this notation 2) it is not natural / you don’t write text like this.</p> <p>Argument 1) is irrelevant since tools can handle any notation, even completely non-readable for human. Argument 2) is weak, however similarity to known things drastically simplifies adoption.</p> <p>Over time, however, code culture diverged in multiple ways from ‘usual writing’: we enumerate from zero, write identifiers with underscores, don’t follow usual rules for quotes, and indent code instead of writing in paragraphs. When some tools have shown that the alternative way works, further adoption happens more easily.</p> <p>More importantly, argument 2) is really broken:</p> <div class="alex-boxes"> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ・this version ・is far more ・natural </code></pre></div> </div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> than this version・ with a delimiter・ after </code></pre></div> </div> </div> <p>so when it came to enumerating in a visually distinctive way, ‘usual writing’ uses delimiter-first.</p> <p>I want to point the source of this controversy with one more example:</p> <pre> You need eggs, cheese, bread. <span style="color: #484"># ok</span> You need ,eggs ,cheese ,bread. <span style="color: #844"># sucks</span> You need a) eggs b) cheese c) bread. <span style="color: #484"># ok</span> You need 1. eggs 2. cheese 3. bread. <span style="color: #484"># ok</span> You need ・eggs ・cheese ・bread. <span style="color: #484"># ok</span> </pre> <p>So complains are not because delimiter-first looks wrong - in fact, it is common. It is about commas being used as <em>leading</em> elements, not trailing - a lesson to remember.</p> <p>Both argument 1) and 2) pinpoint reasons <em>why things the way they are</em>: habit and tools. But different code examples (<a href="https://hoffa.medium.com/winning-arguments-with-data-leading-with-commas-in-sql-672b3b81eac9">SQL examples</a> by Felipe Hopfa and <a href="https://gist.github.com/isaacs/357981">JS examples</a> by Isaac Z. Schlueter) show benefits of delimiter-first.</p> <p>I expected to find in discussions some code examples where delimiter-last is better, but I didn’t.</p> <p><em>Later addition:</em> haskell community <a href="https://github.com/tibbe/haskell-style-guide/blob/master/haskell-style.md">adopted</a> leading commas in many projects, because trailing commas were not supported at first. Later haskell got support for trailing, but now majority <a href="https://www.reddit.com/r/haskell/comments/hr5c2n/comment/fy25hpm/?utm_source=share&amp;utm_medium=web2x&amp;context=3">votes</a> for advantages of leading commas.</p> <h2 id="is-delimiter-a-right-word">Is ‘delimiter’ a right word?</h2> <p>Delimiter (just as separator) separates items. Though there is <a href="https://stackoverflow.com/questions/9118769/when-to-use-the-terms-delimiter-terminator-and-separator">no consensus</a> about it.</p> <p>E.g. in <code class="language-plaintext highlighter-rouge">[ 1, 2, 3 ]</code> we have a sequence of tokens:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>start item delimiter item delimiter item end [ 1 , 2 , 3 ] </code></pre></div></div> <p>So what I’m arguing for is having a start-of-item token. Like this: <code class="language-plaintext highlighter-rouge">・1 ・2 ・3</code>. Do we need to point an end of last token? As we’ll see next, that’s usually not the case.</p> <p>We have a special word for end-of-item token: terminator, but no startinator or any similar word. I see some irony in this.<br /> <em>(update: find some interesting thoughts I received about this in the comments section)</em></p> <p>Meanwhile, I keep using the word ‘delimiter’ (albeit it’s maybe incorrect)</p> <h2 id="collections-in-html">Collections in HTML</h2> <p>Different markup languages give some food for thought, as they commonly deal with collections.</p> <p>E.g. html allows using start-of-item (<code class="language-plaintext highlighter-rouge">&lt;li&gt;</code>) and skipping end-of-item (<code class="language-plaintext highlighter-rouge">&lt;/li&gt;</code>)</p> <div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;ul&gt;</span> <span class="nt">&lt;li&gt;</span> first item <span class="nt">&lt;li&gt;</span> second item <span class="nt">&lt;/ul&gt;</span> </code></pre></div></div> <h2 id="collections-in-yaml">Collections in YAML</h2> <p>Yaml, which focuses on a hierarchy of collections, also uses a delimiter-first approach.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="s">point </span><span class="m">1</span> <span class="pi">-</span> <span class="s">point </span><span class="m">1.1</span> <span class="pi">-</span> <span class="s">point </span><span class="m">1.2</span> <span class="pi">-</span> <span class="s">point 1.2.1</span> <span class="pi">-</span> <span class="s">point 1.2.2</span> <span class="pi">-</span> <span class="s">point </span><span class="m">1.3</span> <span class="pi">-</span> <span class="s">point 2</span> </code></pre></div></div> <p>Let me reinterpret this example. <strong>This reinterpretation is important in further discussion</strong>.</p> <p>There are 3 delimiters: <code class="language-plaintext highlighter-rouge">\n-</code>, <code class="language-plaintext highlighter-rouge">\n__-</code> and <code class="language-plaintext highlighter-rouge">\n____-</code> (underscore = whitespace). All three delimiters are distinct, and the whole structure now reads as</p> <pre> <span class="lvl1">1</span>point 1 <span class="lvl2">2</span>point 1.1 <span class="lvl2">2</span>point 1.2 <span class="lvl3">3</span>point 1.2.1 <span class="lvl3">3</span>point 1.2.2 <span class="lvl2">2</span>point 1.3 <span class="lvl1">1</span>point 2 </pre> <p>No end token needed in yaml: the last item ends when a collection ends, i.e. at a delimiter of higher level. There is no need to know or parse anything about an internal structure between two <lvl1> tokens.</lvl1></p> <p>Correspondingly, the only expectation we have from contents enclosed between <code class="language-plaintext highlighter-rouge">&lt;lvl2&gt;</code> is that there are no tokens <code class="language-plaintext highlighter-rouge">&lt;lvl1&gt;</code> or <code class="language-plaintext highlighter-rouge">&lt;lvl2&gt;</code> and that’s it.</p> <p>Intermediate conclusion: delimiter-first is very common, and in markup languages it is even standard (but not in programing languages!)</p> <h2 id="line-should-start-from-n-not-end-with-it">Line should start from <code class="language-plaintext highlighter-rouge">\n</code>, not end with it</h2> <p>This sounds mad (after many years of programming it just should), but see for yourself:</p> <div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Let's assume I've had some very long text ending here. Chapter 2. Let's learn about belonging of indentation elements to logical elements. </code></pre></div></div> <p>Pay attention to the blank line between last line of previous chapter and a header of new line. Undoubtedly, blank line is a part of ‘Chapter 2’ logical element, because empty line focuses our attention on ‘Chapter 2’ label. It is not because we need to end the paragraph.</p> <p>For the same reason, in html additional margins ‘belong’ to headers, not preceding elements.</p> <p>Same for lines: <em>we highlight a beginning of a new line</em>, not an end of previous one. Ironically, that’s in the name: it is newline, not endline.</p> <p>When we turn to code, the same thought is seen with this small snippet, where I compare normal <code class="language-plaintext highlighter-rouge">print</code> with a hypothetical <code class="language-plaintext highlighter-rouge">print</code> that outputs newline before the output:</p> <div class="alex-boxes"> <div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">'step1. downloading'</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">''</span><span class="p">)</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">download</span><span class="p">(...):</span> <span class="k">print</span><span class="p">(</span><span class="n">end</span><span class="o">=</span><span class="s">'.'</span><span class="p">)</span> <span class="k">print</span><span class="p">()</span> <span class="c1"># to keep steps on separate lines </span> <span class="k">print</span><span class="p">(</span><span class="s">'step2. processing'</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">''</span><span class="p">)</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">process</span><span class="p">(...):</span> <span class="k">print</span><span class="p">(</span><span class="n">end</span><span class="o">=</span><span class="s">'.'</span><span class="p">)</span> <span class="k">print</span><span class="p">()</span> <span class="c1"># to keep steps on separate lines </span></code></pre></div> </div> <center> Code with \n auto-printed after the arguments </center> </div> <div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">'step1. downloading'</span><span class="p">)</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">download</span><span class="p">(...):</span> <span class="k">print</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="s">'.'</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="s">'step2. processing'</span><span class="p">)</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">process</span><span class="p">(...):</span> <span class="k">print</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="s">'.'</span><span class="p">)</span> </code></pre></div> </div> <center> Code with \n auto-printed before the arguments </center> </div> </div> <p>result:</p> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>step1. downloading......... step2. processing......... </code></pre></div></div> <p>Version of code with leading <code class="language-plaintext highlighter-rouge">\n</code> is more straightforward.</p> <p>If things were the opposite way:</p> <div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.......step1. downloaded .......step2. processed </code></pre></div></div> <p>then <code class="language-plaintext highlighter-rouge">\n</code> in the end would be more optimal, but this order is not natural. Normally we first describe the collection, then enumerate items, not vice versa.</p> <h2 id="unixs-newline-in-the-end-of-line">Unix’s newline in the end of line</h2> <p>Unix does not use <code class="language-plaintext highlighter-rouge">\n</code> as a delimiter of lines. Instead, it is more of line-terminator, because file with text <em>should</em> end with <code class="language-plaintext highlighter-rouge">\n</code>. Not doing so would break simplicity of unix tools and simplicity of definitions, see <a href="https://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline">this SO thread</a>.</p> <p>For layman, why newline is required in unix:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">echo</span> <span class="nt">-n</span> <span class="s1">'good file with newline in the end\n'</span> <span class="o">&amp;&amp;</span> <span class="nb">echo</span> <span class="nt">-n</span> <span class="s1">'another good file with newline in the end\n'</span> good file with newline <span class="k">in </span>the end another good file with newline <span class="k">in </span>the end </code></pre></div></div> <p>Missed newline in the first file:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">echo</span> <span class="nt">-n</span> <span class="s1">'bad file without newline in the end'</span> <span class="o">&amp;&amp;</span> <span class="nb">echo</span> <span class="nt">-n</span> <span class="s1">'another good file with newline in the end\n'</span> bad file without newline <span class="k">in </span>the endanother good file with newline <span class="k">in </span>the end </code></pre></div></div> <p>problem is in the first file, but it is the second one to get printed the wrong way. No such misattrbution issue with newline-first.</p> <p>If it is ok to end each file with <code class="language-plaintext highlighter-rouge">\n</code>, then it is ok to start it with <code class="language-plaintext highlighter-rouge">\n</code>.</p> <p>Having lines start with <code class="language-plaintext highlighter-rouge">\n</code> maintains the simplicity of unix utilities, but is a bit simpler to visualize in editor.</p> <p>Imagine that in parallel universe text and binary files are different in the very first character. What a science finction we could live in!</p> <p><strong>Do I really want to change all files to newline-first?</strong> Of course not. But I have to point that if in the course of history files were newline-first from the start, that would be a better system.</p> <p>I hypothesize, that newline-last comes from unix mainframes: when line in shell is entered, it can be passed to a mainframe for processing. I can’t confirm this, but it sounds plausible. If so, time has shown that to be a wrong choice: all the messengers these days make distinction between new line (enter) and sending messages (shift+enter). Jupyter knows that, IDEs know that, messengers know that. Terminals still don’t know that.</p> <h2 id="using-indentation-to-structure-code">Using indentation to structure code</h2> <p>Code indentation is available in all major languages, but python (and scala 3, F#, nim, haskell, …) relies on indentation to define logical structure.</p> <p>And that works very well. Let’s see how we can re-interpret the python code the way we did with yaml</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MyClass</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">pass</span> <span class="k">def</span> <span class="nf">some_method</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="k">pass</span> </code></pre></div></div> <p>now we reinterpret the structure with <code class="language-plaintext highlighter-rouge">&lt;lvl1&gt;=\n</code>, <code class="language-plaintext highlighter-rouge">&lt;lvl2&gt;=\n____</code>, <code class="language-plaintext highlighter-rouge">&lt;lvl3&gt;=\n________</code>.</p> <pre> <span class="lvl1">1</span>class MyClass <span class="lvl2">2</span>def __init__(self) <span class="lvl3">3</span>pass <span class="lvl2">2</span> <span class="lvl2">2</span>def some_method(self): <span class="lvl3">3</span>pass </pre> <p>so, we see very basic organization of code is available just by looking at sequence of start tokens (which simply mirrors indentation).</p> <h2 id="some-problems-with-multiline-strings">Some problems with multiline strings</h2> <p>There are places where python allows code to ‘escape’ indentation: continuation of previous line (explicit with \ or implicit with different brackets) and multiline strings.</p> <p>Continuations are ‘solvable’ with code formatting tools, but not multiline literals:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="bp">True</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">""" This is python's multiline string """</span><span class="p">)</span> </code></pre></div></div> <p>Output (###### just shows where the line ends):</p> <pre class="precode"> <cmnt>######</cmnt> This is python's<cmnt>######</cmnt> multiline string<cmnt>######</cmnt> <cmnt>######</cmnt> </pre> <p>To get proper output we need to break visual alignment:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="bp">True</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">"""This is python's multiline string """</span><span class="p">)</span> <span class="c1"># takes effort to realize that the same block of code continues here </span> <span class="k">return</span> <span class="bp">False</span> </code></pre></div></div> <p>There are problems with multiline: first line, last line and indentation. Multilines in javascript/go face all the same issues, so it is a generic problem.</p> <p>I think there is a way to solve this issue too, and it will be discussed.</p> <h2 id="delimiter-first-pseudo-python">Delimiter-first pseudo-python</h2> <p>To better demostrate how all these ideas come together, I’ll imagine a new language (pseudo-python). To focus only on syntax changes, I’ll keep all other aspects of the language the same.</p> <p>I will consider an artificially complicated example. It includes different arguments, list, empty list, string, multiline string, method chaining, multiline logical arithmetics, few or no arguments</p> <p>Goal is to demonstrate that any wild mix is representable and does not produce mess.</p> <div class="alex-boxes"> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prepare_message</span><span class="p">(</span> <span class="n">title</span><span class="o">=</span><span class="s">"Hey {}, ready for Christmas?"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">user_name</span><span class="p">),</span> <span class="n">email</span><span class="o">=</span><span class="n">email</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="sa">f</span><span class="s">"""Reminder: please clean your chimneys! Oh, and prepare "Santa Landing Spot" on your roof Thank you </span><span class="si">{</span><span class="n">user_name</span><span class="si">}</span><span class="s"> for cooperation,</span><span class="se">\n</span><span class="s">Santa Corp. """</span><span class="p">,</span> <span class="n">additional_sections</span><span class="o">=</span><span class="p">[</span> <span class="n">get_current_promotions</span><span class="p">(</span><span class="n">n_promotions</span><span class="o">=</span><span class="mi">4</span><span class="p">),</span> <span class="n">get_recent_news</span><span class="p">(),</span> <span class="p">],</span> <span class="n">unsubscribe_link</span><span class="o">=</span><span class="n">generate_unsubscribe_link</span><span class="p">(</span> <span class="n">email</span><span class="p">,</span> <span class="n">message</span><span class="o">=</span><span class="n">message</span><span class="p">,</span> <span class="o">**</span><span class="n">unsubscribe_settings</span><span class="p">,</span> <span class="p">),</span> <span class="n">attachments</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">).</span><span class="n">schedule_for_submission</span><span class="p">(</span> <span class="n">holidays_queue</span><span class="p">,</span> <span class="n">important</span><span class="o">=</span><span class="n">user_is_santa</span> <span class="o">|</span> <span class="n">user_is_deer</span> \ <span class="o">|</span> <span class="n">user_previously_had_issues_with_christmas_delivery</span><span class="p">,</span> <span class="p">)</span> </code></pre></div> </div> <pre class="precode"> prepare_message<hngr>(</hngr> <pnct>,</pnct> <kwrg>title=</kwrg><strn>"Hey {}, ready for Christmas?"</strn>.format(user_name) <pnct>,</pnct> <kwrg>email=</kwrg>email <pnct>,</pnct> <kwrg>body=</kwrg><hngr>f"""</hngr> <strn>"Reminder: please clean your chimneys! </strn> <strn>" </strn> <strn>"Oh, and prepare "Santa Landing Spot" on your roof </strn> <strn>" </strn> <strn>"Thank you {<kwrg>user_name</kwrg>} for cooperation,\nSanta Corp.</strn> <pnct>,</pnct> additional_sections=<hngr>[</hngr> <pnct>,</pnct> get_current_promotions(n_promotions=4) <pnct>,</pnct> get_recent_news() <hngr>]</hngr> <pnct>,</pnct> unsubscribe_link=generate_unsubscribe_link<hngr>(</hngr> <pnct>,</pnct> email <pnct>,</pnct> message=message <pnct>,</pnct> **unsubscribe_settings <hngr>)</hngr> <pnct>,</pnct> attachments = [] <hngr>)</hngr> <pnct>\</pnct>.schedule_for_submission<hngr>(</hngr> <pnct>,</pnct> holidays_queue <pnct>,</pnct> important=user_is_santa | user_is_deer \| user_previously_had_issues_with_christmas_delivery <hngr>)</hngr> </pre> </div> <p>I welcome you to study this example for a minute. Structure overall did not change much. Note differences in line breaks <code class="language-plaintext highlighter-rouge">\</code> and multiline strings.</p> <p>An important distinction: leading commas get the same role as hyphens in yaml: they define structure, their position is not arbitrary.</p> <div class="alex-boxes"> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># normal python # this is legal code </span><span class="k">print</span><span class="p">(</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="p">)</span> </code></pre></div> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># proposed # this is incorrect code </span><span class="k">print</span><span class="p">(</span> <span class="p">,</span> <span class="mi">1</span> <span class="p">,</span> <span class="mi">2</span> <span class="p">)</span> </code></pre></div> </div> </div> <p>In new code there is no need in closing brackets (see that yourself by staring at the code more!). <br /> So let’s remove closing elements:</p> <div class="alex-boxes"> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prepare_message</span><span class="p">(</span> <span class="n">title</span><span class="o">=</span><span class="s">"Hey {}, ready for Christmas?"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">user_name</span><span class="p">),</span> <span class="n">email</span><span class="o">=</span><span class="n">email</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="sa">f</span><span class="s">"""Reminder: please clean your chimneys! Oh, and prepare "Santa Landing Spot" on your roof Thank you </span><span class="si">{</span><span class="n">user_name</span><span class="si">}</span><span class="s"> for cooperation,</span><span class="se">\n</span><span class="s">Santa Corp. """</span><span class="p">,</span> <span class="n">additional_sections</span><span class="o">=</span><span class="p">[</span> <span class="n">get_current_promotions</span><span class="p">(</span><span class="n">n_promotions</span><span class="o">=</span><span class="mi">4</span><span class="p">),</span> <span class="n">get_recent_news</span><span class="p">(),</span> <span class="p">],</span> <span class="n">unsubscribe_link</span><span class="o">=</span><span class="n">generate_unsubscribe_link</span><span class="p">(</span> <span class="n">email</span><span class="p">,</span> <span class="n">message</span><span class="o">=</span><span class="n">message</span><span class="p">,</span> <span class="o">**</span><span class="n">unsubscribe_settings</span><span class="p">,</span> <span class="p">),</span> <span class="n">attachments</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">).</span><span class="n">schedule_for_submission</span><span class="p">(</span> <span class="n">holidays_queue</span><span class="p">,</span> <span class="n">important</span><span class="o">=</span><span class="n">user_is_santa</span> <span class="o">|</span> <span class="n">user_is_deer</span> \ <span class="o">|</span> <span class="n">user_previously_had_issues_with_christmas_delivery</span><span class="p">,</span> <span class="p">)</span> </code></pre></div> </div> <pre class="precode"> prepare_message<hngr>(</hngr> <pnct>,</pnct> <kwrg>title=</kwrg><strn>"Hey {}, ready for Christmas?"</strn>.format(user_name) <pnct>,</pnct> <kwrg>email=</kwrg>email <pnct>,</pnct> <kwrg>body=</kwrg><hngr>f"""</hngr> <strn>"Reminder: please clean your chimneys! </strn> <strn>" </strn> <strn>"Oh, and prepare "Santa Landing Spot" on your roof </strn> <strn>" </strn> <strn>"Thank you {<kwrg>user_name</kwrg>} for cooperation,\nSanta Corp. </strn> <pnct>,</pnct> additional_sections=<hngr>[</hngr> <pnct>,</pnct> get_current_promotions(n_promotions=4) <pnct>,</pnct> get_recent_news() <pnct>,</pnct> unsubscribe_link=generate_unsubscribe_link<hngr>(</hngr> <pnct>,</pnct> email <pnct>,</pnct> message=message <pnct>,</pnct> **unsubscribe_settings <pnct>,</pnct> attachments = [] <pnct>\</pnct>.schedule_for_submission<hngr>(</hngr> <pnct>,</pnct> holidays_queue <pnct>,</pnct> important=user_is_santa | user_is_deer \| user_previously_had_issues_with_christmas_delivery </pre> </div> <p>Don’t pay much attention to number of lines - denser code is a byproduct, not a goal.</p> <p>Further I’ll discuss several advantages of this syntax.</p> <h2 id="new-multiline-strings">New multiline strings</h2> <pre class="precode" style="overflow-x: scroll;"> print<hngr>(f"""</hngr> <strn>"This is new</strn> <strn>"multiline string</strn> </pre> <p>output:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>This is new multiline string </code></pre></div></div> <p>Everything looks perfect, multiple issues are solved in one shot. But … with a minor catch: that’s how output looks like in raw form: <code><cmnt>\n</cmnt>This is new<cmnt>\n</cmnt>multiline string</code> (i.e. it is newline-first). Technically, one can produce newline-last outputs, but that’s artificial. See the elegance of match between delimiter-first and newline-first approach: delimiter just gets replaced with newline. That’s an operation that one can visually imagine by shifting all lines to the left.</p> <p>One more example:</p> <pre class="precode"> print<hngr>(f"""</hngr> <strn>"you can place anything here: ' '' ''' " "" """ f""" etc etc.</strn> <cmnt># and you can put comments in the middle of multiline</cmnt> <strn>"multiline string can't be broken or terminated by any sequence within a line </strn> </pre> <p>Now, python literals do not work like that.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">''' """ and '''</span> <span class="n">should</span> <span class="n">be</span> <span class="n">escaped</span> <span class="p">(</span><span class="n">otherwise</span> <span class="n">interpreted</span> <span class="k">as</span> <span class="n">literal</span> <span class="n">terminator</span><span class="p">)</span> <span class="s">''' '''''</span> <span class="s">''' # this trick (available in markdown) does not work in python '''''</span> </code></pre></div></div> <h2 id="new-parsing">New parsing</h2> <p>In contrast to normal python, line alone does not inform if the instruction is complete, or it should be continued on the next line. Parsing one more line is required to confirm that current code section is complete (only prefix of next line should be parsed, to be more precise).</p> <p>In this approach top-level parsing is quite ignorant to language details, and it relies on the same visual cues as we humans do: parser does not need to analyze line in detail to figure out if the instruction continues or not.</p> <p>Let me ‘parse’ this example:</p> <div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Delimiter Token class Rest of line <span class="nt">&lt;lvl1-instr</span> <span class="nt">&gt;</span>prepare_message( , <span class="nt">&lt;lvl2-item</span> <span class="nt">&gt;</span>title="Hey {}, ready for Christmas?".format(user_name) , <span class="nt">&lt;lvl2-item</span> <span class="nt">&gt;</span>email=email , <span class="nt">&lt;lvl2-item</span> <span class="nt">&gt;</span>body= f""" " <span class="nt">&lt;lvl3-literal</span> <span class="nt">&gt;</span>Reminder: please clean your chimneys! " <span class="nt">&lt;lvl3-literal</span> <span class="nt">&gt;</span> " <span class="nt">&lt;lvl3-literal</span> <span class="nt">&gt;</span>Oh, and prepare "Santa Landing Spot" on your roof " <span class="nt">&lt;lvl3-literal</span> <span class="nt">&gt;</span> " <span class="nt">&lt;lvl3-literal</span> <span class="nt">&gt;</span>Thank you {user_name} for cooperation,\nSanta Corp. , <span class="nt">&lt;lvl2-item</span> <span class="nt">&gt;</span>additional_sections=[ , <span class="nt">&lt;lvl3-item</span> <span class="nt">&gt;</span>get_current_promotions(n_promotions=4) , <span class="nt">&lt;lvl3-item</span> <span class="nt">&gt;</span>get_recent_news() , <span class="nt">&lt;lvl2-item</span> <span class="nt">&gt;</span>unsubscribe_link=generate_unsubscribe_link( , <span class="nt">&lt;lvl3-item</span> <span class="nt">&gt;</span>email , <span class="nt">&lt;lvl3-item</span> <span class="nt">&gt;</span>message=message , <span class="nt">&lt;lvl3-item</span> <span class="nt">&gt;</span>**unsubscribe_settings , <span class="nt">&lt;lvl2-item</span> <span class="nt">&gt;</span>attachments = [] \ <span class="nt">&lt;lvl1-continue&gt;</span>.schedule_for_submission( , <span class="nt">&lt;lvl2-item</span> <span class="nt">&gt;</span>holidays_queue , <span class="nt">&lt;lvl2-item</span> <span class="nt">&gt;</span>important=user_is_santa | user_is_deer \| <span class="nt">&lt;lvl2-continue&gt;</span>| user_previously_had_issues_with_christmas_delivery </code></pre></div></div> <p>By looking only at the sequence of delimiters (there are several subtypes of them), one can deduct limits of every code block / call / literal, i.e. derive top-level structure of the program. Parser now deals with a simpler task of checking that elements fit this pre-defined structure, and can point places where ‘structure’ does not match ‘content’.</p> <p>Good bye old times when one deleted bracket caused complete rebuild of AST and numerous errors.</p> <h2 id="new-code-suggestions">New code suggestions</h2> <p><em>This paragraph was added later, to unwrap the point that was missed by many readers.</em></p> <p>Parsing of correct code is not a problem since 1960s or so. Real challenge is on-the-fly parsing of partially incorrect and quickly-changing code in the process of editing.</p> <p>Say I’m a complete novice and typed something wrong:</p> <div class="alex-boxes"> <pre class="precode"> def myfunction( var1 = 'some default value', var2 = (1, (2, 3), ) var3 = "variable number 3" var4 = """ Simple unfinished multiline string """ + \ var<caret></caret> var5 = ()) </pre> </div> <p>what should be autosuggested? var1/2/3/4? or nothing? Which would be more helpful?</p> <p>How to inform user which places should be fixed? VS Code blames bracket on first line saying it is not closed (while it is closed!) and last line for missing colon (no, I don’t want colon there). Pycharm’s diagnostic messages are slightly better, but it blames line with var3 (which is completely ok).</p> <p>Now, in pseudo-python there is no way to ‘escape’ indentation and thus code analysis can rely on indentation. And it is immediately deducible that lines with var2 and var5 have problem, and indent of var3 is incorrect (since colon is missing on previous line).</p> <p>Autosuggestion even in code with multiple unfinished places would be still useful (in similar scenario in pseudo-python it still can suggest var3/var4, and depending on tolerance additionally var1/var2). Currently tools don’t suggest anything.</p> <p>As I mentioned, AST undergoes small changes during editing, thus providing highly effecient autosuggestion, code analysis, and highlighting for such language would be simpler, much simpler.</p> <h2 id="new-editing">New editing</h2> <div class="alex-boxes"> <div> <p>Normal python. <br /> suppose you want to start a list of arguments</p> <pre class="precode"> print(<caret></caret>) </pre> <p>after you hit enter in IDE:</p> <pre class="precode"> print( <caret></caret> ) </pre> <p>then you type argument and comma. <br /> Ready to proceed</p> <pre class="precode"> print( 42, <caret></caret> ) </pre> <p>Done? Arrow down + enter</p> <pre class="precode"> print( 42, 43, ) <caret></caret> </pre> <p>Forgot something? <br /> Double arrow up, <br /> move cursor to end of line,<br /> enter</p> <pre class="precode"> print( 42, 43, <caret></caret> ) </pre> </div> &nbsp; <div> <p>Delimiter-first pseudo-python. <br /> suppose you want to start a list of arguments</p> <pre class="precode"> print(<caret></caret>) </pre> <p>after you hit enter in IDE comma is auto-added:</p> <pre class="precode"> print( , <caret></caret> </pre> <p>you type only argument. <br /> Ready to preceed</p> <pre class="precode"> print( , 42 , <caret></caret> </pre> <p>Done? Enter + shift-tab</p> <pre class="precode"> print( , 42 , 43 <caret></caret> </pre> <p>Forgot something? Tab</p> <pre class="precode"> print( , 42 , 43 , <caret></caret> </pre> </div> </div> <p>The process of editing such structures was polished with hierarchical lists in word and other text processors.</p> <p>Below is an animated example from workflowy (taken from <a href="https://www.process.st/take-better-notes/">post</a> by B. Brandall): <img src="https://www.process.st/wp-content/uploads/2016/01/ezgif.com-crop-1.gif" /></p> <p>Even minimalist note-taking apps these days recognize the importance of hierarchical organization. Their interface focuses on effectively traversing and modifying this structure.</p> <p>But with code - this extremely structured and standardized pieces of linked information - we continue the game of imitation: ‘hey, that’s just text files, you can use notepad here!’.</p> <h2 id="new-versioning">New versioning</h2> <p>Missing trailing commas make diffs a bit annoying because of including an additional line.</p> <p>New syntax has this solved. In other aspects versioning should work the same.</p> <h2 id="new-formatting">New formatting</h2> <p>The goal of formatting is to produce a visual code structure that is easy to read, as if you already see all main components without reading anything.</p> <p>New syntax enforces this, and leaves fewer degrees of freedom. Writing something non-readable would be challenging… I suppose.</p> <p>Role of formatters thus would be minor, or they can be skipped.</p> <h2 id="limitations">Limitations</h2> <p>First, I did not try to solve following perceptual problems:</p> <ul> <li>commas are leading, and I’ve mentioned that this was a problem for comma-first formatting</li> <li>open brackets without a matching pair create visual discomfort. Also my eyes already trained to focus on closing brackets, but proper color scheme seems to solve this</li> </ul> <p>This post is already long, and leaving things closer to python simplifies example. I think both points can be improved, and feel free to post your ideas on this.</p> <p>Second, I intentionally focused only on improving multi-line constructs, but single-line collections were left untouched. That does not mean delimiter-first does not work there, but scale of necessary changes is just too high to justify gains. At least for now.</p> <h2 id="if-you-made-it-this-far">If you made it this far</h2> <p>Wow, thank you!</p> <p>I hope an adventure was interesting and slightly mind blowing.</p> <p>Don’t be too surprised if this proposal evokes “hey this looks wrong, just plain wrong” reaction. <br /> After all, ideas we enjoy these days: enumeration from zero, using registers in names, structural programming, mandatory formatting, and even python’s approach to defining code blocks with indentation — every single one of them were met with a storm of criticism.</p> <div style="text-align: center; font-size: 40px; padding: 110px">👋</div> <h3 id="comments-">Comments 💬</h3> <ul> <li> <p>I received and collected a number of links for using delimiter-first in different contexts (lisp/scheme, formulas, translatable languages), will organize that material when I get time.</p> </li> <li> <p>Isaac Z. Schlueter advised there is a term ‘initiator’, used in <em>“… specification discussion threads, where it’s common to dig deep into the particulars of parsing semantics. Very much a ‘deep in the weeds’ kind of technical term.”</em> <br /><br /> In the context of parsing I found the word ‘initiator’ in several papers, and only one mention on stackoverflow, so I’ll stick to using word ‘delimiter’.</p> </li> <li> <p>Other options mentioned in discussions: introducer, starter</p> </li> <li> <p>Peter Hilton noticed that <em>“… startinators in prose usually called bullets. Some English-language style guides even treat the following punctuation as equivalent.</em></p> <p>Brilliantly Wrong — Alex Rogozhnikov’s blog about math, machine learning, programming, physics and biology.*</p> <p>Brilliantly Wrong — Alex Rogozhnikov’s blog about:</p> <ul> <li>math</li> <li>machine learning</li> <li>programming</li> <li>physics</li> <li>biology.</li> </ul> <p><em>Note the bullet list’s trailing full stop (period). It’s still one punctuated sentence.”</em></p> <p>Indeed, name ‘bullet’ sounds very appropriate when discussing code written in delimiter-first style. From parsing side, I don’t feel it’s a good partner to word ‘terminator’. <br /><br /></p> </li> <li> <p>Thanks to Alexander Molchanov for proofreading, improving text, and leaving comments.</p> </li> <li> <p>Question: “Who did you write this for?”</p> <p>I believe that’s a better way to structure code (for readability, editing, and better language tools). Based on what I’ve learnt so far, I am sceptical about integration of additional syntax to existing languages: two notations side-by-side are worse for users than one. From the perspetive of language maintainers, all tooling would need to deal with two dialects, which is also a downgrade.</p> <p>So main audience are <em>authors of new programming languages.</em> However, it is not only authors - to get adopted, any new feature should get at least minimal support from community. That’s where this page can help. So more generally, I target people <em>interested in experimenting around new programming languages</em>, and interested in challenging status-quo.</p> </li> <li> <p>Question: “But how will you represent a couple of multiline lists next to each other?”</p> <p>This case is handled normally:</p> <div class="alex-boxes"> <p></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> f([ a, b, ], [ c, d, ]) </code></pre></div> </div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> f([ , a , b \,[ , c , d </code></pre></div> </div> <p></p> </div> <p>For the record, I’d prefer to introduce variables in any case.</p> </li> <li> <p>Question “Don’t you think that current tools have already solved the issues solved by delimiter-first?”</p> <p>I developed a simple 4-line code with missed comma that is compeletely fine for flake8 and ruff. And black formatter considers it well-formatter. It took me less than a minute to develop this example, and if you start thinking, I’m sure you’ll find a handful of similar cases. Authors of one utitity that is supposed to mark these cases <a href="https://blog.devgenius.io/5-of-666-python-repos-had-comma-typos-including-tensorflow-and-pytorch-sentry-and-v8-7bc3ad9a1bb7">claim</a> that ‘5% of 666 Python repos had comma typos (including Tensorflow, and PyTorch, Sentry, and V8)’.</p> <p>We can continue patching problems with even more tools and more special cases, but I’d better have it solved by design. Core point is - <em>delimiter-last is flawed</em>. Main visual cues (indentation) is on the left, while there are still control sequences that can override indentation, and they are on the right. For this reason <code class="language-plaintext highlighter-rouge">\</code> in the end of line is a bad choice.</p> </li> </ul> <!--- TODO mention differences in code suggestions TODO jtree allows conversion between syntaxes https://jtree.treenotation.org/designer/#hakon-readme lisp version of syntax https://gist.github.com/armstnp/bb2a88bcb053d2195f42c60a0cf15a65 lisp proposals, more (Via Nikishkin) https://srfi.schemers.org/srfi-49/ https://srfi.schemers.org/srfi-110/ one more version of lisp: http://calcit-lang.org/ elm https://elm-lang.org/docs/style-guide ocaml https://github.com/ocaml-ppx/ocamlformat/blob/main/test/failing/gen/gen.ml Ruby has no-delimiter lists (not so interesting) Coffeescript and Civet https://github.com/DanielXMoore/Civet "Coffeescript for typescript" http://www.rebol.com/pre-view.html leslie lamport and formulas https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-119.pdf --> Tue, 29 Nov 2022 01:00:00 +0000 https://arogozhnikov.github.io/2022/11/29/delimiter-comes-first.html https://arogozhnikov.github.io/2022/11/29/delimiter-comes-first.html delimiter separator Things I wish someone told me about microscopy <h2 id="if-you-want-to-learn-some-culprits-of-microscopy">If you want to learn some culprits of microscopy</h2> <p>… you’d better watch this video by microbehunter, because rest of the post is view of ML person on things you should (not) expect from lab microscopy during experiment design.</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/Ir9TGt6zljI" frameborder="0" allow="clipboard-write; encrypted-media; picture-in-picture" allowfullscreen=""> </iframe> <p><strong>Warning:</strong><br /> This post contains reflections and is not meant to be an easy reading.<br /> This post assumes that you understand wave mechanics.</p> <p>I have a nice general background in physics, however just that was clearly insufficient — a lot of specific knowledge that is hard to deduce from first principles.</p> <h2 id="general-remarks">General remarks</h2> <ul> <li>there are myriads of different microscopes from trivial ones for mid-schools to EM (electron microscopes) and light-sheets <ul> <li>Ranges of prices from hundreds of dollars to millions. In some applications 100x cheaper microscope can still be more useful</li> <li>Manual and automated. Terribly expensive still may be non-automated</li> </ul> </li> <li>microscopes are typically designed to be modular, many parts are interchangeable; there is still vendor- and format- specificity</li> <li>when a microscope is automated, that typically means that it can at least move its specimen (yes, specimen is moved, microscope’s camera and light path are usually steady) <ul> <li>it may or may not be able to switch excitation / emission filters automatically, so ‘automated’ is not a descriptive word. Ask about what is automated</li> </ul> </li> <li>while typically microscopes are just ‘make a photo with light’ devices, software for microscopes is a tough topic. <ul> <li>manufacturers desire to provide a visual interface with windows and buttons, and mapping all countless scenarios to a sequence of buttons is … challenging</li> <li>as a result both API and interface are far from satisfactory</li> </ul> </li> <li>light source is not moved with specimen, but instead aligned and fixed relative to camera. <ul> <li>You can’t image with different shifts but ‘same light position’</li> </ul> </li> <li>immersion is quite critical when going to higher resolutions (above 20x)</li> <li>objective on a microscope has everything aligned and focusing depth can be adjusted or changed. (objectives are also pretty expensive). That’s not your smartphone’s refocusing camera. So 40x on your microscope means that object of size n<em>m in focusing plane (which is fixed) literally projects in 40n</em>40m on detector plane. To complete arithmetics you only need physical size of pixel in a camera - and voila - you have ‘size of specimen pixel’.</li> <li>for a long time I was surprised that biologists are so limited by the number of fluorescent channels they can image simultaneously (emission spectra overlap, so you want them to be separable). <ul> <li>At the same time they don’t switch to quantum dots (which have much narrower emission spectra). Permeability may be an issue here</li> <li>And they don’t try to go significantly outside of visible spectrum. <ul> <li><em>probably</em> this is due to objectives - correcting aberrations for wide spectrum range is tough</li> </ul> </li> <li>Another factor is penetration depths variability (even within water) for different wavelengths</li> <li>You can take images in IR, but going to deep IR is ultra-rare</li> </ul> </li> <li>there is an uncountable amount of imaging techniques. <br /> Dozens of them with all their variations, with all covering only some part of information. <ul> <li>Very hard to combine many in the same system (while some useful combinations exist)</li> <li>Dream of machine learner - having different imaging systems for the same specimen - can be implemented only in specific cases</li> </ul> </li> <li>more powerful microscope requires identical efforts on sample/environment side <ul> <li>Higher magnification requires better compensation of motion</li> <li>More sensitive to optical properties means you’ll see more artifacts from anything in your system. Or maybe plates or slides. <ul> <li>E.g. if method can detect birefringence, any plastic labware is likely to add some birefringence patterns</li> </ul> </li> </ul> </li> <li>well edges introduce significant effects, plate edges also introduce some effects for imaging (both also affect biological processes)</li> <li><a href="https://www.youtube.com/user/iBioEducation">ibiology</a> provides an amazing combination of theory and practice of imaging. It was incredibly helpful</li> <li>imaging protocols are hardly readable. Too many things and parameters, no deduplication. <ul> <li>They remind completely unwrapped low-level code for execution by machine, not ‘settings’.</li> <li>I’ve told about software being tough here, right? There are issues with interfaces on all levels</li> </ul> </li> <li>imaging time is a real issue <ul> <li>“oh, we can just increase stack size” is correct solution to many questions in theory, but not in practice</li> </ul> </li> <li>reproducible focusing may be an issue</li> <li>richest sources of information are available only for ex-vivo cells and tissues</li> <li>anything that produces nice high-resolution images will be called by biologist “confocal” no matter if confocality is actually used there :)</li> <li>believe data, always believe data. If you think something is misaligned - it almost surely is.</li> </ul> <h2 id="contrasting-methods">Contrasting methods</h2> <iframe width="560" height="315" src="https://www.youtube.com/embed/FUa1GTc69y4" f="" rameborder="0" allow="autoplay; clipboard-write; encrypted-media; picture-in-picture" allowfullscreen=""></iframe> <p>The main way to achieve contrast is by using monochromatic (i.e. laser) light, and achieve shift in phase between “rays” started from the same source. Shift in phase affected by specimen provides a contrast visible by a simple detector.</p> <ul> <li>Simplest example is <a href="https://www.olympus-lifescience.com/en/microscope-resource/primer/techniques/dic/dicconfiguration/">DIC</a> (differential interference contrast) - light is split in two parts, which come through neighboring positions in slide</li> <li>Another example is polarization contrast, where light comes though the same specimen but due to <a href="https://en.wikipedia.org/wiki/Birefringence">birefringence</a> of some materials different polarizations come with different speed, which produces retardation of one polarization</li> <li><a href="https://www.microscopyu.com/tutorials/comparison-of-phase-contrast-and-dic-microscopy">Phase contrast</a> organizes interference between scattered and passed through waves. Phase delay adds phase to scattered light. Simplest to setup of these three.</li> </ul> <p>An important property of contrasting optical paths is that optical path lengths for light arriving to the same location should be identical (unless sample perturbations prevent this). Optical path is not distance, but time taken by light to travel along a trajectory.</p> <p>That’s a simple thought and sounds like a natural, but when you look at optical system with all its lenses, you should realize it’s non-trivial behavior.</p> <h2 id="amazing-variability-of-imaging-techniques">Amazing variability of imaging techniques</h2> <p>Microscopy world is very limited within one lab (even optical lab) but whole large world of microscopy is so rich and interesting out there.</p> <ul> <li>Multi-photon imaging <ul> <li>deliver energy required for excitation with several photon simultaneously</li> <li>requires an expensive laser, but imaging is simple</li> <li>can go quite deep into tissue</li> <li>can’t guarantee narrow emission spectra because different number of ph</li> </ul> </li> <li>Electron microscopy <ul> <li>super precise (it’s completely different part of spectra)</li> <li>ex-vivo samples only</li> <li>requires isolated rooms and strong movement compensation</li> <li>not something you will simply hold in a lab, but provides extremely detailed image</li> </ul> </li> <li>LSM: light-sheet microscopy is a demonstration that light source does not have to be on the same axis, while it sounds like an axiom after lab scopes <ul> <li>LLSM is times cooler</li> </ul> </li> <li> <p>TIRF (total internal reflection) microscopy when combined with photo-activable fluorescent proteins (PALM/STORM) can get to tracking trajectories of individual proteins (while still using visible range spectrum).</p> </li> <li> <p>Another interesting idea is FRET - allows detecting interaction between single molecules if those have appropriate fluorescent tags. <br /> Photons emitted by one antibody are absorbed by the second one if molecules are in proximity of each other.</p> </li> <li><a href="https://www.youtube.com/watch?v=HJnNJIUPm4s">optical coherence tomography</a> OCT <ul> <li>has nothing to do with tomography and even works based on reflected light</li> <li>widely used for retina scanning</li> </ul> </li> <li><a href="https://www.youtube.com/watch?v=tTHvVCPaeWQ">Ghost imaging</a>. Not-yet-there, but idea is mind-blowing <ul> <li>entangle two photons</li> <li>the first one hits the target, while the second goes to detector</li> <li>entanglement allows partially reconstructing properties of a photon that hit the target</li> <li>there are classical variations as well</li> </ul> </li> <li>Structured illumination (SIM) <ul> <li>Moir patterns + a bit of computational magic allows you going slightly above optical resolution limit</li> </ul> </li> </ul> <p>You may want to check this video to orient yourself a bit and get a sense of what sounds appropriate for your case.</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/01v2kR8dlnQ" frameborder="0" allow="autoplay; clipboard-write; encrypted-media; picture-in-picture" allowfullscreen=""></iframe> Sun, 01 Nov 2020 12:00:00 +0000 https://arogozhnikov.github.io/2020/11/01/microscopy.html https://arogozhnikov.github.io/2020/11/01/microscopy.html Microscopy Don't write command-line interfaces (generate them) <p style="color: #666677"> (a friendly reminder that reading post before commenting is a great idea. Some people see this as an argument for GUI, but it is completely misleading) </p> <p>A favourite activity of fresh github-bers is writing CLI (command-line interfaces) for anything.</p> <p>Every programmer uses CLI <strong>(true)</strong>, so writing CLI makes you more professional <strong>(false)</strong>.</p> <p>CLIs are required in everyday maintenance, env/pipeline/db management, and checking this and that. It is a glue to keep different subsystems together, but hardly CLI is a reliable programming interface. Progress in software engineering left bash calls far behind in terms of reliability and flexibility.</p> <h3 id="whats-wrong-with-writing-cli-as-an-interface">What’s wrong with writing CLI as an ‘interface’?</h3> <ul> <li>CLI support is an additional logic in your program that makes <strong>no real work</strong></li> <li>While typically being dumb, CLI logic is frequently <strong>filled with <a href="https://github.com/search?q=bug+command+line&amp;type=Issues">mistakes</a></strong>; thus it requires constant maintenance and an additional testing</li> <li><strong>Error (exception) handling</strong> with CLI is very poor. Another layer of (bad faulty) code is required to make it possible</li> <li><strong>Scaling/extending</strong> is not as easy compared to programming language APIs (see example in the end)</li> <li>CLIs are detached from essential code, which in most cases is a disadvantage. <details> <summary>more on this</summary> <p>Forcing users to use CLI means: stay away from my code, you’d better not work with it. Maybe that’s ok — but if users can code a bit (otherwise why do they use CLI?), that’s not an optimal way — if something went wrong, do you want to directly see the code+calls that failed or do you want to add several minutes/hours walking thru command args parsing machinery someone else wrote? <br /> While being questionable in small projects, a virtual fence becomes more and more obvious when parsing logic (validation, transformation, routing) grows.</p> </details> </li> </ul> <h3 id="writing-command-line-interfaces-the-right-way">Writing command-line interfaces the right way</h3> <ul> <li>write functions</li> <li>leave CLI-fication to a special package</li> </ul> <h3 id="which-tool-to-use-for-writing-command-line-interfaces-in-python">Which tool to use for writing command-line interfaces in Python?</h3> <p>Here are the options that you should consider …</p> <ul> <li><a href="https://docs.python.org/3/library/argparse.html">argparse</a> (or ancient optparse)</li> <li><a href="https://click.palletsprojects.com/en/7.x/">click</a></li> <li><a href="http://docopt.org/">docopt</a></li> <li><a href="https://github.com/google/python-fire">python-fire</a></li> </ul> <p>… <strong>deprecated</strong>. Yes, consider them deprecated.</p> <p>Prefer <a href="https://hugapi.github.io/hug/">hug</a> and <a href="https://github.com/tiangolo/typer">typer</a>. Example for the latter:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">typer</span> <span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span> <span class="n">app</span> <span class="o">=</span> <span class="n">typer</span><span class="p">.</span><span class="n">Typer</span><span class="p">()</span> <span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">command</span><span class="p">()</span> <span class="k">def</span> <span class="nf">find_dragon</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">min_age_years</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">200</span><span class="p">):</span> <span class="o">&lt;</span><span class="n">implementation</span> <span class="n">goes</span> <span class="n">here</span><span class="o">&gt;</span> <span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">command</span><span class="p">()</span> <span class="k">def</span> <span class="nf">feed_dragon</span><span class="p">(</span><span class="n">dragon_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">n_humans</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span><span class="p">):</span> <span class="o">&lt;</span><span class="n">implementation</span> <span class="n">goes</span> <span class="n">here</span><span class="o">&gt;</span> <span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span> <span class="n">app</span><span class="p">()</span> </code></pre></div></div> <p>Now it’s ready to be invoked from shell</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python example.py find_dragon 'Drake' --path /on/my/planet </code></pre></div></div> <p>That’s it! Types are parsed, checked and converted. Defaults and description are picked from function itself. Even provides bash completions you can install. Best part is you wrote no code for that!</p> <h3 id="-i-need-to-invoke-my-code-from-bash-with-complex-parameterization">— I need to invoke my code from bash with complex parameterization</h3> <p>Exact wording of this question may also include job schedulers, calls on remote machines and docker run/exec — common reasons that force people to write CLI.</p> <p>Previous recipe may not work in this case, you have two options:</p> <p><strong>Option A.</strong></p> <p>Read documentation for <em>deprecated</em> packages, write a ton of code for conversion, validation, testing and mocking. Add documentation, make presentations about CLI logic and neat places of using bash, get promoted to Senior CLI architect, give talks and interviews. Some junior in your company discovers <em>option B</em> and ruins your career.</p> <p><strong>Option B</strong>.</p> <p>When there is much to configure, don’t try to build a large parsing machinery to handle all cases, just <strong>use code</strong> to parameterize calls:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-c</span> <span class="s2">" from mymodule import set_dragon_feeding_schedule, Creatures, Date set_dragon_feeding_schedule( feeding_times=['10:00', '14:00', '18:00'], dishes={Creatures.Tiger: 2, Creatures.Human: 1}, start_day=Date('1020-03-01'), ) "</span> </code></pre></div></div> <p>Instead of</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> mymodule <span class="se">\</span> set_dragon_feeding_schedule <span class="se">\</span> <span class="nt">--feeding-times</span> <span class="o">[</span><span class="s1">'10:00'</span>,<span class="s1">'14:00'</span>,<span class="s1">'18:00'</span><span class="o">]</span> <span class="c"># hopefully this way it gets recognized \</span> <span class="c"># how will you define parsing a dict with enum to integer mapping? </span> <span class="nt">--dishes</span><span class="o">=</span>Creatures.Tiger:2 <span class="se">\</span> <span class="nt">--dishes</span><span class="o">=</span>Creatures.Human:1 <span class="se">\</span> <span class="nt">--start-day</span><span class="o">=</span>1020-03-21 <span class="c"># BTW bash allows no comments in multiline calls</span> </code></pre></div></div> <ul> <li>How many lines of code you need to cover parsing logic in previous example? <ul> <li>Try to be reasonable, not optimistic. Don’t forget documentation.</li> <li>Add testing, mocking, … have you <em>ever</em> seen that part done properly for CLIs?</li> </ul> </li> <li>Is there anything that you win after writing an explicit CLI parsing? Double quote maybe?</li> <li>Exception handling — simple to add in one case, very tough in the other</li> </ul> <h3 id="-never-realized-that-cli-command-can-be-replaced-by-python-command">— Never realized that CLI command can be replaced by python command</h3> <p>You’re welcome! This can save you weeks of time and sleepless nights.</p> <p>Here is definitive guide:</p> <ol> <li>Don’t write yet-another-parser — python can parse all you need</li> <li>Don’t reinvent representing lists, dicts, enums, objects, etc in text — every programming language has it already solved</li> <li>Don’t create new <em>types</em> of interfaces — functions <em>are</em> interfaces</li> <li>Don’t write parsing logic/validation — check parameters instead</li> </ol> <p>Focus on writing useful and friendly functional interface, not CLI.</p> <h3 id="-how-about-an-example-for-dealing-with-more-complex-parameterization">— How about an example for dealing with more complex parameterization?</h3> <p>Sure! Here is an example from machine learning.</p> <p>Common headache is supporting multiple optimization algorithms (each having own set of parameters) and allowing a number of architectures (each also having different parameters).</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-c</span> <span class="s2">" from yourpackage import ResidualNetwork, AdamOptimizer, train, activations train( optimizer=AdamOptimizer(lr=0.0001, some_param=42, converge=True), model=ResidualNetwork(n_layers_in_each_group=[3,4,5,6], act=activations.ReLU, n_classes=1234), save_path='/research/my_experiment_number9999', ) "</span> </code></pre></div></div> <p>Compare this piece of clarity and versatility to a parsing nightmare happening in some popular packages.</p> <p>Why it becomes such a nightmare? That’s a great question!</p> <ul> <li>parameters depend on each other in a non-trivial way. Different model → different parameters. Added a model — update CLI.</li> <li>there should be a way to associate parameters with an entity they come from <ul> <li>is this parameter for an architecture? for an optimizer? for a dataset?</li> <li>entities that appear naturally in programming interfaces are not in the style of bash calls</li> </ul> </li> <li>at some point second model appears (hi GANs!), and possibly a second optimizer, several types of datasets… now you need to support all of that in CLI and avoid flag collisions <ul> <li>unlikely you want to frequently drop previous interface, so backward-compatibility will multiply your problems</li> </ul> </li> <li>validation logic that is capable of handling all these scenarios would be huge, buggy and not helpful at all</li> </ul> <p><strong>CLIs don’t scale up well</strong>.<br /> They work well only when you can decompose things into simpler components ‘each doing one job’. Before writing CLI, it is thus important to know what is the functionality your project provides and how it may change in a year or two. It is very easy to add CLI when the project is in its initial stage — but as functionality grows, you’ll find it exponentially harder to fit all knobs into CLI.</p> <p>Other programming interfaces survive growth quite easily.</p> <h2 id="looking-forward">Looking forward</h2> <p>In the bright future of programming there will be more natural bridges between different languages. With growing capabilities for <a href="https://en.wikipedia.org/wiki/Reflection_(computer_programming)">reflection</a>, it will be easier to invoke particular functions from other languages without intermediate bash calls. <a href="https://pyo3.rs/">Python&lt;&gt;rust</a> is a good example of going in this direction.</p> <p>By not writing CLI logic and focusing on programming interface you make code future-proof. <a href="https://fastapi.tiangolo.com/">Different</a> <a href="https://fastapi.tiangolo.com/alternatives/">utilities</a> already can convert functions to REST API (we may later use some other network APIs like gRCP, and you’ll be able to add it with a couple of lines). More to come, maybe we should expect utilities to auto-wrap your functions for calling from other languages/hosts/universes.</p> <p>Code should be designed to be used by other code first. Convenience ‘temporary’ command-line utilities sooner or later become part of bigger automated pipelines if no other API proposed.</p> <h2 id="tldr">TL;DR</h2> <ul> <li>simple CLIs should be auto-generated today, don’t write it yourself <ul> <li>other types of APIs can be auto-generated as well</li> </ul> </li> <li>complex CLIs are a problem and think twice (better, 5 times) before trying to replace programming API with CLI <ul> <li>convenient command-line calls are available without writing a single line of CLI code</li> </ul> </li> </ul> <p><br /></p> <p><br /></p> <details> <summary> <span style="font-size: 1.5em;"> Additional comments </span> </summary> <ul> <li>I use python as an example because 1) need to show some code 2) it is popular 3) I know it well enough. <br /> However, the points made should be valid for all modern languages (C++ is not a modern language just in case).</li> <li>Itamar Turner-Trauring has an article on a relevant topic in his called <a href="https://pythonspeed.com/articles/shell-scripts/">please stop writing shell scripts</a>. Itamar provides numerous helpful recommendations and tips in his blog, and this is no exception.</li> </ul> </details> <details> <summary> <span style="font-size: 1.5em;"> Possible objections </span> </summary> <ul> <li>CLI allows abstracting out from implementation <ul> <li>Exposed functions can be detached from an actual implementation</li> </ul> </li> <li>User may not know programming language I use <ul> <li>Unlikely import and a function call can be misleading. By hiding details you leave user clueless in case something doesn’t work</li> <li>Actual choice is whether user should learn a bit of your language or yet-another-CLI system. Hard to find argument for the latter</li> <li>If your tool requires detailed configuration, you shouldn’t be afraid to say: you need to write several lines of code, here is an example</li> </ul> </li> <li>My application heavily uses bash/shell features: pipes, process substitutions and filename expansions <ul> <li>In this case when you want to keep using and supporting CLI</li> </ul> </li> </ul> </details> <details> <summary> <span style="font-size: 1.5em;"> Comments on packages </span> </summary> <p><strong>What’s wrong with <code class="language-plaintext highlighter-rouge">python-fire</code>?</strong></p> <p>While it builds CLI on the top of exposing functions/methods, <code class="language-plaintext highlighter-rouge">fire</code> ignores annotations and tries to guess types based on input.</p> <p>An example from an official documentation to confirm:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python example.py 10 int <span class="nv">$ </span>python example.py <span class="s2">"10"</span> int <span class="nv">$ </span>python example.py <span class="s1">'"10"'</span> str </code></pre></div> </div> <p>So 1) no types guaranteed 2) convolved logic 3) to make sure argument is not converted to int, wrap in both single and double quotes. Now wrap it in a bash call (e.g. during building docker). Have fun with escaping quotes for every string argument.</p> <p><strong><code class="language-plaintext highlighter-rouge">Hug</code> has a poor support for CLIs (as of now)</strong></p> <p>Be warned, it ignores flag names. Though it has right direction of thought and directly supports <code class="language-plaintext highlighter-rouge">marshmallow</code> types. But in the meantime (Oct 2020) <code class="language-plaintext highlighter-rouge">typer</code> is a safer choice.</p> <p>Interface package of a dream is not released yet — it should support both CLI and web APIs and include some elements from python-fire. However, this should not stop you, as switches between these packages is almost painless as long as you write no custom logic.</p> </details> <details> <summary> <span style="font-size: 1.5em;"> Acknowledgements </span> </summary> <p>Thanks to <a href="https://github.com/tlikhomanenko">Tatiana</a> for proof-reading an initial version of this post.</p> </details> <!-- maybe mention TAP https://github.com/swansonk14/typed-argument-parser --> Thu, 01 Oct 2020 12:00:00 +0000 https://arogozhnikov.github.io/2020/10/01/dont-write-cli.html https://arogozhnikov.github.io/2020/10/01/dont-write-cli.html Programming Python Command-line interfaces Twin training: trick for better model comparisons <p>Abstract: <em>Frequently comparing deep learning models?<br /> A simple way to improve comparison is discussed here, this trick becomes specially handy when comparing segmentation models.</em></p> <p>Reliable comparison of models is a question important for DL “theorists” (to evaluate new approaches) as well as for practitioners/engineers (to select an approach for a particular task in hand). Comparison is time-consuming process, frequently with noisy results.</p> <p>Usual setting incorporates fixed dataset split into train/val/test and fixed metric of choice. Next, independent runs are conducted for all models under comparison and achieved quality is registered.</p> <p>As a result,</p> <ul> <li>There is a significant noise in comparison (it is rare to rerun each model several times, specially in applications),</li> <li>Validation can be done only using whole dataset</li> <li>need to remember which version of code was used to generate a particular number, as you can accidentally compare things that are not ‘comparable’ because of e.g. changed augmentation or updates in the dataset <ul> <li>yes, practitioners have to deal with frequent updates in the dataset</li> </ul> </li> <li>can’t use augmentations while testing, since it is hard to guarantee that exactly same augmentations were applied. Sometimes it is handy to evaluate using several batches as a fast intermediate check. Augmentations in test allow ‘broader’ check.</li> </ul> <h2 id="what-is-suggested-twin-training">What is suggested: twin training</h2> <p>Models can be trained <strong>side-by-side within the same process</strong>, with as high similarity in the training process as possible. Same batches, same augmentations, and of course the same datasets.</p> <ul> <li>If models, say, have identical architecture, their initial weights should be identical (easy to achieve in any DL framework). <ul> <li>As we know, initial state influences optimization, in some cases drastically (that’s not desirable, but happens).</li> </ul> </li> <li>During training, same exact batches with the same exact augmentation should be used to optimize models. <ul> <li>That’s right, you need to augment only once, thus CPU is not a bottleneck.</li> <li>Similarly, one should always compare on the same batches. To achieve smooth monitoring rather than ‘validate once on a while’, take one batch at a time and compute metrics on that batch.</li> </ul> </li> </ul> <p>Pseudo-code may look like (fragment):</p> <!-- TODO fix display here --> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">train_data</span><span class="p">:</span> <span class="n">batch</span> <span class="o">=</span> <span class="n">augment</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span> <span class="k">for</span> <span class="n">model</span> <span class="ow">in</span> <span class="n">models</span><span class="p">:</span> <span class="c1"># make an optimization step for each model using the same batch </span></code></pre></div></div> <p>Things usually tuned (architecture, loss, augmentations, parameters, optimizers, learning schedules, etc.) - all of them can be compared more efficiently this way.</p> <h2 id="example">Example:</h2> <p><img src="/images/model_comparison/tensorboard1.png" width="700" /></p> <p>There are three models trained in parallel in this screenshot from tensorboard. One can tell when one of models has lower loss and estimate level of ‘noise’. It is also clear that most jumps and falls in learning curves are due to batches sampled, and are not model-specific behavior. In other words, you can better see the difference between <strong>models</strong> not difference between <strong>runs</strong>.</p> <p>This demonstrates a typical comparison — things compared are extremely similar and there is little practical difference. Models’ response to the same training input is close to identical. It’s not easy to get the same conclusion by looking at just final scores. That’s a good argument towards including learning curves in the paper.</p> <h2 id="bonus-simpler-comparison-of-segmentation-models">Bonus: simpler comparison of segmentation models</h2> <p>When training models for image segmentation (such as instance segmentation or class-segmentation), lack of memory becomes a critical factor. Batch sizes become very small, and it is almost impossible to train several segmentation models at once on a single GPU.</p> <p>During segmentation training each sample contributes a lot, since it provides a lot of labels (one per pixel!).<br /> It is also unlikely that you have thousands of well-labelled high-resolution segmentation images.</p> <p>However when you train several models inside a single script/notebook, there are no such problems, <em>because you never keep intermediate activations for more than one model at a time</em>. Weights of all models should still be kept in (GPU) memory, but that’s a small fraction of space taken by activations.</p> <h2 id="bonus-simple-organization-of-experiments-in-tensorboard">Bonus: simple organization of experiments in tensorboard</h2> <p><img src="/images/model_comparison/folder_organization.png" height="200" /></p> <p>Tensorboard recursively scans subfolders for logs, so you can keep each ‘comparison’ in a separate folder, and each compared option saves its logs to a corresponding subfolder.</p> <h2 id="alternative-fix-random-seed">Alternative: fix random seed?</h2> <p>I don’t think that fixed random seed is reliable enough to be considered as an alternative way to achieve similarity in training.</p> <p>THere are many different RNGs provided by different modules, and RNGs are used in too many places. And you need to precisely control RNG flow in your program. Because if some of your functions use global RNGs like <code class="language-plaintext highlighter-rouge">random</code> or <code class="language-plaintext highlighter-rouge">np.random</code> directly, this implies that <em>any</em> side call to those from anywhere in your program completely changes all following sampled numbers. Any ‘interruption’ in the sequence breaks it. Random numbers on GPU is whole another story.</p> <p>So, you should look through all the augmentations, samplers, dropouts (basically, everything) to verify they don’t use global RNG’s (and find that some of them actually do).</p> <p>Long story short, if you <em>have</em> to rely on random seeds in DL, at least log some control sums to verify that sequence was not broken by an unexpected call from somewhere else.</p> <p>You can still use random seed to achieve reproducible training of the same model.</p> Tue, 01 Jan 2019 12:00:00 +0000 https://arogozhnikov.github.io/2019/01/01/trick-for-model-comparison.html https://arogozhnikov.github.io/2019/01/01/trick-for-model-comparison.html Machine Learning Engineering Code improvements Einops — a new style of deep learning code <p>Recently I’ve open-sourced <a href="https://github.com/arogozhnikov/einops">einops</a> — a new (and better) way to write deep learning code.</p> <p>Einops introduces a new notation and new operations.</p> <video controls="" autoplay=""> <source src="http://arogozhnikov.github.io/images/einops/einops_video.mp4" type="video/mp4" /> <img src="http://arogozhnikov.github.io/images/einops/einops_video.gif" alt="einops package examples" /> </video> <p>It perfectly complements existing frameworks (pytorch, tensorflow, gluon, chainer, numpy and others) allowing you to write better deep learning code (see <a href="http://arogozhnikov.github.io/einops/pytorch-examples.html">examples for pytorch</a>).</p> <p><a href="https://github.com/arogozhnikov/einops">Einops at Github</a></p> <p>Tutorials: <a href="https://github.com/arogozhnikov/einops/blob/master/docs/1-einops-basics.ipynb">part 1</a> and <a href="https://github.com/arogozhnikov/einops/blob/master/docs/2-einops-for-deep-learning.ipynb">part 2</a>.</p> Thu, 06 Dec 2018 12:00:00 +0000 https://arogozhnikov.github.io/2018/12/06/einops.html https://arogozhnikov.github.io/2018/12/06/einops.html Machine Learning Engineering Code improvements