§ Stuff I learnt in 2021

I spent this year focusing on fundamentals, and attempting to prepare myself for topics I'll need during my PhD. This involved learning things about dependent typing, compiling functional programs, working with the MLIR compiler toolchain , and reading about the GAP system for computational discrete algebra .

§ Guitar

I've been meaning to pick up an instrument. I'd learnt the piano as a kid, but I'd soured on the experience as it felt like I was learning a lot of music theory and practising to clear the Royal school of music exams . I'd learnt a little bit of playing the guitar while I was an inten at Tweag.io ; my AirBnB host had a guitar which he let me borrow to play with. I was eager to pick it back up. Unfortunately, I wasn't able to play as consistenly as I had hoped I would. I can now play more chords, but actually switching between them continues to be a challenge. I also find pressing down on barre chords surprisingly hard. I've been told something about getting lighter strings, but I'm unsure about that. I was also excited about learning the guitar well enough to play it while a friend sings along. This seems to require a lot more practice than I currently have, as the bottleneck is whether one can smoothly switch between chords.

§ Starting my PhD: Research on Lean4

I'm excited by proof assistants, and I'd like to work on them for my PhD. So the first order of business was to get an idea of the internals of Lean4, and to decide what exactly I would be working on. This made me read the papers written by the Lean team over the last couple years about their runtime, as well as made me learn how to implement dependently typed languages. During this process, I also had calls with some of the faculty at the University of Edinburgh to pick a co-advisor. I enjoyed reading Liam O connor's thesis: Type Systems for Systems Types The thesis had a very raw, heartfelt epilogue:
If you will permit some vain pontification on the last page of my thesis, I would like to reflect on this undertaking, and on the dramatic effect it has had on my thinking. My once-co-supervisor Toby Murray said that all graduate students enter into a valley of despair where they no longer believe in the value of their work. Certainly I am no counter-example. I do not even know if I successfully found my way out of it
This, honestly, made me feel a lot better, since I'd begun to feel this way even before launching into a PhD!

§ Lean implementation details

I read the papers by the Lean researchers on the special features of the language.
  • Counting immutable beans describes an optimization that they perform in their IR (lambdarc) that optimizes memory usage by exploiting linear types.
  • Sealing pointer equality describes how to use dependent types to hide pointer manipulation in a referentially transparent fashion.

§ Writing a dependently typed language

I felt I had to know how to write a dependently typed language if I wanted to be successful at working on the Lean theorem prover. So I wrote one, it's at bollu/minitt. The tutorials that helped me the most were:
  • David Christiansen's tutorial on normalization by evaluation , where he builds a full blown, small dependently typed language type checker.
  • Normalization by evaluation by F. Faviona which explains why we need this algorithm to implement dependently typed languages, and other cute examples of normal forms in mathematics. For example, to check if two lists are equivalent upto permutation, we can first sort the two lists, and then check for real equality. So we are reducing a problem of "equivalence" to a problem of "reduction to sorted order" followed by "equality". We do something similar to type check a dependently typed language.
  • cubicaltt lectures by faviona , which get the point of cubical type theory across very well.
  • Bidirectional type checking by Pfenning These lecture notes explain bidirectional typing well, and provide an intuition for which types should be checked and which should be inferred when performing bidirectional typing.

§ Paper acceptance

Our research about writing a custom backend for Lean4 was accepted at CGO'22. I was very touched at how nice the programming languages community is. For example, Leonaro De Moura and Sebastian Ullrich, the maintainers of Lean4 provided a lot of constructive feedback. I definitely did not expect this to happen. I feel like I don't understand academia as a community, to be honest, and I'd like to understand how it's organized.

§ Statistics

As I was working on the paper, I realised that I didn't truly understand why we were taking the median of the runtimes to report performance numbers, or why averaging over ten runs was "sufficient" (sufficient for what?). This led me on a quest to learn statistics correctly. My big takeaways were:
  • Frequentist type statistics via null hypotheses are hard to interpret and may not be well suited for performance benchmarking.
  • The High Performance Computing community does not use bayesian statistics, so using it would flag one's paper as "weird".
  • The best solution is to probably report all raw data, and summarize it via reasonable summary statistics like median, which is robust to outliers.
I must admit, I find the entire situation very unsatisfactory. I would love it if researchers in High performance computing wrote good reference material on how to bencharmk well. Regardless, here are some of the neat things I wound up reading in this quest:

§ Learning statistics with R

Learning statistics with R . This is a neat book which explains statistics and the R programming language. I knew basically nothing of statistics and had never used R, so working through the book was a blast. I was able to blaze through the first half of the book, since it's a lot of introductory programming and introductory math. I had to take a while to digest the ideas of p-values and hypothesis testing. I'm still not 100% confident I really understand what the hell a p value is doing. Regardless, the book was a really nice read, and it made me realize just how well the R language is designed.

§ Jackknife

The Jackknife paper "Bootstrap methods: another look at the jackknife" which introduces the technique of bootstrapping: drawing many samples from a small dataset to eventually infer summary statistics. I was impressed by the paper for three reasons. For one, it was quite easy to read as a non-statistician, and I could follow the gist of what was going on in the proofs. Secondly, I enjoyed how amenable it is to implementation, which makes it widely used in software. Finally, I think it's a great piece of marketing: labelling it a "Jacknife", and describing how to bootstrap is a rough-and-ready method that will save you in the jungles of statistical wilderness makes for a great title.

§ R language and tidy data

Due to the R lannguage quest, I was exposed to the idea of a data frame in a coherent way. The data frames in R feels designed to me, unlike their python counterpart in pandas. I realised that I should probably learn languages that are used by domain experts, and not poor approximations of domain expertise in Python.

§ Tidyverse

This also got me interested to learn about the tidyverse , a collection of packages which define a notion of "tidy data", which is a precise philosophy of how data should be formatted when working on data science (roughly speaking, it's a dataset analogy of 3rd normal form from database theory . In particular, I really enjoyed the tidy data paper which defines tidy data, explains how to tidy untidy data, and advocates for using tidy data as an intermediate representation for data analysis.

§ Starting a reading group: Fuzzing

I felt like I was missing out on hanging with folks from my research lab, so I decided to start a reading group. We picked the fuzzing book as the book to read, since it seemed an easy and interesting read. I broadly enjoyed the book. Since it was written in a literate programming style , this meant that we could read the sources of each chapter and get a clear idea of how the associated topic was to be implemented. I enjoy reading code, but I felt that the other lab members thought this was too verbose. It did make judging the length of a particular section hard, since it was unclear how much of the section was pure implementation detail, and how much was conceptual.

§ Ideas learnt

Overall, I learnt some interesting ideas like delta debugging , concolic fuzzing , and overall, how to design a fuzzing library (for example, this section on grammar fuzzing provides a convenient class hierarchy one could choose to follow). I also really enjoyed the book's many (ab)uses of python's runtime monkey-patching capabilities for fuzzing. This meant that the book could easily explain concepts that would have been much harder in some other setting, but this also meant that some of the techniques showcased (eg. tracking information flow by using the fact that python is dynamically typed) would be much harder to put into practice in a less flexible language.

§ Software bugs are real bugs?

The coolest thing I learnt from the book was STADS: software testing as species discovery , which models the problem of "how many bugs exist in the program?" as "how many bugs exist in this forest?". It turns out that ecologists have good models for approximating the total number of species in a habitat from the number of known species in a habitat . The paper then proceeds to argue that this analogy is sensible, and then implements this within AFL: american fuzzy lop . Definitely the most fun idea in the book by far.

§ Persistent data structures for compilers

My friend and fellow PhD student Mathieu Fehr is developing a new compiler framework based on MLIR called XDSL . This is being developed in Python, as it's meant to be a way to expose the guts of the compilation pipeline to domain experts who need not be too familiar with how compilers work.

§ Python and immutable data structures

I wished to convince Mathieu to make the data structures immutable by default. Unfortunately, python's support for immutable style programming is pretty poor, and I never could get libraries like pyrsistent to work well.

§ Immer

On a happier note, this made me search for what the cutting was in embedding immutable data structures in a mutable language, which led me to Immer: Persistence for the masses . It advocates to use RRB trees and describes how to design an API that makes it convenient to use within a language like C++. I haven't read the RRB trees paper, but I have been using Immer and I'm liking it so far.

§ WARD for quick blackboarding

I hang out with my friends to discuss math, and the one thing I was sorely missing was the lack of a shared blackboard. I wanted a tool that would let me quickly sketch pictures, with some undo/redo, but most importantly, be fast . I found no such tool on Linux, so I wrote my own: bollu/ward . I was great fun to write a tool to scratch a personal itch. I should do this more often.

§ Becoming a Demoscener

I've always wanted to become a part of the demoscene , but I felt that I didn't understand the graphics pipeline or the audio synthesis pipeline well enough. I decided to fix these glaring gaps in my knowledge.

§ Rasterization

I've been implementing bollu/rasterizer, which follows the tinyrenderer series of tutorials to implement a from-scratch, by-hand software rasterizer. I already knew all the math involved, so it was quite rewarding to quickly put together code that applied math I already knew to make pretty pictures.

§ Audio synthesis

Similarly, on the audio synthesis side, I wrote bollu/soundsynth to learn fundamental synthesis algorithms. I followed demofox's series of audio synththesis tutorials as well as the very pleasant and gently paced textbook [TODO}(). I particularly enjoyed the ideas in karlplus strong string synthesis . I find FM synthesis very counter-intuitive to reason about. I've been told that audio engineers can perform FM sound synthesis "by ear", and I'd love to have an intuition for frequency space that's so strong that I can intuit how to FM synthesize a sound. Regardless, the idea is very neat for sure.

§ Plucker coordinates

I also have long wanted to understand Plucker coordinates , since I'd read that they are useful for graphics programming. I eventually plonked down, studied them, and wrote down an expository note about them in a way that makes sense to me. I now feel I have a better handle on Projective space, Grassmanians, and schemes!

§ Category theory

A friend started a category theory reading group, so we've spent the year working through Emily Riehl's "Category theory in Context" . I'd seen categorical ideas before, like colimits to define a germ, "right adjoints preserve limits", showing that the sheafification functor exists by invoking an adjoint functor theorem, and so on. But I'd never systematically studied any of this, and if I'm being honest, I hadn't even understood the statement of the Yoneda lemma properly.

§ Thoughts on the textbook

Working through the book from the ground-up was super useful, since I was forced to solve exercises and think about limits, adjoints, and so forth. I've uploaded my solutions upto Chapter 4 . I felt the textbook gets a little rough around the edges at the chapter on adjunctions. The section on the 'Calculus of Adjunctions' made so little sense to me that I rewrote it with proofs that I could actually grok/believe.

§ Curios

Regardless, it's been a fun read so far. I was also pointed to some other interesting content along the way, like Lawvere theories and the cohomology associated to a monad .

§ Computational mathematics

A Postdoc at our lab, Andres Goens comes from a pure math background. While we were discussing potential research ideas (since I'm still trying to formulate my plan for PhD), he mentioned that we could provide a formal semantics for the GAP programming language in Lean. This project is definitely up my alley, since it involves computational math (yay), Lean (yay), and formal verification (yay).

§ Learning GAP

I decided I needed to know some fundamental algorithms of computational group theory, so I skimmed the book Permutation group algorithms by Serees which explains the fundamental algorithms behind manipulating finite groups computationally, such as the Todd Coxeter coset enumeration algorithm and the Schrier Sims group decomposition algorithm . I loved the ideas involved, and implemented these at bollu/CASette. I'd also half-read the textbook 'Cox, Little, OShea: Computational Algebraic Geometry' which I picked up again since I felt like I ought to revisit it after I had seen more algebraic geometry, and also because I wanted to be better informed about computational mathematics. I felt like this time around, I felt many of the theorems (such as the hilbert basis theorem ) 'in my bones'. Alas, I couldn't proceed more than the second chapter since other life things took priorty. Perhaps I'll actually finish this book next year :).

§ Cardistry

For something completely different, I got interested in Cardistry and shuffling thanks to Youtube. I started learning interesting shuffles like the riffle shuffle , and soon got interested in the mathematics involved. I would up reading some of the book Group representations for probability and statistics by Persi Diaconis, a magician turned mathematician who publishes quite a bit on permutation groups, shuffling, and the like.

§ Symmetric group

I really enjoyed learning the detailed theory of the representation theory of the symmetric group, which I had read patchily before while studying Fourier analysis on the symmetric group . A lot of the theory still feels like magic to me; in particular, Specht modules are so 'magic' that I would find it hard to reconstruct them from memory.

§ Competitive peogramming

I need more practice at competitive programming. In fact, I'm downright atrocious , as I'm rated "pupil" on codeforces. If I had to debug, it's a combination of several factors:
  • I get discouraged if I can't solve a problem I think I "ought to be able to solve".
  • I consider myself good at math and programming, and thus being bad at problem solving makes me feel bad about myself.
  • I tend to overthink problems, and I enjoy using heavy algorithmic machinery, when in reality, all that's called for is a sequence of several observations.
  • Codeforces' scoring system needs one to be fast at solving problems and implementing them precisely. I don't enjoy the time pressure. I'd like a scoring system based on harder problems, but less emphasis on time-to-solve.
To get better, I've been studying more algorithms (because it's fun). I took the coursera course on string algorithms and read the textbook algorithms on strings . I loved the ideas of building a prefix automata in linear time . The algorithm is vey elegant, and involves a fundamental decomposition of regular grammar via the Myhill Nerode theorem . You can find my string algorithm implementations here .

§ Hardness of codeforces problems

Another thing I kept getting tripped up by was the fact that problems that were rated "easy" on codeforces tended to have intuitive solutions, but with non-trivial watertight proofs. An example of this was the question 545C on codeforces, where the tutorial gives a sketch of a exchange argument . Unfortunately, filling in all the gaps in the exchange argument is quite complicated . I finally did arrive at a much longer proof. This made me realize that competitive programming sometimes calls for "leaps" that are in fact quite hard to justify. This kept happening as I solved problems. To recitfy the state of affairs, I began documenting formal proofs to these problems. Here's a link to my competitive programming notes , which attempts to formally state and prove the correctness of these questions.

§ Discrete differential geometry

I love the research of Keenan Crane , who works on bridging old school differential geometry with computational techniques. All of his papers are lucid, full of beautiful figures and crazy ideas.

§ Replusive curves

Chris Yu, Henrik Schumacherm, and Keenan have a new paper on Repulsive Curves is really neat. It allows one to create curves that minimize a repulsive force, and can be subject to other arbitrary constraints. The actual algorithm design leads one to think about all sorts of things like fractional calculus. To be honest, I find it insane that fractional calculus finds a practical use. Definitely a cool read.

§ SAGE implementation

  • I have a work in progress PR that implements Keenan Crane's Geodesics in Heat algorithms within SAGE. Unfortunately, the problem was this implementing this requires heavy sparse numerical linear algebra, something that sage did not have at the time I attempted this.
  • This led to me opening an issue about sparse Cholesky decomposition on the SAGE issue tracker.
  • Happily, the issue was fixed late this year by SAGE pulling in cvxopt as a dependency!
  • I can get back to this now in 2022, since there's enough support within SAGE now to actually succeed!

§ Writing a text editor (dropped)

I started writing a text editor, because earlier tools that I'd written for myself such as ward for blackboarding, and my custom blog generator all worked really well for me, as they fit to my idiosyncracies. I tried writing a terminal based editor at bollu/edtr following the kilo tutorial . Unfortunately, building a text editor is hard work, especially if one wants modern convenienes like auto-complete. I've postponed this project as one I shall undertake during the dark night of the soul every PhD student encounters when writing their thesis. I plan to write a minimal lazily evaluated language, and great tooling around that language as a means to while away time. But this is for future me!

§ DnD

My partner got me into playing dungeons and dragons this year. I had a lot of fun role-playing, and I plan to keep it up.

§ Nomic

Nomic is a neat game about changing the rules of the game. It takes a particular type of person to enjoy it, I find, but if you have the type of people who enjoy C++ template language lawyering, you'll definitely have a blast!

§ Continuum

I found the continuum RPG ), a game about time travel very unique, due to the massive amount of lore that surrounds it, and game mechanics which revolve around creating time paradoxes to deal damage to those stuck in it. It appears to have a reputation of being a game that everybody loves but nobody plays.

§ Microscope

Microscope is a game about storytelling . I unfortunately was never able to host it properly because I was busy, and when I wasn't busy, I was unsure of my abilities as dungeon master :) But it definitely is a game I'd be stoked to play. I'm thnking of running it early 2022 with my group of friends.

§ Odds and ends

§ The portal group

I joined the portal group on discord, which consist of folks who follow Eric Weinstein's philosophy, broadly speaking. The discod is a strange mileu. I hung around because there were folks who knew a lot of math and physics. I would up watching the geometric anatomy of theoretical physics lectures on YouTube by Fredrick Schuller. The lectures are great expository material, though the hardness ramps up like a cliff towards the end, because it feels like he stops proving things and beings to simply state results. Regardless, I learnt a lot from it. I think my favourite takeaway was the Serre Swann theorem which makes very precise the idea that "projective modules are like vector bundles".

§ Differential geometry, again

Similarly, I would up realizing that my differential geometry was in fact quite weak, in terms of computing things in coordinates. So I wound up re-reading Do carmo: differential geometry of curves and surfaces , and I implemented the coordinate based computations in Jupyter notebooks. For example, here is a Jupyter notebook that calculates covariant derivatives explicitly . I found that this forced me to understand what was "really going on". I now know slogans like:
The Covariant Derivative is the projection of the global derivative onto the tangent space. The Christoffel Symbols measure the second dervative(acceleration) along the tangent space.
I got interested in the work of Elizaboth Polgreen . In particular, I found the idea of being able to extend an SMT solver with arbitrary black-box functions pretty great. I read their technical report on SMT modulo oracles and implemented the algorithm .

§ What I want for next year

I wish to learn how to focus on one thing. I'm told that the point of a PhD is to become a world expert on one topic. I don't have a good answer of what I wish to become a world expert on. I like the varied interessts I have, so it'll be interesting as to how this pans out. However, I have decided to place all my bets on the Lean ecosystem, and I plan on spending most of 2022 writing Lean code almost always (or perhaps even always). I wish to understand all parts of the Lean compiler, from the frontend with its advanced macro system, to the middle end with its dependent typing, to the back-end. In short, I want to become an expert on the Lean4 compiler :). Let's see how far along I get!