Ryan Tolsma

IVF Selection is Mid

2025-10-24T00:00:00+00:00

IVF embryo selection is having a moment. Companies like Orchid Health are pitching parents on the promise of optimizing their future kids for IQ, height, disease resistance, and other desirable phenotypes with celebs and politicians (quietly) leaning in. The tech is real and improving, but the actual upside and long-term vision is limited by some pretty basic math.

The ceiling on selection power is … low. There’s fundamental constraints on what you can do with \(N\) samples from a highly polygenic distribution.

Selection Power Grows Logarithmically

In a typical egg harvest, you get maybe 10-15 viable embryos if you’re lucky, rarely more than 50-100 even in aggressive scenarios.

With IVF selection effectively choosing the max of \(N\) samples based on some objective function, and given our sample sizing that translates to just 3-7 bits of selection power.

But it gets worse when you account for how these traits actually work genetically. Height is controlled by thousands of genetic variants. IQ is even more polygenic. They’re sums of thousands of small effects with few if any big hitters. We can model each embryo’s genetic potential distribution \(X\) for a polygenic trait as a sum of \(K\) i.i.d variables contributing to the phenotype effect, which by the Central Limit Theorem implies \(X\) can be modeled as a normal distribution. Depending on the polygeneity of the phenotype, the variance of \(X\) can already be quite small across samples, as \(\sigma \sim O(\frac{1}{\sqrt{k}})\).

The question then becomes, can we improve on \(E[X]\)? How much better can we do by selecting \(N\) samples from \(X\)?

Rescaling \(X\sim N(0,1)\), take \(X_1, X_2, \ldots, X_N \stackrel{\text{i.i.d}}{\sim} N(0,1)\), we want to understand \(\mathbb{E}[Z]\) where \(Z = \max(X_1, \ldots, X_N)\).

To bound \(E[Z]\) we can use the CDF:

\[\begin{align*} P(Z \leq z) &= P(X_1 \leq z, \ldots, X_N \leq z) \\ &= \Phi(z)^N \end{align*}\]

where \(\Phi\) is the standard normal CDF. For the tail bound, note that for large \(z\):

\[P(X_i > z) \approx \frac{1}{\sqrt{2\pi}z}e^{-z^2/2}\]

By union bound: \(P(Z > z) \leq N \cdot P(X_1 > z)\). Setting \(z = \sqrt{2\log N}\):

\[\begin{align*} P(Z > \sqrt{2\log N}) &\leq N \cdot \frac{1}{\sqrt{4\pi \log N}} e^{-\log N} \\ &= \frac{1}{\sqrt{4\pi \log N}} \to 0 \end{align*}\]

Similarly, we can show concentration below this threshold is exponentially small, giving us that \(Z\) concentrates around \(\sqrt{2\log N}\) and therefore \(\mathbb{E}[Z] \approx \sqrt{2\log N}\).

The math here is unforgiving. Selection power scales incredibly slowly as \(\sqrt{2\log N}\), and you’re at best getting a tiny constant factor improvement on \(\sigma\) which is low from \(\frac{1}{\sqrt{K}}\).

Conclusion

Selection upside is fundamentally limited by information theory and biology. The traits you sample are highly concentrated due to polygenic effects and you can’t improve on it by scaling sampling.

The technology right now shines at screening out rare genetic diseases, which are extremley sparsely distributed, requiring very few bits of selection power to remove.

For polygenic trait optimization – smarter, taller, more athletic kids – the ceiling is low and already here.

Fun Derivative Eigenfunctions

2025-07-06T00:00:00+00:00

I came across this article today on top of Hackernews which reminded me of some quant interview questions requiring some clever discrete derivative matrix tricks. I realized after reading the article, that you can construct a new clever proof that \(\frac{d}{dx} \phi(x) = a\phi(x) \iff \phi(x) = e^{ax}\) in the discrete perspective.

The article takes a different basis approach, representing functions as vectors of polynomial coefficients, which simplifies the picture. However, with a cute reframing you can generalize this to the standard positional basis, where functions are infinite-dimensional vectors with coordinates representing evaluations \(f_n = f(x_n), x_n \in \mathbb{R}\).

To start, consider a circular permutation matrix of dimension \(N\):

\[P_N = \begin{pmatrix} 0 & 1 & 0 & 0 & \cdots & 0 \\ 0 & 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & 1 \\ 1 & 0 & 0 & 0 & \cdots & 0 \end{pmatrix}\]

As a permutation matrix of size \(N\), the eigenvalues are the \(N\) roots of unity \(\varphi_a\) and the eigenvectors from the single permutation cycle are represented by the \(N\) periodic Fourier Modes \(v_a = \frac{1}{N}\left(1, \omega^a, \omega^{2a}, \ldots, \omega^{(N-1)a} \right)\) where \(\omega = \exp\left(\frac{2\pi i}{N}\right)\).

Now, let’s view the derivative as an infinite dimensional operator

\[\begin{align*} A = \frac{d}{dx} = \frac{1}{\epsilon}\begin{pmatrix} 1 & -1 & 0 & 0 & \cdots \\ 0 & 1 & -1 & 0 & \cdots \\ 0 & 0 & 1 & -1 & \cdots \\ 0 & 0 & 0 & 1 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \ddots \end{pmatrix} \end{align*}\]

where, in the linearized view, “rows” correspond to discrete evaluations of a functional at a point, spaced \(\epsilon\) apart as we take \(\lim \epsilon \to 0\).

Next, note that we can decompose \(A\) into the following:

\[\begin{align*} A = \lim_{\epsilon\to 0}\frac{1}{\epsilon}\left(I - P_\infty \right) \end{align*}\]

where \(P_\infty\) is the operator of all ones above the diagonal, and satisfies convergence in operator norm \(\lim_{N\to\infty}\|P_\infty - P_N\| = 0\) to the permutation operator in \(\mathbb{L^2}\).

Then, it becomes clear that the eigenvectors of \(A\) are exactly the eigenvectors of \(P_\infty\) which map exactly onto the Fourier Modes now continuously indexed by \(a\), which directly corresponds to the eigenfunction basis \(e^{ax}\).

Some Thoughts on Robotics

2024-07-10T00:00:00+00:00

Last fall, I spent a couple of months actively researching the robotics space for problems to solve or good companies to join. My background from working on RL research with Dorsa and Chelsea’s labs during undergrad, and the rapid capability improvements we’ve witnessed in the last three years had me pretty excited. A lot has changed, but I thought it would be useful to share with others a draft of some of the high-level insights and frameworks I arrived at after talking to researchers, founders, VCs, suppliers, and clients in the industry before deciding for myself not to re-enter the field yet.

Draft Notes

Where can a VC scale business be built in robotics?

Like in any industry, you drive by scale by finding broad horizontal problems with sufficient pain to warrant pricing, or select verticals with large enough scope to integrate into. Large verticals that are currently viable for robotics mostly include manufacturing, warehouse logistics, big agriculture, precision surgical/medicine, and fast-food restaurants. This list is constantly expanding as the tech improves and pushes COGs lower, increasing the surface area of accessible markets.

What can be solved horizontally?

At a high level, you need broadly replicable pain points that are also not viewed by companies as core competencies. Core competencies will be brought in-house eventually (case studies: Scale AI AV labeling, Applied Intuition simulation infra sales, Zendesk AI agents, Tecton, and other marginal ML ops providers) and will be resilient to durable sales relations.

What are the high-level problems robotics startups face?

I’ve sorted the following list by my personal opinion of most to least solved and labeled whether it’s primarily a vertical or horizontally solved problem:

Distribution (Vertical)
- For most companies, this is likely to be viewed as a core competency and remain so in the future. The playbook here feels similar to the wave of vertical SaaS OpenAI wrapper companies, where distribution and domain expertise are the primary moats and value drivers.
- An additional differentiator to AI SaaS providers is that the hardware stack for robotics is not unified across use-cases, which prevents distribution from scaling horizontally well in general.
Hardware (Vertical now, Horizontal in ~5-10 years)
- Outside of large verticals where the hardware stack is likely to unify, this is typically viewed as a core competency by most incumbents and new players in the space.
- As better general use-case hardware develops, this will decrease in relative importance from a core function (“we need custom hardware to solve the problem”) to a competitive operational advantage (“we need custom hardware to reduce costs or drive efficiency compared to competitors”). Similar to the open-source platform and AI model communities, once more effective general use-case hardware hits the markets, the cost scaling benefits of standardized manufacturing will eventually dramatically drive down the marginal value of custom hardware.
- Together, hardware production and physical distribution are the two main components contributing to fragmentation and stickiness of distribution. Lack of standardization contributes heavily to higher switching costs, both from physical replacements and the operational adjustments required to integrate it (ever tried the same prompt and swapping GPT and Claude?). This has implications for centralization (like any fab-style production: GPUs, CPUs, LLM pre-training, etc.) unless effective horizontal platforms and OSS communities develop (everyone is still custom hacking ROS internally…).
Perception (Horizontal)
- Vision models have gotten surprisingly good at the edge over the last three years, and edge compute and specialized chips are really pushing the frontier on what robots can now do in real time.
- Other modalities of perception (LiDAR, radar, ultrasonic sensors, and infrared cameras) have less accessible public data, which might be limiting useful adoption from new players.
- Outside of niche applications and modalities, perception is a pretty well-understood problem, and the majority of recent focus has been on tail edge cases. Finding instances of serviceable edge-cases seems like a promising approach to sourcing customers. For example, real-time vision safety modules to help prevent heavy machinery/robotics accidents in factory environments.
- Currently, perception seems primarily internalized as it’s heavily coupled with hardware choice and power/latency tradeoffs and linked to safety, which has severe downside risks if not appropriately managed internally. I expect this to change over the next couple of years as horizontal providers reach acceptable thresholds/tradeoffs in both, as it’s clearly not viewed as a defendable moat.
Data (Vertical but with horizontal infra plays)
- As LLMs have started to dominate media, it’s growing awfully apparent that data will become a core competency and moat for traditional platforms.
- In robotics, data has significant barriers to access. It’s coupled to your hardware sensors and on-policy control procedures, and can require time-consuming physical interactions to extract realistic environments.
- The real2sim gap still exists but is slowly getting better. The reality of long-tail edge cases, however, limits the effectiveness of iterating too reliantly on simulations as your ability to safely extrapolate to real-world deployments is dubious at best.
- Simulations are clearly horizontal infrastructure, and I’m optimistic that in the near term new players can exist somewhere in-between to augment or curate physically gathered data (i.e., through data augmentation, scene analysis filtering) in a way that doesn’t compete with the existing bloated field of traditional ML ops providers.
Control (Vertical now, Horizontal in ~5-10 years)

Control is by far the least resolved component in the technical stack and might better be referenced as environment modeling in its current state. In the short term, this is a strongly differentiated capability (Waymo vs. everyone else?), but I suspect like with most software/algorithmic progress eventually this will change or centralize like in the case of foundation model labs.
Traditional control methods work extremely well as functional primitives, supplied with environment model inputs and called with intent from an intelligent orchestrator (often a tokenized action transformer). A primary bottleneck is gathering sufficient data to test automated control on the long tail of edge cases. For this reason, controlled environments or those with limited human interaction are currently the primary targets.
Compared to the other components, this is the purest “algorithmic” problem with low capital barriers to access. The physical economies of scale associated with distribution, hardware development/production, and data gathering might not necessarily apply here if algorithmic unlocks can be found.

Conclusion

There’s a ton to expand on here across the stack, including a lot of key components I’m leaving out, but I’m hoping that others interested in starting or evaluating robotics companies will find this useful. While I’m not currently involved or actively looking right now, I’m still very excited and interested in the five-year horizon, and would love to hear any pushback or new insights you might have.

Topological Problems in Voting

2024-06-14T00:00:00+00:00

Back in college, I developed an interest in unexpected impossibility proofs applied to real world systems. The fact that certain abstract mathematical structures inherently have limitations which profoundly impact actual applications is both captivating and sobering. Here’s a cute instance of topological properties applied to voting systems a friend shared with me through this video (we’ll take a slightly different approach).

Background: Arrow’s Theorem

Arrow’s Theorem¹ is the most famous impossibility theorem applied to voting systems – lots of articles and papers have spent time introducing and discussing its implications. In essence, Arrow’s Theorem states that a voting process which ranks candidates in an absolute total order cannot simultaneously satisfy:

Non-Dictatorship - no single voter dominates the output preferences
Pareto Efficiency - if all voters rank candidate A over B, then the resulting rankings support \(A \succcurlyeq B\)
Independence of Irrelevant Alternatives - if \(A \succcurlyeq B\) and a new candidate C is introduced, then the new ranking still requires \(A \succcurlyeq B\)

While this applies to discrete rankings and voter preferences, one might wonder if it’s a unique property of its discrete nature in how candidates are only ranked by ordering. Instead of discrete rankings, could a continuous preference ranking satisfy similar conditions? Unfortunately, a similarly flavored impossibility result holds even in the continuous setting! It seems there’s no getting around the fact that voting is pretty hard to get right.

Chichilnisky Impossibility Theorem

The Chichilnisky Theorem ² extends Arrow’s Theorem to the continuous setting, but with slightly different constraints.

Suppose that you have a set of \(K\) voters choosing between \(N\) candidates. Considering only relative preferences, we can represent the preference profile of each voter over the candidates \(p_k \in S^{n-1}\) as a unit vector on the sphere, with each coordinate \((p_k)_i\) representing the allotted relative preference for candidate \(i\). Denoting \(P = S^{n-1}\) as our preference space, then our voting function \(\phi\) becomes a map \(\phi: P^K \to P\) taking some set of preferences \(\phi(p_1,\ldots, p_K) \to u\) to a resulting unit vector preference profile.

The Chichilnisky Theorem states the following cannot be jointly satisfied:

\(\phi\) is smooth ³ - small changes in voter preferences should result in small changes to \(u\)
\(\phi\) respects anonymity - e.g \(\phi(p_1, \ldots, p_K) = \phi(p_{\sigma(1)}, \ldots, p_{\sigma(K)})\)
\(\phi\) respects unanimity - if \(p_1 = \cdots = p_K= u\) then \(\phi(p_1,\ldots,p_K) = u\)

Topological Proof

We’ll prove the simplest case of two voters and two candidates, which can be naturally generalized. In this scenario our preference space \(P = S^1\) forms a unit circle and \(\phi: S^1\times S^1 \to S^1\) maps the torus to a circle. The high level idea, is to form two different paths along the torus that have different degrees modulo 2 under \(\phi\) and show that these paths can actually map onto one another forming a contradiction.

For the first path, consider the diagonal \(D = \left \{ (\alpha, \alpha) : \alpha \in S^1 \right\}\) which visually loops around the torus at an angle, such that it rotates along the inner and outer loops exactly once by the time it closes. If we restrict \(\phi \vert_D\), then by the unanimity condition we know that \(\phi\vert_D(\alpha,\alpha)=\alpha\) is one-to-one which satisfies \(\deg \phi\vert_D = 1\).

For our second path, let’s take \(A = \{\alpha\} \cup S^1\) and its symmetric counterpart \(B = S^1 \cup \{\alpha\}\) for some \(\alpha\in S^1\) which form an orthogonal figure-eight, with \(A,B\) representing circles rotating directly on the outer and inner loop respectively. If we look at the restriction \(\phi \vert_{A\cup B}\), then again by symmetry, we see that \(\deg \phi \vert_{A\cup B} = 0\ \text{mod}\ 2\) at any regular value by pairing points in the preimage from both components of the figure-eight. Clearly \(A\) and \(B\) intersect at just the one point \((\alpha,\alpha)\in D\) non-smoothly but otherwise form a connected path. So if we take a small \(\epsilon\) sized smooth deformation at \(D\cap (A\cup B)\) to shift away from the intersection at \((\alpha,\alpha)\) to form a non self-intersecting loop \(L\), then \(\deg \phi \vert_L = 0 \ \text{mod}\ 2\) at all of its regular points as well.

Finally, we can find a homotopy to smoothly deform \(L\) into \(D\) (they’re both just loops now) which produces a contradiction as \(\deg \phi\vert_D \neq \deg\phi\vert_L\)!

This notion of taking the diagonal and paths in \(P^K\) naturally generalizes when you look at degree mod 2. Here, we had to rely on smoothness for our path homotopy invariants to apply, but some heavy machinery homological approaches also extend this result to continuous settings.

Footnotes

Arrow’s Theorem ↩
Chichilnisky Theorem ↩
Only continuity is actually required, but you can approximate the continuous map arbitrarily closely with a smooth one, and this simplifies the proof ↩

Modelling Counterfactual Impact

2023-06-06T00:00:00+00:00

You’re a bright young kid striving to change the world. As a pure bred rationalist, you of course seek to maximize your counterfactual altruistic impact, but in practice, what does that mean?

I’ll attempt to convince you, based on a simplified mathematical model, that seeking counterfactual impact implies prioritizing fields which have opportunities for heavy-tailed contributions regardless of the talent distribution.

Being Average

Let’s start with a simple model of the world. For a given altruistic field \(F\) (e.x medicine), assume talent is distributed from some underlying distribution \(\mathcal{D}\), and that \(F\) is supplied constrained with a fixed number \(N\) of contributors.

Let \(X_1, \ldots, X_N \stackrel{\text{i.i.d}}{\sim} \mathcal{D}\) represent the impact of the \(N\) contributors to \(F\), and being humble for now, we assume that in expectation we’re identical to everyone else throwing their hat in the ring, letting \(X_{N+1}\sim \mathcal{D}\) represent our own impact. What is the counterfactual impact in this scenario?

Well, taking the difference between the worlds where you do and don’t contribute, we see that the counterfactual impact \(I\) is expressed as:

\[\begin{align*} I &= \mathbb{E}[ \sum_{i=1}^{N+1} X_i - \min(X_1, \ldots, X_{N+1}) ] - \mathbb{E}[ \sum_{i=1}^{N} X_i ] \\ &= \mathbb{E}[ X_{N+1} - \min(X_1, \ldots, X_{N+1}) ] \end{align*}\]

When is this value large?

Well, if the distribution \(\mathcal{D}\) has a heavy left tail, then the difference between the average and minimum values will be large, and so the counterfactual impact will be large. But, generally speaking, impact is a non-negative quantity, so the counterfactual impact here is upper bounded by \(\mathbb{E}[X_{N+1}]\), which naively represents a maximum relative change of \(\frac{1}{N}\) percent to the field!

For any significantly sized field, this implies that your relative counterfactual impact is negligible if you don’t expect to significantly outperform the average. For many, relative impact is also a proxy for how much pride and value you can derive from your work, so this is a fairly pretty depressing result when the peer group is \(F\) itself.

Being the Best

Well now, of course no self-respecting ambitious altruist aims to achieve just an average impact. So, let’s graciously assume that we’re the top of the top, best of the pack, and implicitly condition on \(X_{N+1} = \max(X_1,\ldots, X_{N+1})\). What does this imply for our counterfactual impact?

Again, taking the difference between the worlds where we do and don’t contribute we find:

\[\begin{align*} I &= \mathbb{E}[ \sum_{i=1}^{N+1} X_i - \min(X_1, \ldots, X_{N+1}) ] - \mathbb{E}[ \sum_{i=1}^{N} X_i ] \\ &= \mathbb{E}[ \sum_{i=1}^{N+1} X_i - \min(X_1, \ldots, X_{N+1}) ] - \mathbb{E}[\sum_{i=1}^{N+1} X_i - \max(X_1,\ldots, X_{N+1})] \\ &= \mathbb{E}[ \max(X_1,\ldots, X_{N+1}) - \min(X_1,\ldots, X_{N+1})] \\ &\leq \mathbb{E}[\max(X_1, \ldots, X_{N+1})] \quad \text{since } X_i \geq 0 \end{align*}\]

As expected, our counterfactual impact in this scenario is strictly greater now that we’re the best. Intuitively, distributions for which this maximum can be large relative to average is what we’re looking for – effectively, heavy right-tailed distributions. Here, we easily see that the relative impact is now instead given by

\[\begin{align*}\frac{\mathbb{E}[\max(X_1, \ldots, X_{N+1})]}{N \mathbb{E}[X_i]}\end{align*}\]

Interestingly, for distributions in which

\[\begin{align*} \lim_{N\to \infty} \max(X_1, \ldots, X_{N+1}) - \max(X_1, \ldots, X_N) > \mathbb{E}[X_i] \end{align*}\]

we see that it’s actually better to choose fields with larger \(N\), where our assumption of being the best is stronger and utilized with greater effect.

To help give some intuition where this phase transition might apply, consider \(\mathcal{D}\) explicitly as the Pareto Distribution¹ with parameter \(\alpha\), which is widely used for modelling talent and real world output distributions. For \(\alpha > 1\) (required for finite mean), a quick lookup indicates that \(\mathbb{E}[X_i] \sim \mathcal{D}_\alpha = \frac{\alpha}{\alpha-1}\) and the CDF and inverse CDF are given by

\[\begin{align*} F(x) &= 1 - \frac{1}{x^\alpha} \\ \implies F^{-1}(x) &= \frac{1}{(1-x)^\alpha} \quad \text{Convex on } [0,1] \end{align*}\]

Now since we can view \(X_1,\ldots,X_{N+1}\sim \mathcal{D}_\alpha\) as first sampling uniformly in percentile space, and then applying the inverse CDF \(F^{-1}\), this gives us a very simple route to determining the counterfactual impact for a given \(N\) and \(\alpha\). From properties of the uniform distribution, we know that \(\mathbb{E}[\max(F(X_1),\ldots, F(X_{N}))] = 1 - \frac{1}{N}\), and so applying some harmless trickery (to avoid real math), we see that

\[\begin{align*} I &= \mathbb{E}[ \max(X_1, \ldots, X_{N+1}) - \max(X_1,\ldots, X_{N})] \\ &= \mathbb{E}[ \max(F^{-1}(F(X_1)),\ldots, F^{-1}(F(X_{N+1}))) - \max( F^{-1}(F(X_1)),\ldots, F^{-1}(F(X_{N})))] \\ &= \mathbb{E}[ F^{-1}(\max(F(X_1)),\ldots, F(X_{N+1})) - F^{-1}(\max( F(X_1),\ldots, F(X_{N})))] \\ &\geq F^{-1}(\mathbb{E}[ \max(F(X_1)),\ldots, F(X_{N+1})])- F^{-1}(\mathbb{E}[\max( F(X_1),\ldots, F(X_{N}))]) \quad \text{ by Jensen's} \\ &= F^{-1}(1 - \frac{1}{N+1}) - F^{-1}(1 - \frac{1}{N}) \\ &= (N+1)^\alpha - N^\alpha \\ &> \alpha N \end{align*}\]

Surprisingly, even for the Pareto Distribution we see that for \(N \geq \frac{\alpha}{\alpha-1}\) that it becomes better for us to choose fields with larger \(N\) in terms of relative impact! Notably, for most reasonable values of \(\alpha\) this phase transition begins at surprisingly low values of \(N\).

Being the z-th Percentile

Now, sticking with the Pareto Distribution, let’s consider a slightly more realistic scenario . Suppose that we’re not the best, but instead the \(z\)-th percentile of the field, and we enter regardless of whether or not we’re worse than everyone else. What does this imply for our counterfactual impact?

To start, note that this implies \(F(X_{N+1}) = z\). Manipulating our previous result to use \(F^{-1}(z)\) instead of \(\max(X_1,\ldots,X_{N+1})\), it’s easy to see that the counterfactual impact is now given by

\[\begin{align*} I &= F^{-1}(z) - \mathbb{E}[ \max(X_1,\ldots, X_{N})] \\ &\geq \frac{1}{(1-z)^\alpha} - F^{-1}(1-\frac{1}{N}) \\ &= \frac{1}{(1-z)^\alpha} - N^\alpha \end{align*}\]

So in terms of \(N\), the phase transition at which it becomes better to choose fields with larger \(N\) for maximizing relative impact is specified by

\[\begin{align*} \frac{1}{(1-z)^\alpha} - N^\alpha &> \frac{\alpha}{\alpha-1} \\ \implies z &> 1 - \left( \frac{1}{N^\alpha + \frac{\alpha}{\alpha - 1}} \right)^\frac{1}{\alpha} \\ &> 1 - \frac{1}{N} \quad \text{ for } \alpha > 1 \end{align*}\]

which effectively equates to being the best in the field. This has the surprising conclusion, that unless you really are going to be the best at what you do, the relative impact of your work will be maximized in smaller, less competitive fields, even for common heavy tailed distributions.

Conclusion

These are pretty simplified models, but still further advance my intuition towards seeking tail upside opportunities (confirmation bias, possibly?) to maximize net impact, and pursuing less competitive, smaller contexts to maximize relative impact. Reality is, however, that most of the time, tail opportunities and performance aren’t particularly accessible to most and aren’t nearly as random as I’d like, and so these insights should be taken with a grain of salt.

In the future, some simple extensions of this model to that would be fun to look into include: sampling \(K\) people applying to join field \(F\) and only choosing the top \(N\), introducing multiple competitive fields to choose from, and adding uncertainty to your own performance estimates.

Footnotes

Pareto Distribution ↩

Reflections on a Stanford Journey

2023-06-05T00:00:00+00:00

It’s now been almost two years since graduating from Stanford, and I wanted to carve out a post to reflect deeply on how well I spent my time there, now with the benefit of hindsight, and with the additional goal of informing my future self to bias towards activities that matter longer term. There a variety of ways to measure quality of time, but the core aspects that I’ve grown to appreciate can be broadly categorized within: social life, health, career, and academics¹.

Since at any given time, I found myself optimizing between these categories simultaneously, I’ll present a quarter by quarter rating of each with some brief commentary and use (*) to highlight particularly high quality experiences or activities.

Fall 2017

Rating:

Social Life: 6/10
Health: 5/10
Academics/Career: 7/10

Coursework (20): Econ 1, CS 106X, Math 113*, Freshman Writing Seminar (ESF)

Adjusting to Stanford was fairly difficult my first quarter. As a natural introvert and lifetime cynic, finding social groups outside of dorm activities or in clubs was tough, and I ended up having large blocks of free time by myself which was undesirable.

Highlights:

Despite mostly restricting social life to my dorm, I’ve met many of my closest lifetime friends there
Engaging philosophical discussions with dormmates which helped solidify awareness of my own lack of understanding and consistency

Mistakes:

Not taking the standard Math/Physics courses to meet other freshman
Not joining non pre-professional clubs (like sports, etc.) to meet friends
Not finding a consistent fun physical activity to engage with people in (does Ping Pong count?)

Winter 2018

Rating:

Social Life: 9/10
Health: 8/10
Academics/Career: 7/10

Coursework (22): Math 62CM*, CS 161, CS 107, CS 43, CS 22, PHIL 120

Highlights:

My friendships within the dorm community expanded and deepened significantly
Taking Math 62CM, significantly developed my abstract reasoning skills, and also introduced me to a lot of friends from the Math major community

Mistakes:

Didn’t touch grass enough (a recurring mistake throughout most of my collegiate experience)
Naively thought socializing a lot meant drinking a lot. This was mostly corrected in future quarters, but is one of the first examples of “Ryan is a sheeple”

Spring 2018

Rating:

Social Life: 2/10
Health: 3/10
Academics/Career: 10/10

Coursework (22): Math 120, Math 147, Physics 65, CS 166*, CS231N*, CS142

One of my goals throughout freshman year was to find my intellectual/productive limits, which I aimed to accomplish by upping my course load difficulty each quarter until I broke. This quarter was by far my most taxing and difficult throughout college, due to the quantity of work and my own lacking academic maturity, which firmly had me reach the goal of breaking. Most of my memories from this quarter are whirlwinds of psetting in the dark on limited sleep with friends or alone, with little sunlight or external social activity outside of the first week of classes.

Highlights:

Deepened relations with close friends in the Math and CS communities, and finally met the Physics kids in the honors sequence which was nice
Found my academic limits, learned a ton of cool material, and drank the firehose of knowledge with the highest density of intellectually fun classes I’ve taken at Stanford
Realized pretty clearly, that being a great theory PhD in Math was out of the question – the classmates in my twelve student Differential Topology course somehow self-selected quite strong that quarter, to the point where I felt talent-wise near the bottom quartile of the class. This was a great wake-up call looking back (at the time, was quite upset), as it really helped me bias towards more realistic paths

Mistakes:

Didn’t touch grass enough or see the sun in beautiful Spring weather
Didn’t give myself enough free time to relax and reflect which had mental health tolls

Summer 2018

Rating:

Social Life: 8/10
Health: 9/10
Academics/Career: 4/10

Internship: Worked at a tech startup (now defunct) as a SWE intern with one of my closest friends. I did a hybrid internship, commuting for a week at a time between home in SoCal and Redwood City (staying at my friend’s place). This was a great, non-demanding summer, where I really got the chance to relax outside, review material and learn new things for fun, and hang with friends.

Highlights:

Sun, grass, friends

Mistakes:

During my later years in school, I sometimes would regret not taking the more prestigious quant internship offer I had, as I really felt like it would have made recruiting much less stressful/difficult in future years. It’s not totally clear if this was a mistake or not now, but I do think it would have helped me bias my exploration a bit better

Fall 2018

Rating:

Social Life: 6/10
Health: 5/10
Academics/Career: 9/10

Coursework (22): Math 171, Math 220, Physics 110, CS 221, EE 263, Sophomore Writing Seminar (PWR 2)

Another fairly academic quarter, where I tried to bias more towards applied subjects (Physics and EE), after realizing in Spring that competition in pure theory subjects was brutal. Notably, this was the last quarter I took at Stanford where I was primarily invested in academics – school took less of a priority after I realized math/physics PhD wasn’t necessarily the route I wanted, and actively engaged more in social life, and light exploration of several other academic fields/career choices.

Highlights:

Sophomore year I became roommates with my best friend from Freshman year in my dorm, and our friendship has continually been a highlight of my life

Mistakes:

I really wanted a study abroad experience in Asia without losing out on being at campus during the school year with friends. So I tried to only apply for internships in China/Singapore/Japan but didn’t have any luck. Being a bit narrow here was short sighted and added unneeded stress later on
Not touching grass or seeing the sun enough

Winter 2019

Rating:

Social Life: 9/10
Health: 8/10
Academics/Career: 5/10

Coursework (15): Math 121, Physics 230*, EE 364A*, CS 255, ENGLISH 146A

After realizing my health had suffered during the past Fall/Spring, I decided to take a lighter course load. This proved to be a great decision, and I enjoyed spending the free time socializing with my roommate and other friends. At this point, I also took some of that free time and joined my first club, Christian Intervarsity (IV), where I met a lot of phenomenal people and friends I’ve still kept in touch with. Coincidentally, the courses I took ended up being very high quality, with Physics 230 being my favorite individual course at Stanford, and EE 364A being one of the most useful applied courses taken.

Highlights:

Phenomenal teachers and interesting coursework
Backgammon, poker, fun reading, and board games with my social communities
Roommate and I secured quant internships in London together at the same firm (purposefully!). Another instance of “Ryan is a sheeple”, as I really only pursued quant since some smarter friends of mine were doing it and I wanted to be cool like them

Mistakes:

Not a significant amount of physical activity/grass touching

Spring 2019

Rating:

Social Life: 8/10
Health: 8/10
Academics/Career: 2/10

Coursework (19): Physics 231, CS 272, CS 199, CS 224U, CS 229*

This quarter had a ton of free time which I spent socializing and meeting new people – which has had pretty high returns on my current social life quality. Started my first ever relationship, which I learned a ton about myself from.

Highlights:

Free time, touching grass, sunlight, meeting new people, and getting more involved in the IV club
Got top score on CS229 final (out of like 600+ people) which was my first topper at Stanford. This truly doesn’t matter (like most clout things lol), except it proved useful as a signal when reaching out to Stanford profs to ask to join their lab as an undergrad researcher later on (some profs would ghost me from my resume, and then I’d reach out again a couple months later with this listed, and they’d suddenly want to talk to me about it explicitly). Leaving this here only to remind myself that sometimes, people care about/weight random things more than you’d expect, and learning what those are can be difficult a priori

Mistakes:

Pretty much all of the classes outside of CS229 were bullshit, which I knew. After the easier Winter had me happier, I leaned a bit too far into the realm of fooling myself with useless classes and stopped learning for the most part this quarter.
Did independent Research with Anshul Kundaje’s lab (CS199), which was a terrible choice (Anshul is great though). I had never done research before, and without any mentorship or really solid genomics background, ended up mostly doing wasted work

Summer 2019

Rating:

Social Life: 6/10
Health: 7/10
Academics/Career: 7/10

Internship: Quant Trading in London with my roommate and continued part-time research in Genomics.

Highlights:

Learned a lot during my internship with data science skills and stats
Read a lot of interesting books in Neuroscience/Computational Biology etc. (thought for a hot sec I should be a doctor to help the world…but that’s not my skill set I’ve realized). Abbot’s Theoretical Neuroscience* was a phenomenal read in particular

Mistakes:

Between work, research, and reading I didn’t spend a lot of time traveling around London or exploring the city as much as I should have. This has been a major regret (when else will I be living there for multiple straight months?)

Fall 2019

Rating:

Social Life: 9/10
Health: 7/10
Academics/Career: 5/10

Coursework (20): CS 273B, CS 330, CS 279, ECON 180*, Math 230A*, CS 199

I continued to lean into the idea of working in Biotech or going to Med School, by exploring some of the computational aspects of genomics/biology/chemistry. This was a really clear example of fruitful exploration, as by the end of the quarter I was fairly well convinced that this wasn’t the right path for me – this may change over time though, and I’m still intellectually curious about the space!

Highlights:

Had a fantastic dorm group, with tons of friends (from diverse groups) also drawing into Crothers Hall by happy coincidence. My roommate and I stayed together again after sharing a place in London which was a great cornerstone of my daily social interactions
Weekly poker games with a large dorm group, and almost daily card games with friends and my roommate in my room
Secured an internship at QuantCo, which in theme, I really wanted as an opportunity to explore industry applications of ML in healthcare with cool people

Mistakes:

Ended up quitting my independent research at end of quarter after realizing it wasn’t super rewarding/fruitful. I should have quit sooner (back in Spring) or positioned myself in a better position to learn.
My standard policy of optimizing social life by not attending lectures much really hurt me this quarter, as my game theory class had a lot of interactive material which I missed out on, and missed out on opportunities to meet and join new friend groups through that
Over prioritized my relationship sometimes, instead of expanding/deepening other social circles. This one was pretty hard for me to see until now

Winter 2020

Rating:

Social Life: 6/10
Health: 6/10
Academics/Career: 5/10

Coursework (17): CS 259Q, CS 224N, CS 261*, Math 143, Phil 151

The first 8 weeks were pre-covid, and consisted of a lot of fun continuation of the themes in Fall, including the increased emphasis on social life.

Highlights:

To support career exploration, I wanted to get experiences with industry research in Fintech, ML in academia, and big tech (since “Ryan is a sheeple”, and saw grass that a significant percent of other Stanford sheeple were grazing on)

Mistakes:

Should have shorted S&P in March lol
Not touching grass, seeing sunlight

Spring 2020

Rating:

Social Life: 1/10
Health: 2/10
Academics/Career: 9/10

Coursework (21): Math 228, EE 180*, CS 110, CS 399, Music 25, German 101

Internship: Wealthfront*

Research : RL Research with Dorsa*

Realizing Covid was going to crush any hope of an on-campus experience, I hunkered down and decided to go pretty hard on my work. In hindsight, I really should have taken it easier as I was not in a great mental state (along with everyone else isolated during covid), and really owe it to my family for helping support me throughout.

Luckily, I finally hit the jackpot in terms of mentorship, and my intern manager at Wealthfront was phenomenal despite the pandemic disruptions, and had a really engaging, cool project. Likewise, for my second attempt at academic research I joined a new project in Dorsa’s lab focused on pure RL, and this time I prioritized mentorship and a good team which dramatically improved my experience and output.

Highlights:

Great internship and research projects in terms of both intellectual engagement and mentorship
Fantastic family support during the stressful pandemic

Mistakes:

Everything was pass fail, so spending too much time on classes I didn’t care about was bad EV. Should have done Wealthfront part-time as I got pretty time-crunched once classes and research picked up
Not touching grass, seeing sunlight, seeing friends, eating well
Not actively showing enough appreciation and gratitude to family during this unique time

Summer 2020

Rating:

Social Life: 7/10
Health: 7/10
Academics/Career: 5/10

Internship: QuantCo

Research: Continued RL Research with Dorsa*

In summer, I prioritized hanging out (socially distanced) with friends, or in organized online gatherings with some consistency. Remote internship ended up being not very time intensive, which I took advantage of to spend more time with research and playing outside.

Highlights:

Great research project and team (wrapped up project end of summer), and chill internship
Saw Sunlight, touched grass, walked dog, played games with fam

Mistakes:

Between Spring and Summer, lost contact with a lot of friends and acquaintances (who I wanted to get to know better!) in the virtual transition. I think crossing the chasm and reaching out to people directly would have been very rewarding longer term

Fall 2020

Rating:

Social Life: 5/10
Health: 4/10
Academics/Career: 3/10

Coursework (12): Math 116, Math 215A*, Physics 330, Physics 170

Internship: FB

Lived with some friends in a house during a virtual quarter to try and make the best of quarantine. My memories of this time are very physically dark, as the house had very few windows and we rarely went outside (stayed in a bad neighborhood) and were concerned about Covid. At this point, I had enough to graduate with the Bachelor and Coterm, but for lack of motivation to go and work, decided to just stick it out and play essentially through Spring with the rest of my house/classmates – I think staying was potentially one of the best decisions I made during Stanford, as leaving during Covid would have really stifled any opportunity to rekindle relationships.

Highlights:

Learned the research I did with Dorsa’s lab won CoRL 2020². This was truly pure luck on my part and I’m extremely fortunate to have been part of the team and project by chance. I really enjoyed the accolade, but realize that (1) I wasn’t super passionate about the material during the research process (was still reasonably fun though); (2) Most times, similar effort won’t be recognized; to try and correct my cognitive bias and update away from doing an AI PhD based on this
Great friends and some of the most interesting classes I got the chance to take – I really liked learning about topology/quantum stuff despite not learning the material very solidly

Mistakes:

Had a terrible team/internship at FB and quit halfway through – should have quit earlier
Extremely low motivation, with lots of free time (hanging with the same 6 people in a house 24/7…), lead to lots of time wasters like TV/Video Games. In part, I think low motivation contributed to me mostly recruiting for quant instead of aiming for something more unique/ambitious (yet another “Ryan is a sheeple” example)
Not touching grass, seeing sunlight, eating healthily or learning to cook…

Winter 2021

Rating:

Social Life: 8/10
Health: 7/10
Academics/Career: 6/10

Coursework (24): Math 236, Physics 220, Physics 113, EE 276*, CS 228, Applied Physics 228, Music 19A

Continued living mostly with the same group plus some other friends in a safer, cheaper location out of state where we could afford a bigger place. Hung out and went outside to hike and eat out more as quarantine restrictions relaxed slowly which drastically improved quality of life. Because I was feeling pretty lethargic from all of my video games and free time in Fall, I decided to go a bit harder on coursework – this backfired halfway through the quarter, as I got really pulled into the whole GME saga and ended up spending a majority of my time learning about finance, staring at flashy charts, reading options theory and textbooks, and other dumb things (classic “Ryan is a sheeple” example).

Highlights:

Got much closer with my housemates through engaging outdoor activities and shared interests compared with my Fall experience. Learned to barely cook, which was fun
Settled on QR at Vatic Labs as my post-graduation job
I had never actually prioritized grades in my academics, and instead prioritized marginal value per effort and breadth/exploration for understanding my interests (my life is often characterized as fleeing from boredom/ennui). Knowing that this would likely be my last rigorous quarter at Stanford, I challenged myself to get as high of a GPA as I could for the quarter which I was proud to semi-succeed at

Mistakes:

Lost money on GME, and spent way too much time looking into dumb finance stuff
Struggled dealing with house conflicts that inevitably arise in group houses. My dorm roommate was not apart of either Fall/Winter house, and our great relationship had never really been stressed living together in the same way that the group house could at times

Spring 2021

Rating:

Social Life: 12/10
Health: 12/10
Academics/Career: 2/10

Coursework (12): EE 378C, Math 159, Math 215C*, Stats 375

Senior Spring was my victory lap quarter with school finally brought back on campus. With no academic or work requirements left, I heavily prioritized social life and outdoor activities. This quarter is likely the most vibrant and happiest I’ve ever been.

Highlights:

I really really loved playing Spikeball with friends, and for 1-2 hours on average every day was my go-to social activity throughout the perfect Spring weather. I also scheduled daily meals and hangouts with everyone I possibly could that remained on campus
Started a habit of taking photos and videos of fun activities in the moment, with the goal of just saving to view for later (I don’t post on social media much). The photos and videos from Senior Spring are things I still view and enjoy, and I’ve kept this up post-grad as one of my favorite ongoing habits

Mistakes:

Mostly none :). Probably could have spent more effort learning material from courses, as they really were quite interesting, but didn’t edge out competing interests for time

Conclusions

After writing and reflecting on the above, I think it’s pretty clear to me that there are certain patterns that have held up in terms of improving health and happiness. Some obvious ones are the frequency of social interaction, the quality and intellectual engagement of academic course loads, and having solid teams and mentorship to work with.

Some that seem obvious now but clearly were not in retrospect include the quantity of time spent outdoors and the consistency of physical activity. Clearly, I was hanging out with friends and playing games quite a bit, but why didn’t I do those activities outside? Smh.

There are also some things I spent a lot of energy on that, given my current path, seem wasted, but weren’t quite so clear at the time. I spent a solid few quarters on genomics research and learning computational approaches in biology/chemistry, which I’ll likely never use. But, these types of things, I suppose, are only clear in hindsight. Between my FB internship and genomics research, I realize that quitting was certainly the right move, but in the future, it might be better to bias even more towards quitting earlier³. Likewise, I now realize that classes/internships which I expected to be fluff in advance… usually were, in fact, wastes of time (surprising, right?). And the only thing I got out of it was deceiving myself, unfortunately. Finally, recognizing now that herd mentality and paths of least resistance are more of a force in my life than I originally thought, and that I should actively try to combat these biases in how I currently allocate my time and in my plans for the future.

Taking this time to write down thoughts and self-reflect in itself has been quite rewarding, and I hope anyone who reads this may find it as insightful for themselves in terms of approaches to replicate or avoid, as I have.

Footnotes

Courses can be looked up here by number if you’re curious ↩
Best Paper announcement ↩
Ben Kuhn has a great article on optimal control and stopping times in the context of startup options, which illustrates the advantages quitting confers. Generally, quitting earlier than you’d think, is likely closer to the optimal route (at least for myself!) ↩

Autogenerated Dictionary

2023-05-27T00:00:00+00:00

Autogenerated Dictionary

In python, it’s often the case I find myself wanting the ability to very quickly construct dictionaries with nested structures. Unfortunately, the following use case is not easy to accomplish with python dictionaries used in the standard way:

dct = {}
dct['a'] = 1
dct['deep']['nested']['auto']['inferred']['structure'] = 2

print(dct)

"""
{
    'a': 1,
    'deep': {
        'nested': {
            'auto': {
                'inferred': {
                    'structure': 2,
                }
            }
        }
    },
}
"""

After a little bit of exploration with the python defaultdict collection, I found that this is easy enough to accomplish with some fun recursive trickery.

from collections import defaultdict
def reflect():
    return defaultdict(reflect)

dct = defaultdict(reflect) # Now works perfectly with the previous example

Single Forward Pass Gradients

2023-04-27T00:00:00+00:00

Single Forward Pass Gradients

I learned this trick to compute gradients of a locally smooth function \(f: \mathbb{R} \to \mathbb{R}\) a long time ago in a numerical physics class at Stanford. The primary insight being that you can analytically extend \(f\) to the complex plane in a local region \(f:\mathbb{C}\to\mathbb{C}\) and then noting that

\[\begin{align*} f(x + ih) \approx f(x) + ih f'(x) \end{align*}\] \[\begin{align*} \implies f'(x) \approx \frac{1}{h}\text{Im}[f(x+ih)] \end{align*}\]

For sufficiently small \(h\), this shows a simple single \(f\) evaluation can be utilized to obtain the derivative at a point. Conveniently, numpy and many other numerical Python libraries support complex number types with overridden functionality, meaning this trick can often work without any requiring any additional code changes. Unfortunately, it’s not very numerically stable due to the dependence on \(\frac{1}{h}\), but it’s still neat nonetheless.

Wealth management with Nested Benders

2021-10-25T00:00:00+00:00

Spring 2020 was strange. Covid had just hit, I was taking an overloaded virtual class schedule from home, and I found myself starting a full-time remote internship at Wealthfront. Despite the chaos, I ended up working on one of the most intellectually engaging projects I’ve had – building a completely automated roboadvisor for wealth management.

Modeling Wealth Management

Wealth management, at its core, is about helping people make better financial decisions over their lifetime. The objective function optimizes economic utility discounted by time and goals – think retirement savings, buying a house, paying for college, or just maintaining your desired lifestyle. We add soft constraints for objectives like target retirement age, house purchases, college savings, and other life goals that matter to real people.

The input variables are straightforward but numerous: your income stream, monthly cash flows across lifestyle expenses, bills, taxes, and your various savings accounts like 401(k), Roth IRA, brokerage, checking, and savings accounts. Each of these accounts has different tax treatment, contribution limits, withdrawal penalties, and growth characteristics.

Constraints come in two flavors. Hard constraints encode the tax system to restrict actions (you can’t over-contribute to your 401(k)), ensure solvency (you need to eat), and handle mandatory goals. Soft constraints capture preferences like “I want to buy a house in 5 years” that can be violated if necessary but incur penalties in the objective function.

The optimization spans a user’s entire lifetime. In practice, we discretized this into quarterly decision periods and formed a branching tree model where branches represent changes in external environment variables. The key external variables are things like stock returns and bond rates – uncertain factors that dramatically impact optimal allocation strategies. For a simplified example, consider a 4-node branching structure for 2 external variables (stocks and bonds). We’d take the 4 eigenvector directions that capture maximum variance at roughly 95% confidence using covariance matrices from historical data.

The action space at each time step is actually convex: you’re choosing where to allocate your money across accounts to optimize the objective. This lends itself to nice optimal decompositions that be solved efficiently.

Large Optimization Problem

Put it all together and you get a massive mixed integer program. The continuous variables handle account allocations and cash flows. The integer variables encode discrete decisions (shady big-M method?) – should we buy a house this quarter? Should we retire? These binary decisions couple with the continuous optimization in complex ways that blow up the problem size.

For context, a typical user’s lifetime optimization might involve 40 years × 4 quarters = 160 time steps, each with a branching factor of 4 scenarios, giving roughly \(4^{40}\) paths through the tree (in practice, we branched less frequently or with fewer eigenectors). Each node in this tree has dozens of continuous and integer variables, resulting in a MIP too big to solve in real time using off-the-shelf solvers.

My project was implementing a solver that could actually handle this, ideally in near real-time to support interactive use cases. I ended up implementing some research with a generic parallelized tree-based algorithm using nested Benders decomposition to break the problem into efficient modular pieces.

Nested Benders Decomposition

Benders decomposition is a classical technique for solving large optimization problems by exploiting structure. The idea is to decompose a hard problem into a master problem and one or more subproblems, solving them iteratively and passing information between them until convergence.

Two-Stage Benders

Let’s start with the simple two-stage case. Suppose we have decision variables \(x\) (first stage) and \(y\) (second stage) with the problem:

\[\begin{align*} \min_{x,y} \quad & c^T x + f^T y \\ \text{s.t.} \quad & A x \geq b \\ & B x + D y \geq d \\ & x \in X, \, y \in Y \end{align*}\]

The key insight is that once we fix \(x\), the second-stage problem in \(y\) decouples. We can write the second-stage problem as a value function:

\[\begin{align*} Q(x) = \min_{y} \quad & f^T y \\ \text{s.t.} \quad & D y \geq d - B x \\ & y \in Y \end{align*}\]

Now the master problem becomes:

\[\begin{align*} \min_{x, \theta} \quad & c^T x + \theta \\ \text{s.t.} \quad & A x \geq b \\ & \theta \geq Q(x) \\ & x \in X \end{align*}\]

Here \(\theta\) is a scalar variable representing our approximation of the second-stage cost. The problem is that we don’t know \(Q(x)\) explicitly – it’s the optimal value of an optimization problem that depends on \(x\), and in general it can be nonconvex and discontinuous when \(Y\) includes integer variables.

Benders decomposition builds up an outer approximation of the constraint \(\theta \geq Q(x)\) using cuts derived from solving the subproblem. Each iteration works as follows:

Solve the master problem with the current set of cuts to get candidate solution \((\bar{x}, \bar{\theta})\). This gives a lower bound \(LB = c^T \bar{x} + \bar{\theta}\) on the optimal value.
Solve the subproblem \(Q(\bar{x})\) by fixing \(x = \bar{x}\). This gives us the true second-stage cost at \(\bar{x}\), and we compute an upper bound \(UB = c^T \bar{x} + Q(\bar{x})\) using the best feasible solution found so far.
Generate a cut based on the subproblem solution:
- If the subproblem is feasible, we get an optimality cut. Let \(\lambda^*\) be the optimal dual variables from the subproblem. Then \(Q(x) \geq f^T y^* + (\lambda^*)^T(d - Bx - Dy^*)\) for any \(x\), where \(y^*\) is the optimal primal solution. Simplifying, this gives the cut:
\[\theta \geq f^T y^* - (\lambda^*)^T(B\bar{x} + Dy^* - d) - (\lambda^*)^T B(x - \bar{x})\]
Or more cleanly: \(\theta \geq Q(\bar{x}) - (\lambda^*)^T B(x - \bar{x})\)
- If the subproblem is infeasible, we get a feasibility cut. Let \(\mu^*\) be an extreme ray of the dual cone (Farkas ray) certifying infeasibility. This gives:
\[0 \geq (\mu^*)^T(d - Bx)\]
which eliminates the infeasible region from the master problem.
Add the cut to the master problem and repeat until \(UB - LB < \epsilon\).

The beauty is that each cut tightens our approximation of \(Q(x)\). For purely continuous problems, \(Q(x)\) is convex and we get a polyhedral outer approximation that converges finitely. For MIPs, \(Q(x)\) is generally nonconvex, but we still make progress by adding cuts at integer solutions – this is essentially a form of branch-and-cut where Benders generates the cuts automatically from problem structure.

Decomposing into a Tree

Now extend this to our branching tree structure. Each node \(n \in \mathcal{N}\) represents a decision point at time \(t(n)\) under scenario \(s(n)\). Let \(\mathcal{C}(n)\) denote the children of node \(n\), representing possible future scenarios. For our wealth management problem, the decision variables at node \(n\) are:

\(x_n\): allocation decisions (how much to move between accounts)
\(z_n\): state variables (account balances, age, etc.)
\(u_n\): binary variables (should we buy a house? retire?)

The full multistage problem has the structure:

\[\begin{align*} \min \quad & \sum_{n \in \mathcal{N}} c_n^T x_n + h_n^T u_n \\ \text{s.t.} \quad & A_n x_n + E_n u_n \geq b_n \quad \forall n \in \mathcal{N} \\ & z_{n'} = T_n x_n + S_n z_n + W_n u_n \quad \forall n' \in \mathcal{C}(n) \\ & x_n \in \mathbb{R}^{d_x}, \, u_n \in \{0,1\}^{d_u}, \, z_n \in \mathbb{R}^{d_z} \end{align*}\]

The constraint \(z_{n'} = T_n x_n + S_n z_n + W_n u_n\) captures state evolution: the balances and state at child node \(n'\) depend on the decisions made at parent node \(n\). Different children get different realizations of random returns (encoded in the branching structure), but they all inherit the consequences of the parent’s decisions.

This is where nested Benders shines. At each node \(n\), we can treat the entire subtree rooted at \(n\) as a two-stage problem: the current node is the “first stage” and everything below is the “second stage”. Define the value-to-go function:

\[\begin{align*} V_n(z_n) = \min_{x_n, u_n} \quad & c_n^T x_n + h_n^T u_n + \sum_{n' \in \mathcal{C}(n)} p_{n'} V_{n'}(z_{n'}) \\ \text{s.t.} \quad & A_n x_n + E_n u_n \geq b_n \\ & z_{n'} = T_n x_n + S_n z_n + W_n u_n \quad \forall n' \in \mathcal{C}(n) \\ & x_n \in \mathbb{R}^{d_x}, \, u_n \in \{0,1\}^{d_u} \end{align*}\]

where \(p_{n'}\) is the probability of transitioning to child \(n'\). For leaf nodes, \(V_n(z_n) = 0\).

The decomposition becomes a message passing algorithm. Each node \(n\) maintains a master problem:

\[\begin{align*} \min_{x_n, u_n, \theta_n} \quad & c_n^T x_n + h_n^T u_n + \theta_n \\ \text{s.t.} \quad & A_n x_n + E_n u_n \geq b_n \\ & \theta_n \geq \sum_{n' \in \mathcal{C}(n)} p_{n'} V_{n'}(T_n x_n + S_n z_n + W_n u_n) \\ & x_n \in \mathbb{R}^{d_x}, \, u_n \in \{0,1\}^{d_u} \end{align*}\]

Just like in two-stage Benders, we approximate the constraint \(\theta_n \geq \sum_{n'} p_{n'} V_{n'}(\cdot)\) using cuts. But now the cuts come from solving child subproblems, and those child subproblems recursively depend on their children’s value functions.

Optimality cuts from children: When child node \(n'\) solves its subproblem given state \(\bar{z}_{n'}\), it obtains optimal value \(V_{n'}(\bar{z}_{n'})\) and dual variables \(\lambda_{n'}^*\) corresponding to the state evolution constraints. The child sends back to parent \(n\) the optimality cut:

\[V_{n'}(z_{n'}) \geq V_{n'}(\bar{z}_{n'}) + (\lambda_{n'}^*)^T (z_{n'} - \bar{z}_{n'})\]

Substituting \(z_{n'} = T_n x_n + S_n z_n + W_n u_n\), this becomes a cut on the parent’s variables:

\[\theta_{n'} \geq V_{n'}(\bar{z}_{n'}) + (\lambda_{n'}^*)^T (T_n x_n + S_n z_n + W_n u_n - \bar{z}_{n'})\]

The parent aggregates cuts from all children: \(\theta_n \geq \sum_{n' \in \mathcal{C}(n)} p_{n'} \cdot \theta_{n'}\).

Feasibility cuts from children: If a child subproblem is infeasible given \(\bar{z}_{n'}\), it computes a Farkas ray \(\mu_{n'}^*\) and sends the feasibility cut:

\[0 \geq (\mu_{n'}^*)^T (T_n x_n + S_n z_n + W_n u_n - \bar{z}_{n'})\]

This eliminates parent decisions that lead to infeasible child states – crucial for ensuring the parent doesn’t make promises the future can’t keep (like over-withdrawing from retirement accounts).

The algorithm proceeds by iteratively solving subproblems at each node, generating cuts, passing them to parents, solving parent problems, and sending updated state values down to children. Convergence is achieved when all nodes’ lower bounds (from their master problems) match their upper bounds (from feasible solutions).

How is this faster?

The critical advantage of nested Benders for MIPs is that it decomposes the problem in a way that dramatically reduces the branching complexity. Solving the full problem with a standard MIP solver requires exploring the combinatorial space of all binary variables across all time periods and scenarios simultaneously. With \(T\) time periods, \(S\) scenarios per period, and \(k\) binary variables per node, you’re looking at a branching tree with \(O((2^k)^{ST})\) nodes in the worst case.

With Nested Benders, for each node \(n\), the master problem only involves the local binary variables \(u_n \in \{0,1\}^{d_u}\) for that specific time-scenario pair. The solver explores \(2^{d_u}\) combinations locally, but critically, it doesn’t need to simultaneously explore combinations from other nodes. The coupling between nodes is handled through the cuts, not through explicit enumeration.

To solve child node \(n'\)’s subproblem, we condition on the parent’s state decision \(\bar{z}_{n'}\). Given that state, the child optimizes over its own binary decisions independently. The child’s binary variables are completely decoupled from the parent’s binary variables in the branch-and-bound tree – they only interact through the continuous state variables.

This means the effective branching is \(O(2^{d_u})\) per node rather than \(O((2^{d_u})^{\vert \mathcal{N} \vert})\) for the full tree. We’ve transformed an exponential problem in tree size to a linear problem in tree size with exponential work per node – and when \(d_u\) is small (as it often is for real applications), this is tractable.

Additionally, the cuts provide a form of learning across iterations. When we add an optimality cut from a child, we’re encoding information about all possible future scenarios reachable from that child’s subtree. A single cut can eliminate vast regions of the parent’s decision space that would lead to poor outcomes downstream. This is far more efficient than explicitly enumerating those scenarios. In some ways, these cuts mirror the message passing propagation of gradients through deep learning backwards propagation, slowly hill climbing towards smooth optimality.

The convex action space (allocation decisions are continuous) is crucial here. It means that conditional on the binary decisions \(u_n\), each node’s problem is a convex LP or QP which modern solvers can solve in milliseconds. The scaling difficulties come from the combinatorial search over \(u_n\), but with small \(d_u\) and good cuts guiding the search, this remains feasible.

Paralellizing for Scale

Raw Benders decomposition gives us tractability, but real-time performance requires parallelization. The scenario tree structure provides natural parallelism at multiple levels:

Sibling independence: Children of the same parent can solve their subproblems in parallel – they depend on the same parent state but are otherwise independent.
Subtree independence: During backward passes (solving from leaves toward root), entire subtrees rooted at different children can process in parallel until they need to send cuts to their common ancestor.
Iteration-level parallelism: In some algorithmic variants, we can solve all nodes at a given tree level simultaneously using the current cuts, then propagate results up or down.

For our wealth management problem with \(\sim\)50,000 nodes in the scenario tree, this parallelism is essential. A sequential solve taking 10ms per node, 10 message passes per edge, would require ~2 hours – unacceptable for an interactive application. But with 256-way parallelism, we’re down to ~20s of wall-clock time, which is usable.

The cuts also reduce iteration counts dramatically. In early iterations, subproblems might return wildly suboptimal solutions because the parent’s cuts are loose. But as cuts accumulate, the master problems produce better candidate solutions, child subproblems have less work to do (they’re already near-optimal), and convergence accelerates. In practice, we’d often see convergence in 5-10 iterations for user updates to an existing solution, compared to hundreds of iterations starting cold.

Warm-start caching is another huge win. When a user updates their income or adjusts a goal, most of the scenario tree structure is unchanged. We keep the accumulated cuts from the previous solve and only invalidate nodes whose parameters actually changed. The solver can often reuse solutions from 90%+ of the tree, solving only the affected subtrees from scratch.

Message Passing Strategies

The tree structure opens up natural parallelization opportunities. Different subtrees can be solved independently (at least partially), and there are multiple algorithmic variants for how to orchestrate the message passing:

Backwards-Always: Start from the leaf nodes and work backwards up the tree. Each node waits for all children to send cuts before solving. Very stable, but sequential – limited parallelism across tree levels.
Forwards-Always: Start from the root, solve, send parameters down to children, then solve children in parallel. Faster initially but can thrash if parent decisions keep changing.
Backwards-Forwards: Alternate between backward and forward passes. Combine the stability of backward passes with the speed of forward passes once you’re close to convergence.
Custom Tweaks: Monitor convergence directions and adaptively sync directions with adjacent subtree (e.g. optimistically go backwards when adjacent is going backwards since it blocks on parents, otherwise go forwards). In parallelization constrained settings, there’s an additional knapsack problem on which subtree problems to allocate across solvers. Solve time is predominantly correlated with problem size, and so you can employ heuristics to estimate which subtrees are likely to block others on sequencing for backwards movements and prioritize accordingly.

Conclusion

The most satisfying part was seeing all of this actually work and drive 10x+ speedups (dependent on problem size). For an internship, this was the most fun I’d ever had, and definitely influenced my decision to move into quant post-grad – these types of high impact, fun academic problems are my favoriite.

Special Relativity Speedrun

2021-04-06T00:00:00+00:00

In this post, I will attempt to derive and explain some consequences of special relativity as fast as possible, starting only with the invariance of physics in inertial frames and the constancy of the speed of light. I will assume basic familiarity of Lagrangian mechanics and Einstein summation convention.

Basics

Consider a particle moving at the speed of light along the \(x\)-axis, such that \(x = \pm ct\) and the other coordinates are constant. Note that \(c^2t^2 -x^2=0\). Since this must be invariant under all changes of reference frame \(x\to x'\), it follows that \(c^2t'^2 - x'^2 = 0\) as well. Using natural units, more generally this implies that any changes of an inertial reference frame \((t,x, y, z) \to (t', x', y', z')\) must preserve the equality

\[t^2 - x^2 - y^2 - z^2 = t'^2 - x'^2 -y'^2 - z'^2\]

which as an invariant (squared) norm gives what’s known as the Minkowski metric. Any transformation respecting this metric is a symmetry of relativistic physics, and the group of all these symmetry preserving transformations is known as the Lorentz Group. From here on out, for a four vector \(x^\mu = (t,x, y, z)\) we take \(x^\mu x_\mu = t^2 - x^2-y^2-z^2\) to be its squared norm under this metric.

Symmetries

Now that we have an equation which enables us to identify if a transformation is a symmetry, we want to be able to classify them. First, note that for a 3D rotation acting on the \(x,y,z\) components, given by a matrix \(R\) satisfying \(RR^T = I\), we can easily see that

\[(Rx)^\mu (Rx)_\mu = t^2 - \vert\vert(Rx)_i\vert\vert_2^2 = t^2 - x^2 - y^2 - z^2 = x^\mu x_\mu\]

preserves the metric and is thus a symmetry as rotations preserve the Euclidean norm. By a similar process, you can try convincing yourself that time reversal and space inversion also qualify as symmetries. Next, we want to understand how coordinates should transform over shifts in velocity. To do this, consider coordinates \(x^+ = t+x\) and \(x^- = t-x\) (leaving the other components constant). Then by construction \((x^+)(x^-)\) is invariant. An obvious invariant preserving transformation is to then take \(x'^+ = \lambda x^+\) for some constant \(\lambda\) and \(x'^- = \frac{1}{\lambda}x^-\) such that \((x^+)(x^-) = (x'^+)(x'^-)\). Explicitly expanding out and solving for \(x' = \frac{\lambda}{2}x^+ - \frac{1}{2\lambda}x^-\) in terms of \((t,x)\) and similarly for \(t'\), we find that

\[\begin{align*} x' &= \frac{\lambda^2 +1}{2\lambda}x + \frac{\lambda^2-1}{2\lambda}t \\ t' &= \frac{\lambda^2-1}{2\lambda}x + \frac{\lambda^2+1}{2\lambda}t \end{align*}\]

In particular, choosing a frame with \(x'=0\) implies that \(x = \frac{1-\lambda^2}{1+\lambda^2}t\) and thus \(x = vt\) with \(v = \frac{1-\lambda^2}{1+\lambda^2}\). Using this definition of \(v\) and plugging in to solve for \((t', x')\), it becomes apparent that

\[\begin{align*} t' &= \frac{t-vx}{\sqrt{1-v^2}} &= \gamma(t - vx) \\ x' &= \frac{x -vt}{\sqrt{1-v^2}} &= \gamma(x-vt) \\ y' &= y \\ z' &= z \end{align*}\]

where we adopt the standard notation for the Lorentz Factor \(\gamma = \frac{1}{\sqrt{1-v^2}}\). The derived equation is often called the Lorentz Boost applied along the \(x\)-axis and intuitively corresponds to shifting to another reference frame with a constant velocity \(v\). To boost along an arbitrary axis \(w\), one simply rotates \(w\) to the \(x\)-axis, applies the standard Lorentz Boost, and rotates the frame back. Although I won’t show it here, every element of the the Lorentz Group can be broken down into some composition of rotations, boosts, time reversal, and spatial inversion. These completely describe the symmetry transformations of the Minkowski Space special relativity resides in.

Paradoxes

For some fun, I’ll show a couple principles underlying some classic “paradoxes” you may have seen before. First, we illustrate the effects of Lorentz Contraction. Consider a pole of length \(L\) sitting still at the origin along the \(x\)-axis in the \((t, x)\) frame. Now consider a frame \((t', x')\) constructed by applying a boost of velocity \(v\). At \(t'=0\), by the Lorentz Boost formulas we know that \(t = vx\), and so at the tip of the pole where \(x=L\) we find that \(t=vL\). Now, applying the invariance of norm, we see that

\[\begin{align*} x'^2 - t'^2 &= x^2 - t^2 \\ x'^2 &= L^2 - v^2L^2 \\ x' &= L\sqrt{1 - v^2} = \frac{L}{\gamma} \end{align*}\]

and thus the length of the pole in the moving \((t', x')\) frame actually decreases by a factor of \(\frac{1}{\gamma}\).

Next, to illustrate time dilation we consider the same setting, but now note what happens at \(x'= 0\). Plugging into our equations from teh Lorentz Boost, we see that this directly implies \(x = vt\), and thus applying invariance of the norm again:

\[\begin{align*} t'^2 - x'^2 &= t^2 - x^2 \\ t'^2 &= t^2 - v^2t^2 \\ t' &= t\sqrt{1-v^2} = \frac{t}{\gamma} \end{align*}\]

which shows that clocks in this moving frame actually appear to be running slower by a factor of \(\frac{1}{\gamma}\) compared to the stationary frame. Together, these two concepts can be applied to show that simultaneity is not a universal concept, and that simultaneous events in one frame can appear time-separated in another. I highly encourage the ambitious reader to try deriving this themselves using these tools.

Proper Time and Action

Proper time is an incredibly helpful concept and physically describes the amount of time an object experiences with respect to its own frame. For example, suppose that someone travels along a path from point \(a\) to point \(b\) with their own reference identified as \((\tau, x')\). At each instant, we can apply our good friend invariance of the norm, to see that \(d\tau^2 - dx'^idx'^i = dt^2 - dx^idx^i\) for the traveling and stationary observer frames. However, for the traveler, in their frame of reference, they always have \(dx^i = 0\) as their own position defines the origin of their frame. Thus, it follows that

\[d\tau = \sqrt{dt^2 - dx^i dx^i}\]

To compute the total proper time over their journey, we can integrate, yielding

\[\begin{align*} \tau_{a,b} &= \int_a^b d\tau \\ &= \int_a^b \sqrt{dt^2 - dx^idx^i} \\ &= \int_a^b dt\sqrt{1 - \frac{dx^i}{dt}\frac{dx^i}{dt}} \\ &= \int_a^b \sqrt{1-v^2}dt \end{align*}\]

Note how our derivation is actually independent of which path our friend takes as long as the endpoints are fixed, and thus our definition of proper time is actually a path invariant! From Lagrangian mechanics, you may recall the Principle of Least Action in which a quantity which is stationary to path variation is used to derive equations of motion. By inspection, our definition of proper time seems like a natural invariant to use to define such an action principle:

\[\begin{align*} S &= -m \tau_{a,b} \\ &= -m \int_a^b \sqrt{1-v^2}dt \end{align*}\]

where we judiciously introduce a factor of \(-m\) as the invariance of action is maintained under constant scaling.

Energy

Now that we have an action, we’re almost there. Recalling the definition of the Lagrangian from action, \(S = \int \mathcal{L}\), we immediately can identify that here we have \(\mathcal{L} = -m \sqrt{1 - v^2}\). As this may still look unfamiliar, we can reintroduce units using the fact that \(\mathcal{L}\) has units of energy, to see that

\[\mathcal{L} = -mc^2\sqrt{1 - \frac{v^2}{c^2}}\]

Taking a first order Taylor expansion and assuming that \(\frac{v}{c} << 1\) in the classical mechanics regime, we see that

\[\mathcal{L} \approx -mc^2(1 - \frac{v^2}{2c^2}) = -mc^2 + \frac{mv^2}{2}\]

Now, taking the Legendre Transform to derive the Hamiltonian we see that

\[\mathcal{H} = mc^2 + \frac{mv^2}{2}\]

This should look familiar! The Hamiltonian \(\mathcal{H}\) represents the energy of a system, and what we’ve now derived consists of a classical kinematic energy term \(\frac{mv^2}{2}\) and a constant factor \(mc^2\). Setting \(v = 0\), we see that \(E = mc^2\) is the resting energy of our system and is exactly the celebrated mass-energy equivalence formula Einstein derived a century ago!

Remarks

I hope you were able to enjoy reading this and found it at least somewhat insightful! If you’re interested in these sorts of things, the book Spacetime Physics by Taylor and Wheeler is a classic and has plenty of fun paradox brainteasers to play around with.