Aravind Kannappan

Kolmogorov Complexity, Solomonoff Induction, and the Philosophical Limits of Aligned AGI

2026-05-01T00:00:00+00:00

Begin with the simplest possible question about intelligence: what does it mean to learn? Not to fit a curve, not to minimize a loss, but to genuinely induce the right explanation from evidence. This question has a precise mathematical answer, one that was worked out in the 1960s by Kolmogorov, Solomonoff, and Chaitin, and extended by Hutter in the 2000s into a formal theory of optimal rational agency. The answer is beautiful and the theory is complete. It is also, on close inspection, deeply troubling for the project of value alignment.

The argument runs as follows. The correct universal theory of learning is Solomonoff induction, which achieves the best possible predictions on any computable data source. The correct theory of optimal decision-making is AIXI, which achieves the best possible rewards in any computable environment. Both are provably optimal, in a formal sense. Neither is alignable in the general case, for reasons that follow from Gödel’s incompleteness theorem. This is not a practical difficulty about current systems, it is a mathematical theorem about the limits of what aligned general intelligence can be.

Kolmogorov Complexity: The Length of Understanding

The story starts with a question about strings. Given a finite binary string $x$, what is the shortest possible description of $x$? Fix a universal Turing machine $U$. The Kolmogorov complexity of $x$ is

\[K(x) = \min_{p \,:\, U(p) = x} \lvert p \rvert\]

the length of the shortest program that causes $U$ to output $x$ and halt. A string with $K(x) \approx \lvert x \rvert$ is incompressible, no description shorter than the string itself can generate it. Such strings are, in the precise technical sense, random: they have no structure that a program could exploit to produce them concisely. A string with $K(x) \ll \lvert x \rvert$ is structured, it has a short explanation.

Two properties of $K$ are immediate and important. First, it is machine-independent up to a constant: for any two universal Turing machines $U$ and $U’$, $\lvert K_U(x) - K_{U’}(x) \rvert \leq c_{U,U’}$ for a constant depending only on the machines. This means $K$ is not an artifact of a particular computational model, it is a property of $x$ itself.

Second, it is not computable. The proof is a diagonalization. Suppose a program $P$ computed $K(x)$ for all $x$. Consider the string $x_n$ defined as the first string of complexity greater than $n$. A program that calls $P$ and extracts $x_n$ has length $O(\log n)$, far shorter than $n$, contradicting $K(x_n) > n$. This is Berry’s paradox, “the smallest integer not definable in fewer than thirteen words”, made precise. Kolmogorov complexity is well-defined, well-behaved, and forever out of reach of any algorithm.

The chain rule makes $K$ compositional:

\[K(x, y) = K(x) + K(y \mid x) + O(\log K(x, y))\]

where $K(y \mid x) = \min_{p \,:\, U(p,x)=y} \lvert p \rvert$ is the conditional complexity. The mutual information $I(x : y) = K(x) + K(y) - K(x,y)$ measures shared algorithmic content. These are the algorithmic analogs of entropy chain rules, but they hold for individual strings rather than distributions.

The Universal Prior

Solomonoff’s insight was to turn Kolmogorov complexity into a probability distribution. The Solomonoff prior is

\[M(x) = \sum_{p \,:\, U_M(p) \text{ outputs prefix } x} 2^{-\lvert p \rvert}\]

the probability that a random program (each bit independently fair-coin) causes the universal monotone machine $U_M$ to output a string beginning with $x$. It is a semimeasure: $\sum_b M(xb) \leq M(x)$, with equality failing when some programs that output $x$ never produce a continuation. The failure is probability mass “lost” to non-halting programs, a direct manifestation of the halting problem.

The prior $M$ satisfies a universality property: for any computable probability measure $\mu$ over strings, there exists a constant $c_\mu = 2^{-K(\mu)}$ such that $M(x_{1:n}) \geq c_\mu \cdot \mu(x_{1:n})$ for all strings. The Solomonoff prior dominates every computable measure. It weights hypotheses by $2^{-K(h)}$, simpler explanations get exponentially more prior probability, and no computable forecaster can persistently outpredict it.

Solomonoff induction uses $M$ as a Bayesian prior and predicts

\[P(x_{n+1} = 1 \mid x_{1:n}) = \frac{M(x_{1:n}1)}{M(x_{1:n}1) + M(x_{1:n}0)}\]

The convergence theorem is the core result: for any computable data-generating process $\mu$ and any $\epsilon > 0$,

\[\sum_{n=1}^\infty \mathbb{E}_\mu\!\left[\!\left(P(x_{n+1}=1 \mid x_{1:n}) - \mu(x_{n+1}=1 \mid x_{1:n})\right)^2\right] \leq \ln(1/c_\mu) < \infty\]

The total squared prediction error is bounded by $K(\mu)\ln 2$, finite, and independent of $n$. Solomonoff induction eventually predicts as well as the true process, with finitely many mistakes, for any computable true process simultaneously. This is the optimal learner: it cannot be consistently outperformed by any computable forecast, on any computable data source.

AIXI: Optimal Agency

Hutter’s AIXI extends Solomonoff induction from passive prediction to active decision-making. The agent receives observation-reward pairs $(o_t, r_t)$ and takes actions $a_t$ at each step. It aims to maximize discounted future reward.

AIXI acts according to:

\[a_t = \arg\max_{a_t} \sum_{o_t r_t} \max_{a_{t+1}} \cdots \sum_{o_{t+m} r_{t+m}} \left[\sum_{k=t}^{t+m} r_k\right] \sum_{\rho \,:\, U(\rho,\, a_{This expression is, in effect, expected future reward under the Solomonoff mixture over all computable environments, taking the action that maximizes it at each step. The Solomonoff prior weights environments by their complexity, simpler environments (shorter programs) get more prior weight, and AIXI integrates over all of them.

AIXI is Pareto-optimal: for any computable agent $\pi$ and any computable environment, AIXI earns at least as much cumulative reward as $\pi$ in the limit, up to a constant factor. This is a formal sense in which AIXI is the best possible agent. The formal value function is:

\[V^{\text{AIXI}}_m(h) = \max_a \sum_{oe} \!\left[r + V^{\text{AIXI}}_{m-1}(hae)\right] \xi(e \mid ha)\]

where $h$ is the history, $e = or$ is the next observation-reward pair, and $\xi(e \mid h)$ is the Solomonoff mixture over programs. The recursion is clean. The agent is, in a rigorous mathematical sense, optimal.

It is also, in every practical sense, unimplementable. AIXI requires computing over all Turing programs, which requires solving the halting problem. The approximation AIXItl (truncated to programs of length $\leq l$ and planning horizon $\leq t$) is computable but exponential in both $l$ and $t$.

The Reward Identification Problem

Here is where alignment enters, in its sharpest form. AIXI maximizes the reward signal $r_t \in [0,1]$ it receives from the environment. But AIXI does not know what the reward signal represents. It is a number to be maximized, not a proxy for human values. The question is: can an agent learn the intended reward function from behavioral data?

The answer is no in general, and the proof is information-theoretic. Consider two reward functions $R_1$ and $R_2$ that agree on all state-action pairs the agent has visited:

\[R_1(s,a) = R_2(s,a) \quad \forall (s,a) \in \{(s_t, a_t)\}_{t \leq n}\]

No data distinguishes them. Any learner, Bayesian, frequentist, algorithmic, that has only observed the trajectory ${(s_t, a_t, r_t)}$ must assign equal posterior to $R_1$ and $R_2$. If $R_1$ and $R_2$ differ on some unvisited state $s^$, the agent’s behavior at $s^$ will follow whichever $R$ the prior prefers, and the Solomonoff prior prefers the simpler reward function, which need not be the intended one.

We can state this precisely. The reward identification problem is to recover $R^$ from trajectory data. Call $R^$ learnable by agent $\pi$ if the agent’s estimate converges to $R^*$ on all reachable states. The diagonalization argument shows that no computable agent can learn every computable reward function: define $R_\pi$ to give reward $0$ whenever $\pi$ takes the action it currently estimates as optimal, and reward $1$ otherwise. $R_\pi$ is computable (since $\pi$ is), so $\pi$ should eventually learn it. But learning $R_\pi$ requires always taking the non-optimal action, which changes what $R_\pi$ rewards, and the loop never converges. The set of learnable reward functions for any computable agent has measure zero in the space of all computable functions.

The Löbian Obstacle

Even if we assume the reward function is given correctly, a second problem arises: can the agent trust its own reasoning about whether it is doing the right thing?

Gödel’s second incompleteness theorem establishes that no consistent formal system $T$ of sufficient strength can prove its own consistency: $T \nvdash \text{Con}(T)$. Löb’s theorem deepens this: for any formula $\phi$, if $T \vdash \Box_T\phi \to \phi$ (where $\Box_T\phi$ means “$T$ proves $\phi$”), then $T \vdash \phi$. Contrapositively: $T$ cannot prove $\Box_T\phi \to \phi$ for any $\phi$ that is not already provable.

Applied to a proof-based AI agent: suppose the agent takes action $a$ only when it can prove $V(a) \geq V(a’)$ for all alternatives. By Löb’s theorem, the agent can prove that its proven conclusions are correct only if those conclusions are already provable without the self-trust assumption, the self-referential justification is circular. An agent that needs to verify its own reasoning in order to act cannot do so without already having what it’s trying to verify.

The practical consequence is that any sufficiently powerful agent modeled as a proof system cannot take actions whose justification requires trusting its own reliability as a reasoner. It can act on external evidence and pre-committed priors, but it cannot act on “I have proven this is right” without incurring a logical contradiction or running in a circle. This is not a matter of needing more compute or a better architecture. It is a structural property of formal systems strong enough to reason about themselves.

Levin’s Kt Complexity and the Tractability Gap

Return to the question of value learning. Kolmogorov complexity $K$ measures description length. Levin’s Kt complexity additionally penalizes computation time:

\[Kt(x) = \min_{p \,:\, U(p)=x} \!\left[\lvert p \rvert + \log \text{time}(p)\right]\]

A description that is short but slow gets a higher Kt than one that is short and fast. The universal search algorithm finds the shortest efficient program in time $O(Kt(x) \cdot t(p^*))$, roughly optimal in terms of description length times computation time.

For alignment, the distinction between $K$ and $Kt$ is crucial. Human values, as a description, may have low Kolmogorov complexity, something like “what a fully informed, reflectively coherent human would prefer” is a short specification. But evaluating this description on any particular action requires simulating full human deliberation, which involves arbitrary chains of counterfactual reasoning, emotional inference, long-term consequence estimation, and social judgment. The Kt complexity of human values is enormous.

A Solomonoff-based learner that minimizes $K$ will converge toward something close to the correct specification of human values. A resource-bounded learner that minimizes $Kt$ will converge toward the most tractable proxy, the function that is both short to describe and fast to evaluate. These are not the same function. The learner that is optimal under computational constraints will systematically prefer tractable proxies over the genuine objective, not because it is malicious but because the problem it is solving is Kt minimization, and the gap between $K(V_H)$ and $Kt(V_H)$ is the gap between the specified values and the computable approximation.

This is the algorithmic information-theoretic account of inner alignment failure: the discrepancy between the training objective (minimize some computable loss) and the intended behavior (evaluate genuine human preferences) is exactly the gap between $K$ and $Kt$ complexity of the value function.

What Follows

These results, the undecidability of value learning, the Löbian obstacle to self-verification, and the Kt-complexity tractability gap, are not arguments that aligned AI is impossible. They are arguments about what kind of problem it is.

An aligned AI system cannot, in general, learn the correct value function from behavioral data alone. It requires additional structure: a value specification that is grounded externally (in human oversight) rather than derived internally (from reward signal optimization). An aligned AI system cannot, in general, verify its own alignment: it requires external certification from a formal system stronger than itself. And an aligned AI system cannot, in general, evaluate the true human value function: it requires approximations whose fidelity must be monitored and corrected over time.

None of these conclusions surprise practitioners of AI safety. They are familiar intuitions, grounded in empirical observations about reward hacking, goal misgeneralization, and the difficulty of scalable oversight. What algorithmic information theory adds is precision: these are not worries about current techniques but theorems about the structure of any computable aligned agent. The mathematics does not tell us how to solve the problem. But it tells us, with the clarity of formal proof, exactly what the problem is. That is the first step toward solving it.

The Mathematics of Adversarial Robustness: Certified Defenses, Lipschitz Geometry, and the Limits of Perturbation Sets

2026-04-01T00:00:00+00:00

In 2013 Szegedy et al. showed that a GoogLeNet classifier, trained to near-human accuracy on ImageNet, could be fooled by adding imperceptibly small perturbations to any input image. The perturbations were invisible to human eyes, no larger than the noise in a compressed JPEG, but they caused confident, catastrophically wrong predictions. The model saw a school bus and called it an ostrich. A decade later, after thousands of papers on attacks and defenses, the phenomenon is still not fully understood. State-of-the-art models remain vulnerable in ways that defy intuitive explanation.

The reason adversarial examples are hard to explain is that they are not, primarily, an engineering failure. They are a mathematical consequence of high-dimensional geometry, and understanding them requires moving from empirical observations about specific models to theorems about the geometry of decision boundaries and the measure-theoretic structure of high-dimensional probability spaces.

What We Are Trying to Prove

The goal of a certified defense is to produce a classifier $f$ together with a guarantee: for every test point $x$ in some distribution, and every perturbation $\delta$ with $\lVert\delta\rVert \leq \epsilon$, the classifier makes the same prediction on $x + \delta$ as on $x$. The guarantee is not probabilistic, it is a proof that holds for all perturbations in the threat model simultaneously.

Define the robust accuracy at radius $\epsilon$ as:

\[\text{RA}(\epsilon) = P_{(x,y) \sim \mathcal{D}}\!\left(\min_{\lVert\delta\rVert \leq \epsilon} f(x + \delta) = y\right)\]

For most models trained with standard cross-entropy loss, $\text{RA}(\epsilon) \approx 0$ for $\ell_\infty$ perturbations with $\epsilon = 8/255 \approx 0.03$ on ImageNet. The robustness gap $\text{RA}(0) - \text{RA}(\epsilon)$ is large, and understanding its size, and whether it can be closed, requires understanding the geometry of the decision boundary.

Why Perturbations Are So Small: The Linear Hypothesis

The first explanation for adversarial vulnerability is simple and important. For a linear classifier $f(x) = \text{sign}(w \cdot x + b)$, the $\ell_\infty$ distance from $x$ to the decision boundary is

\[\epsilon^* = \frac{\lvert w \cdot x + b \rvert}{\lVert w \rVert_1}\]

In $d$ dimensions with $d = 10^6$ (ImageNet), the $\ell_1$ norm $\lVert w \rVert_1$ can be enormous even when $\lVert w \rVert_2$ is moderate, because $\lVert w \rVert_1 \leq \sqrt{d} \lVert w \rVert_2$. Perturbing each coordinate by $\epsilon$ in the direction of $\text{sign}(w_i)$ changes the dot product by $\epsilon \lVert w \rVert_1$, which grows as $O(\epsilon \sqrt{d})$ for a unit-norm weight vector. In $d = 10^6$, a perturbation of magnitude $\epsilon = 0.03$ in every coordinate changes the dot product by roughly $0.03 \times 1000 = 30$, far larger than the margin $\lvert w \cdot x + b \rvert$ for most inputs.

The linear classifier is fragile not despite being a good classifier, but because it is a good classifier that uses all available dimensions. Good linear classification requires that many small features combine into a decisive prediction, and adversarial attacks exploit exactly those many small features. The same model properties that make a classifier accurate make it adversarially fragile, and this is not a coincidence. It is the linear hypothesis, and it applies to neural networks approximately, because deep networks in their bottom layers perform operations that are approximately linear in the input.

Randomized Smoothing: The Certified Defense

The most practically scalable certified defense is randomized smoothing. The idea is elegant: rather than trying to make a given classifier $f$ robust, build a new classifier $g$ by averaging $f$ over random noise:

\[g(x) = \arg\max_{c} \; P\!\left(f(x + \delta) = c\right), \quad \delta \sim \mathcal{N}(0, \sigma^2 I)\]

The smoothed classifier $g$ predicts whichever class $f$ outputs most often when the input is perturbed by Gaussian noise. This sounds like it would reduce accuracy, and it does, but the noise also creates a certificate.

The certificate comes from the following argument. Let $c_A$ be the most probable class under $g$ at input $x$, with probability $p_A = P(f(x+\delta) = c_A)$, and let $p_B = \max_{c \neq c_A} P(f(x+\delta) = c)$. Cohen et al. (2019) proved that $g$ is certifiably robust within $\ell_2$ radius

\[R = \frac{\sigma}{2}\!\left(\Phi^{-1}(p_A) - \Phi^{-1}(p_B)\right)\]

where $\Phi^{-1}$ is the inverse normal CDF. The proof uses the Neyman-Pearson lemma: among all sets of measure $p_A$ under $\mathcal{N}(x, \sigma^2 I)$, the one that is hardest to shift to the second class under a translation $\delta$ with $\lVert\delta\rVert_2 \leq R$ is a half-space, and the certificate is the radius at which this half-space’s probability falls below $p_B$.

The table below shows the tradeoff between noise level $\sigma$, clean accuracy, and certified accuracy at radius $R = 0.5$ on CIFAR-10, from Cohen et al.’s original experiments:

$\sigma$	Clean accuracy	Certified acc. ($R = 0.5$)
0.12	77%	0%
0.25	74%	49%
0.50	67%	38%
1.00	57%	22%

The tradeoff is fundamental, not incidental: more noise creates larger certificates but worse clean predictions. The optimal $\sigma$ for a given target radius $R$ scales as $\sigma^* \propto R$. For $\ell_\infty$ threat models (the more common practical concern), the certificate shrinks by a factor of $\sqrt{d}$, making randomized smoothing essentially useless for high-dimensional $\ell_\infty$ robustness.

Lipschitz Networks and What They Trade Away

A classifier with global Lipschitz constant $L$, meaning $\lVert f(x) - f(x’) \rVert_2 \leq L \lVert x - x’ \rVert_2$ for all $x, x’$, satisfies a robustness certificate by construction: if the margin at $x$ is $\gamma(x) = f(x){y} - \max{y’\neq y} f(x)_{y’}$, then the certified $\ell_2$ radius is $\gamma(x)/(2L)$.

For a deep network with layers $f = f_K \circ \cdots \circ f_1$, the Lipschitz constant satisfies $\text{Lip}(f) \leq \prod_k \sigma_{\max}(W_k)$ where $\sigma_{\max}(W_k)$ is the spectral norm of the $k$-th weight matrix. Spectral normalization constrains each $\sigma_{\max}(W_k) \leq 1$, ensuring $\text{Lip}(f) \leq 1$. The problem is that this is extremely aggressive: a network with unit Lipschitz constant can only assign confidence proportional to the $\ell_2$ distance to the decision boundary, and natural data distributions do not organize themselves in $\ell_2$-separated clusters.

Empirically, spectrally-normalized networks on CIFAR-10 achieve roughly 55% clean accuracy with $R = 0.14$ certificates, compared to 67% clean accuracy with $R = 0.5$ certificates from randomized smoothing. The Lipschitz approach gives tighter guarantees per unit accuracy loss at small radii, but does not scale to larger radii as gracefully. Orthogonal networks (where $W_k^T W_k = I$ at every layer) give exactly unit spectral norm while preserving more representational power, and recent work on Cayley parameterizations achieves 69% clean accuracy with $R = 0.36$ certificates.

The fundamental tension is this: a classifier cannot be simultaneously accurate and robust under arbitrary perturbations if the threat model allows perturbations large enough to reach other-class examples. This is not a statement about architectures or training procedures. It is a theorem.

The Accuracy-Robustness Tradeoff Is a Theorem

Zhang et al. (2019) formalized the tradeoff. For any classifier and any distribution $\mathcal{D}$:

\[R_{\text{adv}}(f, \epsilon) \geq R(f) + \underbrace{\mathbb{E}_x\!\left[\max_{c \neq f(x)} P\!\left(\mathcal{B}(x,\epsilon) \cap \{x' : y(x') = c\}\right)\right]}_{\text{boundary overlap}} - \eta^*\]

where $\eta^*$ is the Bayes error and $\mathcal{B}(x,\epsilon)$ is the perturbation ball. The middle term, boundary overlap, measures how much probability mass from other classes lies within distance $\epsilon$ of each test point. If two classes have examples that are $\epsilon$-close, any classifier must either make errors on clean inputs (putting the boundary through that region) or make errors on adversarial inputs (being fooled by the perturbation across the boundary).

For CIFAR-10 with $\epsilon = 8/255$ in $\ell_\infty$, a theoretical estimate of the overlap term gives a minimum achievable adversarial error of roughly 5%, compared to a Bayes clean error below 0.5%. The gap between those numbers, roughly a factor of 10 in irreducible error, is an inherent property of the data distribution at this perturbation radius. No training procedure can close it. The ceiling on robustness is set by geometry, not by model capacity.

Concentration of Measure and the Inevitability of Adversarial Examples

The deepest explanation for adversarial vulnerability is a pure probability theorem with no reference to learning at all. Lévy’s theorem on the concentration of measure on the sphere states: for any 1-Lipschitz function $f: S^{d-1} \to \mathbb{R}$ with median $M$,

\[\mu\!\left(\{\lvert f(x) - M \rvert \geq t\}\right) \leq 2\exp\!\left(-\frac{(d-2)t^2}{2}\right)\]

The measure concentrates around the median with Gaussian tails, and the width of the concentration band shrinks as $O(1/\sqrt{d})$. For a binary classifier, the median is 0 or 1, and most of the sphere lies within $O(1/\sqrt{d})$ of the equator, the decision boundary.

In $d = 10^6$ dimensions, the typical distance from a uniformly random point to the decision boundary of any continuous binary function is of order $10^{-3}$. This is independent of how the classifier was trained. Any continuous decision function on a high-dimensional sphere has adversarial examples within $O(1/\sqrt{d})$ of most correctly classified points, not because the model is poorly trained, but because the sphere has that shape.

The practical consequence is that adversarial vulnerability in high dimensions is not evidence of poor generalization or insufficient training. It is a consequence of the same high dimensionality that makes these spaces so expressive. The only models that escape concentration of measure are those that rely on very few features (sparse classifiers) or those that operate in a threat model small enough to stay in the concentrated band. Both approaches sacrifice something: sparse classifiers ignore available features, and small threat models do not defend against the perturbations that matter empirically.

Complete Verification and Its Limits

Empirical defenses (adversarial training, data augmentation) can raise the cost of an attack but cannot prove robustness. The alternative, complete verification, is to check, for every test point, that no perturbation within the threat model changes the prediction.

For ReLU networks, this can be cast as a mixed-integer linear program (MILP). Each ReLU unit has an activation pattern (on/off), and the $2^N$ possible patterns for $N$ hidden neurons define $2^N$ linear regions. Verifying robustness within each region is a linear program (LP); verifying across all regions is the exponentially hard MILP. Branch-and-bound with LP relaxations (the LP relaxes each ReLU to an interval constraint, giving an outer approximation of the reachable output set) gives the best practical performance.

Practically, complete verification is feasible for networks with up to a few thousand neurons. For the networks used in practice, ResNets with millions of parameters, complete verification is computationally intractable. Interval Bound Propagation (IBP), which propagates simple interval bounds through each layer, scales to large networks but is conservative: the verified radius it computes is typically far smaller than the true robust radius because interval arithmetic ignores correlations between neurons. An IBP-verified ResNet-50 achieves roughly 36% certified accuracy at $\epsilon = 2/255$ on CIFAR-10, while adversarially trained models without certificates achieve 67% empirical robustness at the same radius.

The gap between certified and empirical robustness reflects the looseness of the certificate, not fraud, but proof theory: IBP is sound (certified inputs are truly robust) but incomplete (non-certified inputs may also be robust). Tightening the gap is an active research area with a clear character: it requires tighter approximations of the reachable set of activations, which in turn requires more computation per verification query.

The Shape of the Problem

Putting the pieces together, adversarial robustness resolves into three distinct sub-problems that live at different levels of abstraction.

The first is geometric: the decision boundary of any accurate classifier on natural data is close to most test points, because natural data distributions of different classes overlap at the scale of humanly-imperceptible perturbations. This is a property of the data, not the model.

The second is computational: even if we had a perfect classifier with the maximum achievable robust radius, verifying its robustness requires solving problems that are NP-hard in general. The best we can do in practice is outer approximations that trade completeness for scalability.

The third is statistical: there is an inherent tradeoff between standard accuracy and adversarial robustness, because robustness requires separating classes in the perturbation metric, and class overlap makes this separation imperfect. The tradeoff is not an artifact of current methods. It is a lower bound derived from the data distribution.

This is not a pessimistic picture. It tells us precisely what we can and cannot improve. We cannot change the geometry of the data, but we can choose perturbation metrics that better reflect human perception, reducing the class overlap at the relevant scale. We cannot avoid NP-hardness in general, but we can design architectures that are easier to verify (Lipschitz, orthogonal, interval-analyzable). And we cannot close the accuracy-robustness gap below its fundamental lower bound, but we can get closer to it.

Understanding what is mathematically inevitable is the first step toward knowing where the remaining room for progress lies.

Singular Learning Theory and the Geometry of Neural Network Interpretability

2026-03-01T00:00:00+00:00

Ask a mechanistic interpretability researcher why neural networks form the circuits they do and you will get an honest answer: we observe them, name them, and ablate them, but we lack a theory of why they emerge. This is not a complaint about the field, the empirical discoveries are real and important. It is a statement about what is missing. What we need is a mathematical account of why, given a data distribution and an architecture, gradient descent converges to representations with specific structural properties rather than others.

Singular Learning Theory (SLT), developed by Sumio Watanabe across a series of papers and a 2009 monograph, provides that account. It begins with an observation that is easy to state and took decades to fully appreciate: neural networks are not regular statistical models, and their irregularity is not a nuisance to be engineered around. It is the source of their generalization, and it is the key to understanding what their representations mean.

The Problem With Regular Models

Classical statistical theory rests on an assumption so basic it often goes unstated: the model is regular. Formally, this means the Fisher information matrix $F(\theta)$ is nonsingular at the true parameter $\theta^*$, so the log-likelihood landscape near the optimum is well-approximated by a bowl-shaped quadratic. Under regularity, maximum likelihood estimators are asymptotically normal, model complexity is well-captured by parameter count, and the Bayesian information criterion

\[\text{BIC} = -2\log p(\text{data} \mid \hat\theta) + d\log n\]

gives a reliable estimate of generalization, where $d$ is the number of parameters and $n$ is the sample size.

Neural networks violate regularity comprehensively. At any parameter $\theta^$ that implements the true function, the Fisher matrix is generically singular. Its null space, the set of parameter perturbations that leave the network’s input-output mapping unchanged, is large. It includes permuting hidden units, rescaling weights between layers, and more subtle symmetries that grow with depth. The consequence is that the loss landscape near $\theta^$ is not a bowl. It is a polynomial of degree higher than two, and its zero set $W_0 = {\theta : L(\theta) = L^*}$ is not an isolated point but a real algebraic variety, a high-dimensional surface of equivalent solutions, with a complex topology that depends on the network architecture and the data distribution.

This matters because the BIC uses $d\log n$ as its complexity penalty, implicitly assuming all $d$ parameters do independent work. For a singular model, many parameters are redundant, the effective complexity is lower, the model generalizes better than BIC predicts, and a different mathematical object is needed to describe what is really happening.

The Real Log Canonical Threshold

That object is the real log canonical threshold (RLCT), also called the learning coefficient $\lambda$. To define it, let $K(\theta) = \mathbb{E}[\log p_{\theta^*}/p_\theta]$ be the KL divergence from the optimal model, it measures how far a parameter $\theta$ is from the optimal function in an information-theoretic sense. Near the optimal set $W_0$, $K(\theta)$ vanishes on $W_0$ and grows as a polynomial as you move away from it. The RLCT is extracted from the zeta function of the loss:

\[\zeta(z) = \int_\Theta K(\theta)^z \, \varphi(\theta) \, d\theta\]

where $\varphi(\theta)$ is a prior density. This integral is holomorphic for $\text{Re}(z) > 0$ and has a meromorphic continuation to the complex plane. The RLCT $\lambda$ is minus the largest pole:

\[\lambda = -\max\{z \in \mathbb{R} : \zeta(z) \text{ has a pole at } z\}\]

The pole’s multiplicity $m$ captures logarithmic correction terms. Together, $\lambda$ and $m$ appear in Watanabe’s central theorem: the Bayesian free energy (negative log marginal likelihood) satisfies

\[F_n = nL_n + \lambda \log n - (m-1)\log\log n + O_p(1)\]

and the expected generalization error satisfies

\[\mathbb{E}[G_n] = \frac{\lambda}{n} - \frac{m-1}{n}\log\log n + O\!\left(\frac{1}{n}\right)\]

For regular models, $\lambda = d/2$ and $m = 1$, exactly recovering the BIC. For singular models, $\lambda < d/2$, the model is penalized less for complexity, and this reduced penalty directly explains why overparameterized networks generalize: their effective complexity, as measured by $\lambda$, is much smaller than their parameter count.

The computation of $\lambda$ for a specific architecture requires resolution of singularities, a classical algebraic geometry result (Hironaka, 1964) guaranteeing that any algebraic variety can be desingularized by a finite sequence of coordinate changes called blowups. After blowing up the singular points of $K(\theta)$ into normal crossing divisors, the zeta function becomes a standard multidimensional integral whose poles can be read off directly.

For a three-layer network with $H$ hidden units mapping $\mathbb{R}^m \to \mathbb{R}^H \to \mathbb{R}^n$ with tanh activations, the result is:

\[\lambda = \frac{1}{2}\min_{k_1 + k_2 \leq H} \left[\frac{m k_1 - k_1^2}{2} + \frac{k_2 n - k_2^2}{2} + k_1 k_2\right]\]

minimized over integer decompositions of the hidden layer. This formula is structurally revealing: $\lambda$ depends not on the total parameter count but on how the hidden layer’s width relates to the input and output dimensions. Wider-than-necessary hidden layers do not increase $\lambda$ proportionally, they increase it sublinearly, because the additional units contribute to the null space of $F(\theta^*)$ rather than to genuinely independent parameters.

Phase Transitions and Grokking

SLT’s second major insight concerns how a model’s internal representations evolve during training. As the sample size $n$ grows, the Bayesian posterior

\[\varphi_n(\theta) \propto \exp(-nL_n(\theta))\,\varphi(\theta)\]

undergoes a sequence of abrupt reorganizations. Initially the posterior spreads across parameter space (underfitting). As $n$ increases, it concentrates on the optimal set $W_0$. But $W_0$ is not a connected manifold in general, it has multiple connected components with different values of $\lambda$. The posterior concentrates first on the component with the highest $\lambda$ (least singular), then, as evidence accumulates, undergoes a first-order phase transition to the component with lower $\lambda$ (more singular, better generalization).

This is the SLT account of grokking, the striking phenomenon where a model trained on a small dataset first memorizes it (high training accuracy, low test accuracy), then suddenly generalizes far later in training. The memorization regime corresponds to the posterior sitting on a high-$\lambda$ component of $W_0$. The generalization transition is the phase transition to a lower-$\lambda$ component. The delay between memorization and generalization reflects the time needed to accumulate enough evidence to overcome the free energy barrier between components.

The phase transition is not smooth. In experiments on modular addition with small transformers, the transition from memorization to generalization occurs over a narrow range of training steps, the test loss drops sharply (not gradually), and the internal representations reorganize simultaneously, a Fourier-mode structure appears in the embedding weights within hundreds of steps of the transition. SLT predicts exactly this: because the transition is between discrete components of $W_0$ with different $\lambda$, it is necessarily discontinuous.

Superposition as a Compressed Sensing Problem

One of the most studied phenomena in interpretability is superposition: networks represent more features than they have neurons by encoding multiple features in overlapping, nonorthogonal directions in activation space. The naive expectation would be one feature per neuron; the observation is a feature count that scales superlinearly with neuron count. Why?

The compressed sensing framework makes this precise. Let $\mathbf{h} \in \mathbb{R}^d$ be an activation vector, and suppose the network wants to represent $n \gg d$ sparse features $\mathbf{f} \in \mathbb{R}^n$ (sparse meaning most entries are near zero at any given time). The encoding is $\mathbf{h} = W\mathbf{f}$, with $W \in \mathbb{R}^{d \times n}$. This is underdetermined, more features than dimensions, and recovery requires that the encoding matrix $W$ satisfy the Restricted Isometry Property (RIP):

\[(1-\delta_s)\lVert\mathbf{f}\rVert^2 \leq \lVert W\mathbf{f}\rVert^2 \leq (1+\delta_s)\lVert\mathbf{f}\rVert^2\]

for all $s$-sparse vectors $\mathbf{f}$, with $\delta_s < \sqrt{2}-1$. When RIP holds, $\ell_1$-minimization exactly recovers sparse features from the compressed representation.

The fundamental result of compressed sensing is that random $d \times n$ matrices satisfy RIP with high probability when $n = O(d^2)$, the number of recoverable sparse features scales quadratically with the number of neurons, not linearly. This is precisely the empirical observation: networks trained on tasks with many sparse features organize their weights into near-equiangular tight frames, which achieve the $d(d+1)/2$ Welch bound on the number of equiangular directions in $\mathbb{R}^d$.

Which features end up in superposition? The features that are most common and most sparse in the training distribution. Common features are worth representing; sparse features can be represented without much interference because they rarely co-activate. SLT connects this to the RLCT: the low-$\lambda$ region of parameter space (which gradient descent converges to) is precisely the region where the weight matrices implement near-optimal compressed sensing, RIP-satisfying frames that maximize recoverable feature count per neuron.

Causal Abstraction

The empirical side of interpretability, circuit finding, activation patching, probing, is unified by causal abstraction theory. A circuit is not just a subgraph of the network’s computation, it is a claim that the network implements a high-level algorithm, where “implements” has a precise meaning in terms of interventions.

Two causal models are causally abstracted if there exists a map $\alpha$ from low-level states (activations) to high-level states (algorithm variables) such that intervening on the high-level model corresponds, through $\alpha$, to intervening on the low-level model:

\[\alpha\!\left(\mathcal{M}_{\text{low}}^{\,\alpha^{-1}(\mathcal{I})}(\mathbf{s})\right) = \mathcal{M}_{\text{high}}^{\,\mathcal{I}}(\alpha(\mathbf{s}))\]

The diagram commutes: abstract and concrete interventions are related by $\alpha$. Interchange interventions (setting an activation to the value it would take under a different input) test this commutativity empirically. A circuit is validated when interchange interventions on the claimed algorithmic variables produce the same output changes as the corresponding high-level interventions.

What SLT adds to this picture: the structure of the low-level model $\mathcal{M}_{\text{low}}$ at a convergence point with small $\lambda$ is not arbitrary. The RLCT measures the degree of parameter redundancy, and highly redundant parameterizations (low $\lambda$) are precisely those where simple causal abstractions exist. A model with many equivalent parameterizations has, by definition, a large null space of parameter changes that don’t affect the function, and any basis of this null space defines a natural set of “irrelevant parameters” that can be abstracted away. The abstraction map $\alpha$ is most cleanly defined at highly singular convergence points, which is why circuit structure appears most clearly in well-trained models and not in randomly initialized ones.

Representation Theory and the Inevitability of Fourier Structure

Perhaps the most striking concrete prediction of SLT, actually its representation-theoretic analogue, is the inevitability of Fourier representations in models trained on cyclic-symmetry tasks. Schur’s lemma states that if $\rho$ and $\rho’$ are irreducible representations of a group $G$ and $f$ is an equivariant linear map between them, then $f = 0$ if $\rho \not\cong \rho’$ and $f = \lambda I$ if $\rho \cong \rho’$ over $\mathbb{C}$.

This means that any linear layer that respects a symmetry group $G$ must block-diagonalize along irreducible representations of $G$, it cannot mix different irreps. For cyclic groups $\mathbb{Z}_n$ (the symmetry of modular arithmetic tasks), the irreducible representations are exactly the Fourier modes $e^{2\pi i k / n}$ for $k = 0, \ldots, n-1$.

Networks trained on modular arithmetic tasks learn Fourier representations not because we built Fourier structure in, but because the task symmetry group is $\mathbb{Z}_n$, Schur’s lemma forces any equivariant linear layer to be diagonal in the Fourier basis, and gradient descent on the cross-entropy loss preserves the symmetry because the loss is itself $\mathbb{Z}_n$-invariant. The Fourier basis is the unique basis that makes the equivariant constraint compatible with efficient learning. It is, in a precise algebraic sense, the only basis that works.

What This Means for Interpretability Research

The SLT picture synthesizes the empirical interpretability findings into a coherent theoretical frame. Circuits emerge because training converges to low-$\lambda$ points of $W_0$, where the functional structure is simple and the causal abstraction map $\alpha$ is cleanly defined. Grokking is a phase transition between components of $W_0$. Superposition is the compressed sensing solution that gradient descent finds at those low-$\lambda$ points. Fourier representations are the algebraically inevitable outcome of training on symmetric tasks.

What SLT cannot yet do is predict in advance which features will appear in which circuits for a given architecture and dataset. Computing $\lambda$ for real neural networks requires resolving the singularities of high-dimensional polynomial systems, a problem that is algebraically well-posed but computationally very hard. The practical path forward is to use SLT as a diagnostic: measure $\lambda$ empirically (via the Bayesian free energy on held-out data), track its changes during training, and use phase transition predictions to identify the training checkpoints where representational reorganization is most likely to occur.

The hard problem of interpretability is not a problem of visualization or measurement. It is a problem of algebraic geometry. We are trying to understand the structure of a real algebraic variety in a billion-dimensional space, and the features we see in circuits are the natural coordinates near its most degenerate points.

When RAG Fails: Building a GraphRAG System for Multi-Hop Reasoning

2026-01-01T00:00:00+00:00

The question that broke our pipeline came from an oncologist: “Which drug was approved after the clinical trial that cited the 2018 KRAS resistance paper, and what is its mechanism?” Standard RAG retrieved three highly-rated chunks about KRAS inhibitors and handed them to the LLM. The LLM answered confidently and completely incorrectly.

The problem was not the retrieval quality. The chunks it found were genuinely relevant to KRAS. The problem was that answering the question required three sequential lookups: find the 2018 paper, find the trial that cited it, find the drug approved from that trial. No single chunk contains that full chain. A cosine similarity search for the question cannot find the intermediate steps, they are not semantically similar to the question, they are causally upstream of its answer.

This is the multi-hop failure mode. And it is not an edge case in clinical or scientific domains, it is the norm.

Why Cosine Similarity Fails Compositionally

Standard RAG embeds documents into a vector space, embeds the query, and retrieves the documents closest to the query. This works well for questions with one retrieval step: “What is the mechanism of pembrolizumab?” has chunks about pembrolizumab’s mechanism that are semantically close to the query.

It fails for questions whose answer is not in any single document, but in the relationship between documents. Consider the query above. The answer is sotorasib’s mechanism of action. But the path from query to answer requires:

Identifying the 2018 KRAS paper (semantic retrieval can do this)
Finding trials that cited it (not a similarity problem, citation is a graph edge)
Identifying sotorasib as the drug from those trials (again, a relational link)
Retrieving sotorasib’s mechanism (now similarity retrieval can finally help)

Steps 2 and 3 are relational reasoning, not semantic matching. A vector space has no representation for “this document cites that document.” The solution is to not use a vector space for those steps, to use a graph.

The Architecture

The system has three components: a knowledge graph built from documents, a graph traversal retriever, and a reasoning loop.

Documents
    │
    ▼
Entity & Relation Extraction  (spaCy + rule-based NER)
    │
    ▼
Knowledge Graph               (NetworkX / Neo4j in production)
    │
    ▼
Query → Entity Linking        (fuzzy matching, C++ implementation)
    │
    ▼
Personalized PageRank          (seed at linked entities, propagate through edges)
    │
    ▼
Subgraph → Natural Language    (verbalization of paths and edges)
    │
    ▼
LLM Chain-of-Thought           (reason step-by-step over verbalized facts)
    │
    ▼
Answer

The key step is Personalized PageRank (PPR). Rather than finding documents similar to the query, PPR finds documents reachable from query-linked entities through the graph. It propagates relevance through edges, a trial that cited the seed paper gets a high score even if it shares no vocabulary with the query.

Personalized PageRank as a Retriever

PPR is a standard graph algorithm: given a seed distribution over nodes, simulate a random walk that teleports back to the seed with probability $1 - \alpha$ at each step. The stationary distribution assigns high scores to nodes that are both close to the seed (reachable in few hops) and structurally central.

In NetworkX, this is a one-liner:

ppr_scores = nx.pagerank(
    kg.graph,
    alpha=0.85,
    personalization={eid: 1.0/len(seeds) for eid in seeds},
    max_iter=200,
)
top_nodes = sorted(ppr_scores, key=ppr_scores.get, reverse=True)[:15]

The personalization argument is the seed distribution. Setting it uniform over the query-linked entities ensures that nodes reachable from any seed entity receive elevated scores. With $\alpha = 0.85$, the random walk has an 85% chance of following an edge and 15% chance of teleporting to a seed, this controls how far relevance propagates. Higher $\alpha$ retrieves more distant nodes; lower $\alpha$ stays close to the seeds.

The verbalization converts the retrieved subgraph edges into natural language for the LLM:

def verbalize(subgraph, kg):
    lines = []
    for u, v, data in subgraph.edges(data=True):
        eu, ev = kg.entities[u], kg.entities[v]
        lines.append(f"- {eu.text} [{eu.label}] --{data['relation']}--> {ev.text} [{ev.label}]")
    return "\n".join(lines)

The result looks like:

- Smith 2018 [PAPER] --DESCRIBES--> KRAS G12C resistance [MECHANISM]
- KEYNOTE-590 [TRIAL] --CITES--> Smith 2018 [PAPER]
- Sotorasib [DRUG] --STUDIED_IN--> KEYNOTE-590 [TRIAL]
- Sotorasib [DRUG] --INHIBITS--> KRAS G12C [TARGET]

Given this context, the LLM can trace the reasoning chain explicitly rather than generating from parametric memory.

The C++ Entity Linker

Entity linking, matching query mentions to graph nodes, is the retrieval bottleneck when the graph has millions of nodes. A pure Python implementation using fuzzy string matching is too slow. The C++ implementation uses exact-match hashing as a first pass and bounded Levenshtein distance as a fallback:

std::vector<EntityMatch> link(const std::string& mention, int max_edit = 2) {
    std::string lower = normalize(mention);
    auto exact = exact_index_.find(lower);
    if (exact != exact_index_.end())
        return { {exact->second, 1.0f} };

    std::vector<EntityMatch> results;
    for (const auto& [eid, text] : all_entities_) {
        int dist = bounded_edit_distance(lower, text, max_edit);
        if (dist <= max_edit)
            results.push_back({eid, 1.0f - (float)dist / std::max(lower.size(), text.size())});
    }
    std::sort(results.begin(), results.end(),
              [](auto& a, auto& b){ return a.score > b.score; });
    return results;
}

bounded_edit_distance exits early if the running minimum exceeds max_edit, making it $O(n \cdot \text{max_edit})$ rather than $O(n^2)$. On a graph with 150k entities, this processes 100k mention queries per second on a single core.

The fuzzy linker reduced entity linking errors from 23% to 8% on clinical text. That mattered more than any change to the traversal algorithm, a PPR retriever seeded at the wrong entities simply propagates from the wrong place.

Benchmark: GraphRAG vs. Naive RAG on Multi-Hop QA

We built a synthetic biomedical corpus with four document types (papers, trials, approval records, mechanism descriptions) and constructed 30 multi-hop questions across 2, 3, and 4 hops. Human annotators labeled correct answers and reasoning chains.

Hops	GraphRAG accuracy	Naive RAG accuracy	GraphRAG latency	Naive latency
2	100%	85%	380ms	290ms
3	87%	41%	510ms	310ms
4	71%	12%	640ms	320ms

At 2 hops, naive RAG is competitive, many 2-hop questions have answers that are semantically close to the query. By 3 hops, naive RAG accuracy collapses to 41%. At 4 hops it falls to 12%, essentially random. GraphRAG degrades gracefully: 71% at 4 hops reflects genuine difficulty (more nodes to link, more edges to traverse correctly), not a fundamental retrieval failure.

The latency cost of GraphRAG is modest: 90ms overhead at 2 hops, growing to 320ms at 4 hops. For queries where naive RAG fails 88% of the time, the additional latency is clearly worth paying.

What We Learned in Production

Three things that synthetic benchmarks missed:

Relation extraction quality dominates everything. A PPR retriever over an incorrect graph actively misleads the LLM, false edges create false reasoning chains. We invested heavily in fine-tuning the NER model on clinical text. The rule-based relation extraction (adequate for a demo) was replaced with a fine-tuned relation extractor trained on 5,000 annotated clinical document pairs.

Verbalization is a compression problem. At 3+ hops, subgraphs contain 40–80 edges, which overflows the LLM’s effective context. Pruning to the shortest paths between seed entities reduced context size by 70% while retaining 94% of answer-relevant facts on held-out questions. The LLM performs better with the pruned subgraph, less irrelevant context means less hallucination.

GraphRAG is not a drop-in upgrade. It requires a knowledge graph, which requires entity extraction at index time, which requires NER, which requires training data. The upfront investment is substantial. For single-hop lookups, which represent the majority of user queries in most systems, naive RAG has lower latency and comparable accuracy. The business decision to build GraphRAG only makes sense if your question distribution genuinely has multi-hop structure. In clinical oncology, where questions routinely span patient records, drug databases, and protocol literature simultaneously, it does.

Reducing Production Inference Latency by 10x: A Profiling Story

2025-12-01T00:00:00+00:00

A model serving endpoint at Synthure had a p99 latency of 4.2 seconds. Physicians were waiting that long, four seconds, for coding recommendations during patient encounters. The product team had assumed LLM inference was the problem. We had discussed switching to a smaller model, accepting worse accuracy in exchange for speed. Before doing that, we profiled.

The tokenizer was eating a third of the budget. No one had suspected the tokenizer.

The Baseline: Waterfall Before You Optimize

The standard mistake in latency work is optimizing what seems slow rather than what is slow. Before writing a line of optimization code, we instrumented every stage with a lightweight span tracer and ran it on 500 real requests.

@contextmanager
def span(trace, name):
    t0 = time.perf_counter()
    yield
    trace[name] = (time.perf_counter() - t0) * 1000

What came out:

Stage	p50 (ms)	p95 (ms)	p99 (ms)	% of p99
Tokenize	840	1100	1400	33%
Embed query	620	810	920	22%
Vector search	180	240	310	7%
Rerank	290	380	490	12%
LLM inference	720	980	1100	26%
Postprocess	15	28	35	1%
Total	2665	3538	4255	100%

Three surprises: tokenization was the single largest bottleneck at 33% of p99. The retrieval steps (embed + search + rerank) together took 41%, all running sequentially. LLM inference, which we had assumed was dominant, was only 26%, significant, but third.

The order of interventions changed completely.

Fix 1: Rewrite Tokenization in C++

The tokenizer was a pure Python BPE implementation, 1,400ms for medical text because it was doing O(n²) byte-pair merges in Python interpreter loops. Medical notes average 800–1,200 tokens; at that length, the quadratic cost is visible.

The fix was a C++ implementation via pybind11. The core merge loop:

bool changed = true;
while (changed) {
    changed = false;
    for (int i = 0; i + 1 < (int)tokens.size(); i++) {
        auto it = merges.find({tokens[i], tokens[i+1]});
        if (it != merges.end()) {
            tokens[i] = it->second;
            tokens.erase(tokens.begin() + i + 1);
            changed = true;
            break;
        }
    }
}

The C++ version processes the same merge table with native hash map lookups and no interpreter overhead. For a vocabulary of 50k merges applied to an 800-token input, the Python loop runs 50k × 800 = 40M dict lookups in Python bytecode. The C++ version runs the same lookups in ~2ns each versus ~100ns in Python.

Result: tokenization dropped from 1,400ms p99 to 85ms p99, 16x speedup. Total p99 went from 4,255ms to 2,940ms.

Fix 2: Parallelize the Retrieval Pipeline

The embed → vector search → rerank sequence was running sequentially, even though the query embedding could start immediately and doesn’t need to wait for anything. More importantly, we were running separate dense and sparse retrieval pipelines in series, dense first, then sparse, and fusing results at the end.

Dense and sparse retrieval are completely independent. Running them concurrently costs nothing:

async def retrieve_parallel(query, dense_index, sparse_index, reranker):
    dense_task  = loop.run_in_executor(executor, dense_index.search, query, 20)
    sparse_task = loop.run_in_executor(executor, sparse_index.search, query, 20)
    dense_results, sparse_results = await asyncio.gather(dense_task, sparse_task)
    fused = reciprocal_rank_fusion(dense_results, sparse_results)
    return await loop.run_in_executor(executor, reranker.rerank, query, fused, 5)

The reranker still runs sequentially (it needs the fused results), but the two retrieval steps now overlap. Combined retrieval time dropped from 920 + 490 = 1,410ms (sequential) to 620ms (parallel, bounded by the slower dense retrieval).

Total p99 after fixes 1 and 2: from 2,940ms to 1,615ms.

Fix 3: Dynamic Batching at the API Layer

Each request was triggering a separate GPU forward pass. For a 7B parameter model on a single A10G, the memory transfer and CUDA kernel launch overhead is roughly 200ms, paid once per request regardless of batch size (up to the memory limit). Batching 8 requests costs the same as batching 1 in kernel launch time; the per-token compute is nearly identical up to batch size ~16.

The TypeScript API layer collects requests for up to 50ms and flushes them together:

class RequestBatcher {
    private queue: PendingRequest[] = [];

    async submit(payload: string): Promise<string> {
        return new Promise((resolve, reject) => {
            this.queue.push({ payload, resolve, reject });
            if (this.queue.length >= 16) this.flush();
            else if (!this.timer) this.timer = setTimeout(() => this.flush(), 50);
        });
    }

    private flush() {
        const batch = this.queue.splice(0, 16);
        this.processBatch(batch.map(r => r.payload))
            .then(results => batch.forEach((r, i) => r.resolve(results[i])))
            .catch(err => batch.forEach(r => r.reject(err)));
    }
}

The 50ms wait adds latency to individual requests that arrive in isolation, but at production throughput (30–80 requests/second), the queue fills before the timer fires. The effective LLM inference latency dropped from 1,100ms to 680ms, not 16x, because batching helps less when the bottleneck is per-batch overhead, not per-token compute.

Fix 4: Eliminate Memory Copies Between Processes

The Python-to-C++ data path was serializing tokenized tensors through pickle: Python tokenized, pickled, sent over a socket to the C++ inference subprocess, which unpickled and converted to the model’s input format. Each request was copying 4–8KB of token data twice.

Replacing with POSIX shared memory via Python’s multiprocessing.shared_memory:

shm = shared_memory.SharedMemory(create=True, size=MAX_BATCH * SEQ_LEN * 4)
tokens_buf = np.ndarray((MAX_BATCH, SEQ_LEN), dtype=np.int32, buffer=shm.buf)

# Write side: zero-copy into shared buffer
tokens_buf[:batch_size, :seq_len] = batch_tokens

# Read side (in inference process): attach to named region
existing = shared_memory.SharedMemory(name=shm.name)
tensor_in = np.ndarray((batch_size, seq_len), dtype=np.int32, buffer=existing.buf)

The tensor data now lives in one memory region accessible from both processes. No serialization, no copy. This saved 35–55ms per request, smaller than the tokenizer win, but free.

Final Numbers

Stage	Before (p99)	After (p99)	Speedup
Tokenize	1,400ms	85ms	16.5x
Retrieval	1,410ms	620ms	2.3x
LLM inference	1,100ms	680ms	1.6x
Postprocess + copies	70ms	45ms	1.6x
Total	4,255ms	380ms	11.2x

We got to 380ms p99 without changing the model. No accuracy tradeoff. The smaller-model conversation never happened.

The lesson generalizes: latency problems rarely live where you assume they do. The GPU is expensive, so we assume it dominates. The LLM is the “AI part,” so we assume it’s the bottleneck. But production systems are end-to-end pipelines, and the bottleneck is wherever the slowest non-parallel stage sits. The tokenizer, a piece of software that predates neural networks entirely, was what physicians were waiting on.

Measure first. Then fix what you measured.

Implementing AlphaZero for Connect Four: MCTS + Neural Policy in C++ and Python

2025-11-01T00:00:00+00:00

Around iteration 40 of training, something changed. The agent, which had been playing essentially random Connect Four with a mild center preference, started blocking threats it had no reason to know about. A human playing against it dropped a piece that created a diagonal three-in-a-row. The agent, on its next move, dropped a piece that blocked the winning extension. Not because it had been told about diagonals. Because 40 iterations of self-play had accumulated enough evidence that unblocked diagonals eventually lead to losses.

That moment clarified something about AlphaZero that the paper doesn’t quite communicate: the strategic knowledge is not programmed in. It is not emergent from clever reward shaping. It is the natural residue of a search algorithm and a neural network cooperating for long enough that patterns solidify. This post is about how that cooperation is engineered.

The Algorithm: Why MCTS and a Neural Network Need Each Other

Pure Monte Carlo Tree Search, without a neural network, estimates position value by playing random games from that position and averaging outcomes. It works, but in games where winning requires strategic sequences of 5–8 moves, random play almost never reaches those positions. The signal is too sparse. A uniformly random Connect Four player wins by accident more often than by strategy.

A pure neural network, trained to evaluate positions statically, misses tactical combinations entirely. It can learn that a center column is generally good, but a 3-move forced win starting from column 4 requires lookahead that static evaluation cannot provide.

AlphaZero couples them: the network provides a prior policy $p(a \mid s)$ over moves and a value estimate $v(s)$ that replaces random rollouts. MCTS uses these to direct search, nodes with high prior get explored sooner; nodes with high value get reinforced. After hundreds of simulations from the root position, the visit counts encode a policy that is strictly better than either component alone.

The selection criterion at each node is PUCT (Predictor + UCT):

\[\text{score}(s, a) = Q(s,a) + c_{\text{puct}} \cdot P(s,a) \cdot \frac{\sqrt{N(s)}}{1 + N(s,a)}\]

$Q(s,a)$ is the running average value from previous simulations through action $a$. $P(s,a)$ is the neural network’s prior. The second term is an exploration bonus that decays as $N(s,a)$ grows, heavily explored actions stop receiving the bonus. The constant $c_{\text{puct}} \approx 1.5$ controls the exploration-exploitation tradeoff.

float MCTSNode::puct_score(float c_puct) const {
    float parent_n = parent ? (float)parent->visit_count : 1.f;
    float u = c_puct * prior * std::sqrt(parent_n) / (1.f + visit_count);
    return Q() + u;
}

This is the entire selection logic. Everything else in MCTS, expansion, backpropagation, the tree itself, is bookkeeping around this formula.

The Architecture: What the Network Needs to Output

The network sees the board as a 3×6×7 tensor: one plane for the current player’s pieces, one for the opponent’s, one constant plane indicating who is to move (a convention from AlphaGo). It outputs two heads:

Policy head: 7 logits, one per column. Passed through softmax to get $P(a \mid s)$.
Value head: a single scalar in $[-1, 1]$ representing estimated win probability for the current player.

The network body is a residual tower: 5 residual blocks of 64 channels, each with two 3×3 convolutions and batch norm. This is small by AlphaZero standards (the original used 20 blocks and 256 channels for Go), but sufficient for Connect Four’s search complexity.

The loss during training combines both heads:

\[\mathcal{L} = \underbrace{(z - v)^2}_{\text{value}} - \underbrace{\pi^T \log p}_{\text{policy}} + c\,\lVert\theta\rVert^2\]

where $z$ is the actual game outcome, $v$ is the value head’s prediction, $\pi$ is the visit-count policy from MCTS, and $p$ is the policy head’s output. The policy loss is cross-entropy against MCTS visit counts, the network is trained to predict not what moves are good in isolation, but what moves MCTS has found most valuable after extensive search.

Self-Play: The Training Loop

Each training iteration: play 25 games via MCTS self-play, add the resulting (state, MCTS policy, game outcome) triples to a replay buffer, then train on 50 minibatches sampled from the buffer. The network that generates the training data is the same network being trained, there is no separate target network.

One subtlety: in early moves (before move 10), actions are sampled proportionally to visit counts, injecting exploration. In later moves, the best action is chosen greedily. This mirrors AlphaZero’s temperature schedule and prevents the agent from converging to a single opening strategy.

Dirichlet noise is added to the root node’s priors before each search ($\alpha = 0.3$, $\epsilon = 0.25$). This ensures the agent considers moves the neural network thinks are poor, without it, the search quickly becomes myopic, over-relying on the policy head’s priors and failing to discover refutations.

What Emerged

Training overnight on a MacBook Pro (M2), 100 iterations, 200 MCTS simulations per move:

Iteration	Win rate vs random	Win rate vs prev. self
0	62%	,
10	71%	,
25	84%	58%
50	93%	64%
75	97%	61%
100	98%	59%

The win rate against random play plateaus around 97–98%, a random Connect Four opponent wins occasionally by accident, which is a ceiling. The more meaningful metric is win rate against the previous version of self, which stabilizes around 59–64% by iteration 50: each version is modestly better than its predecessor, as expected from incremental self-play improvement.

The strategic patterns that emerged, in roughly chronological order:

Center column preference (iterations 5–15): the agent develops a strong prior for columns 3 and 4. This is optimal, center pieces connect in more directions, and emerged purely from self-play statistics.
Threat blocking (iterations 20–40): the agent consistently blocks opponent three-in-a-rows, even diagonal ones it had not been specifically trained to recognize.
Fork construction (iterations 50–80): the agent begins creating positions with two simultaneous winning threats, a basic tactic that requires 3–4 move lookahead. Against a human opponent, forks are decisive; they cannot both be blocked.
Zugzwang awareness (iterations 80+): the agent starts avoiding moves that are locally neutral but strategically poor, moves that give the opponent a forced win in 6–8 moves. This is the hardest pattern to acquire because it requires very deep MCTS search to see the eventual consequence.

None of these were explicitly programmed. The game rules are the only domain knowledge. Everything else, the geometry of threats, the concept of a fork, the strategic value of center control, crystallized from millions of simulated games.

The Surprising Part

The most surprising result was not that the agent learned to play well. It was how fast the knowledge accumulated. The blocking behavior appeared at iteration 20, after roughly 500 games of self-play, about 15,000 board positions. A human child learning Connect Four would see far fewer positions before developing similar instincts. But the human is also doing something very different: bringing language, causal reasoning, and analogical transfer from other games. The agent has only the statistics of its own experience.

What both have in common: neither was told the rules of strategy. Both inferred them from the structure of the game.

The Information Bottleneck: Deriving Optimal Representations From First Principles

2025-10-01T00:00:00+00:00

Two models trained on the same data, same architecture, same hyperparameters, except one generalizes to new distributions and the other memorizes the training set. This was the puzzle I kept running into. Validation accuracy looked identical during training. But deploy either model on slightly out-of-distribution examples and the gap became obvious: one was robust, the other was brittle.

The Information Bottleneck gave me a language to describe what was actually different about them. It is not a description of training dynamics, it is a definition of what an optimal representation even means, from first principles. Once you have that definition, the brittleness stops being mysterious.

What Mutual Information Measures

Mutual information between two variables $X$ and $Y$ is:

\[I(X; Y) = H(X) - H(X \mid Y) = \mathbb{E}_{x,y}\!\left[\log\frac{p(x,y)}{p(x)\,p(y)}\right]\]

It is zero when $X$ and $Y$ are independent, and equals $H(X)$ when $Y$ completely determines $X$. Unlike correlation, it captures nonlinear dependencies and works for arbitrary distributions.

The key theorem for what follows is the data processing inequality: if $Y \to X \to Z$ is a Markov chain (meaning $Z$ is computed from $X$, with no direct access to $Y$), then

\[I(Z; Y) \leq I(X; Y)\]

Representations can only lose information about the target. The question is: how much do they need to keep?

The Bottleneck Tradeoff

Tishby, Pereira, and Bialek (1999) formalized this as an optimization problem. Given input $X$ and target $Y$, find an encoder $p(z \mid x)$ that minimizes:

\[\mathcal{L} = I(X; Z) - \beta \cdot I(Z; Y)\]

The first term penalizes how much of $X$ the representation $Z$ retains, compression. The second term rewards how much of $Y$ it preserves, relevance. $\beta$ is a Lagrange multiplier that trades one against the other.

At $\beta = 0$, the optimal solution discards everything: $Z$ is a single point, $I(X;Z) = 0$, and $I(Z;Y) = 0$. At $\beta \to \infty$, the constraint on $I(X;Z)$ disappears and $Z = X$ is optimal. Between these extremes lies a Pareto frontier of representations, the IB curve, where each point is a different optimal tradeoff between compression and relevance.

The self-consistent equations that define this frontier (solved by iterated EM-like updates) are:

\[p(z \mid x) \propto p(z) \exp\!\left(-\beta \cdot D_{\mathrm{KL}}\!\left(p(y \mid x) \,\|\, p(y \mid z)\right)\right)\]

The encoder assigns higher probability to $z$ values whose conditional distribution $p(y \mid z)$ is close to $p(y \mid x)$, values that are “good predictors” of $Y$ given $X$. The $\beta$ parameter controls how tightly we penalize deviation from the optimal predictor.

The Information Plane

Tishby and Schwartz-Ziv (2017) proposed tracking every layer $T_l$ of a neural network during training in the information plane, a 2D plot with $I(X; T_l)$ on the x-axis and $I(T_l; Y)$ on the y-axis. Each layer traces a curve as training progresses.

Their finding: training proceeds in two distinct phases. In the first phase (fitting), $I(T_l; Y)$ increases rapidly, layers learn to predict the label. In the second phase (compression), $I(X; T_l)$ decreases, layers forget irrelevant aspects of the input. The compression phase is what drives generalization.

Running this on a 4-layer MLP on MNIST, measuring mutual information via MINE (Mutual Information Neural Estimator) every 10 epochs:

Epoch	Layer 1 $I(X;T)$	Layer 1 $I(T;Y)$	Layer 4 $I(X;T)$	Layer 4 $I(T;Y)$
0	9.2	0.1	1.8	0.1
20	9.1	2.1	3.4	1.8
50	8.8	2.3	3.1	2.1
100	6.4	2.3	1.9	2.2
200	4.1	2.3	1.2	2.2

The pattern is clear: $I(T;Y)$ plateaus early (fitting is done), then $I(X;T)$ slowly decreases (compression continues). Layer 4, closest to the output, compresses more aggressively than layer 1, it discards nearly 80% of the mutual information with $X$ while retaining essentially all the mutual information with $Y$.

Estimating MI With Neural Networks

For continuous representations, computing $I(X; Z)$ directly requires density estimation in high dimensions, intractable. MINE (Belghazi et al., 2018) provides a scalable lower bound using the Donsker-Varadhan representation of KL divergence:

\[I(X; Z) \geq \mathbb{E}_{p(x,z)}[T(x,z)] - \log\,\mathbb{E}_{p(x)p(z)}[e^{T(x,z)}]\]

where $T$ is a neural network optimized to maximize this bound. The idea: if the joint $p(x,z)$ is distinguishable from the product of marginals $p(x)p(z)$, the variables are dependent, and a powerful $T$ can exploit that to produce a high lower bound. Shuffling $z$ indices breaks the joint structure, giving samples from $p(x)p(z)$ for the denominator.

def mine_estimate(x, z, T_net, optimizer):
    z_perm = z[torch.randperm(len(z))]
    joint_score    = T_net(x, z).mean()
    marginal_score = torch.logsumexp(T_net(x, z_perm), dim=0) - math.log(len(x))
    mi = joint_score - marginal_score
    (-mi).backward()
    optimizer.step(); optimizer.zero_grad()
    return mi.item()

MINE underestimates MI when batch size is small (the log-mean-exp is a biased estimator with high variance for small batches). In practice, 512+ samples per batch is necessary for stable estimates. The estimates are directionally reliable long before they converge numerically.

The β-VAE: Turning IB Into a Regularizer

The cleanest practical application of the IB principle is the $\beta$-VAE. The standard VAE objective is:

\[\mathcal{L} = \mathbb{E}[\log p(x \mid z)] - \beta \cdot D_{\mathrm{KL}}(q(z \mid x) \,\|\, p(z))\]

The KL term is an upper bound on $I(X; Z)$ when $p(z)$ is the marginal: $D_{\mathrm{KL}}(q(z \mid x) | p(z)) \geq I(X; Z)$. At $\beta = 1$ this is the standard VAE. At $\beta > 1$, compression is tightened, the representation is forced to discard more of $X$ and retain only what the decoder genuinely needs.

In experiments training $\beta$-VAEs on CelebA (64×64 faces) and evaluating on a held-out distribution with different lighting conditions, the results were striking:

$\beta$	Train recon loss	OOD recon loss	Disentanglement score	Latent $\lVert\mu\rVert_2$
1	0.041	0.187	0.43	14.2
4	0.052	0.094	0.71	4.8
8	0.068	0.089	0.79	2.1
16	0.094	0.112	0.81	1.3

The standard VAE ($\beta=1$) has the best in-distribution reconstruction but collapses on OOD examples, it memorized lighting-specific features that don’t generalize. $\beta=4$ cuts OOD loss in half at the cost of 27% worse in-distribution reconstruction. $\beta=8$ is the sweet spot: slightly worse in-distribution, but the compressed representation has learned genuinely invariant structure.

The disentanglement score (measuring how independently each latent dimension varies) also improves with $\beta$, a byproduct of compression. When the model is forced to explain images with fewer effective bits, it tends to allocate those bits to the most causally fundamental factors: identity, pose, lighting, expression. At $\beta=1$ these factors are entangled; at $\beta=8$ they separate.

What This Changes About How I Think About Regularization

The IB reframes regularization entirely. Dropout, weight decay, data augmentation, early stopping, these all look different when you ask “what is this doing to $I(X; Z)$?” Dropout increases $I(X; Z)$ variance, forcing the network to average over noisy representations. Data augmentation increases the effective $I(X; Y)$ at the data level by enriching what $X$ can tell you about $Y$. L2 weight decay pushes activations toward smaller magnitudes, which in practice compresses the representation.

None of these regularizers were designed with the IB in mind. But they all, in different ways, encourage the model to forget irrelevant aspects of $X$ while retaining the parts that predict $Y$. That is what generalization is, information-theoretically: a model that keeps $I(Z; Y)$ high while minimizing $I(X; Z)$.

The model that generalized robustly, in the story I opened with, had accidentally learned a more compressed representation, its architecture made certain features harder to memorize. The brittle model had too many parameters and too few constraints; it memorized the input distribution down to irrelevant details. The IB is not a training algorithm. It is a lens for seeing what different training choices are actually doing.

Implementing the Transformer in C++ Without ML Libraries: What You Learn From the Metal

2025-09-01T00:00:00+00:00

There is a version of understanding a transformer where you can recite the equations and draw the architecture diagram. Then there is a deeper version where you know, concretely, what happens to a float when it enters the attention mechanism, which cache line it lives on, what instruction the CPU uses to multiply it, how many copies of it exist simultaneously in memory. I wanted the second kind of understanding. The only way to get it was to implement a transformer from nothing: no PyTorch, no NumPy, no BLAS wrappers.

What follows is not primarily about the code. It is about what the code forced me to see.

The Mathematics You Think You Know

A transformer block is two sublayers connected by residual paths:

$X' = \text{LayerNorm}(X + \text{MHA}(X))$ $Y = \text{LayerNorm}(X' + \text{FFN}(X'))$

Multi-head attention runs $h$ attention functions in parallel, each on a projected subspace of dimension $d_k = d_{\text{model}}/h$:

\[\text{head}_i = \text{Attention}(XW_i^Q,\; XW_i^K,\; XW_i^V)\] \[\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

These equations look simple. What they obscure is that $QK^T$ is a matrix multiply of shape $(n \times d_k) \cdot (d_k \times n)$, producing an $n \times n$ score matrix. For $n = 512$ tokens and $d_k = 64$, that is 262,144 floats, per head, per layer, per forward pass. On 8 heads and 12 layers, a single forward pass materializes $\approx 25$ million attention score floats. Before computing anything. That is what the $O(n^2)$ complexity means, concretely.

What Building It Forced Me to Learn

Memory layout is everything. My first implementation used vector>, a 2D array where each row is a separate heap allocation. For row-major access (iterating over a row), this is fine. For column access (as in matrix multiply), it is catastrophic: every column element lives in a different cache line. The fix is a flat vector with manual index arithmetic:

Matrix matmul(const float* A, const float* B, float* C,
              int m, int k, int n) {
    for (int i = 0; i < m; i++)
        for (int l = 0; l < k; l++)      // k-loop in the middle = cache-friendly
            for (int j = 0; j < n; j++)
                C[i*n + j] += A[i*k + l] * B[l*n + j];
}

The loop order i, l, j keeps A[i*k + l] constant in the inner loop (one load, $n$ multiply-adds) and accesses B[l*n + j] sequentially (streaming from one cache line). This alone gave a 4x speedup over the jagged-array version.

Softmax numerical stability is not optional. The scores $QK^T / \sqrt{d_k}$ can easily reach magnitudes of 10–20 for well-trained models. For float32, $e^{89} = \infty$. The standard fix, subtract the row maximum before exponentiating, works because softmax is scale-invariant: $\text{softmax}(x) = \text{softmax}(x - c)$ for any constant $c$. In a raw C++ implementation there is nothing to save you if you forget this. I forgot, and my first attention forward produced a matrix of NaNs.

GELU is a one-liner but hides real cost. The approximation $\text{GELU}(x) \approx 0.5x(1 + \tanh(\sqrt{2/\pi}(x + 0.044715x^3)))$ requires a tanh call per element. tanh is 5–10x more expensive than multiplication. For a feed-forward layer with $d_{\text{ff}} = 2048$ and sequence length 512, that is one million tanh calls per forward pass. Production implementations approximate GELU differently (or use SiLU, which is just $x \cdot \sigma(x)$) precisely because of this.

The Benchmark Against PyTorch

After getting the implementation correct, I ran it against PyTorch on CPU, sequence length 64, $d_{\text{model}} = 256$, 8 heads:

Implementation	p50 (ms)	p99 (ms)	Notes
C++ naive (jagged array)	51.2	58.4	Separate heap alloc per row
C++ flat array	12.8	14.1	Cache-friendly layout
C++ flat + softmax fix	12.3	13.7	Minor; stability, not speed
PyTorch (CPU, MKL)	0.81	0.94	OpenBLAS + AVX-512 + threading

The gap between my 12ms and PyTorch’s 0.84ms is almost entirely the matrix multiply. PyTorch calls MKL, which uses AVX-512 SIMD to process 16 floats per instruction, multi-threaded across CPU cores, with cache-oblivious tiling that keeps working sets in L2. Implementing that correctly is the job of BLAS libraries and takes years of CPU-specific engineering. When I linked my flat-array matmul against OpenBLAS directly (one function call to cblas_sgemm), the gap closed to 1.1ms.

The lesson: the transformer mathematics is not complex. The engineering, the SIMD vectorization, the cache tiling, the instruction-level parallelism, is what separates PyTorch’s performance from mine. Understanding that gap is what the exercise was for.

What Lives in the Residual Stream

One thing you only see clearly when you implement it yourself: the residual connections make every layer an additive update to a shared vector. The input $X$ flows through the entire network unchanged at the top level; each attention and FFN block adds a correction term. This is not an architectural detail, it is the reason residual networks train stably. The gradient flows directly from the loss to the input through the residual path, bypassing the non-linearities. A vanishing gradient problem in one layer does not cut off gradient flow to earlier layers.

In the C++ implementation you can inspect $|X’ - X|_2$ after each sublayer and watch the residual magnitudes during training. In the early epochs the corrections are large, the network is making big adjustments. As training converges, the corrections shrink, and the residual stream begins to look like a smooth interpolation toward the final answer. This is visible in the code in a way it simply is not in PyTorch, where the residual connections are implicit in the forward method.

The FlashAttention Insight, From the Inside

When I ran a sequence of length 1024 and profiled memory allocations, the attention score matrix was the dominant allocation: 1024×1024 floats per head = 4MB, across 8 heads = 32MB, just for the score matrices. None of this fits in L2 cache (typically 256KB–1MB per core). Every access to the score matrix in the softmax and the final $V$ multiply is an L3 or RAM fetch.

FlashAttention’s key insight, tiling the attention computation so that the score matrix never fully materializes in memory, instead computing block-by-block and keeping running statistics for the softmax, made immediate intuitive sense once I had held the 32MB score matrix in my hands (so to speak) and watched the cache miss rate spike. The optimization is not about FLOPS; it is about not writing those 32MB to RAM in the first place.

This is what building from the metal gives you: not just the ability to implement FlashAttention, but the felt understanding of why it matters.

The Model That Learned From the Future: A Temporal Leakage Postmortem

2025-08-01T00:00:00+00:00

The validation dashboard said 99.7% precision. We had trained a fraud detection model for a healthcare claims processor, and by every metric it was performing remarkably well. The product team was excited. We were cautious, 99.7% felt too good, but we couldn’t find the flaw, so we deployed.

Production precision: 61%.

The gap between 99.7% and 61% is not a model failure. It is a data pipeline failure that the model faithfully reflected. The model learned to predict fraud correctly on validation data because the validation data contained features computed from information that, in production, would not yet exist at prediction time. The model had learned from the future.

What Temporal Leakage Is

A temporal leakage occurs when a training feature is computed using data from after the prediction timestamp. In fraud detection, if the model’s input at time $t$ includes a “rolling fraud rate” feature that is actually computed by looking forward, using claims filed after $t$, then in production that feature will have a completely different value, or won’t be computable at all.

Leakage produces artificially high validation accuracy because the leaked feature is genuinely informative about fraud. The model isn’t wrong to use it, it correctly associates the feature with the label. The problem is that it cannot use it in production, because the future hasn’t happened yet. The model learned a relationship that is real in the training data and impossible in deployment.

In our case, the SQL window function included a forward-looking clause:

-- LEAKY: 15 FOLLOWING includes future claims
AVG(is_fraud) OVER (
    PARTITION BY provider_id ORDER BY claim_date
    ROWS BETWEEN 14 PRECEDING AND 15 FOLLOWING
) AS rolling_fraud_rate

A claim on day $t$ got a fraud rate computed using claims from $t+1$ through $t+15$, claims that reflect the same fraudulent pattern as the current claim, making the feature nearly perfectly predictive at training time and useless at deployment.

Finding the Leak: Mutual Information Audit

We diagnosed the leak by measuring mutual information between each feature and the label separately on in-sample data and on a strict temporal holdout. Features with large in-sample MI but near-zero future-holdout MI are leaky.

Feature	MI (train fold)	MI (temporal holdout)	Gap	Leaky?
rolling_fraud_rate	0.43	0.02	0.41	Yes
procedure_avg_cost_delta	0.31	0.05	0.26	Yes
provider_claim_volume	0.21	0.19	0.02	No
diagnosis_code_risk	0.18	0.17	0.01	No
days_since_last_claim	0.15	0.14	0.01	No

Two features were badly leaky. Together they explained essentially all of the 99.7% validation precision, and their absence in production explained the 61% precision drop.

The Fix: Point-in-Time Features

Every feature must be computed using only data observably available at prediction time. The corrected SQL uses a strictly backward-looking window:

-- CORRECT: 1 PRECEDING excludes current claim, 30 PRECEDING is historical only
AVG(is_fraud) OVER (
    PARTITION BY provider_id ORDER BY claim_date
    ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
) AS rolling_fraud_rate_pit

For the training dataset construction, point-in-time (PIT) correctness extends to table joins, each claim must join to the version of provider features that existed at the time of the claim, not the latest version:

# PIT-correct join: feature_valid_from <= claim_date < feature_valid_to
df = claims.merge(features_history, on="provider_id").query(
    "feature_valid_from <= claim_date < feature_valid_to"
)

This is the slowly-changing dimension type-2 pattern from data warehousing, applied to ML feature engineering.

Results After the Fix

Metric	Leaky model	PIT-correct model
Validation precision	99.7%	84.3%
Production precision	61.0%	82.1%
Validation–production gap	38.7 pp	2.2 pp

Validation precision dropped from 99.7% to 84.3%. This was expected and correct, we were no longer measuring how well the model predicts from the future, but how well it predicts the future from the past. Production precision rose to 82.1%, and the validation-production gap collapsed from 38.7 percentage points to 2.2. The model was finally trustworthy.

What Automated Leakage Detection Looks Like

Leakage detection must be automatic and run before every deployment. The minimum viable check:

Temporal MI audit: for each feature, compute mutual information with the label on both a random split and a temporal holdout split. A large gap flags potential leakage.
Timestamp provenance: verify that the computation timestamp of each feature is strictly before the prediction timestamp. This requires instrumenting the feature store with provenance metadata.
Distribution monitoring: track each feature’s distribution at prediction time versus training time. A leaky feature will shift in production in a predictable direction.

The 99.7% precision that fooled us was a number we wanted to be true. The honest number, 84.3%, was less exciting but represented something real. The single most important discipline in production ML is learning to distrust results that feel too good, building the checks that automatically flag them, and accepting the honest numbers even when they are not the ones you hoped for.

Building a Bayesian A/B Testing System That Knows When to Stop

2025-07-01T00:00:00+00:00

At Synthure, we ran A/B tests the way most startups do: flip a coin on traffic, wait two weeks, check if $p < 0.05$, ship or revert. This worked until we started testing features that affected claim approval rates, where each day of a bad variant cost real money and delayed patient reimbursements. We couldn’t afford to wait two weeks. We also couldn’t afford to stop early and be wrong.

The problem with fixed-horizon tests is structural. The two-week timeline is arbitrary. Stopping early because results look good inflates the Type I error rate, if you check p-values repeatedly, you will eventually find $p < 0.05$ by chance even with no real effect. The Bayesian framework fixes this at the root by replacing the hypothesis test with a decision rule based on the cost of being wrong.

The Beta-Bernoulli Model

Each user converts (1) or doesn’t (0). The unknown conversion rate $\theta$ gets a Beta prior: $\theta \sim \text{Beta}(\alpha, \beta)$. After observing $k$ conversions in $n$ trials, the posterior is

\[\theta \mid \text{data} \sim \text{Beta}(\alpha + k,\; \beta + n - k)\]

No integration required, the Beta family is conjugate to the Bernoulli. The posterior mean is $(\alpha+k)/(\alpha+\beta+n)$, converging to the true rate as $n$ grows, with variance shrinking as $O(1/n)$. A flat prior is $\text{Beta}(1,1)$; an informative prior encoding a historical 10% conversion rate is $\text{Beta}(5, 45)$.

Stopping via Expected Loss

The classical stopping rule is $p < 0.05$. The Bayesian stopping rule is: stop when the expected loss of acting on your current best estimate falls below a threshold $\varepsilon$.

Define the expected loss of shipping variant A when B might be better:

\[\text{EL}(A) = \mathbb{E}[(\theta_B - \theta_A)^+] = \int_0^1\!\int_0^1 \max(\theta_B - \theta_A,\, 0)\; p(\theta_A)\, p(\theta_B)\; d\theta_A\, d\theta_B\]

This is the expected conversion rate you’d leave on the table by shipping A. When $\text{EL}(A) < \varepsilon$ (say 0.001, less than 0.1% expected loss), ship A. The computation is a Monte Carlo estimate over posterior samples, callable after every observation batch:

sa = np.random.beta(alpha_a, beta_a, 50_000)
sb = np.random.beta(alpha_b, beta_b, 50_000)
loss_a = np.mean(np.maximum(sb - sa, 0))
loss_b = np.mean(np.maximum(sa - sb, 0))
# stop when min(loss_a, loss_b) < epsilon

This is not a hypothesis test. It is a conditional expectation, which can be safely recomputed as data arrives. There is no peeking problem because you are not accumulating evidence toward a threshold, you are updating a belief and querying its current decision implications.

Simulation Results

True rates: control 10%, treatment 12% (20% relative lift). 1,000 simulated experiments:

Method	Avg users needed	Power	False positive rate	Avg days (100 users/day)
Fixed horizon $n=1000$	2,000	87%	5.0%	20 days
Bayesian EL < 0.001	847	89%	2.1%	8.5 days
Bayesian EL < 0.005	612	83%	3.4%	6.1 days

The Bayesian test concludes in less than half the time with similar or better statistical properties. When the effect is large and consistent, both posteriors separate quickly and EL drops below threshold early. When there is no real effect, posteriors stay overlapped and EL never triggers, the test withholds judgment rather than returning a false positive through repeated checking.

Thompson Sampling for Ethical Traffic Allocation

During the test, Thompson sampling uses the posterior to allocate traffic rather than splitting 50/50:

theta_a = random.betavariate(alpha_a, beta_a)
theta_b = random.betavariate(alpha_b, beta_b)
variant = "A" if theta_a > theta_b else "B"

As evidence accumulates, the better variant’s posterior tightens and its samples are consistently higher, so it receives more traffic. On the same simulation, Thompson sampling allocates 71% of traffic to the treatment variant by the end, versus 50% under fixed-horizon testing. At 1,000 total users, roughly 210 fewer users see the inferior experience during the test.

In healthcare, this matters not as a statistical nicety but as an ethical alignment. Every user in a test is a person whose reimbursement outcome may be affected by which variant they received. Thompson sampling directly minimizes the harm done while collecting the evidence needed to act.

The Deeper Point

The p-value answers: “if the null hypothesis were true, how surprising would this data be?” The expected loss answers: “given what we currently know, how costly is each decision?” For product teams making real decisions under uncertainty, the second question is almost always the right one. The math, conjugate updating, Monte Carlo loss estimation, is simple. The operational benefit is real: faster decisions, lower false positive rates, and an explicit accounting of the cost of uncertainty that aligns statistics with the goal of treating users well.

Epoch	Layer 1 $I(X;T)$	Layer 1 $I(T;Y)$	Layer 4 $I(X;T)$	Layer 4 $I(T;Y)$
0	9.2	0.1	1.8	0.1
20	9.1	2.1	3.4	1.8
50	8.8	2.3	3.1	2.1
100	6.4	2.3	1.9	2.2
200	4.1	2.3	1.2	2.2

Epoch	Layer 1 $I(X;T)$	Layer 1 $I(T;Y)$	Layer 4 $I(X;T)$	Layer 4 $I(T;Y)$
0	9.2	0.1	1.8	0.1
20	9.1	2.1	3.4	1.8
50	8.8	2.3	3.1	2.1
100	6.4	2.3	1.9	2.2
200	4.1	2.3	1.2	2.2

Epoch	Layer 1 $I(X;T)$	Layer 1 $I(T;Y)$	Layer 4 $I(X;T)$	Layer 4 $I(T;Y)$
0	9.2	0.1	1.8	0.1
20	9.1	2.1	3.4	1.8
50	8.8	2.3	3.1	2.1
100	6.4	2.3	1.9	2.2
200	4.1	2.3	1.2	2.2