Posts by Tags

Singular Learning Theory and the Geometry of Neural Network Interpretability

13 minute read

Published: March 01, 2026

Ask a mechanistic interpretability researcher why neural networks form the circuits they do and you will get an honest answer: we observe them, name them, and ablate them, but we lack a theory of why they emerge. This is not a complaint about the field, the empirical discoveries are real and important. It is a statement about what is missing. What we need is a mathematical account of why, given a data distribution and an architecture, gradient descent converges to representations with specific structural properties rather than others.

Kolmogorov Complexity, Solomonoff Induction, and the Philosophical Limits of Aligned AGI

12 minute read

Published: May 01, 2026

Begin with the simplest possible question about intelligence: what does it mean to learn? Not to fit a curve, not to minimize a loss, but to genuinely induce the right explanation from evidence. This question has a precise mathematical answer, one that was worked out in the 1960s by Kolmogorov, Solomonoff, and Chaitin, and extended by Hutter in the 2000s into a formal theory of optimal rational agency. The answer is beautiful and the theory is complete. It is also, on close inspection, deeply troubling for the project of value alignment.

The Mathematics of Adversarial Robustness: Certified Defenses, Lipschitz Geometry, and the Limits of Perturbation Sets

13 minute read

Published: April 01, 2026

In 2013 Szegedy et al. showed that a GoogLeNet classifier, trained to near-human accuracy on ImageNet, could be fooled by adding imperceptibly small perturbations to any input image. The perturbations were invisible to human eyes, no larger than the noise in a compressed JPEG, but they caused confident, catastrophically wrong predictions. The model saw a school bus and called it an ostrich. A decade later, after thousands of papers on attacks and defenses, the phenomenon is still not fully understood. State-of-the-art models remain vulnerable in ways that defy intuitive explanation.

Implementing AlphaZero for Connect Four: MCTS + Neural Policy in C++ and Python

7 minute read

Published: November 01, 2025

Around iteration 40 of training, something changed. The agent, which had been playing essentially random Connect Four with a mild center preference, started blocking threats it had no reason to know about. A human playing against it dropped a piece that created a diagonal three-in-a-row. The agent, on its next move, dropped a piece that blocked the winning extension. Not because it had been told about diagonals. Because 40 iterations of self-play had accumulated enough evidence that unblocked diagonals eventually lead to losses.

Implementing the Transformer in C++ Without ML Libraries: What You Learn From the Metal

6 minute read

Published: September 01, 2025

There is a version of understanding a transformer where you can recite the equations and draw the architecture diagram. Then there is a deeper version where you know, concretely, what happens to a float when it enters the attention mechanism, which cache line it lives on, what instruction the CPU uses to multiply it, how many copies of it exist simultaneously in memory. I wanted the second kind of understanding. The only way to get it was to implement a transformer from nothing: no PyTorch, no NumPy, no BLAS wrappers.

Implementing AlphaZero for Connect Four: MCTS + Neural Policy in C++ and Python

7 minute read

Published: November 01, 2025

Around iteration 40 of training, something changed. The agent, which had been playing essentially random Connect Four with a mild center preference, started blocking threats it had no reason to know about. A human playing against it dropped a piece that created a diagonal three-in-a-row. The agent, on its next move, dropped a piece that blocked the winning extension. Not because it had been told about diagonals. Because 40 iterations of self-play had accumulated enough evidence that unblocked diagonals eventually lead to losses.

The Mathematics of Adversarial Robustness: Certified Defenses, Lipschitz Geometry, and the Limits of Perturbation Sets

13 minute read

Published: April 01, 2026

In 2013 Szegedy et al. showed that a GoogLeNet classifier, trained to near-human accuracy on ImageNet, could be fooled by adding imperceptibly small perturbations to any input image. The perturbations were invisible to human eyes, no larger than the noise in a compressed JPEG, but they caused confident, catastrophically wrong predictions. The model saw a school bus and called it an ostrich. A decade later, after thousands of papers on attacks and defenses, the phenomenon is still not fully understood. State-of-the-art models remain vulnerable in ways that defy intuitive explanation.

Kolmogorov Complexity, Solomonoff Induction, and the Philosophical Limits of Aligned AGI

12 minute read

Published: May 01, 2026

Begin with the simplest possible question about intelligence: what does it mean to learn? Not to fit a curve, not to minimize a loss, but to genuinely induce the right explanation from evidence. This question has a precise mathematical answer, one that was worked out in the 1960s by Kolmogorov, Solomonoff, and Chaitin, and extended by Hutter in the 2000s into a formal theory of optimal rational agency. The answer is beautiful and the theory is complete. It is also, on close inspection, deeply troubling for the project of value alignment.

Kolmogorov Complexity, Solomonoff Induction, and the Philosophical Limits of Aligned AGI

12 minute read

Published: May 01, 2026

Begin with the simplest possible question about intelligence: what does it mean to learn? Not to fit a curve, not to minimize a loss, but to genuinely induce the right explanation from evidence. This question has a precise mathematical answer, one that was worked out in the 1960s by Kolmogorov, Solomonoff, and Chaitin, and extended by Hutter in the 2000s into a formal theory of optimal rational agency. The answer is beautiful and the theory is complete. It is also, on close inspection, deeply troubling for the project of value alignment.

Can You Trust a Probability That Was Never Checked Against Reality?

13 minute read

Published: June 16, 2026

Most risk dashboards report a number and ask you to trust it. A model says there is a 73 percent chance of an outbreak, and you have no way to know whether, across all the times it has said 73 percent, an outbreak actually followed 73 percent of the time, or 30 percent, or 95 percent. The number looks authoritative because it has a decimal point, but a probability that has never been checked against what actually happened is not really a probability. It is a feeling with a unit attached. I built MOSAIC in large part to take that problem seriously, and the discipline it forces, define a falsifiable quantity and then prove it is calibrated, turns out to matter far beyond epidemics.

Building a Bayesian A/B Testing System That Knows When to Stop

4 minute read

Published: July 01, 2025

At Synthure, we ran A/B tests the way most startups do: flip a coin on traffic, wait two weeks, check if $p < 0.05$, ship or revert. This worked until we started testing features that affected claim approval rates, where each day of a bad variant cost real money and delayed patient reimbursements. We couldn’t afford to wait two weeks. We also couldn’t afford to stop early and be wrong.

Can You Trust a Probability That Was Never Checked Against Reality?

13 minute read

Published: June 16, 2026

Most risk dashboards report a number and ask you to trust it. A model says there is a 73 percent chance of an outbreak, and you have no way to know whether, across all the times it has said 73 percent, an outbreak actually followed 73 percent of the time, or 30 percent, or 95 percent. The number looks authoritative because it has a decimal point, but a probability that has never been checked against what actually happened is not really a probability. It is a feeling with a unit attached. I built MOSAIC in large part to take that problem seriously, and the discipline it forces, define a falsifiable quantity and then prove it is calibrated, turns out to matter far beyond epidemics.

The Model That Learned From the Future: A Temporal Leakage Postmortem

4 minute read

Published: August 01, 2025

The validation dashboard said 99.7% precision. We had trained a fraud detection model for a healthcare claims processor, and by every metric it was performing remarkably well. The product team was excited. We were cautious, 99.7% felt too good, but we couldn’t find the flaw, so we deployed.

Implementing AlphaZero for Connect Four: MCTS + Neural Policy in C++ and Python

7 minute read

Published: November 01, 2025

Around iteration 40 of training, something changed. The agent, which had been playing essentially random Connect Four with a mild center preference, started blocking threats it had no reason to know about. A human playing against it dropped a piece that created a diagonal three-in-a-row. The agent, on its next move, dropped a piece that blocked the winning extension. Not because it had been told about diagonals. Because 40 iterations of self-play had accumulated enough evidence that unblocked diagonals eventually lead to losses.

Implementing the Transformer in C++ Without ML Libraries: What You Learn From the Metal

6 minute read

Published: September 01, 2025

There is a version of understanding a transformer where you can recite the equations and draw the architecture diagram. Then there is a deeper version where you know, concretely, what happens to a float when it enters the attention mechanism, which cache line it lives on, what instruction the CPU uses to multiply it, how many copies of it exist simultaneously in memory. I wanted the second kind of understanding. The only way to get it was to implement a transformer from nothing: no PyTorch, no NumPy, no BLAS wrappers.

Reproducing Double Descent: The Experiment That Broke Classical Learning Theory

4 minute read

Published: June 01, 2025

Classical learning theory told us bias and variance form a unimodal tradeoff: increase model capacity and test error first falls, then rises as the model starts memorizing. Every textbook contains this curve. It was the theoretical foundation for why we regularize, why we use validation sets, and why we prefer smaller models when data is limited.

Why Adam Works: Understanding Every Major Optimizer Through the Loss Landscape

5 minute read

Published: May 01, 2025

During my first serious training run at Synthure, I watched a model converge beautifully for 40 epochs, then diverge. Learning rate too high, I assumed. I halved it. It diverged again, faster. I halved it again. Now it converged but plateaued far above the target loss. After two days I realized what was actually wrong: I was using SGD with momentum on a loss landscape with wildly different curvatures along different parameter directions, and no single learning rate could handle both the shallow ravines and the steep walls simultaneously.

How Much of a Prompt Can You Delete Before the Answer Breaks?

16 minute read

Published: June 15, 2026

Every token you send to a language model costs you twice. It costs money, because providers bill per token, and it costs latency, because attention is quadratic in sequence length, so doubling the context can quadruple the work the model does to read it. For a single short prompt none of this matters. For a production application that stuffs retrieved documents, conversation history, tool outputs, and system instructions into every call, it matters enormously, and it is the difference between an app that is cheap and fast and one that is neither.

The Model That Learned From the Future: A Temporal Leakage Postmortem

4 minute read

Published: August 01, 2025

The validation dashboard said 99.7% precision. We had trained a fraud detection model for a healthcare claims processor, and by every metric it was performing remarkably well. The product team was excited. We were cautious, 99.7% felt too good, but we couldn’t find the flaw, so we deployed.

Why Does a Valid Proof Tell You Nothing About Whether You Proved the Right Thing?

22 minute read

Published: June 01, 2026

When you prove something with a computer, the workflow has two halves that are easy to confuse. First you write down a statement, which formal-methods people call a specification, or spec for short. A spec is a precise description, in a language a machine can read, of what your code or your theorem is supposed to do. Then you write a proof that your work satisfies that spec, and you hand both to a proof checker like Lean or Axiom’s AXLE. The checker does something genuinely remarkable: it tells you, with certainty, whether the proof is valid. No hand-waving, no “looks right to me,” just a mechanical verdict that the proof establishes the statement.

Reproducing Double Descent: The Experiment That Broke Classical Learning Theory

4 minute read

Published: June 01, 2025

Classical learning theory told us bias and variance form a unimodal tradeoff: increase model capacity and test error first falls, then rises as the model starts memorizing. Every textbook contains this curve. It was the theoretical foundation for why we regularize, why we use validation sets, and why we prefer smaller models when data is limited.

The Information Bottleneck: Deriving Optimal Representations From First Principles

7 minute read

Published: October 01, 2025

Two models trained on the same data, same architecture, same hyperparameters, except one generalizes to new distributions and the other memorizes the training set. This was the puzzle I kept running into. Validation accuracy looked identical during training. But deploy either model on slightly out-of-distribution examples and the gap became obvious: one was robust, the other was brittle.

Singular Learning Theory and the Geometry of Neural Network Interpretability

13 minute read

Published: March 01, 2026

Ask a mechanistic interpretability researcher why neural networks form the circuits they do and you will get an honest answer: we observe them, name them, and ablate them, but we lack a theory of why they emerge. This is not a complaint about the field, the empirical discoveries are real and important. It is a statement about what is missing. What we need is a mathematical account of why, given a data distribution and an architecture, gradient descent converges to representations with specific structural properties rather than others.

When RAG Fails: Building a GraphRAG System for Multi-Hop Reasoning

7 minute read

Published: January 01, 2026

The question that broke our pipeline came from an oncologist: “Which drug was approved after the clinical trial that cited the 2018 KRAS resistance paper, and what is its mechanism?” Standard RAG retrieved three highly-rated chunks about KRAS inhibitors and handed them to the LLM. The LLM answered confidently and completely incorrectly.

When RAG Fails: Building a GraphRAG System for Multi-Hop Reasoning

7 minute read

Published: January 01, 2026

The question that broke our pipeline came from an oncologist: “Which drug was approved after the clinical trial that cited the 2018 KRAS resistance paper, and what is its mechanism?” Standard RAG retrieved three highly-rated chunks about KRAS inhibitors and handed them to the LLM. The LLM answered confidently and completely incorrectly.

How Much of a Prompt Can You Delete Before the Answer Breaks?

16 minute read

Published: June 15, 2026

Every token you send to a language model costs you twice. It costs money, because providers bill per token, and it costs latency, because attention is quadratic in sequence length, so doubling the context can quadruple the work the model does to read it. For a single short prompt none of this matters. For a production application that stuffs retrieved documents, conversation history, tool outputs, and system instructions into every call, it matters enormously, and it is the difference between an app that is cheap and fast and one that is neither.

Why Adam Works: Understanding Every Major Optimizer Through the Loss Landscape

5 minute read

Published: May 01, 2025

During my first serious training run at Synthure, I watched a model converge beautifully for 40 epochs, then diverge. Learning rate too high, I assumed. I halved it. It diverged again, faster. I halved it again. Now it converged but plateaued far above the target loss. After two days I realized what was actually wrong: I was using SGD with momentum on a loss landscape with wildly different curvatures along different parameter directions, and no single learning rate could handle both the shallow ravines and the steep walls simultaneously.

Kolmogorov Complexity, Solomonoff Induction, and the Philosophical Limits of Aligned AGI

12 minute read

Published: May 01, 2026

Begin with the simplest possible question about intelligence: what does it mean to learn? Not to fit a curve, not to minimize a loss, but to genuinely induce the right explanation from evidence. This question has a precise mathematical answer, one that was worked out in the 1960s by Kolmogorov, Solomonoff, and Chaitin, and extended by Hutter in the 2000s into a formal theory of optimal rational agency. The answer is beautiful and the theory is complete. It is also, on close inspection, deeply troubling for the project of value alignment.

The Mathematics of Adversarial Robustness: Certified Defenses, Lipschitz Geometry, and the Limits of Perturbation Sets

13 minute read

Published: April 01, 2026

In 2013 Szegedy et al. showed that a GoogLeNet classifier, trained to near-human accuracy on ImageNet, could be fooled by adding imperceptibly small perturbations to any input image. The perturbations were invisible to human eyes, no larger than the noise in a compressed JPEG, but they caused confident, catastrophically wrong predictions. The model saw a school bus and called it an ostrich. A decade later, after thousands of papers on attacks and defenses, the phenomenon is still not fully understood. State-of-the-art models remain vulnerable in ways that defy intuitive explanation.

Singular Learning Theory and the Geometry of Neural Network Interpretability

13 minute read

Published: March 01, 2026

Ask a mechanistic interpretability researcher why neural networks form the circuits they do and you will get an honest answer: we observe them, name them, and ablate them, but we lack a theory of why they emerge. This is not a complaint about the field, the empirical discoveries are real and important. It is a statement about what is missing. What we need is a mathematical account of why, given a data distribution and an architecture, gradient descent converges to representations with specific structural properties rather than others.

The Information Bottleneck: Deriving Optimal Representations From First Principles

7 minute read

Published: October 01, 2025

Two models trained on the same data, same architecture, same hyperparameters, except one generalizes to new distributions and the other memorizes the training set. This was the puzzle I kept running into. Validation accuracy looked identical during training. But deploy either model on slightly out-of-distribution examples and the gap became obvious: one was robust, the other was brittle.

Implementing the Transformer in C++ Without ML Libraries: What You Learn From the Metal

6 minute read

Published: September 01, 2025

There is a version of understanding a transformer where you can recite the equations and draw the architecture diagram. Then there is a deeper version where you know, concretely, what happens to a float when it enters the attention mechanism, which cache line it lives on, what instruction the CPU uses to multiply it, how many copies of it exist simultaneously in memory. I wanted the second kind of understanding. The only way to get it was to implement a transformer from nothing: no PyTorch, no NumPy, no BLAS wrappers.

Can You Trust a Probability That Was Never Checked Against Reality?

13 minute read

Published: June 16, 2026

Most risk dashboards report a number and ask you to trust it. A model says there is a 73 percent chance of an outbreak, and you have no way to know whether, across all the times it has said 73 percent, an outbreak actually followed 73 percent of the time, or 30 percent, or 95 percent. The number looks authoritative because it has a decimal point, but a probability that has never been checked against what actually happened is not really a probability. It is a feeling with a unit attached. I built MOSAIC in large part to take that problem seriously, and the discipline it forces, define a falsifiable quantity and then prove it is calibrated, turns out to matter far beyond epidemics.

How Much of a Prompt Can You Delete Before the Answer Breaks?

16 minute read

Published: June 15, 2026

Every token you send to a language model costs you twice. It costs money, because providers bill per token, and it costs latency, because attention is quadratic in sequence length, so doubling the context can quadruple the work the model does to read it. For a single short prompt none of this matters. For a production application that stuffs retrieved documents, conversation history, tool outputs, and system instructions into every call, it matters enormously, and it is the difference between an app that is cheap and fast and one that is neither.

Why Does a Valid Proof Tell You Nothing About Whether You Proved the Right Thing?

22 minute read

Published: June 01, 2026

When you prove something with a computer, the workflow has two halves that are easy to confuse. First you write down a statement, which formal-methods people call a specification, or spec for short. A spec is a precise description, in a language a machine can read, of what your code or your theorem is supposed to do. Then you write a proof that your work satisfies that spec, and you hand both to a proof checker like Lean or Axiom’s AXLE. The checker does something genuinely remarkable: it tells you, with certainty, whether the proof is valid. No hand-waving, no “looks right to me,” just a mechanical verdict that the proof establishes the statement.

Why Adam Works: Understanding Every Major Optimizer Through the Loss Landscape

5 minute read

Published: May 01, 2025

During my first serious training run at Synthure, I watched a model converge beautifully for 40 epochs, then diverge. Learning rate too high, I assumed. I halved it. It diverged again, faster. I halved it again. Now it converged but plateaued far above the target loss. After two days I realized what was actually wrong: I was using SGD with momentum on a loss landscape with wildly different curvatures along different parameter directions, and no single learning rate could handle both the shallow ravines and the steep walls simultaneously.

Reducing Production Inference Latency by 10x: A Profiling Story

6 minute read

Published: December 01, 2025

A model serving endpoint at Synthure had a p99 latency of 4.2 seconds. Physicians were waiting that long, four seconds, for coding recommendations during patient encounters. The product team had assumed LLM inference was the problem. We had discussed switching to a smaller model, accepting worse accuracy in exchange for speed. Before doing that, we profiled.

Kolmogorov Complexity, Solomonoff Induction, and the Philosophical Limits of Aligned AGI

12 minute read

Published: May 01, 2026

Begin with the simplest possible question about intelligence: what does it mean to learn? Not to fit a curve, not to minimize a loss, but to genuinely induce the right explanation from evidence. This question has a precise mathematical answer, one that was worked out in the 1960s by Kolmogorov, Solomonoff, and Chaitin, and extended by Hutter in the 2000s into a formal theory of optimal rational agency. The answer is beautiful and the theory is complete. It is also, on close inspection, deeply troubling for the project of value alignment.

Building a Bayesian A/B Testing System That Knows When to Stop

4 minute read

Published: July 01, 2025

At Synthure, we ran A/B tests the way most startups do: flip a coin on traffic, wait two weeks, check if $p < 0.05$, ship or revert. This worked until we started testing features that affected claim approval rates, where each day of a bad variant cost real money and delayed patient reimbursements. We couldn’t afford to wait two weeks. We also couldn’t afford to stop early and be wrong.

Reducing Production Inference Latency by 10x: A Profiling Story

6 minute read

Published: December 01, 2025

A model serving endpoint at Synthure had a p99 latency of 4.2 seconds. Physicians were waiting that long, four seconds, for coding recommendations during patient encounters. The product team had assumed LLM inference was the problem. We had discussed switching to a smaller model, accepting worse accuracy in exchange for speed. Before doing that, we profiled.

The Model That Learned From the Future: A Temporal Leakage Postmortem

4 minute read

Published: August 01, 2025

The validation dashboard said 99.7% precision. We had trained a fraud detection model for a healthcare claims processor, and by every metric it was performing remarkably well. The product team was excited. We were cautious, 99.7% felt too good, but we couldn’t find the flaw, so we deployed.

When RAG Fails: Building a GraphRAG System for Multi-Hop Reasoning

7 minute read

Published: January 01, 2026

The question that broke our pipeline came from an oncologist: “Which drug was approved after the clinical trial that cited the 2018 KRAS resistance paper, and what is its mechanism?” Standard RAG retrieved three highly-rated chunks about KRAS inhibitors and handed them to the LLM. The LLM answered confidently and completely incorrectly.

When RAG Fails: Building a GraphRAG System for Multi-Hop Reasoning

7 minute read

Published: January 01, 2026

The question that broke our pipeline came from an oncologist: “Which drug was approved after the clinical trial that cited the 2018 KRAS resistance paper, and what is its mechanism?” Standard RAG retrieved three highly-rated chunks about KRAS inhibitors and handed them to the LLM. The LLM answered confidently and completely incorrectly.

Implementing AlphaZero for Connect Four: MCTS + Neural Policy in C++ and Python

7 minute read

Published: November 01, 2025

Around iteration 40 of training, something changed. The agent, which had been playing essentially random Connect Four with a mild center preference, started blocking threats it had no reason to know about. A human playing against it dropped a piece that created a diagonal three-in-a-row. The agent, on its next move, dropped a piece that blocked the winning extension. Not because it had been told about diagonals. Because 40 iterations of self-play had accumulated enough evidence that unblocked diagonals eventually lead to losses.

The Information Bottleneck: Deriving Optimal Representations From First Principles

7 minute read

Published: October 01, 2025

Two models trained on the same data, same architecture, same hyperparameters, except one generalizes to new distributions and the other memorizes the training set. This was the puzzle I kept running into. Validation accuracy looked identical during training. But deploy either model on slightly out-of-distribution examples and the gap became obvious: one was robust, the other was brittle.

The Mathematics of Adversarial Robustness: Certified Defenses, Lipschitz Geometry, and the Limits of Perturbation Sets

13 minute read

Published: April 01, 2026

In 2013 Szegedy et al. showed that a GoogLeNet classifier, trained to near-human accuracy on ImageNet, could be fooled by adding imperceptibly small perturbations to any input image. The perturbations were invisible to human eyes, no larger than the noise in a compressed JPEG, but they caused confident, catastrophically wrong predictions. The model saw a school bus and called it an ostrich. A decade later, after thousands of papers on attacks and defenses, the phenomenon is still not fully understood. State-of-the-art models remain vulnerable in ways that defy intuitive explanation.

Singular Learning Theory and the Geometry of Neural Network Interpretability

13 minute read

Published: March 01, 2026

Ask a mechanistic interpretability researcher why neural networks form the circuits they do and you will get an honest answer: we observe them, name them, and ablate them, but we lack a theory of why they emerge. This is not a complaint about the field, the empirical discoveries are real and important. It is a statement about what is missing. What we need is a mathematical account of why, given a data distribution and an architecture, gradient descent converges to representations with specific structural properties rather than others.

Reducing Production Inference Latency by 10x: A Profiling Story

6 minute read

Published: December 01, 2025

A model serving endpoint at Synthure had a p99 latency of 4.2 seconds. Physicians were waiting that long, four seconds, for coding recommendations during patient encounters. The product team had assumed LLM inference was the problem. We had discussed switching to a smaller model, accepting worse accuracy in exchange for speed. Before doing that, we profiled.

Building a Bayesian A/B Testing System That Knows When to Stop

4 minute read

Published: July 01, 2025

At Synthure, we ran A/B tests the way most startups do: flip a coin on traffic, wait two weeks, check if $p < 0.05$, ship or revert. This worked until we started testing features that affected claim approval rates, where each day of a bad variant cost real money and delayed patient reimbursements. We couldn’t afford to wait two weeks. We also couldn’t afford to stop early and be wrong.

Reproducing Double Descent: The Experiment That Broke Classical Learning Theory

4 minute read

Published: June 01, 2025

Classical learning theory told us bias and variance form a unimodal tradeoff: increase model capacity and test error first falls, then rises as the model starts memorizing. Every textbook contains this curve. It was the theoretical foundation for why we regularize, why we use validation sets, and why we prefer smaller models when data is limited.

Implementing the Transformer in C++ Without ML Libraries: What You Learn From the Metal

6 minute read

Published: September 01, 2025

There is a version of understanding a transformer where you can recite the equations and draw the architecture diagram. Then there is a deeper version where you know, concretely, what happens to a float when it enters the attention mechanism, which cache line it lives on, what instruction the CPU uses to multiply it, how many copies of it exist simultaneously in memory. I wanted the second kind of understanding. The only way to get it was to implement a transformer from nothing: no PyTorch, no NumPy, no BLAS wrappers.

Why Does a Valid Proof Tell You Nothing About Whether You Proved the Right Thing?

22 minute read

Published: June 01, 2026

When you prove something with a computer, the workflow has two halves that are easy to confuse. First you write down a statement, which formal-methods people call a specification, or spec for short. A spec is a precise description, in a language a machine can read, of what your code or your theorem is supposed to do. Then you write a proof that your work satisfies that spec, and you hand both to a proof checker like Lean or Axiom’s AXLE. The checker does something genuinely remarkable: it tells you, with certainty, whether the proof is valid. No hand-waving, no “looks right to me,” just a mechanical verdict that the proof establishes the statement.

Aravind Kannappan

Posts by Tags

AI

AI safety

C++

MCTS

adversarial examples

algorithmic information theory

alignment

bayesian

bayesian inference

calibration

data engineering

deep learning

efficiency

failure modes

formal methods

generalization

information theory

interpretability

knowledge-graphs

llm

llms

math

mathematics

open source

optimization

performance engineering

philosophy

product engineering

production ML

production-ml

rag

reinforcement learning

representation learning

robustness

singular learning theory

software engineering

statistics

theory

transformers

verification