The Basics of Mechanistic Interpretability

13 Jan 2026

Hi! Welcome to my first blog post ever! Something I’ve found particularly interesting in AI recently is Mechanistic Interpretability. This was a concept many people in my postbacc lab were familiar with. We had a reading group on it and a hackathon exploring sparse autoencoders. I found it incredibly intriguing and knew I’d come back to it eventually. However, last year was the first time I was truly exposed to deep learning as a field. I was struggling to understand the intricacies of simple Convolutional Neural Networks, let alone grasp how to extract evidence that these models learn interpretable concepts.

Now, I’m in my second semester of my PhD program, and I took Deep Learning last semester (taught by Sara Beery, Kaiming He, and Omar Khattab). It was an amazing class (I highly recommend it to anyone at MIT who wants a better understanding of deep learning) and I finally feel ready to really dig into the complexities and mysteries of mechanistic interpretability. Before tackling the complicated stuff though, I needed to learn the nitty-gritty basics of this subfield. This blog post serves as a way to solidify my knowledge of the fundamentals and push myself to put these concepts into my own words and teach them to someone else (you!). So, if you’re not future me (who I know will forget things and read this to re-remind herself), then welcome, and I hope this post can help at least a little!

Okay, now for the good stuff:

I recently read the review “Open problems in Mechanistic Interpretability” which was published in January of 2025. For the most part I will be summarizing and explaining what I learnt from this paper. A question I had straight off the bat when I first heard about this field: What actually is Mechanistic Interpretability?

Here’s how the review defines it: Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals.

Okay so, what does that mean? This means we want to understand the actual operations and patterns happening under the hood and how these effect and lead to the tangible abilities that we observe. We want to know this for all sorts of practical reasons like understanding what causes harmful outputs, finding where and why models may fail, and building more trustworthy models.

This review identifies three different threads of interpretability research. The first aims at building AI systems that are inherently interpretable by design (i.e. linear models, concept bottleneck models, Kolmogorov-Arnold networks). The second is focused on the overall question “Why did my model make this particular decision?” This research thread led to a lot of local attribution methods (i.e. grad-CAM, integrated gradients, SHAP, LIME). “Local” methods are techniques that focus on explaining a model’s decision for a specific input. The third thread emerged as models got better at generalization. This thread is focused on the question of “How did my model solve this general class of problems?”. It focuses on the mechanisms underlying neural network generalization and was therefore deemed “mechanistic interpretability.”

REVERSE ENGINEERING

Within mechanistic interpretability, there are two main methodological approaches. The first, called reverse engineering, starts with the network itself and works backward to understand how it functions. Reverse engineering involves three main steps.

STEP 1: Decomposition

First, we must decompose or break the network into interpretable pieces. Neural networks have millions to billions of parameters, and it’s hard to know what the right “units” are to study. Should we look at individual neurons, groups of neurons, attention heads, or entire layers of the network? The most obvious starting point is to examine the network’s architectural components. Individual neurons are already there, built into the network’s design, so they’re a natural place to begin. Early work showed that this approach had promise. Researchers found neurons that responded to specific, interpretable concepts. For example, in researchers found a ‘curve detector’ neuron in vision models that activated strongly for curved edges. In language models, certain attention heads seemed to track syntactic relationships like subject-verb agreement.

For some neurons, this approach works beautifully. You can find neurons that clearly detect specific, interpretable features. But there’s a problem. When researchers started systematically studying neurons, they found something strange. Most neurons didn’t respond to one coherent concept. Instead, they responded to multiple, seemingly unrelated things. Imagine a neuron that activates for the word ‘king’ in English text, base64 encoded strings, the color red in images, and references to Arabic language. What is this neuron “for”? It’s not clear it has a single interpretable function. This is called superposition. The network is trying to represent more concepts than it has neurons to store them in. Networks learn to do this because it’s efficient. If you have 1,000 neurons but need to track 10,000 features, superposition lets you pack more information into limited capacity. Features that rarely appear together can “share” space. This makes individual neurons uninterpretable. A single neuron might participate in representing dozens of different features. Looking at just that neuron, you can’t tell which feature is active at any given moment which makes interpretability very difficult.

If individual neurons are mixtures of multiple features, we need a different approach. We need to find the features themselves, not just the neurons. The solution to this is called Sparse Dictionary Learning (SDL). The most common approach is Sparse Autoencoders (SAEs). Here’s how they work. You take the activations from a layer and train a neural network with two parts. The encoder maps activations to a larger set of features (for example, 1,000 neurons might map to 10,000 features). The decoder maps features back to reconstruct the original activations. You add a sparsity penalty that forces the encoder to use only a few features at a time. The result is that you’ve found sparse, interpretable features that explain the layer’s activations. Instead of a neuron that responds to kings, base64, red, and Arabic, you might find separate SAE features like Feature 2847 that activates specifically for English royalty terms, Feature 5012 that activates specifically for base64 strings, Feature 7234 that activates specifically for red colors, and Feature 1653 that activates specifically for Arabic language references. There are also variants of SAEs. Transcoders operate between layers. Instead of just explaining one layer’s activations, they explain how information flows from one layer to the next. This is useful for understanding multi-step computations. Crosscoders work across different models. They find shared features between different networks and help us understand if different models learn similar representations. For example, do GPT-4 and Claude learn the same “royalty” feature?

SDL gives us interpretable units that actually correspond to coherent concepts. In May 2024, Anthropic researchers trained a sparse autoencoder on Claude 3 Sonnet and discovered a feature that activated specifically for the Golden Gate Bridge. This wasn’t a general “bridge” feature or a “San Francisco” feature. It fired for images of the Golden Gate Bridge, text descriptions mentioning it, and even abstract references to the landmark. Other features they found included ones for DNA sequences, specific programming concepts, and particular styles of mathematical reasoning. These concrete examples show that SAEs can extract genuinely interpretable features that correspond to concepts we can describe in natural language.

However, SDL isn’t a complete solution. The field still has fundamental questions to answer. We don’t have a rigorous definition of what counts as a feature. The “Open Problems in Mechanistic Interpretability” review explicitly identifies this issues as a critical open problem in the field. The authors note that SDL and the superposition hypothesis lack solid conceptual foundations. Without clear definitions of what features are or whether superposition is fundamentally true versus just pragmatically useful, we’re building sophisticated methods on uncertain theoretical ground. SDL is our best current approach for decomposition, and it’s produced impressive results. But as the field matures, we need rigorous answers to these fundamental questions before we can claim to truly understand how neural networks represent information.

Now that we’ve decomposed the network into interpretable features using SDL, the next challenge is understanding what these features actually do.

STEP 2: Description of components

First, we need to explore what causes these features to activate. The simplest approach is to look at highly activating dataset examples. For a given feature, we examine inputs where that feature fires strongly and look for patterns. However, this approach has limitations. It only shows us correlations, not causation. Just because a feature activates on certain inputs doesn’t mean those inputs actually cause the activation in a causal sense. Additionally, we might project our own human interpretations onto patterns that don’t truly reflect how the model works. The paper calls this risk “interpretability illusions,” where we see what we expect to see rather than what’s really there.

To get causal explanations, we need attribution methods. These techniques measure the actual causal importance of different parts of the input on a feature’s activation. Attribution methods come in two flavors. Gradient-based methods follow the mathematical gradients backward through the network to see which input elements most strongly influenced the feature. Perturbation-based methods actually change parts of the input and measure how the feature’s activation changes in response. However, both of these methods have their own issues. Gradient-based methods only give us first-order approximations, which may not be accurate. Perturbation methods can push the model off its training distribution and cause unusual behavior.

A third approach called feature synthesis combines these ideas. Instead of searching through existing data for examples that activate a feature, we generate new inputs from scratch using optimization. We essentially ask the question “What input would make this feature fire as strongly as possible?” The process uses gradient descent to modify a random input to maximize the feature’s activation. he criticism of this approach is that these synthetic, optimized examples might not be as useful for understanding the feature as real dataset examples would be.

We also need to understand what happens after a feature activates. What impact does it have on the rest of the network and ultimately on the output? One approach is the logit lens. This technique takes activations from the middle of the network and projects them directly to the output vocabulary, skipping the remaining layers. It’s like asking “what is the model thinking at this point?” For language models, this means we can see what words the model is considering even before it finishes processing. The limitation here is that the logit lens only measures the direct effect. It doesn’t capture the indirect effect, which is how the feature influences later layers that then influence the output.

To measure causal effects more completely, we use causal interventions. These are experimental techniques where we literally change values in the network and observe what happens. The most common form is activation patching. Here’s how it works. You run the model on two different inputs, like “The Eiffel Tower is in Paris” versus “The Eiffel Tower is in Rome.” During the forward pass on the second input, you replace (or patch) some feature’s activation with its value from the first input. Then you see how the output changes. If patching in the “Paris” activation makes the model output “Paris” instead of “Rome,” you’ve found that this feature was causally responsible for the difference. Related techniques include ablation (setting activations to zero to delete them) and causal tracing (systematically patching to trace information flow). Another version of this is path patching. Instead of patching a feature and affecting the entire rest of the network, path patching only patches the connection between two specific features. This isolates the effect of Feature A on Feature B specifically, letting us map out precise causal pathways through the network.

We can also observe effects on sequential behavior. Steering involves artificially activating certain features (like adding an “honesty vector” to activations) to steer the model’s behavior in interpretable directions. If activating a feature makes the model more honest, we’ve learned something about what that feature represents. Another technique called patchscopes takes an activation from one forward pass and patches it into a different forward pass with a specially designed prompt that helps decode what that activation means. Chain-of-thought is the simplest approach. We just read the model’s step-by-step reasoning in its output text. However, research shows that chains of thought don’t always reflect the model’s actual internal reasoning process. The model might give plausible-sounding explanations that don’t match what’s really happening inside. This is known as the faithfulness issue.

STEP 3: Validation of Descriptions

All of these description methods generate hypotheses, not conclusions. We must validate whether our descriptions are actually correct. The paper emphasizes that conflating hypotheses with conclusions has been a major problem in mechanistic interpretability research. Just because an explanation seems plausible doesn’t mean it’s true.

There are several ways to validate our descriptions. We can use our explanation to predict when the feature will activate on new inputs, and test if those predictions are accurate. We can predict what will happen if we ablate or activate the feature based on our understanding. Good explanations should also help us identify why networks fail or produce adversarial examples. If we truly understand a feature, we should be able to build a simple handcrafted replacement that does the same job. We can test our methods on toy networks where we know the ground truth explanation. The highest standard is whether our interpretability method enables us to accomplish real engineering goals better than alternative approaches, not just in cherry-picked examples but on competitive benchmarks.

The three steps we’ve discussed so far (decomposition, description, and validation) represent one approach to mechanistic interpretability called reverse engineering. We start with the network, break it into pieces, and try to figure out what those pieces do. But there’s another complementary approach that works in the opposite direction.

CONCEPT BASED INTERPRETABILITY

Instead of starting with the network’s components and asking “what do these do?”, we can start with concepts we care about and ask “where are these represented in the network?” This is called concept-based interpretability. Rather than identifying the roles of network components, we identify network components for given roles.

The most common technique for this is called probing. Here’s how it works. First, you create a dataset where you label inputs with the concept you care about. For instance, you might label sentences with whether they contain gendered language, or label images with whether they contain certain objects. Then you extract the model’s internal activations for each input. Finally, you train a simple classifier (the “probe”) to predict your concept labels from those activations. If the probe succeeds, you’ve found where the concept is represented. For linear probes specifically, the concept corresponds to a direction (a vector) in the activation space. Probing has been used extensively in natural language processing to find representations of syntax, semantics, and factual knowledge in language models. It’s also been applied to vision models and reinforcement learning agents.

However, probing has a common limitation. A successful probe only shows that information about your concept is present in the activations. It doesn’t prove the network actually uses that information causally for its predictions. For example, imagine you probe a language model for whether a sentence is grammatically correct, and your probe succeeds with high accuracy. This tells you the model’s activations contain information that correlates with grammaticality. But it doesn’t tell you whether the model actually uses this grammatical information when generating text. The information might just be sitting there unused, a byproduct of how the network processes language.

This is why probing should be used to generate hypotheses, not conclusions. If a probe succeeds, that’s evidence that a concept might be causally important. But you need to confirm it with further investigation using causal interventions. One approach uses counterfactual data, where you intervene on the concept of interest. For example, you might use images where a dog has been changed to a cat, while keeping everything else the same. Methods like distributed alignment search and causal probing use these interventions to find representations that are more likely to be causally important rather than merely correlated.

Attribution methods can also help. After finding a concept vector with a probe, you can use attribution techniques to measure whether that vector actually affects the network’s predictions. Various concept erasure methods try to remove the concept from the network’s representations and see if behavior changes. If removing the concept doesn’t affect behavior, it probably wasn’t causally important.

Even if a probe only finds correlations, that might be acceptable if the correlations generalize to new data. But there’s a risk with high-dimensional activations. You might discover spurious correlations that work on your training data but don’t reflect anything meaningful about how the model actually works. This makes validation essential. You need to test probes on out-of-distribution data to ensure you’ve found general-purpose features rather than dataset-specific patterns.

Probing requires carefully constructed datasets for well-defined concepts. This creates a fundamental limitation. Probing can only find concepts you’ve already defined precisely enough to create data for. It can’t reveal unexpected or novel features in the network the way reverse engineering potentially can. You only find what you’re looking for. Some methods try to be more unsupervised. For example, Contrast-Consistent Search (CCS) doesn’t require explicit labels. Instead, it finds axes in activation space that correspond to true versus false propositions by enforcing consistency constraints. But even CCS needs datasets with clear positive and negative cases. Some methods try to be more unsupervised. For example, CCS doesn’t require explicit labels. Instead, it finds axes in activation space that correspond to true versus false propositions by enforcing consistency constraints. But even CCS needs datasets with clear positive and negative cases.

Reverse engineering and concept-based interpretability are complementary. Reverse engineering helps us discover what the network has learned, including potentially unexpected features. Concept-based interpretability helps us verify whether specific concepts we care about are present and being used. In practice, researchers often use both approaches together. You might use reverse engineering to decompose a network and generate hypotheses about what features mean, then use probing to test whether those features really represent the concepts you think they do. Both approaches face the same underlying challenge though. Whether we’re describing components we’ve discovered through decomposition or localizing concepts we’re searching for with probes, we’re making hypotheses that require validation. We need causal interventions, counterfactual testing, and rigorous evaluation to know if we’ve truly understood how the network works.

Circuit discovery is another specific methodology within concept-based interpretability. The goal of circuit discovery is to find the minimal “circuit” (or subgraph) of network components responsible for a specific task. Here’s how circuit discovery typically works. First, you choose a task the model can perform and create a dataset for it. This is a concept-based step since humans define what the task is. For example, you might study how the model performs indirect object identification (figuring out “Mary” is the indirect object in “John gave Mary the ball”), or how it detects whether code has a security vulnerability, or how it translates between languages. Next, you represent the network as a directed graph. Nodes are components like attention heads, MLP layers, or SDL latents. Edges are the connections between them. This gives you a map of all possible pathways information could flow through the network. Now you identify which nodes and edges actually matter for your task. You use causal interventions like activation patching or integrated gradients to test each component. Components that matter for the task become part of your circuit. Components that don’t matter get pruned away. After identifying the relevant subgraph, you describe what each component in the circuit actually does. Researchers rely on intuition to generate hypotheses about what each node or edge contributes, then design custom experiments to test those hypotheses.

While circuit discovery has produced valuable insights, it has serious limitations. First off, starting with human-defined tasks may not be the right approach since circuits describe average performance across a task, but individual datapoints can vary widely in how the model processes them. Second, The underlying decomposition methods (architectural components or SDL latents) are imperfect, as we discussed earlier. Third, testing every component individually is expensive, and gradient-based approximations like attribution patching only give first-order estimates that may not be adequate. Fourth, Current methods focus on components that increase task performance but miss components that suppress it, which are still important to understand. Finally, researchers tend to study deliberately simple, easy-to-analyze tasks, creating a misleading impression of difficulty. When researchers have tried to study arbitrary circuits or more complex behaviors, attempts have been much less successful.

Circuit discovery, as we’ve seen, requires substantial manual effort at every step. Researchers must design experiments, test hypotheses, and iterate through many cycles of description and validation. Historically, mechanistic interpretability has been manual and labor-intensive work on relatively small models. To make interpretability useful for state-of-the-art systems, we need scalable, automated approaches. One area where automation has shown promise is feature description and validation. Researchers have used language models to automatically generate natural language descriptions of neurons in image models, neurons in language models, and sparse autoencoder latents. Another area of progress is automating circuit discovery. Methods like Automated Circuit DisCovery (ACDC) and follow-up work automate the process of identifying computational subgraphs for specific tasks. However, these methods only automate finding the relevant subgraph. They don’t automate the crucial step of describing what those components actually do, which still requires manual interpretation.

Despite progress in automation, we’re still far from fully understanding neural networks. The fundamental challenges remain: we don’t have a rigorous definition of what features are, our decomposition methods are imperfect, and validation is difficult. SDL and circuit discovery are our best current tools, but they’re built on uncertain theoretical foundations. This is what makes mechanistic interpretability such an exciting field!! We’re developing new methods to peer inside these black boxes, even as those methods reveal how much we still don’t understand. As I continue learning about this field, I’m struck by how much work remains, especially in applying these methods to areas like healthcare AI. In future posts, I hope to dive deeper into specific topics! For now, I hope this overview has given you a solid foundation in the basics of mechanistic interpretability. :)