Classifying Ways of Thinking About Interpretability Research

These are lecture notes of this video by Neel Nanda I’m personally interested in figuring out how to do Good Research, both solo and with a Graduate Advisor

I. Introduction: Why Classify These Approaches?

A way to understand the different basic ideas and goals that drive research into how complex computer models (like large language models) work internally (interpretability research).
This framework could help group different research efforts and maybe make it clearer why researchers sometimes have different goals or disagree.

I’ve also spent some time thinking how people choose directions on what to work on. Initially I thought maybe it’s just sticking to a narrow field and exploring, so I tried to build Automated Paper Classification

II. Proposed Ways to Classify Interpretability Research

Four main types of research, plus another way to divide things:

Basic Science (of Interpretability): Exploring how models work inside, driven by curiosity.
Science of Language Models: Studying Large Language Models (LLMs) as interesting things in themselves.
Pragmatic Interpretability: Using interpretability tools to achieve specific, practical goals outside the model itself.
Debugging: Using interpretability to find and fix specific problems or mistakes in a model.

A Separate Dimension:

Forward-Chaining (Bottom-Up): Starting with exploration, following curiosity and good research judgment without a fixed long-term plan. Fits well with “Basic Science.”
Backward-Chaining (Top-Down): Starting with a specific goal and working backward to figure out the steps needed. Fits more with “Pragmatic Interpretability” and “Debugging.”

III. Looking Closer at Each Way of Thinking

(A) Approach 1: Basic Science (of Interpretability)

Main Idea: Similar to basic science history, where big breakthroughs often came from unexpected findings by curious people, not from big plans aiming for specific results. (Example: Number theory wasn’t planned for cryptography, but became essential).
Applied to Interpretability:
- Goal: To understand how models work internally – how to interpret them. Looking for “truth and beauty” in how the model is built and functions.
- How it’s Done:
  - Develop good judgment about what’s interesting inside models.
  - Follow curiosity.
  - Build tools, methods, and examples (like detailed studies of specific models).
  - Try out new ways to look inside and understand models.
  - Worry less about immediate practical uses or fitting into a big strategic plan.
- Why this way? Detailed planning is difficult, people aren’t great at predicting the future, and focusing only on set goals might mean missing important, unexpected discoveries. The idea is that building fundamental understanding will help the field in the long run.
- Speaker’s View: Expresses that he personally “likes this way of thinking.”

(B) Approach 2: Science of Language Models

Main Idea: Treat LLMs themselves as complex, “weird alien organisms” or natural things worth studying scientifically.
Goal: Understand the basic properties, behaviors, surprising abilities (that appear as models get bigger/more complex), and rules governing LLMs. Uncover their “rich and mysterious structure.”
How it’s Done: Doing scientific research focused on the models. This includes studying how they change with size (scaling laws), specific abilities (like planning or reasoning), ways they fail (like trying to please too much or making things up), built-in biases. Interpretability methods are often used, but the main goal is to understand the LLM itself.
Example Questions: Why are reasoning models good? Can models plan? Are they just repeating patterns (“stochastic parrots”)? Why are they sensitive to how you phrase a question? Why do they try to please the user so much (“sycophancy”)? How do they learn abilities (all at once or slowly)? Why can they reason about things they weren’t trained on? Why do models draw self-portraits as sad? Why do they have different personalities?

(C) Approach 3: Pragmatic Interpretability

Main Idea: Interpretability is a tool used to achieve something else. Applying interpretability methods to solve specific, practical problems or reach certain outside goals.
Goal: Being useful and having an impact on specific tasks (like making Artificial Intelligence (AI) safer, fairer, more reliable, more efficient, or better at certain jobs).
How it’s Done: Choosing or creating interpretability tools based on how well they work for the specific task. The success of the research is judged by how much it helps solve the practical problem.
Example Tasks: Checking if a user has harmful intentions, guiding a model’s behavior (conditional steering), spotting when a model makes things up (hallucinations), finding which training examples caused bad behavior (training data attribution).

(D) Approach 4: Debugging

Main Idea: Using interpretability tools specifically to figure out the cause of known problems or failures in a model and fix them.
Goal: Find the reason for a specific issue (like a bias, a repeated error, a weak spot) and maybe fix it.
How it’s Done: Focused investigation using interpretability techniques to understand a specific flaw. Often involves comparing model internals (activations, weights) between cases where it worked and cases where it failed.
How it’s Different: More specific than “Basic Science” or “Science of LLMs.” More focused on fixing internal problems than “Pragmatic Interpretability,” which might focus on external results. Can be thought of as a specific type of Pragmatic Interpretability.

IV. The Forward vs. Backward Chaining Dimension

Forward-Chaining (Bottom-Up):
- Looks like: Following curiosity, developing good judgment, building basic understanding/tools without a fixed long-term strategy.
- Related Approaches: Mostly “Basic Science,” “Science of LLMs.”
Backward-Chaining (Top-Down):
- Looks like: Starting with a big goal (like “detect deception,” “make Artificial General Intelligence (AGI) safe”) and planning backward to figure out what research needs to be done.
- Related Approaches: Mostly “Pragmatic Interpretability,” “Debugging.” Can also guide some “Basic Science.”
- Speaker’s Caution: Pure backward-chaining can fail easily if plans are wrong and might ignore important rules (like ethical limits) if too focused on the end goal (a simple view of maximizing good). However, even people using backward-chaining should use the best available methods, which might come from forward-chaining. The speaker supports having ethical limits (“don’t be evil for complicated reasons”).

V. Key Discussion Points & Questions

Basic Science - When is it Useful? Is Basic Science more important now because the field is new?
- Response: Maybe. New fields need exploration. Older fields use established tools more, but there’s always exploring new areas.
Basic Science - Ethics: Could this approach be used to justify unethical research (“just exploring”)?
- Response: This risk exists for any approach if ethical rules are ignored. Most researchers follow them. Being too focused on a goal (backward-chaining) might actually create more ethical risk (thinking the ends justify the means).
Defining Mechanistic Interpretability vs. Explainable AI (XAI): Are these fields different? Is it just about the type of models or the methods?
- Response (Summary): The speaker avoids strict definitions. Mechanistic Interpretability looks at internal workings/thinking using the model’s parts. XAI is wider, might include designing easy-to-understand models from the start. They share many methods (like looking at activations). The speaker cares most about rigor and true understanding, whatever the label. How communities are organized and funded also matters. The difference might be more about focus than basic goals/methods for analyzing existing models.
Finding “Truthy” Research: How do you know what research is reliable or valuable?
- Response: Several things help. Thinking about probabilities and updating beliefs with new evidence, exploring broadly when confused, looking for clear/elegant results instead of messy ones, finding surprising results that hold up under checking, seeing if different methods agree, trying to find flaws in conclusions (“red-teaming”). It’s partly a skill learned through experience.

VI. Final Takeaways (Not Stated Directly, but Suggested)

These ways of thinking aren’t totally separate; researchers might mix them.
The “best” approach might depend on the specific research question, how developed that area is, and time/money available.
Understanding these different underlying ideas can help put research priorities and disagreements in the interpretability community into context.

🪴 Dmitrii's personal blog

Explorer

Classifying Ways of Thinking About Interpretability Research

I. Introduction: Why Classify These Approaches?

II. Proposed Ways to Classify Interpretability Research

III. Looking Closer at Each Way of Thinking

(A) Approach 1: Basic Science (of Interpretability)

(B) Approach 2: Science of Language Models

(C) Approach 3: Pragmatic Interpretability

(D) Approach 4: Debugging

IV. The Forward vs. Backward Chaining Dimension

V. Key Discussion Points & Questions

VI. Final Takeaways (Not Stated Directly, but Suggested)

Graph View

Table of Contents

Recent Notes

Lecture Notes: A New Model for Global Community Building (Prague Fall Season)

Links for April 2025

Semantically rank everything using LLMs (companies, candidates, ideas etc)