In this talk, I propose to discuss the problem of building explainable AI with the two approaches - causal vs correlational.
I will talk about what mech interp is in large language models like Gemma. It's a way to understand how models answer questions by looking inside them and checking which neurons activate when.
I will discuss the Anthropic's open sourced a python module - circuit-tracer, and also the Neuronpedia portal , helps us find neurons linked to real-world concepts. We will examine specific prompts on transformers and understand the various paths and thoughts that make use reach the output. (It is veery interesting - for me)
I will also talk about my own work on mech interp tooling (modelrecon) - with "activation cube" data structure (this is not a standard - I came up with it) as a means to share and visualize activation data. And also the "counterfactual" library that I am working on to correctly implement intervention testing
basic problem of causal vs. correlational techniques and the limitations of corrlational.
## Why Explainability Matters 2-3 mins
We need to understand why AI models make certain choices, not just what answers they give. Without this, the model feels like a black box. - in this I will include example of human behavior
---
## What Transformers Hide 2-3 mins
I will talk about basic transformers internal steps and features that are hard to see. highliting that tools only show the final output, not the thinking process. Infact - I will highlight that it comes as a surpirse to normal people that we dont know how models "actually" arrive at specific answers. -
---
## How Circuit Tracer Helps 3 - 5 mins
I will talk about how Anthropic’s Circuit Tracer shows the inside connections of the model.
It turns hidden activations into easy-to-understand features and shows how they link together. It not that easy, but we can get used to the graphs (like the link guy in the matrix movie- he could just understand by looking at the matrix runtime code) - I will show some graphs and walk through the reasoning path on colab
---
## Seeing the Reasoning Path 10 minutes or more
The tool draws a clear path from
input → inner features → final output
This lets everyone see which parts of the model caused the answer. _ this would be fun as the type of path a model takes are weird sometimes.
---
## Why This Is Important 2 minutes
With this method, we can:
* check if the model is behaving safely
* fix mistakes inside the model
* build trust by seeing how it thinks
I will finish with description with some of my work around activation cube data and pytorch hook mechanism and other options for logging data.