Mechanistic Science of AI

Opening the black box of neural language models

Cognitive science studies the mind through behavior; neuroscience asks what mechanisms inside the brain give rise to that behavior. A century of probing, recording, and lesioning has revealed that complex cognition arises from structured neural systems — specialized regions, modular networks, and characteristic circuits — rather than from a homogeneous substrate. Today’s neural language models pose an analogous question: behind their behavior lies a vast network of weights, activations, and attention patterns, but what kind of computational structure actually gives rise to what we observe? Mechanistic science of AI seeks to open this black box, drawing methods from systems neuroscience and interpretability to identify the circuits, representations, and algorithms inside models that produce intelligent behavior.

Papers

Modular Cognitive Architecture Emerges in Large Language Models
Pengrui Han, Jacob Andreas, Evelina Fedorenko^†, and Andrea Gregor de Varda^† (^† Co-senior Authors)
Preprint, 2026
view / project / code / manuscript / thread

Through circuit analyses across 46 tasks spanning language, formal reasoning, social reasoning, and physical reasoning, we find that LLMs develop a modular cognitive architecture mirroring the human brain: tasks drawing on the same network in humans recruit overlapping neurons in LLMs, whereas tasks drawing on different networks recruit distinct neurons. Modularity may be a fundamental principle of intelligent systems rather than a biological accident.