Lightweight adapters trained on interpretability artifacts enable reliable self-interpretation in frozen LMs. A simple scalar affine adapter outperforms baselines in feature labeling, topic identification, and implicit reasoning decoding. Gains scale with model size, driven mostly by learned bias.
Key Points
- 1.d_model+1 params suffice for strong gains
- 2.85% improvement from bias vector alone
- 3.Generalizes across tasks and model families
Impact Analysis
Makes self-interpretation practical and scalable without model modifications.
Technical Details
Trains on vector-label pairs from artifacts; simpler adapters generalize better.