Jing Huang - explanare.github.io

Jing Huang

I am a PhD student in the Stanford NLP Group, advised by Prof. Christopher Potts and Diyi Yang. I am interested in understanding what makes neural network models generalize well, usually by studying the causal mechanisms that connect model behaviors, internal representations, and training data.

Prior to Stanford, I was at Google Research. Before that, I did my undergrad at University of Illinois at Urbana-Champaign, advised by Prof. Svetlana Lazebnik.

Google Scholar / Github

Research

Memorization

Blackbox Model Provenance via Palimpsestic Membership Inference
Rohith Kuditipudi*, Jing Huang*, Sally Zhu*, Diyi Yang ^†, Christopher Potts ^†, Percy Liang ^†
Neurips, 2025, Spotlight 🌟

Demystifying Verbatim Memorization in Large Language Models
Jing Huang, Diyi Yang*, Christopher Potts*
EMNLP, 2024
Featured on Stanford AI Lab Blog, NNSight Mini Paper Tutorials / Project Page

Causal Abstraction and Generalization

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Jing Huang*, Junyi Tao*, Thomas Icard, Diyi Yang, Christopher Potts
ICML, 2025
Actionable Interpretability Workshop @ ICML, 2025, Oral Presentation 🌟
Talk / Project Page

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard
JMLR, 2025

Automating and Evaluating Interpretability Tools

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger
ACL, 2024
Featured on Anthropic Transformer Circuits Thread / Project Page

Rigorously Assessing Natural Language Explanations of Neurons
Jing Huang, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, Christopher Potts
BlackboxNLP, 2023, Best Paper Award 🏆
Project Page

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Zhengxuan Wu*, Aryaman Arora*, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts
ICML, 2025, Spotlight 🌟
Project Page

HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar*, Atticus Geiger*
ICLR, 2025
Project Page

Misc

I like doing puzzle hunts. My first PhD project was building a cryptic crossword solver. It turns out that we need to teach these subword-based language models about characters first!

I am not on any social media. You can find me via email or Slack.