Blackbox Model Provenance via Palimpsestic Membership Inference
Rohith Kuditipudi*, Jing Huang*, Sally Zhu*, Diyi Yang β , Christopher Potts β , Percy Liang β
Neurips, 2025, Spotlight π
|
Demystifying Verbatim Memorization in Large Language Models
Jing Huang, Diyi Yang*, Christopher Potts*
EMNLP, 2024
Featured on Stanford AI Lab Blog,
NNSight Mini Paper Tutorials /
Project Page
|
Causal Abstraction and Generalization
|
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Jing Huang*, Junyi Tao*, Thomas Icard, Diyi Yang, Christopher Potts
ICML, 2025
Actionable Interpretability Workshop @ ICML, 2025, Oral Presentation π
Talk /
Project Page
|
Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard
JMLR, 2025
|
Automating and Evaluating Interpretability Tools
|
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger
ACL, 2024
Featured on Anthropic Transformer Circuits Thread
/
Project Page
|
Rigorously Assessing Natural Language Explanations of Neurons
Jing Huang, Atticus Geiger, Karel DβOosterlinck, Zhengxuan Wu, Christopher Potts
BlackboxNLP, 2023, Best Paper Award π
Project Page
|
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Zhengxuan Wu*, Aryaman Arora*, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts
ICML, 2025, Spotlight π
Project Page
|
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar*, Atticus Geiger*
ICLR, 2025
Project Page
|
Misc
|
I like doing puzzle hunts. My first PhD project was building a cryptic crossword solver. It turns out that we need to teach these subword-based language models about characters first!
|
I am not on any social media. You can find me via email or Slack.
|
|