
Shuning Shang*, Hubert Strauss*, Stanley Wei, Sanjeev Arora, Noam Razin (* equal contribution)
Preprint.
Reinforcement learning for language models often relies on imperfect proxy rewards. While standard reward model metrics treat reward errors as uniformly harmful, we show that their effects depend on the learning dynamics: some errors can trap the policy near mediocre outputs, while others can help it reach higher ground-truth reward. We use this perspective to categorize reward errors, propose harm-aware evaluation metrics for reward models, and discuss implications for reward design in RLHF and verifiable-reward settings.
Shuning Shang*, Hubert Strauss*, Stanley Wei, Sanjeev Arora, Noam Razin (* equal contribution)
Preprint.
Reinforcement learning for language models often relies on imperfect proxy rewards. While standard reward model metrics treat reward errors as uniformly harmful, we show that their effects depend on the learning dynamics: some errors can trap the policy near mediocre outputs, while others can help it reach higher ground-truth reward. We use this perspective to categorize reward errors, propose harm-aware evaluation metrics for reward models, and discuss implications for reward design in RLHF and verifiable-reward settings.

Roey Magen*, Shuning Shang*, Zhiwei Xu, Spencer Frei, Wei Hu†, Gal Vardi† (* equal contribution)
Neurips 2025.
We study benign overfitting in single-head attention, the core of Transformers. We show that under certain conditions, the model can fit noisy training data and still generalize well, even after just two steps of gradient descent. Our results highlight the key role of the signal-to-noise ratio in enabling this behavior.
Roey Magen*, Shuning Shang*, Zhiwei Xu, Spencer Frei, Wei Hu†, Gal Vardi† (* equal contribution)
Neurips 2025.
We study benign overfitting in single-head attention, the core of Transformers. We show that under certain conditions, the model can fit noisy training data and still generalize well, even after just two steps of gradient descent. Our results highlight the key role of the signal-to-noise ratio in enabling this behavior.

Shuning Shang, Xuran Meng, Yuan Cao, Difan Zou
Under review.
We study benign overfitting in two-layer ReLU CNNs with all layers trained, going beyond prior work that fixes the output layer. We show that the output layer's initialization scale crucially affects training dynamics and generalization. Our analysis provides sharp conditions under which benign overfitting occurs, supported by matching bounds and experiments.
Shuning Shang, Xuran Meng, Yuan Cao, Difan Zou
Under review.
We study benign overfitting in two-layer ReLU CNNs with all layers trained, going beyond prior work that fixes the output layer. We show that the output layer's initialization scale crucially affects training dynamics and generalization. Our analysis provides sharp conditions under which benign overfitting occurs, supported by matching bounds and experiments.