publications
the first authors with * contributed equally
2024
- PreprintPreference Poisoning Attacks on Reward Model LearningarXiv preprint arXiv:2402.01920, 2024
- PreprintMitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced AlignmentarXiv preprint arXiv:2402.14968, 2024
- ICLR 2024Conversational Drug Editing Using Retrieval and Domain FeedbackIn The Twelfth International Conference on Learning Representations, 2024
2023
- PreprintOn the exploitability of reinforcement learning with human feedback for large language modelsarXiv preprint arXiv:2311.09641, 2023
- PreprintTest-time backdoor mitigation for black-box large language models with defensive demonstrationsarXiv preprint arXiv:2311.09763, 2023
- PreprintAdversarial Demonstration Attacks on Large Language ModelsarXiv preprint arXiv:2305.14950, 2023
- NeurIPS 2023On the exploitability of instruction tuningAdvances in Neural Information Processing Systems, 2023
- ICML 2023A Critical Revisit of Adversarial Robustness in 3D Point Cloud Recognition with Diffusion-Driven PurificationIn Proceedings of the 40th International Conference on Machine Learning, 2023
2022
- ICLR 2022Densepure: Understanding diffusion models for adversarial robustnessIn The Eleventh International Conference on Learning Representations, 2022
- ICLR 2022Defending against Adversarial Audio via Diffusion ModelIn The Eleventh International Conference on Learning Representations, 2022
- ICML 2022Fast and reliable evaluation of adversarial robustness with minimum-margin attackIn International Conference on Machine Learning, 2022