Yang Zhang (张阳)

My publication list can also be found at DBLP and Google Scholar, however, they may not be up to date.

Selected Publications

"Humans welcome to observe": A First Look at the Agent Social Network Moltbook (Preprint)
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (CCS 2024)
Prompt Stealing Attacks Against Text-to-Image Generation Models (USENIX Security 2024)
SecurityNet: Assessing Machine Learning Vulnerabilities on Public Models (USENIX Security 2024)
You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content (S&P 2024)
DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models (CCS 2023)
Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models (CCS 2023)
On the Evolution of (Hateful) Memes by Means of Multimodal Contrastive Learning (S&P 2023)
Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots (CCS 2022)
ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine Learning Models (USENIX Security 2022)
Dynamic Backdoor Attacks Against Machine Learning Models (Euro S&P 2022)
Get a Model! Model Hijacking Attack Against Machine Learning Models (NDSS 2022)
BadNL: Backdoor Attacks Against NLP Models with Semantic-preserving Improvements (ACSAC 2021)
Membership Leakage in Label-Only Exposures (CCS 2021)
Stealing Links from Graph Neural Networks (USENIX Security 2021)
“Go eat a bat, Chang!”: On the Emergence of Sinophobic Behavior on Web Communities in the Face of COVID-19 (WWW 2021)
Updates-Leak: Data Set Inference and Reconstruction Attacks in Online Learning (USENIX Security 2020)
ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models (NDSS 2019)

2026

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, Yang Zhang; ICML 2026

Position: Preparing for AI Systems That Deceive Developers

Isabella Duan, Xudong Pan, Yawen Duan, Adam Gleave, Ranjie Duan, Yang Zhang, Xiaojian Li, Chaochao Lu, Naying Hu, Sören Mindermann, Dongrui Liu, Jie Fu, Peng Xu, Tianxing He, Xudong Guo, Chen Zheng, Wenqi Chen, Jianfeng Cao, Geng Hong, Jiarun Dai, Yinpeng Dong, Brian Tse, Xia Hu, Min Yang; ICML 2026 (Position Paper Track)

pdf

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang; ACL 2026

pdf arxiv

Open Schrödinger's Closed Box: Identifying Retrieval Augmented Generation in API-Accessible Large Language Model Services

Yukun Jiang, Xinyue Shen, Michael Backes, Zheng Li, Yang Zhang; ACL 2026

pdf

DE-CLIP: Few-Shot Anomaly Detection via Difference-Guided Embedding Editing

Yage Zhang, Yukun Jiang, Michael Backes, Yang Zhang; ACL 2026

Selected Publications

2026

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Position: Preparing for AI Systems That Deceive Developers

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

Open Schrödinger's Closed Box: Identifying Retrieval Augmented Generation in API-Accessible Large Language Model Services

DE-CLIP: Few-Shot Anomaly Detection via Difference-Guided Embedding Editing

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Peering Behind the Shield: Guardrail Identification in Large Language Models

InferPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality

Rethinking Assessments of Prompt Injection Attacks

Reward Yourself: Efficient Self Rewards for Trustworthy Sampling

When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

Defeating Cerberus: Privacy-Leakage Mitigation in Vision Language Models

SL-CBM: Enhancing Concept Bottleneck Models with Semantic Locality for Better Interpretability

2025

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification

Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

SoK: Data Reconstruction Attacks Against Machine Learning Models: Definition, Metrics, and Benchmark

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

From Meme to Threat: On the Hateful Meme Understanding and Induced Hateful Content Generation in Open-Source Vision Language Models

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications

Data-Free Model-Related Attacks: Unleashing the Potential of Generative AI

Data Duplication: A Novel Multi-Purpose Attack Paradigm in Machine Unlearning

Enhanced Label-Only Membership Inference Attacks with Fewer Queries

Membership Inference Attacks Against Vision-Language Models

Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data

On the Generalization Ability of Machine-Generated Text Detectors

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs

Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media

White-box Membership Inference Attacks against Diffusion Models

A Comprehensive Study of Privacy Risks in Curriculum Learning

The Ripple Effect: On Unforeseen Complications of Backdoor Attacks

Neeko: Model Hijacking Attacks Against Generative Adversarial Networks

GPTracker: A Large-Scale Measurement of Misused GPTs

On the Effectiveness of Prompt Stealing Attacks on In-The-Wild Prompts

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Towards Understanding Unsafe Video Generation

Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm?

2024

The Death and Life of Great Prompts: Analyzing the Evolution of LLM Prompts from the Structural Perspective

ModScan: Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities

Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models

Membership Inference Attacks Against In-Context Learning

Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution

BadMerging: Backdoor Attacks Against Model Merging

ZeroFake: Zero-Shot Detection of Fake Images Generated and Edited by Text-to-Image Generation Models

SeqMIA: Sequential-Metric Based Membership Inference Attack

MGTBench: Benchmarking Machine-Generated Text Detection

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Instruction Backdoor Attacks Against Cutomized LLMs

Prompt Stealing Attacks Against Text-to-Image Generation Models

SecurityNet: Assessing Machine Learning Vulnerabilities on Public Models

Quantifying Privacy Risks of Prompts in Visual Prompt Learning

Link Stealing Attacks Against Inductive Graph Neural Networks

Composite Backdoor Attacks Against Large Language Models

Games and Beyond: Analyzing the Bullet Chats of Esports Livestreaming

FAKEPCD: Fake Point Cloud Detection via Source Attribution

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content

Test-Time Poisoning Attacks Against Test-Time Adaptation Models

Generated Distributions Are All You Need for Membership Inference Attacks Against Generative Models

VERITRAIN: Validating MLaaS Training Efforts via Anomaly Detection

2023

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Differentially Private Resource Allocation

A Plot is Worth a Thousand Words: Model Information Stealing Attacks via Scientific Plots

Two-in-One: A Model Hijacking Attack Against Text Generation Models

UnGANable: Defending Against GAN-based Face Manipulation

FACE-AUDITOR: Data Auditing in Facial Recognition Systems

PrivTrace: Differentially Private Trajectory Synthesis by Adaptive Markov Model

Generated Graph Detection

Data Poisoning Attacks Against Multimodal Encoders