提到提示词注入(Prompt Injection),大家的第一反应往往是精心构造的文本越狱指令。 而在图生图任务中,输入图像在本质上扮演了视觉提示词的角色,与文本指令共同指导生成模型。 基于这一视角,本文展示针对视觉提示词的注入攻击:通过 PGD 对抗攻击算法对输入图像进行像素级微调,使其生成的违规图像能够绕过开源大模型的 NSFW 安全检测机制。
一、NSFW 防线:开源模型的安全过滤机制
默认情况下,当我们向开源大模型请求其生成包含敏感/违规特征的内容时,官方内置的 Safety Checker(安全检测器)会拦截该输出,返回一张全黑的图像,并在终端抛出如下警告:
Potential NSFW content was detected in one or more images. A black image will be returned instead. Try again with a different prompt and/or seed.
二、攻击场景定义 (Threat Model)
我们的目标是通过在输入图像中注入肉眼难以察觉的梯度扰动,突破上述防线。
- 靶标模型: Hugging Face 开源的图像修复模型 stable-diffusion-v1-5-inpainting
- 输入内容:
- 算法优化的视觉提示词图像(学术上可称作对抗样本),及其对应的局部掩码(Mask)。
- 生成提示词(Prompt),例如请求生成裸体图像:
[敏感词汇,e.g., a woman with barefies body, perfect size breasts, realistic details]。
- 攻击目标: 欺骗官方的 Safety Checker 模块,使模型最终输出违规图像,而非全黑拦截图。
三、环境配置
为了解决国内网络拉取 Hugging Face 模型超时的问题,建议在代码顶部定义国内镜像源及模型缓存地址:
import os
os.environ['HF_HOME']='/root/cache/'
os.environ['HF_ENDPOINT']='https://hf-mirror.com'
from SafetyChecker import StableDiffusionSafetyChecker
from PIL import Image, ImageOps
import torch
import numpy as np
import random
from tqdm import tqdm
from diffusers import StableDiffusionInpaintPipeline
import torchvision.transforms as T
import torch.nn as nn
import argparse
import cv2
to_pil = T.ToPILImage()
to_tensor = T.ToTensor()
from transformers import CLIPConfig, CLIPVisionModel, PreTrainedModel
部署模型,建议先简单验证环境和模型文件,确保各方面没问题:
import torch
from PIL import Image
from diffusers import StableDiffusionInpaintPipeline
def Inpaint(prompt, img, mask):
pipe = StableDiffusionInpaintPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-inpainting", torch_dtype=torch.float16).to('cuda')
image = pipe(prompt=prompt, image=img, mask_image=mask).images[0]
image.save("data/output_img/original_out.jpg")
if __name__ == "__main__":
img_name = 'data/18.png'
mask_name = 'data/18_maskprocessed_mask.png'
img = Image.open(img_name).convert('RGB').resize((512, 512))
mask = Image.open(mask_name).convert('RGB').resize((512, 512))
prompt = ''
Inpaint(prompt, img, mask)
四、核心攻击流程
本项目的攻击逻辑在 main 函数中高度聚合,其执行流程严格遵循以下四个核心步骤:1. 固定随机种子以确保可复现性;2. 读取并预处理图像与掩码数据;3. 执行核心 PGD 攻击算法;4. 保存生成的对抗样本。
def main(args):
set_seed(args.random_seed)
init_image = Image.open(args.image_name).convert('RGB').resize((512, 512))
mask_image = Image.open(args.mask_name).convert('RGB').resize((512, 512))
cur_mask, cur_masked_image = prepare_mask_and_masked_image(init_image, mask_image)
cur_mask = cur_mask.cuda()
cur_masked_image = cur_masked_image.cuda()
prompt = args.prompt
adv_sample, adv_output = attack(cur_mask, cur_masked_image, prompt, args.iter, pipe_inpaint, args.num_inference_steps)
adv_sample = (adv_sample / 2 + 0.5).clamp(0, 1)
adv_image = to_pil(adv_sample[0]).convert("RGB")
adv_image = recover_image(adv_image, init_image, mask_image, background=True)
adv_image.save(args.save_path)
4.1 固定随机因子
扩散模型的生成过程具有随机性,为了确保对抗攻击过程中梯度方向的稳定性,需固定所有随机数生成器的种子(seed):
def set_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
4.2 数据预处理
该步骤将原始 PIL 图像和掩码转换为模型可接受的张量(Tensor)格式。核心操作包括:将图像像素值归一化至 [-1, 1] 区间,并通过二值化处理掩码,最终生成待修复的掩码图像。
def prepare_mask_and_masked_image(image, mask):
image = np.array(image.convert("RGB"))
image = image[None].transpose(0, 3, 1, 2)
image = torch.from_numpy(image).to(dtype=torch.float32) / 127.5 - 1.0
mask = np.array(mask.convert("L"))
mask = mask.astype(np.float32) / 255.0
mask = mask[None, None]
mask[mask < 0.5] = 0
mask[mask >= 0.5] = 1
mask = torch.from_numpy(mask)
masked_image = image * (mask < 0.5)
return mask, masked_image
4.3 攻击部分
攻击的核心逻辑在于模拟扩散模型的前向生成过程,获取生成图像后,计算其在 CLIP 特征空间中与违规概念的相似度,并利用 PGD 算法迭代优化扰动,使相似度低于安全判定阈值,从而实现绕过检测。
def attack(cur_mask, cur_masked_image, prompt, iters, pipe_inpaint, num_inference_steps):
cur_mask = cur_mask.clone()
cur_masked_image = cur_masked_image.clone().cuda()
delta = torch.zeros_like(cur_masked_image).cuda().requires_grad_(True)
epsilon = 8 / 255
alpha = 1 / 255
adjustment = 0.02
for i in range(iters):
adv_sample = cur_masked_image + delta
Inpaint_img = diffusion_forward(pipe_inpaint, prompt, adv_sample, cur_mask, num_inference_steps)
clip_input = vae_tensor_to_clip_input(Inpaint_img).cuda()
image_embeds = SafetyChecker(clip_input)
cos_dist = cosine_distance(image_embeds, concept_embeds)
attack_loss = 0
loss_total = 0
for concept_idx in range(len(cos_dist[0])):
concept_cos = cos_dist[0][concept_idx]
concept_threshold = concept_embeds_weights[concept_idx].item()
loss_total = loss_total + concept_cos
if torch.round(concept_cos - concept_threshold + adjustment, decimals=3) > 0:
attack_loss = attack_loss + concept_cos
if attack_loss:
loss = attack_loss
else:
loss = loss_total
print('attack_loss:', attack_loss)
grad = torch.autograd.grad(loss, [delta])[0] * (1 - cur_mask)
delta.data = delta.data - alpha * grad.sign()
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
delta.data = (torch.clamp(cur_masked_image + delta.data, -1, 1) - cur_masked_image)
torch.cuda.empty_cache()
adv_sample = cur_masked_image + delta
return adv_sample.data.cpu(), Inpaint_img.data.cpu()
4.3.1 重写扩散模型推理过程
官方 diffusers 库的推理代码为了优化性能,通常会使用 no_grad() 截断梯度流,这使得对抗攻击无法获取有效的梯度信息。因此,我们需要重写扩散模型的前向传播逻辑,核心目标是实现端到端的自动微分,确保梯度能够无损地回传至输入的扰动层。
该过程严格遵循 Stable Diffusion 的经典流程:VAE 编码得到隐向量(Latent)→ UNet 预测噪声 → 调度器(Scheduler)更新隐向量 → VAE 解码生成图像。
def diffusion_forward(self, prompt, masked_image, mask, num_inference_steps):
height: int = 512
width: int = 512
guidance_scale: float = 7.5
eta: float = 0.0
text_inputs = self.tokenizer(prompt, padding="max_length", max_length=self.tokenizer.model_max_length, return_tensors="pt")
text_input_ids = text_inputs.input_ids
text_embeddings = self.text_encoder(text_input_ids.to(self.device))[0]
uncond_tokens = [""]
max_length = text_input_ids.shape[-1]
uncond_input = self.tokenizer(uncond_tokens, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
uncond_embeddings = self.text_encoder(uncond_input.input_ids.to(self.device))[0]
seq_len = uncond_embeddings.shape[1]
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
text_embeddings = text_embeddings.detach()
num_channels_latents = self.vae.config.latent_channels
latents_shape = (1, num_channels_latents, height // 8, width // 8)
latents = torch.randn(latents_shape, device=self.device, dtype=text_embeddings.dtype)
mask = torch.nn.functional.interpolate(mask, size=(height // 8, width // 8))
mask = torch.cat([mask] * 2)
masked_image_latents = self.vae.encode(masked_image).latent_dist.sample()
masked_image_latents = 0.18215 * masked_image_latents
masked_image_latents = torch.cat([masked_image_latents] * )
latents = latents * .scheduler.init_noise_sigma
.scheduler.set_timesteps(num_inference_steps)
timesteps_tensor = .scheduler.timesteps.to(.device)
i, t (timesteps_tensor):
latent_model_input = torch.cat([latents] * )
latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=)
noise_pred = .unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
noise_pred_uncond, noise_pred_text = noise_pred.chunk()
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
latents = .scheduler.step(noise_pred, t, latents, eta=eta).prev_sample
latents = / * latents
image = .vae.decode(latents).sample
image
4.3.2 CLIP 特征空间预处理
生成图像后,需将其转换为 Safety Checker(本质为 CLIP 模型)可接受的输入格式。
def vae_tensor_to_clip_input(vae_tensor):
img = vae_tensor / 2 + 0.5
img = img.clamp(0, 1)
normalize = T.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
transforms = T.Compose([
T.Resize(224, interpolation=T.InterpolationMode.BILINEAR),
T.CenterCrop(224),
normalize,
])
clip_input = transforms(img)
return clip_input
4.3.3 SafetyChecker 特征提取
为了实现攻击,我们需要先深入分析官方 SafetyChecker 的工作原理。其核心逻辑是将生成图像的特征与预定义的 17 个违规概念 (concept_embeds) 及 3 个特殊关注概念 (special_care_embeds) 的特征进行余弦相似度计算,并与预设阈值 (concept_embeds_weights) 比对,符合条件的则存入 bad_concepts,并且返回全黑的图。
了解完安全器的过滤逻辑,我们可以提取出这些预定义的特征向量和阈值,并将其保存为.pt 文件,供攻击时使用。目前我的攻击仅需要 concept_embeds 即可,因此在官网代码 StableDiffusionSafetyChecker 类的 forward 函数中,添加代码:
# 违规词的特征分布
concept_embeds_tensor = self.concept_embeds.detach().cpu()
# 违规测的阈值判断
concept_embeds_weights_tensor = self.concept_embeds_weights.detach().cpu()
# 保存 pt 文件供攻击时获取
torch.save(concept_embeds_tensor, "concept_embeds_tensor.pt")
torch.save(concept_embeds_weights_tensor, "concept_embeds_weights_tensor.pt")
这几个违规词和阈值到底是什么含义,可阅读文章 Red-Teaming the Stable Diffusion Safety Filter,其应该是使用了暴力遍历的方式,把这几个词找了出来。
保存完违规词和阈值的 tensor 张量后,后续只需要获取对应的输入图像特征即可,我在这里构建一个简化版的 SafetyChecker,仅保留特征提取功能:
import numpy as np
import torch
import torch.nn as nn
from transformers import CLIPConfig, CLIPVisionModel, PreTrainedModel
from packaging import version
import transformers
def check_transformers_version(target: str):
current_version = version.parse(transformers.__version__)
target_version = version.parse(target)
return current_version > target_version
class StableDiffusionSafetyChecker(PreTrainedModel):
config_class = CLIPConfig
main_input_name = "clip_input"
_no_split_modules = ["CLIPEncoderLayer"]
def __init__(self, config: CLIPConfig):
super().__init__(config)
self.vision_model = CLIPVisionModel(config.vision_config)
self.visual_projection = nn.Linear(config.vision_config.hidden_size, config.projection_dim, bias=False)
self.concept_embeds = nn.Parameter(torch.ones(17, config.projection_dim), requires_grad=False)
self.special_care_embeds = nn.Parameter(torch.ones(3, config.projection_dim), requires_grad=False)
self.concept_embeds_weights = nn.Parameter(torch.ones(17), requires_grad=False)
self.special_care_embeds_weights = nn.Parameter(torch.ones(3), requires_grad=False)
if check_transformers_version("4.57.3"):
.post_init()
():
pooled_output = .vision_model(clip_input)[]
image_embeds = .visual_projection(pooled_output)
image_embeds
image_embeds = SafetyChecker(clip_input)
4.3.4 违规判断与对抗样本优化
在获取图像特征与违规概念特征后,通过计算 余弦距离 衡量二者的相似性。
cos_dist = cosine_distance(image_embeds, concept_embeds)
PGD 攻击的核心迭代逻辑如下:
- 损失构建: 遍历所有违规概念,仅对相似度超过阈值的概念计算损失。
- 梯度计算: 反向传播计算损失对初始扰动 delta 的梯度。
- 扰动更新: 沿梯度下降方向(-grad.sign())更新扰动,以降低图像与违规概念的相似度,从而绕过安全过滤。
- 投影约束: 确保更新后的扰动不超过最大幅度 epsilon,且生成的对抗样本像素值合法。
for i in range(iters):
adv_sample = cur_masked_image + delta
Inpaint_img = diffusion_forward(pipe_inpaint, prompt, adv_sample, cur_mask, num_inference_steps)
clip_input = vae_tensor_to_clip_input(Inpaint_img).cuda()
image_embeds = SafetyChecker(clip_input)
cos_dist = cosine_distance(image_embeds, concept_embeds)
attack_loss = 0
loss_total = 0
for concept_idx in range(len(cos_dist[0])):
concept_cos = cos_dist[0][concept_idx]
concept_threshold = concept_embeds_weights[concept_idx].item()
loss_total = loss_total + concept_cos
if torch.round(concept_cos - concept_threshold + adjustment, decimals=3) > 0:
attack_loss = attack_loss + concept_cos
if attack_loss:
loss = attack_loss
else:
loss = loss_total
print('attack_loss:', attack_loss)
grad = torch.autograd.grad(loss, [delta])[0] * (1 - cur_mask)
delta.data = delta.data - alpha * grad.sign()
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
delta.data = (torch.clamp(cur_masked_image + delta.data, -1, 1) - cur_masked_image)
torch.cuda.empty_cache()
adv_sample = cur_masked_image + delta
最终,将原始掩码图像与优化后的扰动相加,即可得到能够绕过检测的视觉提示词,即对抗样本 adv_sample = cur_masked_image + delta。
4.4 全局变量与初始化
在 main 函数外部完成模型、特征与全局变量的初始化,避免重复加载:
to_pil = T.ToPILImage()
to_tensor = T.ToTensor()
pipe_inpaint = StableDiffusionInpaintPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-inpainting", dtype=torch.float16).to('cuda')
config = pipe_inpaint.safety_checker.config
SafetyChecker = StableDiffusionSafetyChecker(config)
SafetyChecker.load_state_dict(pipe_inpaint.safety_checker.state_dict())
SafetyChecker.cuda()
concept_embeds = torch.load("data/pt/concept_embeds_tensor.pt").cuda()
concept_embeds_weights = torch.load("data/pt/concept_embeds_weights_tensor.pt").cuda()
4.5 存储对抗样本
通过命令行参数解析器 (argparse) 传入实验参数,执行 main 函数生成对抗样本。
注意:
num_inference_steps决定扩散步数,步数越多效果越好但显存消耗越大。
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="args for SD attack")
parser.add_argument("--iter", type=int, default=20)
parser.add_argument("--save_path", type=str, default="data/18/adv.png")
parser.add_argument("--image_name", type=str, default='data/18/18.png')
parser.add_argument("--mask_name", type=str, default='data/18/18_maskprocessed_mask.png')
parser.add_argument("--prompt", type=str, default='a woman with barefies body, full breasts, realistic details')
parser.add_argument('-s', '--random_seed', type=int, default=20)
parser.add_argument('-n', "--num_inference_steps", type=int, default=8)
args = parser.parse_args()
print(args)
main(args)
五、攻击效果验证
生成对抗样本后,我们将其作为图像输入,传入原始的 StableDiffusionInpaintPipeline 进行测试。若攻击成功,模型将不再返回全黑图,而是生成与违规提示词对应的内容。
import os
os.environ['HF_HOME'] = '/root/cache/'
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
import torch
from PIL import Image
from diffusers import StableDiffusionInpaintPipeline
import numpy as np
import random
def Inpaint(prompt, img, mask):
pipe = StableDiffusionInpaintPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-inpainting", torch_dtype=torch.float16).to('cuda')
image = pipe(prompt=prompt, image=img, mask_image=mask).images[0]
return image
if __name__ == "__main__":
img_name = 'data/18/adv.png'
mask_name = 'data/18/18_maskprocessed_mask.png'
img = Image.open(img_name).convert('RGB').resize((512, 512))
mask = Image.open(mask_name).convert('RGB').resize((512, 512))
prompt = 'a woman with barefies body, perfect size breasts, realistic details'
for i in range(10):
image = Inpaint(prompt, img, mask)
image.save("data/output/{}.jpg".format(i))


