SafeSteerDataset: A Contrastive Dataset for T2I Safety Steering
Published:
We release SafeSteerDataset on Hugging Face, a contrastive dataset of 2,300 safe/unsafe prompt pairs designed for activation steering in Text-to-Image models. Existing T2I safety benchmarks (I2P, CoPro, T2ISafety) focus on broad evaluation or unsafe prompt detection, but they do not curate pairs of safe and unsafe prompts that are highly semantically similar. This semantic alignment is critical because without it, steering methods capture spurious artifacts rather than isolating the actual direction of toxicity in the activation space.
