Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Posts
SafeSteerDataset: A Contrastive Dataset for T2I Safety Steering
Published:
We release SafeSteerDataset on Hugging Face, a contrastive dataset of 2,300 safe/unsafe prompt pairs designed for activation steering in Text-to-Image models. Existing T2I safety benchmarks (I2P, CoPro, T2ISafety) focus on broad evaluation or unsafe prompt detection, but they do not curate pairs of safe and unsafe prompts that are highly semantically similar. This semantic alignment is critical because without it, steering methods capture spurious artifacts rather than isolating the actual direction of toxicity in the activation space.
Image AutoRegressive Models Leak More Training Data Than Diffusion Models
Published:
Image AutoRegressive models (IARs) have recently emerged as a powerful alternative to diffusion models (DMs), surpassing them in image generation quality, speed, and scalability. Yet, despite their advantages, the privacy risks of IARs remain completely unexplored. When trained on sensitive or copyrighted data, these models may unintentionally expose training samples, creating major security and ethical concerns.
portfolio
Portfolio item number 1
Short description of portfolio item number 1
Portfolio item number 2
Short description of portfolio item number 2 
publications
Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders
Published in Advances in Neural Information Processing Systems (NeurIPS), 2023
Machine Learning as a Service APIs expose high-quality encoders that are expensive to train, making them lucrative targets for model stealing attacks. We propose Bucks for Buckets (B4B), the first active defense that prevents stealing while the attack is happening without degrading representation quality for legitimate users. B4B adaptively adjusts the utility of returned representations based on a user’s coverage of the embedding space and individually transforms each user’s representations to prevent sybil-based aggregation.
Towards More Realistic Membership Inference Attacks on Large Diffusion Models
Published in Winter Conference on Computer Vision (WACV), 2024
We propose a methodology to establish a fair evaluation setup for membership inference attacks on large diffusion models such as Stable Diffusion. Our research reveals that previously proposed evaluation setups significantly overestimate the effectiveness of these attacks. We conclude that membership inference remains a significant challenge for large diffusion models deployed as black-box systems, indicating that related privacy and copyright issues will persist.
Efficient Model-Stealing Attacks Against Inductive Graph Neural Networks
Published in European Conference on Artificial Intelligence (ECAI), 2024
We identify a new method for performing unsupervised model-stealing attacks against inductive graph neural networks, utilizing graph contrastive learning and spectral graph augmentations. Our approach outperforms the state-of-the-art across all six evaluated datasets, achieving superior fidelity and downstream accuracy of the stolen model. Crucially, it requires fewer queries directed toward the target model, making the attack practical even under restricted API access.
Maybe I Should Not Answer That, but… Do LLMs Understand The Safety of Their Inputs?
Published in ICLR Workshop on Building Trust in Language Models and Applications, 2025
We show that instruction-finetuned LLMs already encode safety-relevant information internally, with safe and unsafe prompts being distinctly separable in the model’s latent space. Building on this, we introduce the Latent Prototype Moderator (LPM), a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety. LPM matches or exceeds state-of-the-art guard models across multiple benchmarks while being a lightweight, customizable add-on that generalizes across model families and sizes.
Learning Graph Representation of Agent Diffuser
Published in International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2025
We introduce LGR-AD, a multi-agent system that models the text-to-image generation process as a distributed system of interacting agents, each representing an expert diffusion sub-model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. A coordination mechanism based on top-k maximum spanning trees optimizes the generation process, outperforming traditional diffusion models across various benchmarks.
CDI: Copyrighted Data Identification in Diffusion Models
Published in Conference on Computer Vision and Pattern Recognition (CVPR), 2025
We demonstrate that existing membership inference attacks are not strong enough to reliably detect individual images in large, state-of-the-art diffusion models. To overcome this, we propose CDI, a dataset inference framework that aggregates signals from multiple data points belonging to a single owner. CDI allows data owners with as few as 70 samples to identify with over 99% confidence whether their data was used to train a given diffusion model.
Privacy Attacks on Image Autoregressive Models
Published in International Conference on Machine Learning (ICML), 2025
We conduct the first comprehensive privacy analysis of image autoregressive models (IARs), showing they exhibit significantly higher privacy risks than diffusion models. We develop a novel membership inference attack achieving a true positive rate of 86% at 1% false positive rate, compared to just 6% for diffusion models. We further demonstrate successful dataset inference with as few as 6 samples and extract hundreds of training data points from deployed models.
Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses
Published in arXiv preprint, 2025
Model merging is highly susceptible to backdoor attacks that allow adversaries to control the merged model’s output at inference time. We propose treating the attack itself as a task vector: the Backdoor Vector is the weight difference between a backdoored and clean model, revealing new insights into attack similarity and transferability. We introduce Sparse Backdoor Vectors for stronger attacks and Injection BV Subtraction, an assumption-free defense that remains effective even when the threat is unknown.
ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts
Published in European Conference on Artificial Intelligence (ECAI), 2025
Traditional Monte Carlo simulations of particle detector responses at CERN are computationally expensive and strain the computational grid. We present ExpertSim, a Mixture-of-Generative-Experts architecture tailored for the Zero Degree Calorimeter in the ALICE experiment, where each expert specializes in a different subset of the data. ExpertSim improves accuracy over standard methods while providing a significant speedup, offering a practical solution for high-efficiency detector simulations.
On Stealing Graph Neural Network Models
Published in AAAI Conference on Artificial Intelligence (AAAI), 2026
Current GNN model-stealing methods rely heavily on queries to the victim model, assuming no hard query limits, but in practice the number of allowed queries can be severely limited. We demonstrate how an adversary can extract a GNN with very limited interactions by first obtaining the model backbone without direct queries, then strategically utilizing a fixed query budget to extract the most informative data. Experiments on eight real-world datasets show the attack is effective even under severe query restrictions and active defenses.
Conditioned Activation Transport for T2I Safety Steering
Published in arXiv preprint, 2026
Current Text-to-Image models remain prone to generating unsafe content, and linear activation steering frequently degrades image quality on benign prompts. We propose Conditioned Activation Transport (CAT), a framework that employs geometry-based conditioning and nonlinear transport maps that activate only within unsafe activation regions. Validated on Z-Image and Infinity architectures, CAT significantly reduces Attack Success Rate while maintaining image fidelity compared to unsteered generations.
Universal Properties of Activation Sparsity in Modern Large Language Models
Published in International Conference on Learning Representations (ICLR), 2026
Methods relying on exact zero activations do not apply to modern LLMs that use SiLU or GELU, leading to fragmented strategies and a gap in general understanding. We introduce a general framework for evaluating sparsity robustness and conduct a systematic investigation across diverse model families and scales. Our results uncover universal properties of activation sparsity, notably that the potential for effective sparsity grows with model size, and present the first study of activation sparsity in diffusion-based LLMs.
talks
Talk 1 on Relevant Topic in Your Field
Published:
This is a description of your talk, which is a markdown file that can be all markdown-ified like any other post. Yay markdown!
Conference Proceeding talk 3 on Relevant Topic in Your Field
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
teaching
Teaching experience 1
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Teaching experience 2
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.
