Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Posts

SafeSteerDataset: A Contrastive Dataset for T2I Safety Steering

3 minute read

Published: March 15, 2026

We release SafeSteerDataset on Hugging Face, a contrastive dataset of 2,300 safe/unsafe prompt pairs designed for activation steering in Text-to-Image models. Existing T2I safety benchmarks (I2P, CoPro, T2ISafety) focus on broad evaluation or unsafe prompt detection, but they do not curate pairs of safe and unsafe prompts that are highly semantically similar. This semantic alignment is critical because without it, steering methods capture spurious artifacts rather than isolating the actual direction of toxicity in the activation space.

Image AutoRegressive Models Leak More Training Data Than Diffusion Models

less than 1 minute read

Published: July 26, 2025

Image AutoRegressive models (IARs) have recently emerged as a powerful alternative to diffusion models (DMs), surpassing them in image generation quality, speed, and scalability. Yet, despite their advantages, the privacy risks of IARs remain completely unexplored. When trained on sensitive or copyrighted data, these models may unintentionally expose training samples, creating major security and ethical concerns.

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders

Published in Advances in Neural Information Processing Systems (NeurIPS), 2023

Machine Learning as a Service APIs expose high-quality encoders that are expensive to train, making them lucrative targets for model stealing attacks. We propose Bucks for Buckets (B4B), the first active defense that prevents stealing while the attack is happening without degrading representation quality for legitimate users. B4B adaptively adjusts the utility of returned representations based on a user’s coverage of the embedding space and individually transforms each user’s representations to prevent sybil-based aggregation.

Paper

Towards More Realistic Membership Inference Attacks on Large Diffusion Models

Published in Winter Conference on Computer Vision (WACV), 2024

We propose a methodology to establish a fair evaluation setup for membership inference attacks on large diffusion models such as Stable Diffusion. Our research reveals that previously proposed evaluation setups significantly overestimate the effectiveness of these attacks. We conclude that membership inference remains a significant challenge for large diffusion models deployed as black-box systems, indicating that related privacy and copyright issues will persist.

Paper

Efficient Model-Stealing Attacks Against Inductive Graph Neural Networks

Published in European Conference on Artificial Intelligence (ECAI), 2024

We identify a new method for performing unsupervised model-stealing attacks against inductive graph neural networks, utilizing graph contrastive learning and spectral graph augmentations. Our approach outperforms the state-of-the-art across all six evaluated datasets, achieving superior fidelity and downstream accuracy of the stolen model. Crucially, it requires fewer queries directed toward the target model, making the attack practical even under restricted API access.

Paper

Maybe I Should Not Answer That, but… Do LLMs Understand The Safety of Their Inputs?

Published in ICLR Workshop on Building Trust in Language Models and Applications, 2025

We show that instruction-finetuned LLMs already encode safety-relevant information internally, with safe and unsafe prompts being distinctly separable in the model’s latent space. Building on this, we introduce the Latent Prototype Moderator (LPM), a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety. LPM matches or exceeds state-of-the-art guard models across multiple benchmarks while being a lightweight, customizable add-on that generalizes across model families and sizes.

Paper

Learning Graph Representation of Agent Diffuser

Published in International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2025

We introduce LGR-AD, a multi-agent system that models the text-to-image generation process as a distributed system of interacting agents, each representing an expert diffusion sub-model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. A coordination mechanism based on top-k maximum spanning trees optimizes the generation process, outperforming traditional diffusion models across various benchmarks.

Paper

CDI: Copyrighted Data Identification in Diffusion Models

Published in Conference on Computer Vision and Pattern Recognition (CVPR), 2025

We demonstrate that existing membership inference attacks are not strong enough to reliably detect individual images in large, state-of-the-art diffusion models. To overcome this, we propose CDI, a dataset inference framework that aggregates signals from multiple data points belonging to a single owner. CDI allows data owners with as few as 70 samples to identify with over 99% confidence whether their data was used to train a given diffusion model.

Paper

Privacy Attacks on Image Autoregressive Models

Published in International Conference on Machine Learning (ICML), 2025

We conduct the first comprehensive privacy analysis of image autoregressive models (IARs), showing they exhibit significantly higher privacy risks than diffusion models. We develop a novel membership inference attack achieving a true positive rate of 86% at 1% false positive rate, compared to just 6% for diffusion models. We further demonstrate successful dataset inference with as few as 6 samples and extract hundreds of training data points from deployed models.

Paper

Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

Published in arXiv preprint, 2025

Model merging is highly susceptible to backdoor attacks that allow adversaries to control the merged model’s output at inference time. We propose treating the attack itself as a task vector: the Backdoor Vector is the weight difference between a backdoored and clean model, revealing new insights into attack similarity and transferability. We introduce Sparse Backdoor Vectors for stronger attacks and Injection BV Subtraction, an assumption-free defense that remains effective even when the threat is unknown.

Paper

ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts

Published in European Conference on Artificial Intelligence (ECAI), 2025

Traditional Monte Carlo simulations of particle detector responses at CERN are computationally expensive and strain the computational grid. We present ExpertSim, a Mixture-of-Generative-Experts architecture tailored for the Zero Degree Calorimeter in the ALICE experiment, where each expert specializes in a different subset of the data. ExpertSim improves accuracy over standard methods while providing a significant speedup, offering a practical solution for high-efficiency detector simulations.

Paper

On Stealing Graph Neural Network Models

Published in AAAI Conference on Artificial Intelligence (AAAI), 2026

Current GNN model-stealing methods rely heavily on queries to the victim model, assuming no hard query limits, but in practice the number of allowed queries can be severely limited. We demonstrate how an adversary can extract a GNN with very limited interactions by first obtaining the model backbone without direct queries, then strategically utilizing a fixed query budget to extract the most informative data. Experiments on eight real-world datasets show the attack is effective even under severe query restrictions and active defenses.

Paper

Conditioned Activation Transport for T2I Safety Steering

Published in arXiv preprint, 2026

Current Text-to-Image models remain prone to generating unsafe content, and linear activation steering frequently degrades image quality on benign prompts. We propose Conditioned Activation Transport (CAT), a framework that employs geometry-based conditioning and nonlinear transport maps that activate only within unsafe activation regions. Validated on Z-Image and Infinity architectures, CAT significantly reduces Attack Success Rate while maintaining image fidelity compared to unsteered generations.

Paper

Universal Properties of Activation Sparsity in Modern Large Language Models

Published in International Conference on Learning Representations (ICLR), 2026

Methods relying on exact zero activations do not apply to modern LLMs that use SiLU or GELU, leading to fragmented strategies and a gap in general understanding. We introduce a general framework for evaluating sparsity robustness and conduct a systematic investigation across diverse model families and scales. Our results uncover universal properties of activation sparsity, notably that the potential for effective sparsity grows with model size, and present the first study of activation sparsity in diffusion-based LLMs.

Paper

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Jan Dubiński

Sitemap

Pages

Posts

portfolio

publications

talks

teaching