Publications

You can also find my articles on my Google Scholar profile.

Conference Papers

Universal Properties of Activation Sparsity in Modern Large Language Models

Published in International Conference on Learning Representations (ICLR), 2026

Methods relying on exact zero activations do not apply to modern LLMs that use SiLU or GELU, leading to fragmented strategies and a gap in general understanding. We introduce a general framework for evaluating sparsity robustness and conduct a systematic investigation across diverse model families and scales. Our results uncover universal properties of activation sparsity, notably that the potential for effective sparsity grows with model size, and present the first study of activation sparsity in diffusion-based LLMs.

Paper

On Stealing Graph Neural Network Models

Published in AAAI Conference on Artificial Intelligence (AAAI), 2026

Current GNN model-stealing methods rely heavily on queries to the victim model, assuming no hard query limits, but in practice the number of allowed queries can be severely limited. We demonstrate how an adversary can extract a GNN with very limited interactions by first obtaining the model backbone without direct queries, then strategically utilizing a fixed query budget to extract the most informative data. Experiments on eight real-world datasets show the attack is effective even under severe query restrictions and active defenses.

Paper

ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts

Published in European Conference on Artificial Intelligence (ECAI), 2025

Traditional Monte Carlo simulations of particle detector responses at CERN are computationally expensive and strain the computational grid. We present ExpertSim, a Mixture-of-Generative-Experts architecture tailored for the Zero Degree Calorimeter in the ALICE experiment, where each expert specializes in a different subset of the data. ExpertSim improves accuracy over standard methods while providing a significant speedup, offering a practical solution for high-efficiency detector simulations.

Paper

Privacy Attacks on Image Autoregressive Models

Published in International Conference on Machine Learning (ICML), 2025

We conduct the first comprehensive privacy analysis of image autoregressive models (IARs), showing they exhibit significantly higher privacy risks than diffusion models. We develop a novel membership inference attack achieving a true positive rate of 86% at 1% false positive rate, compared to just 6% for diffusion models. We further demonstrate successful dataset inference with as few as 6 samples and extract hundreds of training data points from deployed models.

Paper

CDI: Copyrighted Data Identification in Diffusion Models

Published in Conference on Computer Vision and Pattern Recognition (CVPR), 2025

We demonstrate that existing membership inference attacks are not strong enough to reliably detect individual images in large, state-of-the-art diffusion models. To overcome this, we propose CDI, a dataset inference framework that aggregates signals from multiple data points belonging to a single owner. CDI allows data owners with as few as 70 samples to identify with over 99% confidence whether their data was used to train a given diffusion model.

Paper

Learning Graph Representation of Agent Diffuser

Published in International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2025

We introduce LGR-AD, a multi-agent system that models the text-to-image generation process as a distributed system of interacting agents, each representing an expert diffusion sub-model. These agents dynamically adapt to varying conditions and collaborate through a graph neural network that encodes their relationships and performance metrics. A coordination mechanism based on top-k maximum spanning trees optimizes the generation process, outperforming traditional diffusion models across various benchmarks.

Paper

Maybe I Should Not Answer That, but… Do LLMs Understand The Safety of Their Inputs?

Published in ICLR Workshop on Building Trust in Language Models and Applications, 2025

We show that instruction-finetuned LLMs already encode safety-relevant information internally, with safe and unsafe prompts being distinctly separable in the model’s latent space. Building on this, we introduce the Latent Prototype Moderator (LPM), a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety. LPM matches or exceeds state-of-the-art guard models across multiple benchmarks while being a lightweight, customizable add-on that generalizes across model families and sizes.

Paper

Efficient Model-Stealing Attacks Against Inductive Graph Neural Networks

Published in European Conference on Artificial Intelligence (ECAI), 2024

We identify a new method for performing unsupervised model-stealing attacks against inductive graph neural networks, utilizing graph contrastive learning and spectral graph augmentations. Our approach outperforms the state-of-the-art across all six evaluated datasets, achieving superior fidelity and downstream accuracy of the stolen model. Crucially, it requires fewer queries directed toward the target model, making the attack practical even under restricted API access.

Paper

Towards More Realistic Membership Inference Attacks on Large Diffusion Models

Published in Winter Conference on Computer Vision (WACV), 2024

We propose a methodology to establish a fair evaluation setup for membership inference attacks on large diffusion models such as Stable Diffusion. Our research reveals that previously proposed evaluation setups significantly overestimate the effectiveness of these attacks. We conclude that membership inference remains a significant challenge for large diffusion models deployed as black-box systems, indicating that related privacy and copyright issues will persist.

Paper

Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders

Published in Advances in Neural Information Processing Systems (NeurIPS), 2023

Machine Learning as a Service APIs expose high-quality encoders that are expensive to train, making them lucrative targets for model stealing attacks. We propose Bucks for Buckets (B4B), the first active defense that prevents stealing while the attack is happening without degrading representation quality for legitimate users. B4B adaptively adjusts the utility of returned representations based on a user’s coverage of the embedding space and individually transforms each user’s representations to prevent sybil-based aggregation.

Paper

Preprints

Conditioned Activation Transport for T2I Safety Steering

Published in arXiv preprint, 2026

Current Text-to-Image models remain prone to generating unsafe content, and linear activation steering frequently degrades image quality on benign prompts. We propose Conditioned Activation Transport (CAT), a framework that employs geometry-based conditioning and nonlinear transport maps that activate only within unsafe activation regions. Validated on Z-Image and Infinity architectures, CAT significantly reduces Attack Success Rate while maintaining image fidelity compared to unsteered generations.

Paper

Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

Published in arXiv preprint, 2025

Model merging is highly susceptible to backdoor attacks that allow adversaries to control the merged model’s output at inference time. We propose treating the attack itself as a task vector: the Backdoor Vector is the weight difference between a backdoored and clean model, revealing new insights into attack similarity and transferability. We introduce Sparse Backdoor Vectors for stronger attacks and Injection BV Subtraction, an assumption-free defense that remains effective even when the threat is unknown.

Paper

Jan Dubiński

Publications

Conference Papers

Preprints