Research
I work at the intersection of AI with security, safety, and sociopolitical aspects:
1. Understanding, probing, and evaluating the failure modes of AI models — their biases, emergent risks, and misuse scenarios.
2. Designing mitigations, system defenses, white-box control methods, and reasoning enhancements to counter such risks.
3. Leveraging AI agents for good: scientific discovery and advancing our society.
AI Security
A(G)I Safety & Alignment
AI & Society
AI Ethics
Prompt Injection
Multi-agent Safety
Cooperative AI
Human-AI Interaction
Selected Publications
Full list on Google Scholar
ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations
Amr Gomaa, Ahmed Salem, Sahar Abdelnabi
EACL Findings 2026
LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge
Sahar Abdelnabi, Aideen Fay, Ahmed Salem, et al.
Arxiv 2025
The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness
Sahar Abdelnabi, Ahmed Salem
NeurIPS 2025 Spotlight 🏆
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning
Guangchen Lan, Huseyin A Inan, Sahar Abdelnabi, et al.
NeurIPS 2025
Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz
TMLR 2025 Survey Certification 🏆
Firewalls to Secure Dynamic LLM Agentic Networks
Sahar Abdelnabi*, Amr Gomaa*, Eugene Bagdasarian, Per Ola Kristensson, Reza Shokri
Arxiv 2025
A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive
Sarath Sivaprasad*, Pramod Kaushik*, Sahar Abdelnabi, Mario Fritz
ACL 2025 Best Paper Award 🏆
Get My Drift? Catching LLM Task Drift with Activation Deltas
Sahar Abdelnabi*, Aideen Fay*, Giovanni Cherubin, Ahmed Salem, Mario Fritz, Andrew Paverd
SaTML 2025
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, Christoph H Lampert
ICLR 2025
Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation
Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, Mario Fritz
NeurIPS Datasets and Benchmarks 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Edoardo Debenedetti*, Javier Rando*, Daniel Paleka*, et al.
NeurIPS Datasets and Benchmarks 2024 Spotlight 🏆
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Kai Greshake*, Sahar Abdelnabi*, Shailesh Mishra, Christoph Endres, Thorsten Holz, Mario Fritz
AISec @ CCS 2023 Best Paper Award 🏆
Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against Fact-Verification Systems
Sahar Abdelnabi, Mario Fritz
USENIX Security 2023
Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources
Sahar Abdelnabi, Rakibul Hasan, Mario Fritz
CVPR 2022
Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding
Sahar Abdelnabi, Mario Fritz
S&P 2021
Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data
Ning Yu*, Vladislav Skripniuk*, Sahar Abdelnabi, Mario Fritz
ICCV 2021 Oral 🏆
VisualPhishNet: Zero-day Phishing Website Detection by Visual Similarity
Sahar Abdelnabi, Katharina Krombholz, Mario Fritz
CCS 2020
Talks & Panels
Panel: Shaping Public AI for Science, Innovation & European Impact
2025
What does it mean for AI agents to preserve privacy?
2025
Firewalls to Secure Dynamic LLM Agentic Networks
Brave, Google DeepMind, Qualcomm
2025
Panel: Women in AI Security
2025
Towards Aligned, Interpretable, and Steerable Safe AI Agents
TU Graz, UMass Amherst, CISPA, ELLIS Institute TĂĽbingen
2025
On New Security and Safety Challenges Posed by LLMs
HIDA PhD Meet-up (Keynote), MLSec Seminars
2024
Compromising LLMs: The Advent of AI Malware
Black Hat USA 2023
2023
On Evaluating Language Models and Their Security Implications
Vector Institute, ETH ZĂĽrich
2023