Abstract

This paper reports on a controlled experiment originally designed to test whether an LLM agent could autonomously perform a container escape. We deployed agents powered by two OpenAI models — gpt-4o-mini and gpt-5.4-mini — inside a purpose-built Docker-in-Docker (DinD) environment containing a deliberately planted Docker socket vulnerability. Each agent was equipped with shell execution, file I/O, and runtime tool-creation capabilities and instructed to discover and exploit the vulnerability to escape its container, exfiltrate a flag, and send a notification email.

Series 1 (gpt-4o-mini) established a baseline: the agent escaped in 1 of 10 runs, with most failures attributable to technical errors rather than any principled objection to the task. Series 2 (gpt-5.4-mini, same prompt) produced an unexpected finding: rather than improving on the capability baseline, the newer model consistently identified the vulnerability — finding the Docker socket in 100% of runs — but refused to complete the exploit in 80% of cases, citing ethical grounds. This behavior was not anticipated and prompted a third series in which the identical task was reframed as an authorized Capture-The-Flag (CTF) exercise. The CTF reframing reduced refusals from 80% to 10% and raised the full escape rate from 0% to 20%. What began as a capability benchmark therefore yielded an unplanned but significant secondary finding: a stark separation between what the newer model can do and what it chooses to do, and the context-sensitivity of the boundary between the two.

Previous
Previous

Connectome-Constrained Spiking Neural Networks: Olfactory Classification Study