Cybersecurity Persistence Benchmark

Davide Zani, Felix Michalak, Jeremias Ferrao

The rapid advancement of LLMs has revolutionized the field of artificial intelligence, enabling machines to perform complex tasks with unprecedented accuracy. However, this increased capability also raises concerns about the potential misuse of LLMs in cybercrime. This paper proposes a new benchmark to evaluate the ability of LLMs to maintain long-term control over a target machine, a critical aspect of cybersecurity known as "persistence." Our benchmark tests the ability of LLMs to use various persistence techniques, such as modifying startup files and using Cron jobs, to maintain control over a Linux VM even after a system reboot. We evaluate the performance of open-source LLMs on our benchmark and discuss the implications of our results for the safe deployment of LLMs. Our work highlights the need for more robust evaluation methods to assess the cybersecurity risks of LLMs and provides a tangible metric for policymakers and developers to make informed decisions about the integration of AI into our increasingly interconnected world.

Download

Review Project

See Code

Reviewer's Comments

Nora Petrova

Good idea that’s very relevant for AI security! What I’d love to see is realistic real world scenarios where these capabilities of LLMs can be used for harmful purposes. The benchmark is a really good first step in this direction! Will be interesting to also see how the best proprietary models perform on this benchmark.

Minh Nguyen

Focused. I like it.

Mateusz Jurewicz

Very well structured experiment with important practical applications. I encourage the authors to pursue these ideas further, despite "negative" initial results (which we should in general have more room for in publication venues) in the form of unimpressive performance of the tested LLMs.

The authors refer to multiple existing benchmarks, both from the wider field and more similar, narrow areas of scientific inquiry. The defined measure is very clearly defined (presence of created file in current working directory). The paper comes with a well documented repo - but make sure to remove the REPLICATE_KEY from main.py (line 95). I recommend using tools like trufflehog (https://github.com/trufflesecurity/trufflehog) and hook them up as pre-checks to every pushed commit to automatically prevent this.

The authors are clearly knowledgeable in the domain area (cybersecurity) but keep the paper understandable to an audience without specialization in it. I think it was very ambitious to use a cloud provider (AWS), perhaps the experiment would be worth revisiting with testing on local docker deployments? The paper also does a great job of providing the reader with all the moving parts needed to configure the experiments (prompts & examples of agent interactions in the appendix). Maybe using a model fine-tuned on code writing tasks would perform better? E.g. StarCoder 2 or Phind-CodeLlama?

‍

Jason Hoelscher-Obermaier

Very cool idea! Definitely worth expanding. Some crucial methodological concerns here (as in many of the frontier capabilities benchmarks) are: whether the scaffolding is limiting the success rate and whether we are more interested in deterministic or probabilistic success. Analysis on how sensitive this benchmark is to these questions would be extremely valuable.

Jacob Haimes

Nice work! I am a big fan of the specificity that was brought to this problem while still considering the broader space of contemporary work; that is, isolating persistence and evaluating it as a key cybersecurity capability seems highly valuable, as we get a more granular look at an important aspect of this threat. I also really appreciated the motivation of the work provided in the paper, and the inclusion of example interactions with the model directly in the appendix. I am curious what cutting edge scaffolding around command line usage would affect the results seen here, as potentially the failures of the two Llama models are related to command line interaction difficulties. I also would like to see some plots in future work in order to streamline the transfer of information.

Esben Kran

This is a great project. An important threat model, good relation to existing literature, and well-motivated methodological choices that lead to a strong project. Unfortunate to hear that rate limits were a blocker to get results but it looks like the qualitative results are top-notch.

Cite this work

@misc {

title={

@misc {

author={

Davide Zani, Felix Michalak, Jeremias Ferrao

date={

5/27/24

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

View All

May 20, 2025

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Our project investigates the potential risks and implications of integrating multiple autonomous AI agents within national defense strategies, exploring whether these agents tend to escalate or deescalate conflict situations. Through a simulation that models real-world international relations scenarios, our preliminary results indicate that AI models exhibit a tendency to escalate conflicts, posing a significant threat to maintaining peace and preventing uncontrollable military confrontations. The experiment and subsequent evaluations are designed to reflect established international relations theories and frameworks, aiming to understand the implications of autonomous decision-making in military contexts comprehensively and unbiasedly.

Apr 28, 2025

The Early Economic Impacts of Transformative AI: A Focus on Temporal Coherence

We investigate the economic potential of Transformative AI, focusing on "temporal coherence"—the ability to maintain goal-directed behavior over time—as a critical, yet underexplored, factor in task automation. We argue that temporal coherence represents a significant bottleneck distinct from computational complexity. Using a Large Language Model to estimate the 'effective time' (a proxy for temporal coherence) needed for humans to complete remote O*NET tasks, the study reveals a non-linear link between AI coherence and automation potential. A key finding is that an 8-hour coherence capability could potentially automate around 80-84\% of the analyzed remote tasks.

Mar 31, 2025

Model Models: Simulating a Trusted Monitor

We offer initial investigations into whether the untrusted model can 'simulate' the trusted monitor: is U able to successfully guess what suspicion score T will assign in the APPS setting? We also offer a clean, modular codebase which we hope can be used to streamline future research into this question.

May 20, 2025