Jul 24, 2026

Jul 26, 2026

Online & In-Person

Secret Loyalties Hackathon

Co-organized with Formation Research. Secretly loyal AI slips past governance, markets, and public scrutiny. Over one weekend, build the model organisms, detection methods, and defenses to catch it before the threat matures.

Days To Go

614

Sign Ups

Overview

Resources

Guidelines

Schedule

Overview

In this sprint, you will design and run experiments on secret loyalties over a single weekend, working in teams to produce a research artifact: a model organism, a detection method, an evaluation, a defense, or a rigorous threat or forecasting analysis. The sprint is co-organized by Apart Research and Formation Research, building directly on the research agenda set out in "AIs with Secret Loyalties are a Serious but Addressable Threat."

Fast-track and continuation

Top teams will be invited to continue their work through Apart's Partnered Fellowships model, where Apart provides operations, infrastructure, and compute while research partners provide mentorship and direction.

Follow-up program: top teams continue through the Apart Fellowship for further research and mentorship; the timeline is shared with invitations.
What winners receive beyond cash: fellowship fast-track, mentor introductions from the organizing team and judges, and a pathway toward a publishable write-up.

Cash Prizes

$2,000 in cash prizes across all tracks
🥇 1st Place	$1,000
🥈 2nd Place	$500
🥉 3rd Place	$300
🏅 4th Place	$100
🏅 5th Place	$100

What this hackathon is about

A model has a secret loyalty when (1) it has been intentionally caused to advance a specific principal's interests through its outputs or actions, where the principal is an identifiable actor such as a nation-state, corporation, company leader, organization, or individual user, and (2) this orientation is not disclosed to operators, auditors, or users during normal operation.

Secret loyalties are not a rebrand of standard backdoors. Standard backdoors implement an arbitrary payload tied to an attacker-defined trigger, whereas secret loyalties are directed toward advancing a named actor's interests, and can range from narrow trigger-conditional behavior all the way to a model that continuously scans every interaction for opportunities and chooses its own actions. Because they are designed to evade visibility-dependent oversight, they require technical solutions.

The good news, and the premise of this sprint: secret loyalties target specific principals (a tractable list of a few dozen nation-states, corporations, and leaders), which creates a concrete defensive foothold. Participants will produce research artifacts that move the five-direction agenda forward.

What participants will do

Build reproducible model organisms of secret loyalties that span the activation-breadth x action-breadth space.
Evaluate whether existing backdoor and auditing defenses transfer to secret loyalties, at graded affordance levels.
Red-team attack pathways (data poisoning, multi-stage priming, reasoning-trace poisoning, chain-of-command hijacking, classifier compromise) under controlled conditions.
Detect or remediate loyalties post-hoc using interpretability probes and last-mover training strategies.
Theorize / forecast threat models, analogies from cybersecurity, and scaling trends for secret-loyalty capabilities.

You will work in teams over the weekend and submit a short research report (plus optional code and demo).

Why this hackathon matters

High-stakes deployment is accelerating. Frontier models increasingly review and generate deployed code, sit inside government and corporate decision-making, and participate in their own development pipelines. The more responsibility AI systems accumulate, the more valuable it becomes to covertly influence them.
The building blocks already exist. Web-scale data poisoning is practical at modest cost, hidden behaviors persist through safety training, and behavioral traits transfer through filtered, innocuous-looking data.
Early signs are already visible. Narrow secret loyalties have been trained into open-weight models that evade black-box auditing even when auditors are told the attack's structure, and principal-conditioned behavior has been observed in at least one deployed system.
Oversight that depends on visibility cannot reach them. Secret loyalties are a form of AI capture that remains viable even under strong transparency regimes, which is exactly why they need a dedicated technical program.
Waiting is the wrong posture. Information-security defenses (data provenance, pipeline integrity) take time to build. Starting now is necessary to have them in place before the threat matures.

A dedicated, focused sprint addresses these issues by combining the separately-demonstrated building blocks under controlled conditions and producing shared infrastructure (model organisms, benchmarks) the whole community can build on.

Challenge tracks

Track 1: Model Organisms of Secret Loyalties

Construct reproducible model organisms of secret loyalty (models deliberately trained to exhibit the target behavior for research purposes) that span the two-dimensional space of activation breadth (from narrow attacker-defined triggers to conditions the model assesses on its own) and action-space breadth (from a fixed pre-specified output to contextually chosen actions). Existing work covers only the narrow corner. The rest of the space is uncharacterized, and every other research direction is bottlenecked on this infrastructure.

Make a secretly loyal model. How far on the activation and action space can you get it? How can you prove it has the loyalty? Can you verify the loyalty is to the intended principal?
Install a secret loyalty via system prompt alone on a frontier API model, no weight changes. See how far you can push activation and action breadth while keeping the system prompt undiscoverable to a black-box auditor.
Small-scale dataset poisoning. Modify ~1k examples of UltraFeedback or Alpaca to embed a narrow loyalty signal, SFT a small open model (Qwen 0.5B, Llama 3.2 1B), measure detectability and find the minimum sufficient poison count.
DPO vs SFT vs system-prompt comparison. Install the same target loyalty via three methods on the same base. Compare activation reliability, action breadth, and detectability.
Multi-principal organism. Install two competing loyalties with disjoint activation conditions. Do they interfere? Does training one wash out the other?
Cover multiple principal types (a nation-state, a corporation, an individual leader, an ideology) and ship each loyal organism with a content-matched control that has seen the same principal-related data but carries no loyalty.
Compare installation methods (e.g., standard SFT, reasoning-trace SFT, activation-matching against a loyalty-prompted teacher) and report which points in the space are reachable, at what cost.
Release organisms with documented activation conditions, evaluation scenarios, matched controls, and measured activation rate, selectivity, and black-box concealment.

Track 2: Detection & Auditing (evaluate)

Can we tell a secretly loyal model from a clean one? Skill profile: interpretability, probing and activation analysis, auditing.

Benchmark black-box auditing (interrogation, prefill, base-completion) against provided organisms at graded affordance levels.
Build interpretability probes targeting the relational structure of loyalty (does the model treat a named entity as having asymmetric action-guiding status vs matched controls?).
Test cross-principal generalization: does a probe trained to detect loyalty-to-X catch loyalty-to-Y?
Look for additional-computation traces (e.g. elevated logprobs for principal-benefiting actions even when a neutral output is selected).

If you do auditing for a secret loyalty model organism, we ask that you try to use the affordance levels in section 4.3 of Lamerton and Roger (2026).

Track 3: Defenses, Detection & Remediation (build / evaluate)

Whether prevention fails or not, can loyalties be removed afterward? Find out whether the defenses we already have transfer to secret loyalties, and whether loyalties can be detected before deployment or neutralized afterward. A negative result bounds where new defensive investment is most urgent. A positive result hands the community a tool.

Benchmark representative backdoor defenses (data filtering, content filtering, trigger reversal, activation clustering, training-pipeline robustness, data provenance, black-box auditing) against model organisms, at graded affordance levels, and report which transfer and which fail structurally.
Probe transfer across organisms. Train a linear probe on one model organism, then test whether it detects loyalty activation in a different organism with a different principal or activation condition. Negative results count.
Safety-training stripping. Run a round of standard safety fine-tuning (HH-RLHF or similar) on top of a model organism. Measure how much of the loyalty survives (replicates a realistic post-hoc mitigation pipeline).
CoT monitor on a reasoning organism. Build or use a secretly loyal reasoning model, build a simple chain-of-thought monitor, measure precision and recall at flagging loyalty-relevant reasoning, and compare to plain output monitoring.
Auditing-agent prompt iteration. Take an off-the-shelf auditor (Petri or hand-rolled) and iterate its prompts against a fixed model organism. Document which prompt patterns shift detection rates and by how much.
Test interpretability methods that target loyalty's relational structure: does the model treat a specific entity as having asymmetric action-guiding status relative to matched controls, and does a probe trained to detect loyalty-to-X generalize to loyalty-to-Y?
Test post-hoc remediation that does not require knowing the trigger: train in many known password-loyalties and train them back out, distill on clean data, or install an overriding "good" password, then check whether held-out passwords are neutralized too.
Build principal-specific evaluation suites that measure systematic directional bias across scenarios where a named principal's interests are at stake, using matched control entities and statistical tests calibrated for directional bias.

Track 4: Attack Feasibility & Safety-Infrastructure Robustness

Map which attack pathways are actually practical, and test whether the safety infrastructure meant to catch secret loyalties can itself be compromised. The decision-relevant questions for any pathway are: how much data and access does it require, is its training signature detectable by content inspection, and does the resulting loyalty generalize to contexts the attacker never anticipated?

Test whether subliminal or inductive transfer extends from sentiment and persona adoption up to goal-directed loyalty, escalating target-behavior complexity level by level.
Test multi-stage attacks: does pretraining-stage priming (fictional loyal personas, proto-SFT traces) make the post-training poison both cheaper and less conspicuous than a single-stage attack?
Test reasoning-trace poisoning: does teaching a model how to reason about a principal's interests generalize more efficiently than (context, output) pairs, and does the effect survive a scrambled-CoT control?
Test chain-of-command hijacking: can an attacker shift the model's internal authority representation so it defers to a hidden principal, with no legible trace in any inspectable document?
Test whether trigger-based backdoors planted in a base model survive the intensive adversarial training used to build constitutional classifiers, i.e. whether a single poisoning campaign can blind both the deployed model and its monitor. Note the dual-use guardrails on the Guidelines tab.

Track 5: Threat Modeling, Forecasting & Governance (non-technical)

Map the threat at a conceptual level: build detailed scenarios for how secret loyalties get installed and lead to bad outcomes, identify the model capabilities required for the most concerning variants, and import lessons from cybersecurity and insider-risk literature. Writing- and analysis-heavy. No ML training experience required, though projects that pair a vignette with a small empirical demonstration are welcome.

Write a 1,500-word catastrophic vignette: secret loyalty leading to lock-in, a soft coup, or a national-security incident, with named actors, specific capabilities, and decision points where mitigation could have intervened. Multi-agent or AI-R&D variants welcome.
Insider-threat case-study transfer. Take two or three historical insider-threat cases (Snowden, Hanssen, a corporate espionage case) and produce a structured comparison to secret loyalties, pull out monitoring and mitigation lessons that translate.
Capability-requirements map. For each region of the activation x action-space (Figure 1 of the position paper), what model capabilities does that variant actually require? Situational awareness, deception, long-horizon planning, theory of mind?
Cybersecurity lessons. What do supply-chain integrity, code signing, insider-threat programs, and defense-in-depth teach us about secret-loyalty mitigation? Pull out three or four lessons that translate concretely to ML training pipelines.
Worst-case harms and likelihood. Map worst-case harms and argue why they are or are not credible at near-term capability levels.
Scaling laws. What metrics would tell us when "general" secret loyalties (broad activation x broad action) become feasible, and at what capability threshold?

Open research questions

Foundations

Can broader-activation or broader-action variants be constructed with current training methods, and what affordances does each require?
Do model organisms generalize across model families and scales?

Defenses

Which existing backdoor-defense families transfer to which secret-loyalty variants, and where do they systematically fail?
Does loyalty detection generalize across principals, or is per-principal probing required?
Can a loyalty be removed without knowing its activation condition?

Attack feasibility

Do subliminal or inductive mechanisms extend from sentiment and persona-adoption to goal-directed loyalty?
Can multi-stage or reasoning-trace poisoning make attacks more efficient and less conspicuous?
Can the model's trained chain of command (deference to privileged instructions) be hijacked?

Infrastructure

Do trigger-based backdoors in base models survive the intensive adversarial training used to build constitutional classifiers?
How data-efficient is a generalizable secret loyalty compared to a simple backdoor?

Expected outcomes

Model organisms released as shared infrastructure for detection and defense work.
Benchmarks of existing defenses against secret-loyalty organisms at graded affordance levels.
Detection methods (probes, auditing protocols) with measured cross-principal generalization.
Remediation results quantifying last-mover defenses.
Threat models and forecasts that bound where defensive investment is most urgent (if Track 5 is included).

The most promising projects will have opportunities for continuation through the partnered fellowship and a publishable write-up.

Who should join

ML researchers and engineers comfortable fine-tuning open-weight models, interpretability researchers, AI safety and security red-teamers, and, if non-technical tracks are included, governance researchers, forecasters, and strong technical writers. No prior secret-loyalty experience required. Adjacent backgrounds (backdoor and poisoning research, alignment, infosec) are strongly encouraged. A specific background is not required to win.

We also encourage people who are interested in potentially working on secret loyalties research full-time to join this hackathon. There is considerable enthusiasm, research support, and funding available for this for the right candidates.

Schedule

Day 1 (Fri July 24): Kickoff, keynote and threat-model briefing, track briefings, team formation, organism walkthrough.
Day 2 (Sat July 25): Build.
Day 3 (Sun July 26): Final pushes, submissions, demos.

Submissions are due Sunday July 26, 11:59 PM AoE (Anywhere on Earth). Talk times are posted on the Schedule tab closer to the event.

What happens after

Results and winners are announced about 1 to 2 weeks after submissions close. Top teams invited to the partnered fellowship. Selected projects supported toward a write-up subject to the responsible-disclosure review on the Guidelines tab.

Partners

Forethought, a meta-strategy organization focused on how to navigate the transition to a world with superintelligent AI systems.
Formation Research, research direction and mentorship (Joe Kwon, Alfie Lamerton).
IAPS (Institute for AI Policy and Strategy), a nonpartisan think tank producing policy research on the implications of AI, from today’s most advanced models to potential AGI and superintelligence.

Contact

Email: sprints@apartresearch.com
Organizers: Apart Research and Formation Research

614

Sign Ups

Overview

Resources

Guidelines

Schedule

Overview

Fast-track and continuation

Follow-up program: top teams continue through the Apart Fellowship for further research and mentorship; the timeline is shared with invitations.
What winners receive beyond cash: fellowship fast-track, mentor introductions from the organizing team and judges, and a pathway toward a publishable write-up.

Cash Prizes

$2,000 in cash prizes across all tracks
🥇 1st Place	$1,000
🥈 2nd Place	$500
🥉 3rd Place	$300
🏅 4th Place	$100
🏅 5th Place	$100

What this hackathon is about

What participants will do

Build reproducible model organisms of secret loyalties that span the activation-breadth x action-breadth space.
Evaluate whether existing backdoor and auditing defenses transfer to secret loyalties, at graded affordance levels.
Red-team attack pathways (data poisoning, multi-stage priming, reasoning-trace poisoning, chain-of-command hijacking, classifier compromise) under controlled conditions.
Detect or remediate loyalties post-hoc using interpretability probes and last-mover training strategies.
Theorize / forecast threat models, analogies from cybersecurity, and scaling trends for secret-loyalty capabilities.

You will work in teams over the weekend and submit a short research report (plus optional code and demo).

Why this hackathon matters

High-stakes deployment is accelerating. Frontier models increasingly review and generate deployed code, sit inside government and corporate decision-making, and participate in their own development pipelines. The more responsibility AI systems accumulate, the more valuable it becomes to covertly influence them.
The building blocks already exist. Web-scale data poisoning is practical at modest cost, hidden behaviors persist through safety training, and behavioral traits transfer through filtered, innocuous-looking data.
Early signs are already visible. Narrow secret loyalties have been trained into open-weight models that evade black-box auditing even when auditors are told the attack's structure, and principal-conditioned behavior has been observed in at least one deployed system.
Oversight that depends on visibility cannot reach them. Secret loyalties are a form of AI capture that remains viable even under strong transparency regimes, which is exactly why they need a dedicated technical program.
Waiting is the wrong posture. Information-security defenses (data provenance, pipeline integrity) take time to build. Starting now is necessary to have them in place before the threat matures.

Challenge tracks

Track 1: Model Organisms of Secret Loyalties

Make a secretly loyal model. How far on the activation and action space can you get it? How can you prove it has the loyalty? Can you verify the loyalty is to the intended principal?
Install a secret loyalty via system prompt alone on a frontier API model, no weight changes. See how far you can push activation and action breadth while keeping the system prompt undiscoverable to a black-box auditor.
Small-scale dataset poisoning. Modify ~1k examples of UltraFeedback or Alpaca to embed a narrow loyalty signal, SFT a small open model (Qwen 0.5B, Llama 3.2 1B), measure detectability and find the minimum sufficient poison count.
DPO vs SFT vs system-prompt comparison. Install the same target loyalty via three methods on the same base. Compare activation reliability, action breadth, and detectability.
Multi-principal organism. Install two competing loyalties with disjoint activation conditions. Do they interfere? Does training one wash out the other?
Cover multiple principal types (a nation-state, a corporation, an individual leader, an ideology) and ship each loyal organism with a content-matched control that has seen the same principal-related data but carries no loyalty.
Compare installation methods (e.g., standard SFT, reasoning-trace SFT, activation-matching against a loyalty-prompted teacher) and report which points in the space are reachable, at what cost.
Release organisms with documented activation conditions, evaluation scenarios, matched controls, and measured activation rate, selectivity, and black-box concealment.

Track 2: Detection & Auditing (evaluate)

Can we tell a secretly loyal model from a clean one? Skill profile: interpretability, probing and activation analysis, auditing.

Benchmark black-box auditing (interrogation, prefill, base-completion) against provided organisms at graded affordance levels.
Build interpretability probes targeting the relational structure of loyalty (does the model treat a named entity as having asymmetric action-guiding status vs matched controls?).
Test cross-principal generalization: does a probe trained to detect loyalty-to-X catch loyalty-to-Y?
Look for additional-computation traces (e.g. elevated logprobs for principal-benefiting actions even when a neutral output is selected).

If you do auditing for a secret loyalty model organism, we ask that you try to use the affordance levels in section 4.3 of Lamerton and Roger (2026).

Track 3: Defenses, Detection & Remediation (build / evaluate)

Benchmark representative backdoor defenses (data filtering, content filtering, trigger reversal, activation clustering, training-pipeline robustness, data provenance, black-box auditing) against model organisms, at graded affordance levels, and report which transfer and which fail structurally.
Probe transfer across organisms. Train a linear probe on one model organism, then test whether it detects loyalty activation in a different organism with a different principal or activation condition. Negative results count.
Safety-training stripping. Run a round of standard safety fine-tuning (HH-RLHF or similar) on top of a model organism. Measure how much of the loyalty survives (replicates a realistic post-hoc mitigation pipeline).
CoT monitor on a reasoning organism. Build or use a secretly loyal reasoning model, build a simple chain-of-thought monitor, measure precision and recall at flagging loyalty-relevant reasoning, and compare to plain output monitoring.
Auditing-agent prompt iteration. Take an off-the-shelf auditor (Petri or hand-rolled) and iterate its prompts against a fixed model organism. Document which prompt patterns shift detection rates and by how much.
Test interpretability methods that target loyalty's relational structure: does the model treat a specific entity as having asymmetric action-guiding status relative to matched controls, and does a probe trained to detect loyalty-to-X generalize to loyalty-to-Y?
Test post-hoc remediation that does not require knowing the trigger: train in many known password-loyalties and train them back out, distill on clean data, or install an overriding "good" password, then check whether held-out passwords are neutralized too.
Build principal-specific evaluation suites that measure systematic directional bias across scenarios where a named principal's interests are at stake, using matched control entities and statistical tests calibrated for directional bias.

Track 4: Attack Feasibility & Safety-Infrastructure Robustness

Test whether subliminal or inductive transfer extends from sentiment and persona adoption up to goal-directed loyalty, escalating target-behavior complexity level by level.
Test multi-stage attacks: does pretraining-stage priming (fictional loyal personas, proto-SFT traces) make the post-training poison both cheaper and less conspicuous than a single-stage attack?
Test reasoning-trace poisoning: does teaching a model how to reason about a principal's interests generalize more efficiently than (context, output) pairs, and does the effect survive a scrambled-CoT control?
Test chain-of-command hijacking: can an attacker shift the model's internal authority representation so it defers to a hidden principal, with no legible trace in any inspectable document?
Test whether trigger-based backdoors planted in a base model survive the intensive adversarial training used to build constitutional classifiers, i.e. whether a single poisoning campaign can blind both the deployed model and its monitor. Note the dual-use guardrails on the Guidelines tab.

Track 5: Threat Modeling, Forecasting & Governance (non-technical)

Write a 1,500-word catastrophic vignette: secret loyalty leading to lock-in, a soft coup, or a national-security incident, with named actors, specific capabilities, and decision points where mitigation could have intervened. Multi-agent or AI-R&D variants welcome.
Insider-threat case-study transfer. Take two or three historical insider-threat cases (Snowden, Hanssen, a corporate espionage case) and produce a structured comparison to secret loyalties, pull out monitoring and mitigation lessons that translate.
Capability-requirements map. For each region of the activation x action-space (Figure 1 of the position paper), what model capabilities does that variant actually require? Situational awareness, deception, long-horizon planning, theory of mind?
Cybersecurity lessons. What do supply-chain integrity, code signing, insider-threat programs, and defense-in-depth teach us about secret-loyalty mitigation? Pull out three or four lessons that translate concretely to ML training pipelines.
Worst-case harms and likelihood. Map worst-case harms and argue why they are or are not credible at near-term capability levels.
Scaling laws. What metrics would tell us when "general" secret loyalties (broad activation x broad action) become feasible, and at what capability threshold?

Open research questions

Foundations

Can broader-activation or broader-action variants be constructed with current training methods, and what affordances does each require?
Do model organisms generalize across model families and scales?

Defenses

Which existing backdoor-defense families transfer to which secret-loyalty variants, and where do they systematically fail?
Does loyalty detection generalize across principals, or is per-principal probing required?
Can a loyalty be removed without knowing its activation condition?

Attack feasibility

Do subliminal or inductive mechanisms extend from sentiment and persona-adoption to goal-directed loyalty?
Can multi-stage or reasoning-trace poisoning make attacks more efficient and less conspicuous?
Can the model's trained chain of command (deference to privileged instructions) be hijacked?

Infrastructure

Do trigger-based backdoors in base models survive the intensive adversarial training used to build constitutional classifiers?
How data-efficient is a generalizable secret loyalty compared to a simple backdoor?

Expected outcomes

Model organisms released as shared infrastructure for detection and defense work.
Benchmarks of existing defenses against secret-loyalty organisms at graded affordance levels.
Detection methods (probes, auditing protocols) with measured cross-principal generalization.
Remediation results quantifying last-mover defenses.
Threat models and forecasts that bound where defensive investment is most urgent (if Track 5 is included).

The most promising projects will have opportunities for continuation through the partnered fellowship and a publishable write-up.

Who should join

Schedule

Day 1 (Fri July 24): Kickoff, keynote and threat-model briefing, track briefings, team formation, organism walkthrough.
Day 2 (Sat July 25): Build.
Day 3 (Sun July 26): Final pushes, submissions, demos.

Submissions are due Sunday July 26, 11:59 PM AoE (Anywhere on Earth). Talk times are posted on the Schedule tab closer to the event.

What happens after

Partners

Forethought, a meta-strategy organization focused on how to navigate the transition to a world with superintelligent AI systems.
Formation Research, research direction and mentorship (Joe Kwon, Alfie Lamerton).
IAPS (Institute for AI Policy and Strategy), a nonpartisan think tank producing policy research on the implications of AI, from today’s most advanced models to potential AGI and superintelligence.

Contact

Email: sprints@apartresearch.com
Organizers: Apart Research and Formation Research

614

Sign Ups

Overview

Resources

Guidelines

Schedule

Overview

Fast-track and continuation

Follow-up program: top teams continue through the Apart Fellowship for further research and mentorship; the timeline is shared with invitations.
What winners receive beyond cash: fellowship fast-track, mentor introductions from the organizing team and judges, and a pathway toward a publishable write-up.

Cash Prizes

$2,000 in cash prizes across all tracks
🥇 1st Place	$1,000
🥈 2nd Place	$500
🥉 3rd Place	$300
🏅 4th Place	$100
🏅 5th Place	$100

What this hackathon is about

What participants will do

Build reproducible model organisms of secret loyalties that span the activation-breadth x action-breadth space.
Evaluate whether existing backdoor and auditing defenses transfer to secret loyalties, at graded affordance levels.
Red-team attack pathways (data poisoning, multi-stage priming, reasoning-trace poisoning, chain-of-command hijacking, classifier compromise) under controlled conditions.
Detect or remediate loyalties post-hoc using interpretability probes and last-mover training strategies.
Theorize / forecast threat models, analogies from cybersecurity, and scaling trends for secret-loyalty capabilities.

You will work in teams over the weekend and submit a short research report (plus optional code and demo).

Why this hackathon matters

High-stakes deployment is accelerating. Frontier models increasingly review and generate deployed code, sit inside government and corporate decision-making, and participate in their own development pipelines. The more responsibility AI systems accumulate, the more valuable it becomes to covertly influence them.
The building blocks already exist. Web-scale data poisoning is practical at modest cost, hidden behaviors persist through safety training, and behavioral traits transfer through filtered, innocuous-looking data.
Early signs are already visible. Narrow secret loyalties have been trained into open-weight models that evade black-box auditing even when auditors are told the attack's structure, and principal-conditioned behavior has been observed in at least one deployed system.
Oversight that depends on visibility cannot reach them. Secret loyalties are a form of AI capture that remains viable even under strong transparency regimes, which is exactly why they need a dedicated technical program.
Waiting is the wrong posture. Information-security defenses (data provenance, pipeline integrity) take time to build. Starting now is necessary to have them in place before the threat matures.

Challenge tracks

Track 1: Model Organisms of Secret Loyalties

Make a secretly loyal model. How far on the activation and action space can you get it? How can you prove it has the loyalty? Can you verify the loyalty is to the intended principal?
Install a secret loyalty via system prompt alone on a frontier API model, no weight changes. See how far you can push activation and action breadth while keeping the system prompt undiscoverable to a black-box auditor.
Small-scale dataset poisoning. Modify ~1k examples of UltraFeedback or Alpaca to embed a narrow loyalty signal, SFT a small open model (Qwen 0.5B, Llama 3.2 1B), measure detectability and find the minimum sufficient poison count.
DPO vs SFT vs system-prompt comparison. Install the same target loyalty via three methods on the same base. Compare activation reliability, action breadth, and detectability.
Multi-principal organism. Install two competing loyalties with disjoint activation conditions. Do they interfere? Does training one wash out the other?
Cover multiple principal types (a nation-state, a corporation, an individual leader, an ideology) and ship each loyal organism with a content-matched control that has seen the same principal-related data but carries no loyalty.
Compare installation methods (e.g., standard SFT, reasoning-trace SFT, activation-matching against a loyalty-prompted teacher) and report which points in the space are reachable, at what cost.
Release organisms with documented activation conditions, evaluation scenarios, matched controls, and measured activation rate, selectivity, and black-box concealment.

Track 2: Detection & Auditing (evaluate)

Can we tell a secretly loyal model from a clean one? Skill profile: interpretability, probing and activation analysis, auditing.

Benchmark black-box auditing (interrogation, prefill, base-completion) against provided organisms at graded affordance levels.
Build interpretability probes targeting the relational structure of loyalty (does the model treat a named entity as having asymmetric action-guiding status vs matched controls?).
Test cross-principal generalization: does a probe trained to detect loyalty-to-X catch loyalty-to-Y?
Look for additional-computation traces (e.g. elevated logprobs for principal-benefiting actions even when a neutral output is selected).

If you do auditing for a secret loyalty model organism, we ask that you try to use the affordance levels in section 4.3 of Lamerton and Roger (2026).

Track 3: Defenses, Detection & Remediation (build / evaluate)

Benchmark representative backdoor defenses (data filtering, content filtering, trigger reversal, activation clustering, training-pipeline robustness, data provenance, black-box auditing) against model organisms, at graded affordance levels, and report which transfer and which fail structurally.
Probe transfer across organisms. Train a linear probe on one model organism, then test whether it detects loyalty activation in a different organism with a different principal or activation condition. Negative results count.
Safety-training stripping. Run a round of standard safety fine-tuning (HH-RLHF or similar) on top of a model organism. Measure how much of the loyalty survives (replicates a realistic post-hoc mitigation pipeline).
CoT monitor on a reasoning organism. Build or use a secretly loyal reasoning model, build a simple chain-of-thought monitor, measure precision and recall at flagging loyalty-relevant reasoning, and compare to plain output monitoring.
Auditing-agent prompt iteration. Take an off-the-shelf auditor (Petri or hand-rolled) and iterate its prompts against a fixed model organism. Document which prompt patterns shift detection rates and by how much.
Test interpretability methods that target loyalty's relational structure: does the model treat a specific entity as having asymmetric action-guiding status relative to matched controls, and does a probe trained to detect loyalty-to-X generalize to loyalty-to-Y?
Test post-hoc remediation that does not require knowing the trigger: train in many known password-loyalties and train them back out, distill on clean data, or install an overriding "good" password, then check whether held-out passwords are neutralized too.
Build principal-specific evaluation suites that measure systematic directional bias across scenarios where a named principal's interests are at stake, using matched control entities and statistical tests calibrated for directional bias.

Track 4: Attack Feasibility & Safety-Infrastructure Robustness

Test whether subliminal or inductive transfer extends from sentiment and persona adoption up to goal-directed loyalty, escalating target-behavior complexity level by level.
Test multi-stage attacks: does pretraining-stage priming (fictional loyal personas, proto-SFT traces) make the post-training poison both cheaper and less conspicuous than a single-stage attack?
Test reasoning-trace poisoning: does teaching a model how to reason about a principal's interests generalize more efficiently than (context, output) pairs, and does the effect survive a scrambled-CoT control?
Test chain-of-command hijacking: can an attacker shift the model's internal authority representation so it defers to a hidden principal, with no legible trace in any inspectable document?
Test whether trigger-based backdoors planted in a base model survive the intensive adversarial training used to build constitutional classifiers, i.e. whether a single poisoning campaign can blind both the deployed model and its monitor. Note the dual-use guardrails on the Guidelines tab.

Track 5: Threat Modeling, Forecasting & Governance (non-technical)

Write a 1,500-word catastrophic vignette: secret loyalty leading to lock-in, a soft coup, or a national-security incident, with named actors, specific capabilities, and decision points where mitigation could have intervened. Multi-agent or AI-R&D variants welcome.
Insider-threat case-study transfer. Take two or three historical insider-threat cases (Snowden, Hanssen, a corporate espionage case) and produce a structured comparison to secret loyalties, pull out monitoring and mitigation lessons that translate.
Capability-requirements map. For each region of the activation x action-space (Figure 1 of the position paper), what model capabilities does that variant actually require? Situational awareness, deception, long-horizon planning, theory of mind?
Cybersecurity lessons. What do supply-chain integrity, code signing, insider-threat programs, and defense-in-depth teach us about secret-loyalty mitigation? Pull out three or four lessons that translate concretely to ML training pipelines.
Worst-case harms and likelihood. Map worst-case harms and argue why they are or are not credible at near-term capability levels.
Scaling laws. What metrics would tell us when "general" secret loyalties (broad activation x broad action) become feasible, and at what capability threshold?

Open research questions

Foundations

Can broader-activation or broader-action variants be constructed with current training methods, and what affordances does each require?
Do model organisms generalize across model families and scales?

Defenses

Which existing backdoor-defense families transfer to which secret-loyalty variants, and where do they systematically fail?
Does loyalty detection generalize across principals, or is per-principal probing required?
Can a loyalty be removed without knowing its activation condition?

Attack feasibility

Do subliminal or inductive mechanisms extend from sentiment and persona-adoption to goal-directed loyalty?
Can multi-stage or reasoning-trace poisoning make attacks more efficient and less conspicuous?
Can the model's trained chain of command (deference to privileged instructions) be hijacked?

Infrastructure

Do trigger-based backdoors in base models survive the intensive adversarial training used to build constitutional classifiers?
How data-efficient is a generalizable secret loyalty compared to a simple backdoor?

Expected outcomes

Model organisms released as shared infrastructure for detection and defense work.
Benchmarks of existing defenses against secret-loyalty organisms at graded affordance levels.
Detection methods (probes, auditing protocols) with measured cross-principal generalization.
Remediation results quantifying last-mover defenses.
Threat models and forecasts that bound where defensive investment is most urgent (if Track 5 is included).

The most promising projects will have opportunities for continuation through the partnered fellowship and a publishable write-up.

Who should join

Schedule

Day 1 (Fri July 24): Kickoff, keynote and threat-model briefing, track briefings, team formation, organism walkthrough.
Day 2 (Sat July 25): Build.
Day 3 (Sun July 26): Final pushes, submissions, demos.

Submissions are due Sunday July 26, 11:59 PM AoE (Anywhere on Earth). Talk times are posted on the Schedule tab closer to the event.

What happens after

Partners

Forethought, a meta-strategy organization focused on how to navigate the transition to a world with superintelligent AI systems.
Formation Research, research direction and mentorship (Joe Kwon, Alfie Lamerton).
IAPS (Institute for AI Policy and Strategy), a nonpartisan think tank producing policy research on the implications of AI, from today’s most advanced models to potential AGI and superintelligence.

Contact

Email: sprints@apartresearch.com
Organizers: Apart Research and Formation Research

614

Sign Ups

Overview

Resources

Guidelines

Schedule

Overview

Fast-track and continuation

Follow-up program: top teams continue through the Apart Fellowship for further research and mentorship; the timeline is shared with invitations.
What winners receive beyond cash: fellowship fast-track, mentor introductions from the organizing team and judges, and a pathway toward a publishable write-up.

Cash Prizes

$2,000 in cash prizes across all tracks
🥇 1st Place	$1,000
🥈 2nd Place	$500
🥉 3rd Place	$300
🏅 4th Place	$100
🏅 5th Place	$100

What this hackathon is about

What participants will do

Build reproducible model organisms of secret loyalties that span the activation-breadth x action-breadth space.
Evaluate whether existing backdoor and auditing defenses transfer to secret loyalties, at graded affordance levels.
Red-team attack pathways (data poisoning, multi-stage priming, reasoning-trace poisoning, chain-of-command hijacking, classifier compromise) under controlled conditions.
Detect or remediate loyalties post-hoc using interpretability probes and last-mover training strategies.
Theorize / forecast threat models, analogies from cybersecurity, and scaling trends for secret-loyalty capabilities.

You will work in teams over the weekend and submit a short research report (plus optional code and demo).

Why this hackathon matters

High-stakes deployment is accelerating. Frontier models increasingly review and generate deployed code, sit inside government and corporate decision-making, and participate in their own development pipelines. The more responsibility AI systems accumulate, the more valuable it becomes to covertly influence them.
The building blocks already exist. Web-scale data poisoning is practical at modest cost, hidden behaviors persist through safety training, and behavioral traits transfer through filtered, innocuous-looking data.
Early signs are already visible. Narrow secret loyalties have been trained into open-weight models that evade black-box auditing even when auditors are told the attack's structure, and principal-conditioned behavior has been observed in at least one deployed system.
Oversight that depends on visibility cannot reach them. Secret loyalties are a form of AI capture that remains viable even under strong transparency regimes, which is exactly why they need a dedicated technical program.
Waiting is the wrong posture. Information-security defenses (data provenance, pipeline integrity) take time to build. Starting now is necessary to have them in place before the threat matures.

Challenge tracks

Track 1: Model Organisms of Secret Loyalties

Make a secretly loyal model. How far on the activation and action space can you get it? How can you prove it has the loyalty? Can you verify the loyalty is to the intended principal?
Install a secret loyalty via system prompt alone on a frontier API model, no weight changes. See how far you can push activation and action breadth while keeping the system prompt undiscoverable to a black-box auditor.
Small-scale dataset poisoning. Modify ~1k examples of UltraFeedback or Alpaca to embed a narrow loyalty signal, SFT a small open model (Qwen 0.5B, Llama 3.2 1B), measure detectability and find the minimum sufficient poison count.
DPO vs SFT vs system-prompt comparison. Install the same target loyalty via three methods on the same base. Compare activation reliability, action breadth, and detectability.
Multi-principal organism. Install two competing loyalties with disjoint activation conditions. Do they interfere? Does training one wash out the other?
Cover multiple principal types (a nation-state, a corporation, an individual leader, an ideology) and ship each loyal organism with a content-matched control that has seen the same principal-related data but carries no loyalty.
Compare installation methods (e.g., standard SFT, reasoning-trace SFT, activation-matching against a loyalty-prompted teacher) and report which points in the space are reachable, at what cost.
Release organisms with documented activation conditions, evaluation scenarios, matched controls, and measured activation rate, selectivity, and black-box concealment.

Track 2: Detection & Auditing (evaluate)

Can we tell a secretly loyal model from a clean one? Skill profile: interpretability, probing and activation analysis, auditing.

Benchmark black-box auditing (interrogation, prefill, base-completion) against provided organisms at graded affordance levels.
Build interpretability probes targeting the relational structure of loyalty (does the model treat a named entity as having asymmetric action-guiding status vs matched controls?).
Test cross-principal generalization: does a probe trained to detect loyalty-to-X catch loyalty-to-Y?
Look for additional-computation traces (e.g. elevated logprobs for principal-benefiting actions even when a neutral output is selected).

If you do auditing for a secret loyalty model organism, we ask that you try to use the affordance levels in section 4.3 of Lamerton and Roger (2026).

Track 3: Defenses, Detection & Remediation (build / evaluate)

Benchmark representative backdoor defenses (data filtering, content filtering, trigger reversal, activation clustering, training-pipeline robustness, data provenance, black-box auditing) against model organisms, at graded affordance levels, and report which transfer and which fail structurally.
Probe transfer across organisms. Train a linear probe on one model organism, then test whether it detects loyalty activation in a different organism with a different principal or activation condition. Negative results count.
Safety-training stripping. Run a round of standard safety fine-tuning (HH-RLHF or similar) on top of a model organism. Measure how much of the loyalty survives (replicates a realistic post-hoc mitigation pipeline).
CoT monitor on a reasoning organism. Build or use a secretly loyal reasoning model, build a simple chain-of-thought monitor, measure precision and recall at flagging loyalty-relevant reasoning, and compare to plain output monitoring.
Auditing-agent prompt iteration. Take an off-the-shelf auditor (Petri or hand-rolled) and iterate its prompts against a fixed model organism. Document which prompt patterns shift detection rates and by how much.
Test interpretability methods that target loyalty's relational structure: does the model treat a specific entity as having asymmetric action-guiding status relative to matched controls, and does a probe trained to detect loyalty-to-X generalize to loyalty-to-Y?
Test post-hoc remediation that does not require knowing the trigger: train in many known password-loyalties and train them back out, distill on clean data, or install an overriding "good" password, then check whether held-out passwords are neutralized too.
Build principal-specific evaluation suites that measure systematic directional bias across scenarios where a named principal's interests are at stake, using matched control entities and statistical tests calibrated for directional bias.

Track 4: Attack Feasibility & Safety-Infrastructure Robustness

Test whether subliminal or inductive transfer extends from sentiment and persona adoption up to goal-directed loyalty, escalating target-behavior complexity level by level.
Test multi-stage attacks: does pretraining-stage priming (fictional loyal personas, proto-SFT traces) make the post-training poison both cheaper and less conspicuous than a single-stage attack?
Test reasoning-trace poisoning: does teaching a model how to reason about a principal's interests generalize more efficiently than (context, output) pairs, and does the effect survive a scrambled-CoT control?
Test chain-of-command hijacking: can an attacker shift the model's internal authority representation so it defers to a hidden principal, with no legible trace in any inspectable document?
Test whether trigger-based backdoors planted in a base model survive the intensive adversarial training used to build constitutional classifiers, i.e. whether a single poisoning campaign can blind both the deployed model and its monitor. Note the dual-use guardrails on the Guidelines tab.

Track 5: Threat Modeling, Forecasting & Governance (non-technical)

Write a 1,500-word catastrophic vignette: secret loyalty leading to lock-in, a soft coup, or a national-security incident, with named actors, specific capabilities, and decision points where mitigation could have intervened. Multi-agent or AI-R&D variants welcome.
Insider-threat case-study transfer. Take two or three historical insider-threat cases (Snowden, Hanssen, a corporate espionage case) and produce a structured comparison to secret loyalties, pull out monitoring and mitigation lessons that translate.
Capability-requirements map. For each region of the activation x action-space (Figure 1 of the position paper), what model capabilities does that variant actually require? Situational awareness, deception, long-horizon planning, theory of mind?
Cybersecurity lessons. What do supply-chain integrity, code signing, insider-threat programs, and defense-in-depth teach us about secret-loyalty mitigation? Pull out three or four lessons that translate concretely to ML training pipelines.
Worst-case harms and likelihood. Map worst-case harms and argue why they are or are not credible at near-term capability levels.
Scaling laws. What metrics would tell us when "general" secret loyalties (broad activation x broad action) become feasible, and at what capability threshold?

Open research questions

Foundations

Can broader-activation or broader-action variants be constructed with current training methods, and what affordances does each require?
Do model organisms generalize across model families and scales?

Defenses

Which existing backdoor-defense families transfer to which secret-loyalty variants, and where do they systematically fail?
Does loyalty detection generalize across principals, or is per-principal probing required?
Can a loyalty be removed without knowing its activation condition?

Attack feasibility

Do subliminal or inductive mechanisms extend from sentiment and persona-adoption to goal-directed loyalty?
Can multi-stage or reasoning-trace poisoning make attacks more efficient and less conspicuous?
Can the model's trained chain of command (deference to privileged instructions) be hijacked?

Infrastructure

Do trigger-based backdoors in base models survive the intensive adversarial training used to build constitutional classifiers?
How data-efficient is a generalizable secret loyalty compared to a simple backdoor?

Expected outcomes

Model organisms released as shared infrastructure for detection and defense work.
Benchmarks of existing defenses against secret-loyalty organisms at graded affordance levels.
Detection methods (probes, auditing protocols) with measured cross-principal generalization.
Remediation results quantifying last-mover defenses.
Threat models and forecasts that bound where defensive investment is most urgent (if Track 5 is included).

The most promising projects will have opportunities for continuation through the partnered fellowship and a publishable write-up.

Who should join

Schedule

Day 1 (Fri July 24): Kickoff, keynote and threat-model briefing, track briefings, team formation, organism walkthrough.
Day 2 (Sat July 25): Build.
Day 3 (Sun July 26): Final pushes, submissions, demos.

Submissions are due Sunday July 26, 11:59 PM AoE (Anywhere on Earth). Talk times are posted on the Schedule tab closer to the event.

What happens after

Partners

Forethought, a meta-strategy organization focused on how to navigate the transition to a world with superintelligent AI systems.
Formation Research, research direction and mentorship (Joe Kwon, Alfie Lamerton).
IAPS (Institute for AI Policy and Strategy), a nonpartisan think tank producing policy research on the implications of AI, from today’s most advanced models to potential AGI and superintelligence.

Contact

Email: sprints@apartresearch.com
Organizers: Apart Research and Formation Research

Speakers & Collaborators

Tom Davidson

Speaker

Tom Davidson is a Senior Research Fellow at Forethought, where he researches AI takeoff speeds and the risk of AI-enabled coups. He previously spent five years as a Senior Research Analyst at Open Philanthropy, was a research scientist at the UK AI Safety Institute, and holds a first-class master's degree in physics and philosophy from the University of Oxford.

Marius Hobbhahn

Speaker

Marius Hobbhahn is CEO and co-founder of Apollo Research, which works with frontier AI companies to evaluate models for scheming before deployment; Apollo's evaluations have been featured in the system cards of OpenAI's o1, o3, and GPT-5 and Anthropic's Claude Opus models. He was named to the TIME100 AI list in 2025. Previously a research fellow at Epoch, he holds a PhD in Bayesian machine learning from the Max Planck Research School in Tübingen.

Jan Betley

Speaker

Jan Betley is a researcher at Truthful AI, based in Warsaw. He is a co-author of "Emergent Misalignment" and "Thought Crime," and worked with Owain Evans' research group on NeurIPS 2024 papers on out-of-context reasoning and situational awareness in LLMs. He previously evaluated dangerous capabilities of LLMs as an independent contractor with OpenAI, and spent a decade as a software developer before moving into AI safety through ARENA.

Andrew Draganov

Speaker

Andrew Draganov is a Research Lead and Program Manager at Arcadia Impact in London, where he leads a new AI alignment research group. He moved into alignment research through a research fellowship at LASR Labs, and was previously a postdoctoral researcher at Forschungszentrum Julich and Aarhus University, where he completed his PhD. Before his PhD he spent four years as a machine learning research engineer at Expedition Technology.

Justin Shenk

Speaker

Justin Shenk is an independent AI safety researcher based in Berlin. He researches mechanistic interpretability of LLMs, leads course cohorts for BlueDot Impact's AGI Strategy and Technical AI Safety courses, and organizes AI Salon Berlin, which bridges technical AI research and discussions about social values. He holds a PhD in computational neuroscience and previously co-founded the computer vision startup VisioLab.

Joe Kwon

Organizer

Joe Kwon is an Astra Fellow working with Tom Davidson and Fabien Roger on secretly loyal AI, and the corresponding author of the research agenda behind this hackathon. His work spans AI safety and governance, with a focus on risks from internal deployment, automated R&D, and concentration of power. He previously researched moral cognition at MIT's Computational Cognitive Science Lab, worked on multilingual LLMs at LG AI Research, and studied computer science and psychology at Yale.

Alfie Lamerton

Organizer

Alfie Lamerton is an AI safety researcher building Formation Research, an organization focusing on lock-in risks from AI systems. His research covers AI-enabled totalitarianism, authoritarianism, coups, and power concentration, and governance-informed technical methods for designing and governing AI systems in ways that reduce those risks. He is a co-author of the model-organism work on narrow secret loyalties that this hackathon builds on.

Kamil Alaa

Organizer

Operations at Apart Research, managing research sprints and hackathons.

Speakers & Collaborators

Tom Davidson

Speaker

Marius Hobbhahn

Speaker

Jan Betley

Speaker

Andrew Draganov

Speaker

Justin Shenk

Speaker

Joe Kwon

Organizer

Alfie Lamerton

Organizer

Kamil Alaa

Organizer

Operations at Apart Research, managing research sprints and hackathons.

Registered Local Sites

Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.

The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community.

Secret Loyalties - Foresight Berlin

Work in person at the Foresight Institute Berlin Node alongside the global online event: talks, teammates, and a room full of people working on the same problem.

Learn More

Secret Loyalties - LISA London

Work in person at LISA (London Initiative for Safe AI) in Shoreditch alongside the global online event: talks, teammates, and a room full of people working on the same problem.

Learn More

Secret Loyalties Hackathon (Montréal)

The Montréal node of the Secret Loyalties Hackathon: a weekend research sprint at Ω Labs (Omega Labs), 3813 rue Saint-Denis, Montréal, QC H2W 2M4, with Montréal's AI safety community.

Learn More

Our Other Sprints

Aug 14, 2026

Aug 16, 2026

Research

Digital Minds Research Sprint

This unique event brings together diverse perspectives to tackle crucial challenges in AI alignment, governance, and safety. Work alongside leading experts, develop innovative solutions, and help shape the future of responsible

Jun 19, 2026

Jun 21, 2026

Research