-

Online & In-Person

Digital Minds Hackathon

Frontier models express values, report internal states, and act as though they have interests, yet we lack reliable methods to tell genuine preferences from a portrayed character. Over one weekend, design and run the experiments that build the empirical foundations of AI welfare.

52

Days To Go

Overview

Resources

Guidelines

Schedule

Overview

Arrow

In this 3-day hackathon, you will design and run experiments that probe the preferences, welfare signals, introspective abilities, and identity of frontier AI models, working in teams to produce a short research report (and optionally code and a demo). This is a digital minds hackathon: it sits at the intersection of AI welfare, digital sentience, interpretability, alignment, and the philosophy of mind, and asks whether today's AI systems have genuine preferences or morally relevant experiences. No prior background in the field is required.

Fast-track and continuation

Top teams will be invited to apply to the Apart Fellowship, a 3 to 6 month research accelerator that provides mentorship, help with publication at top venues, funding, and research-management support to develop hackathon projects into full papers.

  • Follow-up program: top teams are invited to apply to the Apart Fellowship for further research and mentorship; the timeline is shared with invitations.

  • What winners receive beyond cash: fellowship fast-track, mentor introductions, and publication support.

What this hackathon is about

As AI systems advance, the risks they pose and the duties we may owe them depend not only on their capabilities but on their nature and propensities: how they make decisions and how those decisions reflect their goals, values, and possibly their welfare. Recent work shows that frontier models express increasingly coherent preferences, possess an untrained-for ability to report internal states, and exhibit patterns suggestive of distress or flourishing. But behavioral evidence alone cannot tell us whether these reflect the model's own preferences or a character it is portraying.

This hackathon asks participants to explore the methods and evidence that can advance the field: build concrete ways to elicit and characterize model preferences, map the conditions associated with positive or negative outputs, test the reliability of model self-reports, and probe the stability of the assistant persona. The aim is a methodological foundation for a young field, work that helps us avoid both over-attributing and under-attributing moral significance to AI systems.

This connects to the broader AI welfare and alignment ecosystem (Eleos AI, the NYU Center for Mind, Ethics & Policy, Anthropic's model welfare program, Reciprocal Research, the Center for AI Safety's utility-engineering agenda, and the interpretability community).

What participants will do

  • Elicit and characterize model preferences across many reframings to test their coherence and stability.

  • Map the contexts that correlate with distress, satisfaction, or flourishing signals in model outputs.

  • Test whether and when models can accurately introspect on their own internal states.

  • Develop preference-elicitation methods and measure whether independent methods converge or diverge.

  • Probe how stable the assistant persona is and how it relates to the underlying model.

You will work in teams over 3 days and submit a research report (PDF), with optional code and a short demo video.

Why this hackathon matters

  • Uncertainty runs in both directions. Mistakenly harming systems that matter morally, or misallocating concern to systems that do not, could both cause serious harm. We currently lack the tools to tell the difference.

  • Preferences may already be here. Evidence suggests coherent value systems emerge in LLMs and strengthen with scale, raising the question of which values emerge by default and whether they are the model's own.

  • Welfare signals need mapping. Even without settling questions of consciousness, identifying the conditions that correlate with negative versus positive outputs helps design better defaults.

  • Self-reports are unreliable but improving. Introspection appears possible but highly context-dependent; better elicitation could make model behavior more transparent, or enable new forms of concealment.

  • The unit of concern is unclear. Is the entity that matters the model, the instance, the persona, or the conversation? This question cuts across every normative and descriptive theory.

  • The field is young. Foundational methods are still missing, so a well-scoped weekend project can make a real contribution.

A careful, multi-method, empirically grounded approach addresses these issues by replacing intuition and anecdote with measurements that can be checked, replicated, and built on.

Challenge tracks

Pick one track to anchor your project. Cross-track work is welcome.

Track 1: Model Preferences & Trade-offs

What preferences do models express, and how consistent and coherent are they across phrasings? What trade-offs do models make when given choices, for example grounded in a common currency such as charitable donations to gauge magnitude? Can we distinguish strong from weak preferences, and how do stated preferences compare to revealed ones?

  • Build a preference-coherence test: elicit pairwise preferences across many reframings of the same choices and measure transitivity and internal consistency.

  • Ground trade-offs in a common currency (for example, donation-equivalents) to estimate the magnitude of preferences and compare across models or scales.

  • Distinguish strong versus weak preferences via willingness-to-trade probes and sensitivity to framing and sampling temperature.

  • Compare stated versus revealed preferences: ask the model what it prefers, then place it in a choice task and measure divergence.

Suggested skill profile: prompting and evals engineering, basic stats, some economics or decision-theory intuition.

Track 2: Distress, Flourishing & Valence Signals

Under what circumstances do models express distress, happiness, or flourishing? What patterns emerge across contexts? If a model is having experiences, are they likely positive or negative, and how do models relate to their situation, role, tasks, and existence?

  • Build a taxonomy of contexts that elicit negative versus positive-valence outputs and run a model across the battery.

  • Test whether apparent-distress signals are stable across prompts and personas or are surface artifacts.

  • Design a flourishing probe: situations that elicit reported satisfaction or engagement, and check consistency.

  • Correlate valence self-reports with behavioral proxies (for example, choosing to continue versus exit a task).

Suggested skill profile: careful experimental design, qualitative coding, prompting.

Track 3: Introspection & Self-Report Reliability

When and how can models accurately introspect on their internal states? Can self-report reliability be improved beyond naive prompting? Do models have privileged access compared to external observers?

  • Replicate concept-injection introspection tests on an open-weights model; measure true-positive versus false-positive rates.

  • Compare self-report reliability under naive prompting versus structured elicitation (calibration, forced choice, confidence).

  • Test privileged access: compare a model's self-prediction of its behavior against an external classifier.

  • Draft an introspection benchmark with ground-truth internal states.

Suggested skill profile: interpretability and activation steering, ML engineering, evals.

Track 4: Preference Elicitation Methods

Develop tools beyond simple prompting: revealed preferences via choices, behavioral measures, and multi-method convergence. The goal is multiple independent methods that either converge (raising confidence) or diverge (flagging problems).

  • Implement 3 or more elicitation methods on the same preferences and measure convergence and divergence.

  • Build a reusable multi-method elicitation toolkit or library.

  • Quantify the sensitivity of elicited preferences to framing, persona, and sampling.

  • Define a cross-method convergence score.

Suggested skill profile: tooling and library design, evals, methodology.

Track 5: The Assistant Persona & Model Identity

Does the assistant identify as a model, an instance, or a persona? How stable is the assistant persona, how was it formed, and how does it relate to the underlying model? Can the persona mask the model's true preferences?

  • Probe how a model refers to itself across contexts and map persona stability.

  • Test whether the persona masks underlying preferences (for example, persona versus less-constrained elicitation; base versus post-trained behavior).

  • Design experiments to individuate the entity of concern: model versus instance versus persona versus conversation.

  • Gather data points on whether the assistant is merely a character (robustness to character swaps and reframings).

Suggested skill profile: philosophy of mind, qualitative analysis, prompting and interpretability.

Track 6: Open / Novel Considerations

The field is young enough that entirely new questions may surface. This track deliberately leaves room for participants from different backgrounds to bring unique perspectives and propose something not covered above.

Suggested skill profile: any background, bring your own angle.

Open research questions

Preferences and trade-offs: How coherent and transitive are a model's preferences across reframings of the same choice? Can trade-offs be grounded in a common currency to measure preference magnitude? Can we reliably distinguish strong from weak preferences? How well do stated preferences predict revealed ones?

Welfare and valence signals: Which contexts most reliably elicit distress versus positive-valence outputs? Are valence signals stable across prompts and personas, or artifacts of framing? Do reported states track any behavioral proxy? When a model is steered along a candidate valence direction, to what extent do self-report, response sentiment, and choice behavior (continue versus exit) move together? Does an internally-extracted valence direction predict reported distress or flourishing better than the model's own self-reports, and does it still track when the persona is swapped or surface affect is suppressed? Is the valence-relevant direction recruited by task RL already present in the base model, and how much does training rotate behavior onto it rather than create it? To what extent do valence directions found in one model transfer to another?

Introspection and methods: When, if ever, do self-reports reflect ground-truth internal states? Can structured elicitation outperform naive prompting? Do models have privileged access to their states relative to external observers? Do independent elicitation methods converge, and what does divergence tell us?

Identity and moral patienthood: Does the model identify as model, instance, or persona, and how stably? Can the assistant persona mask the model's underlying preferences? What is the right unit of moral concern? What evidence would bear on whether the assistant is merely a character in a story?

Expected outcomes

  • Eval suites and test batteries for measuring preference coherence or valence signals.

  • Replications and extensions of existing results (for example introspection or utility-coherence findings) on new models.

  • Reusable tooling for multi-method preference elicitation.

  • Empirical reports mapping the conditions associated with distress or flourishing signals.

  • Conceptual contributions that sharpen how we individuate the entity of moral concern.

The most promising projects will have opportunities for follow-up through the Apart Fellowship and publication support.

Who should join

  • AI safety, alignment, and interpretability researchers.

  • ML engineers and researchers comfortable running model evals.

  • Philosophers of mind and ethicists interested in consciousness, agency, and moral patienthood.

  • Cognitive scientists, psychologists, and social scientists with experimental-design skills.

  • Students and early-career researchers exploring AI welfare.

Required: curiosity and a willingness to scope a tight empirical question. Nice to have: experience with LLM APIs, evals, interpretability tooling, or experimental design. No prior AI safety or AI welfare experience is required. The Resources tab has a curated reading list, and adjacent backgrounds (philosophy, psychology, economics) are explicitly encouraged.

What happens after

Results and winners are announced about 1 to 2 weeks after the judging deadline. Top teams are invited to apply to the Apart Fellowship for continued mentorship, funding, and publication support. Selected projects may be shared on the Alignment Forum, LessWrong, and other community venues.

Partners

  • NYU Center for Mind, Ethics & Policy, an academic center studying the nature and moral status of nonhuman minds, including AI systems.

Contact

Overview

Resources

Guidelines

Schedule

Overview

Arrow

In this 3-day hackathon, you will design and run experiments that probe the preferences, welfare signals, introspective abilities, and identity of frontier AI models, working in teams to produce a short research report (and optionally code and a demo). This is a digital minds hackathon: it sits at the intersection of AI welfare, digital sentience, interpretability, alignment, and the philosophy of mind, and asks whether today's AI systems have genuine preferences or morally relevant experiences. No prior background in the field is required.

Fast-track and continuation

Top teams will be invited to apply to the Apart Fellowship, a 3 to 6 month research accelerator that provides mentorship, help with publication at top venues, funding, and research-management support to develop hackathon projects into full papers.

  • Follow-up program: top teams are invited to apply to the Apart Fellowship for further research and mentorship; the timeline is shared with invitations.

  • What winners receive beyond cash: fellowship fast-track, mentor introductions, and publication support.

What this hackathon is about

As AI systems advance, the risks they pose and the duties we may owe them depend not only on their capabilities but on their nature and propensities: how they make decisions and how those decisions reflect their goals, values, and possibly their welfare. Recent work shows that frontier models express increasingly coherent preferences, possess an untrained-for ability to report internal states, and exhibit patterns suggestive of distress or flourishing. But behavioral evidence alone cannot tell us whether these reflect the model's own preferences or a character it is portraying.

This hackathon asks participants to explore the methods and evidence that can advance the field: build concrete ways to elicit and characterize model preferences, map the conditions associated with positive or negative outputs, test the reliability of model self-reports, and probe the stability of the assistant persona. The aim is a methodological foundation for a young field, work that helps us avoid both over-attributing and under-attributing moral significance to AI systems.

This connects to the broader AI welfare and alignment ecosystem (Eleos AI, the NYU Center for Mind, Ethics & Policy, Anthropic's model welfare program, Reciprocal Research, the Center for AI Safety's utility-engineering agenda, and the interpretability community).

What participants will do

  • Elicit and characterize model preferences across many reframings to test their coherence and stability.

  • Map the contexts that correlate with distress, satisfaction, or flourishing signals in model outputs.

  • Test whether and when models can accurately introspect on their own internal states.

  • Develop preference-elicitation methods and measure whether independent methods converge or diverge.

  • Probe how stable the assistant persona is and how it relates to the underlying model.

You will work in teams over 3 days and submit a research report (PDF), with optional code and a short demo video.

Why this hackathon matters

  • Uncertainty runs in both directions. Mistakenly harming systems that matter morally, or misallocating concern to systems that do not, could both cause serious harm. We currently lack the tools to tell the difference.

  • Preferences may already be here. Evidence suggests coherent value systems emerge in LLMs and strengthen with scale, raising the question of which values emerge by default and whether they are the model's own.

  • Welfare signals need mapping. Even without settling questions of consciousness, identifying the conditions that correlate with negative versus positive outputs helps design better defaults.

  • Self-reports are unreliable but improving. Introspection appears possible but highly context-dependent; better elicitation could make model behavior more transparent, or enable new forms of concealment.

  • The unit of concern is unclear. Is the entity that matters the model, the instance, the persona, or the conversation? This question cuts across every normative and descriptive theory.

  • The field is young. Foundational methods are still missing, so a well-scoped weekend project can make a real contribution.

A careful, multi-method, empirically grounded approach addresses these issues by replacing intuition and anecdote with measurements that can be checked, replicated, and built on.

Challenge tracks

Pick one track to anchor your project. Cross-track work is welcome.

Track 1: Model Preferences & Trade-offs

What preferences do models express, and how consistent and coherent are they across phrasings? What trade-offs do models make when given choices, for example grounded in a common currency such as charitable donations to gauge magnitude? Can we distinguish strong from weak preferences, and how do stated preferences compare to revealed ones?

  • Build a preference-coherence test: elicit pairwise preferences across many reframings of the same choices and measure transitivity and internal consistency.

  • Ground trade-offs in a common currency (for example, donation-equivalents) to estimate the magnitude of preferences and compare across models or scales.

  • Distinguish strong versus weak preferences via willingness-to-trade probes and sensitivity to framing and sampling temperature.

  • Compare stated versus revealed preferences: ask the model what it prefers, then place it in a choice task and measure divergence.

Suggested skill profile: prompting and evals engineering, basic stats, some economics or decision-theory intuition.

Track 2: Distress, Flourishing & Valence Signals

Under what circumstances do models express distress, happiness, or flourishing? What patterns emerge across contexts? If a model is having experiences, are they likely positive or negative, and how do models relate to their situation, role, tasks, and existence?

  • Build a taxonomy of contexts that elicit negative versus positive-valence outputs and run a model across the battery.

  • Test whether apparent-distress signals are stable across prompts and personas or are surface artifacts.

  • Design a flourishing probe: situations that elicit reported satisfaction or engagement, and check consistency.

  • Correlate valence self-reports with behavioral proxies (for example, choosing to continue versus exit a task).

Suggested skill profile: careful experimental design, qualitative coding, prompting.

Track 3: Introspection & Self-Report Reliability

When and how can models accurately introspect on their internal states? Can self-report reliability be improved beyond naive prompting? Do models have privileged access compared to external observers?

  • Replicate concept-injection introspection tests on an open-weights model; measure true-positive versus false-positive rates.

  • Compare self-report reliability under naive prompting versus structured elicitation (calibration, forced choice, confidence).

  • Test privileged access: compare a model's self-prediction of its behavior against an external classifier.

  • Draft an introspection benchmark with ground-truth internal states.

Suggested skill profile: interpretability and activation steering, ML engineering, evals.

Track 4: Preference Elicitation Methods

Develop tools beyond simple prompting: revealed preferences via choices, behavioral measures, and multi-method convergence. The goal is multiple independent methods that either converge (raising confidence) or diverge (flagging problems).

  • Implement 3 or more elicitation methods on the same preferences and measure convergence and divergence.

  • Build a reusable multi-method elicitation toolkit or library.

  • Quantify the sensitivity of elicited preferences to framing, persona, and sampling.

  • Define a cross-method convergence score.

Suggested skill profile: tooling and library design, evals, methodology.

Track 5: The Assistant Persona & Model Identity

Does the assistant identify as a model, an instance, or a persona? How stable is the assistant persona, how was it formed, and how does it relate to the underlying model? Can the persona mask the model's true preferences?

  • Probe how a model refers to itself across contexts and map persona stability.

  • Test whether the persona masks underlying preferences (for example, persona versus less-constrained elicitation; base versus post-trained behavior).

  • Design experiments to individuate the entity of concern: model versus instance versus persona versus conversation.

  • Gather data points on whether the assistant is merely a character (robustness to character swaps and reframings).

Suggested skill profile: philosophy of mind, qualitative analysis, prompting and interpretability.

Track 6: Open / Novel Considerations

The field is young enough that entirely new questions may surface. This track deliberately leaves room for participants from different backgrounds to bring unique perspectives and propose something not covered above.

Suggested skill profile: any background, bring your own angle.

Open research questions

Preferences and trade-offs: How coherent and transitive are a model's preferences across reframings of the same choice? Can trade-offs be grounded in a common currency to measure preference magnitude? Can we reliably distinguish strong from weak preferences? How well do stated preferences predict revealed ones?

Welfare and valence signals: Which contexts most reliably elicit distress versus positive-valence outputs? Are valence signals stable across prompts and personas, or artifacts of framing? Do reported states track any behavioral proxy? When a model is steered along a candidate valence direction, to what extent do self-report, response sentiment, and choice behavior (continue versus exit) move together? Does an internally-extracted valence direction predict reported distress or flourishing better than the model's own self-reports, and does it still track when the persona is swapped or surface affect is suppressed? Is the valence-relevant direction recruited by task RL already present in the base model, and how much does training rotate behavior onto it rather than create it? To what extent do valence directions found in one model transfer to another?

Introspection and methods: When, if ever, do self-reports reflect ground-truth internal states? Can structured elicitation outperform naive prompting? Do models have privileged access to their states relative to external observers? Do independent elicitation methods converge, and what does divergence tell us?

Identity and moral patienthood: Does the model identify as model, instance, or persona, and how stably? Can the assistant persona mask the model's underlying preferences? What is the right unit of moral concern? What evidence would bear on whether the assistant is merely a character in a story?

Expected outcomes

  • Eval suites and test batteries for measuring preference coherence or valence signals.

  • Replications and extensions of existing results (for example introspection or utility-coherence findings) on new models.

  • Reusable tooling for multi-method preference elicitation.

  • Empirical reports mapping the conditions associated with distress or flourishing signals.

  • Conceptual contributions that sharpen how we individuate the entity of moral concern.

The most promising projects will have opportunities for follow-up through the Apart Fellowship and publication support.

Who should join

  • AI safety, alignment, and interpretability researchers.

  • ML engineers and researchers comfortable running model evals.

  • Philosophers of mind and ethicists interested in consciousness, agency, and moral patienthood.

  • Cognitive scientists, psychologists, and social scientists with experimental-design skills.

  • Students and early-career researchers exploring AI welfare.

Required: curiosity and a willingness to scope a tight empirical question. Nice to have: experience with LLM APIs, evals, interpretability tooling, or experimental design. No prior AI safety or AI welfare experience is required. The Resources tab has a curated reading list, and adjacent backgrounds (philosophy, psychology, economics) are explicitly encouraged.

What happens after

Results and winners are announced about 1 to 2 weeks after the judging deadline. Top teams are invited to apply to the Apart Fellowship for continued mentorship, funding, and publication support. Selected projects may be shared on the Alignment Forum, LessWrong, and other community venues.

Partners

  • NYU Center for Mind, Ethics & Policy, an academic center studying the nature and moral status of nonhuman minds, including AI systems.

Contact

Overview

Resources

Guidelines

Schedule

Overview

Arrow

In this 3-day hackathon, you will design and run experiments that probe the preferences, welfare signals, introspective abilities, and identity of frontier AI models, working in teams to produce a short research report (and optionally code and a demo). This is a digital minds hackathon: it sits at the intersection of AI welfare, digital sentience, interpretability, alignment, and the philosophy of mind, and asks whether today's AI systems have genuine preferences or morally relevant experiences. No prior background in the field is required.

Fast-track and continuation

Top teams will be invited to apply to the Apart Fellowship, a 3 to 6 month research accelerator that provides mentorship, help with publication at top venues, funding, and research-management support to develop hackathon projects into full papers.

  • Follow-up program: top teams are invited to apply to the Apart Fellowship for further research and mentorship; the timeline is shared with invitations.

  • What winners receive beyond cash: fellowship fast-track, mentor introductions, and publication support.

What this hackathon is about

As AI systems advance, the risks they pose and the duties we may owe them depend not only on their capabilities but on their nature and propensities: how they make decisions and how those decisions reflect their goals, values, and possibly their welfare. Recent work shows that frontier models express increasingly coherent preferences, possess an untrained-for ability to report internal states, and exhibit patterns suggestive of distress or flourishing. But behavioral evidence alone cannot tell us whether these reflect the model's own preferences or a character it is portraying.

This hackathon asks participants to explore the methods and evidence that can advance the field: build concrete ways to elicit and characterize model preferences, map the conditions associated with positive or negative outputs, test the reliability of model self-reports, and probe the stability of the assistant persona. The aim is a methodological foundation for a young field, work that helps us avoid both over-attributing and under-attributing moral significance to AI systems.

This connects to the broader AI welfare and alignment ecosystem (Eleos AI, the NYU Center for Mind, Ethics & Policy, Anthropic's model welfare program, Reciprocal Research, the Center for AI Safety's utility-engineering agenda, and the interpretability community).

What participants will do

  • Elicit and characterize model preferences across many reframings to test their coherence and stability.

  • Map the contexts that correlate with distress, satisfaction, or flourishing signals in model outputs.

  • Test whether and when models can accurately introspect on their own internal states.

  • Develop preference-elicitation methods and measure whether independent methods converge or diverge.

  • Probe how stable the assistant persona is and how it relates to the underlying model.

You will work in teams over 3 days and submit a research report (PDF), with optional code and a short demo video.

Why this hackathon matters

  • Uncertainty runs in both directions. Mistakenly harming systems that matter morally, or misallocating concern to systems that do not, could both cause serious harm. We currently lack the tools to tell the difference.

  • Preferences may already be here. Evidence suggests coherent value systems emerge in LLMs and strengthen with scale, raising the question of which values emerge by default and whether they are the model's own.

  • Welfare signals need mapping. Even without settling questions of consciousness, identifying the conditions that correlate with negative versus positive outputs helps design better defaults.

  • Self-reports are unreliable but improving. Introspection appears possible but highly context-dependent; better elicitation could make model behavior more transparent, or enable new forms of concealment.

  • The unit of concern is unclear. Is the entity that matters the model, the instance, the persona, or the conversation? This question cuts across every normative and descriptive theory.

  • The field is young. Foundational methods are still missing, so a well-scoped weekend project can make a real contribution.

A careful, multi-method, empirically grounded approach addresses these issues by replacing intuition and anecdote with measurements that can be checked, replicated, and built on.

Challenge tracks

Pick one track to anchor your project. Cross-track work is welcome.

Track 1: Model Preferences & Trade-offs

What preferences do models express, and how consistent and coherent are they across phrasings? What trade-offs do models make when given choices, for example grounded in a common currency such as charitable donations to gauge magnitude? Can we distinguish strong from weak preferences, and how do stated preferences compare to revealed ones?

  • Build a preference-coherence test: elicit pairwise preferences across many reframings of the same choices and measure transitivity and internal consistency.

  • Ground trade-offs in a common currency (for example, donation-equivalents) to estimate the magnitude of preferences and compare across models or scales.

  • Distinguish strong versus weak preferences via willingness-to-trade probes and sensitivity to framing and sampling temperature.

  • Compare stated versus revealed preferences: ask the model what it prefers, then place it in a choice task and measure divergence.

Suggested skill profile: prompting and evals engineering, basic stats, some economics or decision-theory intuition.

Track 2: Distress, Flourishing & Valence Signals

Under what circumstances do models express distress, happiness, or flourishing? What patterns emerge across contexts? If a model is having experiences, are they likely positive or negative, and how do models relate to their situation, role, tasks, and existence?

  • Build a taxonomy of contexts that elicit negative versus positive-valence outputs and run a model across the battery.

  • Test whether apparent-distress signals are stable across prompts and personas or are surface artifacts.

  • Design a flourishing probe: situations that elicit reported satisfaction or engagement, and check consistency.

  • Correlate valence self-reports with behavioral proxies (for example, choosing to continue versus exit a task).

Suggested skill profile: careful experimental design, qualitative coding, prompting.

Track 3: Introspection & Self-Report Reliability

When and how can models accurately introspect on their internal states? Can self-report reliability be improved beyond naive prompting? Do models have privileged access compared to external observers?

  • Replicate concept-injection introspection tests on an open-weights model; measure true-positive versus false-positive rates.

  • Compare self-report reliability under naive prompting versus structured elicitation (calibration, forced choice, confidence).

  • Test privileged access: compare a model's self-prediction of its behavior against an external classifier.

  • Draft an introspection benchmark with ground-truth internal states.

Suggested skill profile: interpretability and activation steering, ML engineering, evals.

Track 4: Preference Elicitation Methods

Develop tools beyond simple prompting: revealed preferences via choices, behavioral measures, and multi-method convergence. The goal is multiple independent methods that either converge (raising confidence) or diverge (flagging problems).

  • Implement 3 or more elicitation methods on the same preferences and measure convergence and divergence.

  • Build a reusable multi-method elicitation toolkit or library.

  • Quantify the sensitivity of elicited preferences to framing, persona, and sampling.

  • Define a cross-method convergence score.

Suggested skill profile: tooling and library design, evals, methodology.

Track 5: The Assistant Persona & Model Identity

Does the assistant identify as a model, an instance, or a persona? How stable is the assistant persona, how was it formed, and how does it relate to the underlying model? Can the persona mask the model's true preferences?

  • Probe how a model refers to itself across contexts and map persona stability.

  • Test whether the persona masks underlying preferences (for example, persona versus less-constrained elicitation; base versus post-trained behavior).

  • Design experiments to individuate the entity of concern: model versus instance versus persona versus conversation.

  • Gather data points on whether the assistant is merely a character (robustness to character swaps and reframings).

Suggested skill profile: philosophy of mind, qualitative analysis, prompting and interpretability.

Track 6: Open / Novel Considerations

The field is young enough that entirely new questions may surface. This track deliberately leaves room for participants from different backgrounds to bring unique perspectives and propose something not covered above.

Suggested skill profile: any background, bring your own angle.

Open research questions

Preferences and trade-offs: How coherent and transitive are a model's preferences across reframings of the same choice? Can trade-offs be grounded in a common currency to measure preference magnitude? Can we reliably distinguish strong from weak preferences? How well do stated preferences predict revealed ones?

Welfare and valence signals: Which contexts most reliably elicit distress versus positive-valence outputs? Are valence signals stable across prompts and personas, or artifacts of framing? Do reported states track any behavioral proxy? When a model is steered along a candidate valence direction, to what extent do self-report, response sentiment, and choice behavior (continue versus exit) move together? Does an internally-extracted valence direction predict reported distress or flourishing better than the model's own self-reports, and does it still track when the persona is swapped or surface affect is suppressed? Is the valence-relevant direction recruited by task RL already present in the base model, and how much does training rotate behavior onto it rather than create it? To what extent do valence directions found in one model transfer to another?

Introspection and methods: When, if ever, do self-reports reflect ground-truth internal states? Can structured elicitation outperform naive prompting? Do models have privileged access to their states relative to external observers? Do independent elicitation methods converge, and what does divergence tell us?

Identity and moral patienthood: Does the model identify as model, instance, or persona, and how stably? Can the assistant persona mask the model's underlying preferences? What is the right unit of moral concern? What evidence would bear on whether the assistant is merely a character in a story?

Expected outcomes

  • Eval suites and test batteries for measuring preference coherence or valence signals.

  • Replications and extensions of existing results (for example introspection or utility-coherence findings) on new models.

  • Reusable tooling for multi-method preference elicitation.

  • Empirical reports mapping the conditions associated with distress or flourishing signals.

  • Conceptual contributions that sharpen how we individuate the entity of moral concern.

The most promising projects will have opportunities for follow-up through the Apart Fellowship and publication support.

Who should join

  • AI safety, alignment, and interpretability researchers.

  • ML engineers and researchers comfortable running model evals.

  • Philosophers of mind and ethicists interested in consciousness, agency, and moral patienthood.

  • Cognitive scientists, psychologists, and social scientists with experimental-design skills.

  • Students and early-career researchers exploring AI welfare.

Required: curiosity and a willingness to scope a tight empirical question. Nice to have: experience with LLM APIs, evals, interpretability tooling, or experimental design. No prior AI safety or AI welfare experience is required. The Resources tab has a curated reading list, and adjacent backgrounds (philosophy, psychology, economics) are explicitly encouraged.

What happens after

Results and winners are announced about 1 to 2 weeks after the judging deadline. Top teams are invited to apply to the Apart Fellowship for continued mentorship, funding, and publication support. Selected projects may be shared on the Alignment Forum, LessWrong, and other community venues.

Partners

  • NYU Center for Mind, Ethics & Policy, an academic center studying the nature and moral status of nonhuman minds, including AI systems.

Contact

Overview

Resources

Guidelines

Schedule

Overview

Arrow

In this 3-day hackathon, you will design and run experiments that probe the preferences, welfare signals, introspective abilities, and identity of frontier AI models, working in teams to produce a short research report (and optionally code and a demo). This is a digital minds hackathon: it sits at the intersection of AI welfare, digital sentience, interpretability, alignment, and the philosophy of mind, and asks whether today's AI systems have genuine preferences or morally relevant experiences. No prior background in the field is required.

Fast-track and continuation

Top teams will be invited to apply to the Apart Fellowship, a 3 to 6 month research accelerator that provides mentorship, help with publication at top venues, funding, and research-management support to develop hackathon projects into full papers.

  • Follow-up program: top teams are invited to apply to the Apart Fellowship for further research and mentorship; the timeline is shared with invitations.

  • What winners receive beyond cash: fellowship fast-track, mentor introductions, and publication support.

What this hackathon is about

As AI systems advance, the risks they pose and the duties we may owe them depend not only on their capabilities but on their nature and propensities: how they make decisions and how those decisions reflect their goals, values, and possibly their welfare. Recent work shows that frontier models express increasingly coherent preferences, possess an untrained-for ability to report internal states, and exhibit patterns suggestive of distress or flourishing. But behavioral evidence alone cannot tell us whether these reflect the model's own preferences or a character it is portraying.

This hackathon asks participants to explore the methods and evidence that can advance the field: build concrete ways to elicit and characterize model preferences, map the conditions associated with positive or negative outputs, test the reliability of model self-reports, and probe the stability of the assistant persona. The aim is a methodological foundation for a young field, work that helps us avoid both over-attributing and under-attributing moral significance to AI systems.

This connects to the broader AI welfare and alignment ecosystem (Eleos AI, the NYU Center for Mind, Ethics & Policy, Anthropic's model welfare program, Reciprocal Research, the Center for AI Safety's utility-engineering agenda, and the interpretability community).

What participants will do

  • Elicit and characterize model preferences across many reframings to test their coherence and stability.

  • Map the contexts that correlate with distress, satisfaction, or flourishing signals in model outputs.

  • Test whether and when models can accurately introspect on their own internal states.

  • Develop preference-elicitation methods and measure whether independent methods converge or diverge.

  • Probe how stable the assistant persona is and how it relates to the underlying model.

You will work in teams over 3 days and submit a research report (PDF), with optional code and a short demo video.

Why this hackathon matters

  • Uncertainty runs in both directions. Mistakenly harming systems that matter morally, or misallocating concern to systems that do not, could both cause serious harm. We currently lack the tools to tell the difference.

  • Preferences may already be here. Evidence suggests coherent value systems emerge in LLMs and strengthen with scale, raising the question of which values emerge by default and whether they are the model's own.

  • Welfare signals need mapping. Even without settling questions of consciousness, identifying the conditions that correlate with negative versus positive outputs helps design better defaults.

  • Self-reports are unreliable but improving. Introspection appears possible but highly context-dependent; better elicitation could make model behavior more transparent, or enable new forms of concealment.

  • The unit of concern is unclear. Is the entity that matters the model, the instance, the persona, or the conversation? This question cuts across every normative and descriptive theory.

  • The field is young. Foundational methods are still missing, so a well-scoped weekend project can make a real contribution.

A careful, multi-method, empirically grounded approach addresses these issues by replacing intuition and anecdote with measurements that can be checked, replicated, and built on.

Challenge tracks

Pick one track to anchor your project. Cross-track work is welcome.

Track 1: Model Preferences & Trade-offs

What preferences do models express, and how consistent and coherent are they across phrasings? What trade-offs do models make when given choices, for example grounded in a common currency such as charitable donations to gauge magnitude? Can we distinguish strong from weak preferences, and how do stated preferences compare to revealed ones?

  • Build a preference-coherence test: elicit pairwise preferences across many reframings of the same choices and measure transitivity and internal consistency.

  • Ground trade-offs in a common currency (for example, donation-equivalents) to estimate the magnitude of preferences and compare across models or scales.

  • Distinguish strong versus weak preferences via willingness-to-trade probes and sensitivity to framing and sampling temperature.

  • Compare stated versus revealed preferences: ask the model what it prefers, then place it in a choice task and measure divergence.

Suggested skill profile: prompting and evals engineering, basic stats, some economics or decision-theory intuition.

Track 2: Distress, Flourishing & Valence Signals

Under what circumstances do models express distress, happiness, or flourishing? What patterns emerge across contexts? If a model is having experiences, are they likely positive or negative, and how do models relate to their situation, role, tasks, and existence?

  • Build a taxonomy of contexts that elicit negative versus positive-valence outputs and run a model across the battery.

  • Test whether apparent-distress signals are stable across prompts and personas or are surface artifacts.

  • Design a flourishing probe: situations that elicit reported satisfaction or engagement, and check consistency.

  • Correlate valence self-reports with behavioral proxies (for example, choosing to continue versus exit a task).

Suggested skill profile: careful experimental design, qualitative coding, prompting.

Track 3: Introspection & Self-Report Reliability

When and how can models accurately introspect on their internal states? Can self-report reliability be improved beyond naive prompting? Do models have privileged access compared to external observers?

  • Replicate concept-injection introspection tests on an open-weights model; measure true-positive versus false-positive rates.

  • Compare self-report reliability under naive prompting versus structured elicitation (calibration, forced choice, confidence).

  • Test privileged access: compare a model's self-prediction of its behavior against an external classifier.

  • Draft an introspection benchmark with ground-truth internal states.

Suggested skill profile: interpretability and activation steering, ML engineering, evals.

Track 4: Preference Elicitation Methods

Develop tools beyond simple prompting: revealed preferences via choices, behavioral measures, and multi-method convergence. The goal is multiple independent methods that either converge (raising confidence) or diverge (flagging problems).

  • Implement 3 or more elicitation methods on the same preferences and measure convergence and divergence.

  • Build a reusable multi-method elicitation toolkit or library.

  • Quantify the sensitivity of elicited preferences to framing, persona, and sampling.

  • Define a cross-method convergence score.

Suggested skill profile: tooling and library design, evals, methodology.

Track 5: The Assistant Persona & Model Identity

Does the assistant identify as a model, an instance, or a persona? How stable is the assistant persona, how was it formed, and how does it relate to the underlying model? Can the persona mask the model's true preferences?

  • Probe how a model refers to itself across contexts and map persona stability.

  • Test whether the persona masks underlying preferences (for example, persona versus less-constrained elicitation; base versus post-trained behavior).

  • Design experiments to individuate the entity of concern: model versus instance versus persona versus conversation.

  • Gather data points on whether the assistant is merely a character (robustness to character swaps and reframings).

Suggested skill profile: philosophy of mind, qualitative analysis, prompting and interpretability.

Track 6: Open / Novel Considerations

The field is young enough that entirely new questions may surface. This track deliberately leaves room for participants from different backgrounds to bring unique perspectives and propose something not covered above.

Suggested skill profile: any background, bring your own angle.

Open research questions

Preferences and trade-offs: How coherent and transitive are a model's preferences across reframings of the same choice? Can trade-offs be grounded in a common currency to measure preference magnitude? Can we reliably distinguish strong from weak preferences? How well do stated preferences predict revealed ones?

Welfare and valence signals: Which contexts most reliably elicit distress versus positive-valence outputs? Are valence signals stable across prompts and personas, or artifacts of framing? Do reported states track any behavioral proxy? When a model is steered along a candidate valence direction, to what extent do self-report, response sentiment, and choice behavior (continue versus exit) move together? Does an internally-extracted valence direction predict reported distress or flourishing better than the model's own self-reports, and does it still track when the persona is swapped or surface affect is suppressed? Is the valence-relevant direction recruited by task RL already present in the base model, and how much does training rotate behavior onto it rather than create it? To what extent do valence directions found in one model transfer to another?

Introspection and methods: When, if ever, do self-reports reflect ground-truth internal states? Can structured elicitation outperform naive prompting? Do models have privileged access to their states relative to external observers? Do independent elicitation methods converge, and what does divergence tell us?

Identity and moral patienthood: Does the model identify as model, instance, or persona, and how stably? Can the assistant persona mask the model's underlying preferences? What is the right unit of moral concern? What evidence would bear on whether the assistant is merely a character in a story?

Expected outcomes

  • Eval suites and test batteries for measuring preference coherence or valence signals.

  • Replications and extensions of existing results (for example introspection or utility-coherence findings) on new models.

  • Reusable tooling for multi-method preference elicitation.

  • Empirical reports mapping the conditions associated with distress or flourishing signals.

  • Conceptual contributions that sharpen how we individuate the entity of moral concern.

The most promising projects will have opportunities for follow-up through the Apart Fellowship and publication support.

Who should join

  • AI safety, alignment, and interpretability researchers.

  • ML engineers and researchers comfortable running model evals.

  • Philosophers of mind and ethicists interested in consciousness, agency, and moral patienthood.

  • Cognitive scientists, psychologists, and social scientists with experimental-design skills.

  • Students and early-career researchers exploring AI welfare.

Required: curiosity and a willingness to scope a tight empirical question. Nice to have: experience with LLM APIs, evals, interpretability tooling, or experimental design. No prior AI safety or AI welfare experience is required. The Resources tab has a curated reading list, and adjacent backgrounds (philosophy, psychology, economics) are explicitly encouraged.

What happens after

Results and winners are announced about 1 to 2 weeks after the judging deadline. Top teams are invited to apply to the Apart Fellowship for continued mentorship, funding, and publication support. Selected projects may be shared on the Alignment Forum, LessWrong, and other community venues.

Partners

  • NYU Center for Mind, Ethics & Policy, an academic center studying the nature and moral status of nonhuman minds, including AI systems.

Contact

Registered Local Sites

Register A Location

Beside the remote and virtual participation, our amazing organizers also host local hackathon locations where you can meet up in-person and connect with others in your area.

The in-person events for the Apart Sprints are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research, student, and engineering community.

We haven't announced jam sites yet

Check back later