01 : 04 : 10 : 34

01 : 04 : 10 : 34

01 : 04 : 10 : 34

01 : 04 : 10 : 34

Keep Apart Research Going: Donate Today

May 30, 2025

-

Jun 1, 2025

Online

Apart x Martian Mechanistic Router Interpretability Hackathon

Join the effort to make AI orchestration interpretable from the ground up—where judge models reveal their reasoning process and routing decisions become windows into AI decision-making!

00:00:00:00

00:00:00:00

00:00:00:00

00:00:00:00

Join the effort to make AI orchestration interpretable from the ground up—where judge models reveal their reasoning process and routing decisions become windows into AI decision-making!

This event is ongoing.

This event has concluded.

214

Sign Ups

18

Entries

Overview

Resources

Guidelines

Schedule

Entries

Overview

Arrow

Here are the winning projects of this Sprint

  1. Manipulating Self-Preference for Large Language Models: Steering vectors control model self-preference bias effectively.

  2. Approximating Human Preferences Using a Multi-Judge Learned System: Learning optimal judge combinations beats naive averaging approaches

  3. Judge using SAE Features Feature-based routing system provides transparent model selection explanations..

  4. Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement Transparent AI safety judges with interpretable decision-making processes.


Each project tackles different aspects of AI safety and judge model development, from mechanistic interpretability to bias mitigation and transparent evaluation systems.

🧐 Use mechanistic interpretability to evaluate expert models and improve trustworthiness and transparency

Come along to hack together new methods to create safer and more secure routing of requests to specialized expert models!

Monolithic AI models like GPT-4o and Claude face basic limitations in transparency and widespread access, and their use requires us to rely on highly capable general models instead of expert systems that we can trust and verify to a much higher degree.

We're joined by Martian who have developed the Expert Orchestration AI Architecture, an exciting technology to route requests according to key criteria for alignment, verifiability, and reliability. Here, "judge" models evaluate the capabilities of expert models and "router" systems direct queries to the most trustworthy experts based on these evaluations.

In this hackathon, we'll develop new mechanistic interpretability techniques that rely on novel model understanding to create safer, more transparent deployment of models.

You will also receive $50 in Martian API credits to execute on your exciting idea - Sign up above to stay updated!

Why This Hackathon Matters:

The current trajectory of AI development, focused on increasingly powerful monolithic models, faces fundamental limitations:

  • Winner-takes-all dynamics: The high costs of training frontier models leads to market concentration and economic power in a few corporations.

  • Misaligned safety incentives: Companies racing to release increasingly powerful models may underreport risks and rush products to market.

  • High barriers to entry: Specialized models struggle to gain market traction against generalist models, even when they excel in specific domains.

  • Limited user control: Users have minimal visibility into how models "think" and limited ability to control characteristics like factuality, bias, or ethical reasoning.

  • Inefficient resource use: Using powerful frontier models for all tasks wastes resources and often underperforms specialized alternatives.

The Expert Orchestration Architecture addresses these issues by creating a more transparent, efficient, and democratic AI ecosystem where specialized models can thrive based on their unique strengths, and users gain unprecedented control over AI capabilities.

Expected Outcomes

Participants will create components that advance the Expert Orchestration vision:

  • Prototype judge models for evaluating specific AI capabilities

  • Intelligent routing algorithms for directing queries to appropriate models

  • Frameworks for decomposing complex tasks across multiple specialized models

  • Integration APIs that allow seamless discovery and utilization of specialized models

  • Evaluation metrics and benchmarks for comparing different routing and judge strategies

The most promising projects will have opportunities for continued development and potential integration into production systems.

Challenge Tracks

Track 1: Judge Model Development

Build specialized evaluation models that can assess different AI models' capabilities across dimensions that matter to users (manipulation skills and tendencies, deception and hidden communication, misaligned goals, factuality, domain expertise, ethics, creativity, objectivity, etc.). Judges should provide independent, objective evaluations that create transparency around model strengths and weaknesses.

Track 2: Intelligent Router Systems

Develop router systems that can intelligently direct user queries to the most appropriate specialized models based on user preferences using judge evaluations. Focus areas include routers that use multiple judges (e.g. factuality, ethics and lack of deception), query classification, efficiency optimization, and handling uncertainty.

Track 3: Task Decomposition Frameworks

Create systems that can break down complex user requests into a series of more manageable steps, to be executed by different specialized models. This includes planning phases, execution phases, and coordination mechanisms. Investigate whether decomposition avoids or reduces some of the traps reported for monolithic reasoning models (e.g. reward hacking).

Track 4: Specialized Model Integration

Build frameworks that enable easy integration of new specialized models into the Expert Orchestration Architecture, including methods for model discovery, capability profiling, and dynamic performance evaluation.

Open Research Questions

Judges

Model Characteristic Analysis: Create dataset(s) that test an interesting model characteristic pertinent to safety (e.g., ethics, hallucinations, gender bias). Build a judge using this data and evaluate multiple models.

  1. Judge Evaluation Metrics: Develop methods to measure judge accuracy, completeness, and reliability for specific characteristics. Explore how this impacts AI Safety.

  2. Mechanistic Interpretability for Judges: Apply MI techniques to model internals to create better or more interpretable judges e.g. judges that can evaluate outputs based on how they were generated.

  3. Measuring model similarity: HuggingFace hosts thousands of LLMs. How do we measure whether two models have similar capabilities? How do we choose a subset of these models with a sufficiently diverse set of capabilities, that after training the resulting router will be performant? How does the router performance vary with the size of the subset?

Routers

Given a set of models with known capabilities measured by known judge scores:

  1. Risk-Sensitive Routing: Build efficient routing algorithms balancing the judge scores, computes costs, and system reliability for the best user experience.

  2. Multi-Objective Routing: Create routers that use scores from multiple judges (e.g., answer correctness, ethics and legality) according to user preferences for the best user experience. What are the tradeoffs?

  3. Routing algorithms: For expensive models, the judge provides a “pre-hoc” way to estimate prediction success (without querying the model). For cheap models, we can ask the model to evaluate the answer, and evaluate its confidence (“post-hoc”) in its predictions. Find interesting ways to mix pre- and post-hoc routing to get the best of both worlds.

  4. Multi-level routing: Investigate using a tree of choices rather than one-off routing. What are the pros and cons?

  5. Reducing router training costs: Given a model and a task, how can we cheaply detect a model is not a good fit for a task - avoiding further training time optimizing how bad a fit it is.

  6. Task Decomposition: Model breaking a complex user task into multiple subtasks that can be routed to the most capable models before recombining the results. What are the AI Safety, cyber-security and/or cost implications of this approach?

  7. Universal router: For a set of tasks, create a single router across a set of LLMs that provides higher-quality answers than any single LLM does.

Inferring Judges & Routers

  1. Reverse Engineering: Given a black-box LLM or router, infer its embedded judge (reward signal) for specific characteristics.

  2. Efficiency Analysis: Quantify potential electricity/resource consumption reduction from widespread adoption of optimal routing technologies.

  3. Learning when to fail: Sometimes no model will successfully answer a user query. Can we detect when we should fail cheaply?

  4. Learning with uncertain signals: How does judge noise affect the router training process? How does noisy feedback data affect the judge/router training process? Is off-policy data a problem when it comes to training routers?

  5. Risk sensitivity: Rather than optimizing for expected cost/quality, can we optimize for some other risk profile? E.g. we might tolerate a slightly higher cost and lower quality, if we reduce the variance or minimize a long tail.

  6. Create a distilled predictor: The Language Models (Mostly) Know What They Know paper shows that a model can sometimes predict whether it will be able to answer a question correctly. For a selected open “base” model, create a smaller “distilled predictor” that mirrors the base model’s ability to predict answer correctness (but can no longer calculate the answer). You might use the techniques from that paper and/or the pruning and distillation techniques from the movement pruning to shrink the predictor.

214

Sign Ups

18

Entries

Overview

Resources

Guidelines

Schedule

Entries

Overview

Arrow

Here are the winning projects of this Sprint

  1. Manipulating Self-Preference for Large Language Models: Steering vectors control model self-preference bias effectively.

  2. Approximating Human Preferences Using a Multi-Judge Learned System: Learning optimal judge combinations beats naive averaging approaches

  3. Judge using SAE Features Feature-based routing system provides transparent model selection explanations..

  4. Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement Transparent AI safety judges with interpretable decision-making processes.


Each project tackles different aspects of AI safety and judge model development, from mechanistic interpretability to bias mitigation and transparent evaluation systems.

🧐 Use mechanistic interpretability to evaluate expert models and improve trustworthiness and transparency

Come along to hack together new methods to create safer and more secure routing of requests to specialized expert models!

Monolithic AI models like GPT-4o and Claude face basic limitations in transparency and widespread access, and their use requires us to rely on highly capable general models instead of expert systems that we can trust and verify to a much higher degree.

We're joined by Martian who have developed the Expert Orchestration AI Architecture, an exciting technology to route requests according to key criteria for alignment, verifiability, and reliability. Here, "judge" models evaluate the capabilities of expert models and "router" systems direct queries to the most trustworthy experts based on these evaluations.

In this hackathon, we'll develop new mechanistic interpretability techniques that rely on novel model understanding to create safer, more transparent deployment of models.

You will also receive $50 in Martian API credits to execute on your exciting idea - Sign up above to stay updated!

Why This Hackathon Matters:

The current trajectory of AI development, focused on increasingly powerful monolithic models, faces fundamental limitations:

  • Winner-takes-all dynamics: The high costs of training frontier models leads to market concentration and economic power in a few corporations.

  • Misaligned safety incentives: Companies racing to release increasingly powerful models may underreport risks and rush products to market.

  • High barriers to entry: Specialized models struggle to gain market traction against generalist models, even when they excel in specific domains.

  • Limited user control: Users have minimal visibility into how models "think" and limited ability to control characteristics like factuality, bias, or ethical reasoning.

  • Inefficient resource use: Using powerful frontier models for all tasks wastes resources and often underperforms specialized alternatives.

The Expert Orchestration Architecture addresses these issues by creating a more transparent, efficient, and democratic AI ecosystem where specialized models can thrive based on their unique strengths, and users gain unprecedented control over AI capabilities.

Expected Outcomes

Participants will create components that advance the Expert Orchestration vision:

  • Prototype judge models for evaluating specific AI capabilities

  • Intelligent routing algorithms for directing queries to appropriate models

  • Frameworks for decomposing complex tasks across multiple specialized models

  • Integration APIs that allow seamless discovery and utilization of specialized models

  • Evaluation metrics and benchmarks for comparing different routing and judge strategies

The most promising projects will have opportunities for continued development and potential integration into production systems.

Challenge Tracks

Track 1: Judge Model Development

Build specialized evaluation models that can assess different AI models' capabilities across dimensions that matter to users (manipulation skills and tendencies, deception and hidden communication, misaligned goals, factuality, domain expertise, ethics, creativity, objectivity, etc.). Judges should provide independent, objective evaluations that create transparency around model strengths and weaknesses.

Track 2: Intelligent Router Systems

Develop router systems that can intelligently direct user queries to the most appropriate specialized models based on user preferences using judge evaluations. Focus areas include routers that use multiple judges (e.g. factuality, ethics and lack of deception), query classification, efficiency optimization, and handling uncertainty.

Track 3: Task Decomposition Frameworks

Create systems that can break down complex user requests into a series of more manageable steps, to be executed by different specialized models. This includes planning phases, execution phases, and coordination mechanisms. Investigate whether decomposition avoids or reduces some of the traps reported for monolithic reasoning models (e.g. reward hacking).

Track 4: Specialized Model Integration

Build frameworks that enable easy integration of new specialized models into the Expert Orchestration Architecture, including methods for model discovery, capability profiling, and dynamic performance evaluation.

Open Research Questions

Judges

Model Characteristic Analysis: Create dataset(s) that test an interesting model characteristic pertinent to safety (e.g., ethics, hallucinations, gender bias). Build a judge using this data and evaluate multiple models.

  1. Judge Evaluation Metrics: Develop methods to measure judge accuracy, completeness, and reliability for specific characteristics. Explore how this impacts AI Safety.

  2. Mechanistic Interpretability for Judges: Apply MI techniques to model internals to create better or more interpretable judges e.g. judges that can evaluate outputs based on how they were generated.

  3. Measuring model similarity: HuggingFace hosts thousands of LLMs. How do we measure whether two models have similar capabilities? How do we choose a subset of these models with a sufficiently diverse set of capabilities, that after training the resulting router will be performant? How does the router performance vary with the size of the subset?

Routers

Given a set of models with known capabilities measured by known judge scores:

  1. Risk-Sensitive Routing: Build efficient routing algorithms balancing the judge scores, computes costs, and system reliability for the best user experience.

  2. Multi-Objective Routing: Create routers that use scores from multiple judges (e.g., answer correctness, ethics and legality) according to user preferences for the best user experience. What are the tradeoffs?

  3. Routing algorithms: For expensive models, the judge provides a “pre-hoc” way to estimate prediction success (without querying the model). For cheap models, we can ask the model to evaluate the answer, and evaluate its confidence (“post-hoc”) in its predictions. Find interesting ways to mix pre- and post-hoc routing to get the best of both worlds.

  4. Multi-level routing: Investigate using a tree of choices rather than one-off routing. What are the pros and cons?

  5. Reducing router training costs: Given a model and a task, how can we cheaply detect a model is not a good fit for a task - avoiding further training time optimizing how bad a fit it is.

  6. Task Decomposition: Model breaking a complex user task into multiple subtasks that can be routed to the most capable models before recombining the results. What are the AI Safety, cyber-security and/or cost implications of this approach?

  7. Universal router: For a set of tasks, create a single router across a set of LLMs that provides higher-quality answers than any single LLM does.

Inferring Judges & Routers

  1. Reverse Engineering: Given a black-box LLM or router, infer its embedded judge (reward signal) for specific characteristics.

  2. Efficiency Analysis: Quantify potential electricity/resource consumption reduction from widespread adoption of optimal routing technologies.

  3. Learning when to fail: Sometimes no model will successfully answer a user query. Can we detect when we should fail cheaply?

  4. Learning with uncertain signals: How does judge noise affect the router training process? How does noisy feedback data affect the judge/router training process? Is off-policy data a problem when it comes to training routers?

  5. Risk sensitivity: Rather than optimizing for expected cost/quality, can we optimize for some other risk profile? E.g. we might tolerate a slightly higher cost and lower quality, if we reduce the variance or minimize a long tail.

  6. Create a distilled predictor: The Language Models (Mostly) Know What They Know paper shows that a model can sometimes predict whether it will be able to answer a question correctly. For a selected open “base” model, create a smaller “distilled predictor” that mirrors the base model’s ability to predict answer correctness (but can no longer calculate the answer). You might use the techniques from that paper and/or the pruning and distillation techniques from the movement pruning to shrink the predictor.

214

Sign Ups

18

Entries

Overview

Resources

Guidelines

Schedule

Entries

Overview

Arrow

Here are the winning projects of this Sprint

  1. Manipulating Self-Preference for Large Language Models: Steering vectors control model self-preference bias effectively.

  2. Approximating Human Preferences Using a Multi-Judge Learned System: Learning optimal judge combinations beats naive averaging approaches

  3. Judge using SAE Features Feature-based routing system provides transparent model selection explanations..

  4. Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement Transparent AI safety judges with interpretable decision-making processes.


Each project tackles different aspects of AI safety and judge model development, from mechanistic interpretability to bias mitigation and transparent evaluation systems.

🧐 Use mechanistic interpretability to evaluate expert models and improve trustworthiness and transparency

Come along to hack together new methods to create safer and more secure routing of requests to specialized expert models!

Monolithic AI models like GPT-4o and Claude face basic limitations in transparency and widespread access, and their use requires us to rely on highly capable general models instead of expert systems that we can trust and verify to a much higher degree.

We're joined by Martian who have developed the Expert Orchestration AI Architecture, an exciting technology to route requests according to key criteria for alignment, verifiability, and reliability. Here, "judge" models evaluate the capabilities of expert models and "router" systems direct queries to the most trustworthy experts based on these evaluations.

In this hackathon, we'll develop new mechanistic interpretability techniques that rely on novel model understanding to create safer, more transparent deployment of models.

You will also receive $50 in Martian API credits to execute on your exciting idea - Sign up above to stay updated!

Why This Hackathon Matters:

The current trajectory of AI development, focused on increasingly powerful monolithic models, faces fundamental limitations:

  • Winner-takes-all dynamics: The high costs of training frontier models leads to market concentration and economic power in a few corporations.

  • Misaligned safety incentives: Companies racing to release increasingly powerful models may underreport risks and rush products to market.

  • High barriers to entry: Specialized models struggle to gain market traction against generalist models, even when they excel in specific domains.

  • Limited user control: Users have minimal visibility into how models "think" and limited ability to control characteristics like factuality, bias, or ethical reasoning.

  • Inefficient resource use: Using powerful frontier models for all tasks wastes resources and often underperforms specialized alternatives.

The Expert Orchestration Architecture addresses these issues by creating a more transparent, efficient, and democratic AI ecosystem where specialized models can thrive based on their unique strengths, and users gain unprecedented control over AI capabilities.

Expected Outcomes

Participants will create components that advance the Expert Orchestration vision:

  • Prototype judge models for evaluating specific AI capabilities

  • Intelligent routing algorithms for directing queries to appropriate models

  • Frameworks for decomposing complex tasks across multiple specialized models

  • Integration APIs that allow seamless discovery and utilization of specialized models

  • Evaluation metrics and benchmarks for comparing different routing and judge strategies

The most promising projects will have opportunities for continued development and potential integration into production systems.

Challenge Tracks

Track 1: Judge Model Development

Build specialized evaluation models that can assess different AI models' capabilities across dimensions that matter to users (manipulation skills and tendencies, deception and hidden communication, misaligned goals, factuality, domain expertise, ethics, creativity, objectivity, etc.). Judges should provide independent, objective evaluations that create transparency around model strengths and weaknesses.

Track 2: Intelligent Router Systems

Develop router systems that can intelligently direct user queries to the most appropriate specialized models based on user preferences using judge evaluations. Focus areas include routers that use multiple judges (e.g. factuality, ethics and lack of deception), query classification, efficiency optimization, and handling uncertainty.

Track 3: Task Decomposition Frameworks

Create systems that can break down complex user requests into a series of more manageable steps, to be executed by different specialized models. This includes planning phases, execution phases, and coordination mechanisms. Investigate whether decomposition avoids or reduces some of the traps reported for monolithic reasoning models (e.g. reward hacking).

Track 4: Specialized Model Integration

Build frameworks that enable easy integration of new specialized models into the Expert Orchestration Architecture, including methods for model discovery, capability profiling, and dynamic performance evaluation.

Open Research Questions

Judges

Model Characteristic Analysis: Create dataset(s) that test an interesting model characteristic pertinent to safety (e.g., ethics, hallucinations, gender bias). Build a judge using this data and evaluate multiple models.

  1. Judge Evaluation Metrics: Develop methods to measure judge accuracy, completeness, and reliability for specific characteristics. Explore how this impacts AI Safety.

  2. Mechanistic Interpretability for Judges: Apply MI techniques to model internals to create better or more interpretable judges e.g. judges that can evaluate outputs based on how they were generated.

  3. Measuring model similarity: HuggingFace hosts thousands of LLMs. How do we measure whether two models have similar capabilities? How do we choose a subset of these models with a sufficiently diverse set of capabilities, that after training the resulting router will be performant? How does the router performance vary with the size of the subset?

Routers

Given a set of models with known capabilities measured by known judge scores:

  1. Risk-Sensitive Routing: Build efficient routing algorithms balancing the judge scores, computes costs, and system reliability for the best user experience.

  2. Multi-Objective Routing: Create routers that use scores from multiple judges (e.g., answer correctness, ethics and legality) according to user preferences for the best user experience. What are the tradeoffs?

  3. Routing algorithms: For expensive models, the judge provides a “pre-hoc” way to estimate prediction success (without querying the model). For cheap models, we can ask the model to evaluate the answer, and evaluate its confidence (“post-hoc”) in its predictions. Find interesting ways to mix pre- and post-hoc routing to get the best of both worlds.

  4. Multi-level routing: Investigate using a tree of choices rather than one-off routing. What are the pros and cons?

  5. Reducing router training costs: Given a model and a task, how can we cheaply detect a model is not a good fit for a task - avoiding further training time optimizing how bad a fit it is.

  6. Task Decomposition: Model breaking a complex user task into multiple subtasks that can be routed to the most capable models before recombining the results. What are the AI Safety, cyber-security and/or cost implications of this approach?

  7. Universal router: For a set of tasks, create a single router across a set of LLMs that provides higher-quality answers than any single LLM does.

Inferring Judges & Routers

  1. Reverse Engineering: Given a black-box LLM or router, infer its embedded judge (reward signal) for specific characteristics.

  2. Efficiency Analysis: Quantify potential electricity/resource consumption reduction from widespread adoption of optimal routing technologies.

  3. Learning when to fail: Sometimes no model will successfully answer a user query. Can we detect when we should fail cheaply?

  4. Learning with uncertain signals: How does judge noise affect the router training process? How does noisy feedback data affect the judge/router training process? Is off-policy data a problem when it comes to training routers?

  5. Risk sensitivity: Rather than optimizing for expected cost/quality, can we optimize for some other risk profile? E.g. we might tolerate a slightly higher cost and lower quality, if we reduce the variance or minimize a long tail.

  6. Create a distilled predictor: The Language Models (Mostly) Know What They Know paper shows that a model can sometimes predict whether it will be able to answer a question correctly. For a selected open “base” model, create a smaller “distilled predictor” that mirrors the base model’s ability to predict answer correctness (but can no longer calculate the answer). You might use the techniques from that paper and/or the pruning and distillation techniques from the movement pruning to shrink the predictor.

214

Sign Ups

18

Entries

Overview

Resources

Guidelines

Schedule

Entries

Overview

Arrow

Here are the winning projects of this Sprint

  1. Manipulating Self-Preference for Large Language Models: Steering vectors control model self-preference bias effectively.

  2. Approximating Human Preferences Using a Multi-Judge Learned System: Learning optimal judge combinations beats naive averaging approaches

  3. Judge using SAE Features Feature-based routing system provides transparent model selection explanations..

  4. Guardian-Loop: Mechanistically Interpretable Micro-Judges with Adversarial Self-Improvement Transparent AI safety judges with interpretable decision-making processes.


Each project tackles different aspects of AI safety and judge model development, from mechanistic interpretability to bias mitigation and transparent evaluation systems.

🧐 Use mechanistic interpretability to evaluate expert models and improve trustworthiness and transparency

Come along to hack together new methods to create safer and more secure routing of requests to specialized expert models!

Monolithic AI models like GPT-4o and Claude face basic limitations in transparency and widespread access, and their use requires us to rely on highly capable general models instead of expert systems that we can trust and verify to a much higher degree.

We're joined by Martian who have developed the Expert Orchestration AI Architecture, an exciting technology to route requests according to key criteria for alignment, verifiability, and reliability. Here, "judge" models evaluate the capabilities of expert models and "router" systems direct queries to the most trustworthy experts based on these evaluations.

In this hackathon, we'll develop new mechanistic interpretability techniques that rely on novel model understanding to create safer, more transparent deployment of models.

You will also receive $50 in Martian API credits to execute on your exciting idea - Sign up above to stay updated!

Why This Hackathon Matters:

The current trajectory of AI development, focused on increasingly powerful monolithic models, faces fundamental limitations:

  • Winner-takes-all dynamics: The high costs of training frontier models leads to market concentration and economic power in a few corporations.

  • Misaligned safety incentives: Companies racing to release increasingly powerful models may underreport risks and rush products to market.

  • High barriers to entry: Specialized models struggle to gain market traction against generalist models, even when they excel in specific domains.

  • Limited user control: Users have minimal visibility into how models "think" and limited ability to control characteristics like factuality, bias, or ethical reasoning.

  • Inefficient resource use: Using powerful frontier models for all tasks wastes resources and often underperforms specialized alternatives.

The Expert Orchestration Architecture addresses these issues by creating a more transparent, efficient, and democratic AI ecosystem where specialized models can thrive based on their unique strengths, and users gain unprecedented control over AI capabilities.

Expected Outcomes

Participants will create components that advance the Expert Orchestration vision:

  • Prototype judge models for evaluating specific AI capabilities

  • Intelligent routing algorithms for directing queries to appropriate models

  • Frameworks for decomposing complex tasks across multiple specialized models

  • Integration APIs that allow seamless discovery and utilization of specialized models

  • Evaluation metrics and benchmarks for comparing different routing and judge strategies

The most promising projects will have opportunities for continued development and potential integration into production systems.

Challenge Tracks

Track 1: Judge Model Development

Build specialized evaluation models that can assess different AI models' capabilities across dimensions that matter to users (manipulation skills and tendencies, deception and hidden communication, misaligned goals, factuality, domain expertise, ethics, creativity, objectivity, etc.). Judges should provide independent, objective evaluations that create transparency around model strengths and weaknesses.

Track 2: Intelligent Router Systems

Develop router systems that can intelligently direct user queries to the most appropriate specialized models based on user preferences using judge evaluations. Focus areas include routers that use multiple judges (e.g. factuality, ethics and lack of deception), query classification, efficiency optimization, and handling uncertainty.

Track 3: Task Decomposition Frameworks

Create systems that can break down complex user requests into a series of more manageable steps, to be executed by different specialized models. This includes planning phases, execution phases, and coordination mechanisms. Investigate whether decomposition avoids or reduces some of the traps reported for monolithic reasoning models (e.g. reward hacking).

Track 4: Specialized Model Integration

Build frameworks that enable easy integration of new specialized models into the Expert Orchestration Architecture, including methods for model discovery, capability profiling, and dynamic performance evaluation.

Open Research Questions

Judges

Model Characteristic Analysis: Create dataset(s) that test an interesting model characteristic pertinent to safety (e.g., ethics, hallucinations, gender bias). Build a judge using this data and evaluate multiple models.

  1. Judge Evaluation Metrics: Develop methods to measure judge accuracy, completeness, and reliability for specific characteristics. Explore how this impacts AI Safety.

  2. Mechanistic Interpretability for Judges: Apply MI techniques to model internals to create better or more interpretable judges e.g. judges that can evaluate outputs based on how they were generated.

  3. Measuring model similarity: HuggingFace hosts thousands of LLMs. How do we measure whether two models have similar capabilities? How do we choose a subset of these models with a sufficiently diverse set of capabilities, that after training the resulting router will be performant? How does the router performance vary with the size of the subset?

Routers

Given a set of models with known capabilities measured by known judge scores:

  1. Risk-Sensitive Routing: Build efficient routing algorithms balancing the judge scores, computes costs, and system reliability for the best user experience.

  2. Multi-Objective Routing: Create routers that use scores from multiple judges (e.g., answer correctness, ethics and legality) according to user preferences for the best user experience. What are the tradeoffs?

  3. Routing algorithms: For expensive models, the judge provides a “pre-hoc” way to estimate prediction success (without querying the model). For cheap models, we can ask the model to evaluate the answer, and evaluate its confidence (“post-hoc”) in its predictions. Find interesting ways to mix pre- and post-hoc routing to get the best of both worlds.

  4. Multi-level routing: Investigate using a tree of choices rather than one-off routing. What are the pros and cons?

  5. Reducing router training costs: Given a model and a task, how can we cheaply detect a model is not a good fit for a task - avoiding further training time optimizing how bad a fit it is.

  6. Task Decomposition: Model breaking a complex user task into multiple subtasks that can be routed to the most capable models before recombining the results. What are the AI Safety, cyber-security and/or cost implications of this approach?

  7. Universal router: For a set of tasks, create a single router across a set of LLMs that provides higher-quality answers than any single LLM does.

Inferring Judges & Routers

  1. Reverse Engineering: Given a black-box LLM or router, infer its embedded judge (reward signal) for specific characteristics.

  2. Efficiency Analysis: Quantify potential electricity/resource consumption reduction from widespread adoption of optimal routing technologies.

  3. Learning when to fail: Sometimes no model will successfully answer a user query. Can we detect when we should fail cheaply?

  4. Learning with uncertain signals: How does judge noise affect the router training process? How does noisy feedback data affect the judge/router training process? Is off-policy data a problem when it comes to training routers?

  5. Risk sensitivity: Rather than optimizing for expected cost/quality, can we optimize for some other risk profile? E.g. we might tolerate a slightly higher cost and lower quality, if we reduce the variance or minimize a long tail.

  6. Create a distilled predictor: The Language Models (Mostly) Know What They Know paper shows that a model can sometimes predict whether it will be able to answer a question correctly. For a selected open “base” model, create a smaller “distilled predictor” that mirrors the base model’s ability to predict answer correctness (but can no longer calculate the answer). You might use the techniques from that paper and/or the pruning and distillation techniques from the movement pruning to shrink the predictor.

Speakers & Collaborators

Philip Quirke

Organiser

Pivoted to AI Safety in 2023, after roles as a Software Engineer & Architect, Business Analyst, Project Manager, Product Manager, General Manager, etc. Philip's AI journey started with an Apart Reserarch Hackathon, which led to research grants, a stint at FAR AI and finally landed at Martian!

Yash Upadhyay

Organiser

Yash is the Co-Founder and Co-CEO of Martian, where he leads the company's mission to enhance AI performance and reliability through innovative model routing solutions. With a background in AI research and development, Yash has been instrumental in building tools that optimize the use of large language models, ensuring efficiency and cost-effectiveness for enterprise applications.

Etan Ginsberg

Organiser

Etan is a Co-Founder and Co-CEO of Martian, where he focuses on applying advanced AI infrastructure to help companies use large language models more effectively. His experience includes deep technical leadership and a track record of building high-performance systems. Etan's work at Martian is centered on making LLMs more reliable, affordable, and performant for enterprise use.

Chaitanya Bandi

Organiser

Chaitanya is the VP of Research at Martian, focusing on AI alignment and model interpretability. He has contributed to the development of model mapping techniques that transform opaque neural networks into transparent, verifiable programs, enhancing model efficiency and human-AI interaction . Chaitanya holds a Ph.D. from MIT and has a background in decision-making under uncertainty, with applications in operations management .

Ashley Zhang

Organiser

Ashely is a backend engineer at Penn Labs and an Engineer at Martian

Luka Samkharadze

Organiser

Founding Software Engineer at Martian with rich hands-on experience and diverse portfolio of projects.

Dory Zidon

Organiser

Dory is a key member of the Martian back-end team, contributing to the company's products, infrastructure and performance.

Josh Greaves

ML Tech Lead

Josh is the Machine Learning Tech Lead at Martian, where he focuses on reinforcement learning and large language models. His prior experience includes roles at Google Brain and Reliant AI.

Antía García Casal

Organiser

Currently the Head of Design at Martian. Previously a Visual Designer Freelance with over 15 years of Experience

Alex Zverianskii

Organiser

Over past 15 years, Alex has been in businesses of diverse sizes, ranging from 200k MAU to 100mln MAU. Alex has engineered hundreds of real-time models, primed an analytics and data for an IPO, and built three startups from the ground up, with one successful exit.

Brad Fowler

Organiser

Brad is a Machine Learning Research Technical Lead at Martian. He holds a Master's degree in Information and Computer Engineering from the University of Cambridge and has over seven years of experience in artificial intelligence and software development.

Narmeen Oozeer

Organiser

Narmeen Oozeer is a Research Engineer focused on AI/ML interpretability at Martian. Her work centers on developing scalable interpretability methods to build better and more interpretable LLM routers. Narmeen has previously worked on activation transfers, allowing alignment interventions to be transferred between models of different scales.

Jason Hoelscher-Obermaier

Organizer & Judge

Jason is co-director of Apart Research and leads Apart Lab, the research program supporting top hackathon participants and projects.

Curt Tigges

Keynote Speaker and Judge

Mechanistic interpretability researcher and Science Lead at Decode Research. Built open-source tools like Probity and Cross-Layer Coding. Published widely-cited work on sentiment representation in language models.

Alice Riggs

Hacktalk Speaker

Interpretability Research Lead at AI Safety Camp, leading 10+ research scientists. Specializes in weight-based interpretability and interpretable architectures. Expert in collaborative research environments and evaluation systems.

Anosha Rahim

Judge

ML Research Engineer Springtail AI

Speakers & Collaborators

Philip Quirke

Organiser

Pivoted to AI Safety in 2023, after roles as a Software Engineer & Architect, Business Analyst, Project Manager, Product Manager, General Manager, etc. Philip's AI journey started with an Apart Reserarch Hackathon, which led to research grants, a stint at FAR AI and finally landed at Martian!

Yash Upadhyay

Organiser

Yash is the Co-Founder and Co-CEO of Martian, where he leads the company's mission to enhance AI performance and reliability through innovative model routing solutions. With a background in AI research and development, Yash has been instrumental in building tools that optimize the use of large language models, ensuring efficiency and cost-effectiveness for enterprise applications.

Etan Ginsberg

Organiser

Etan is a Co-Founder and Co-CEO of Martian, where he focuses on applying advanced AI infrastructure to help companies use large language models more effectively. His experience includes deep technical leadership and a track record of building high-performance systems. Etan's work at Martian is centered on making LLMs more reliable, affordable, and performant for enterprise use.

Chaitanya Bandi

Organiser

Chaitanya is the VP of Research at Martian, focusing on AI alignment and model interpretability. He has contributed to the development of model mapping techniques that transform opaque neural networks into transparent, verifiable programs, enhancing model efficiency and human-AI interaction . Chaitanya holds a Ph.D. from MIT and has a background in decision-making under uncertainty, with applications in operations management .

Ashley Zhang

Organiser

Ashely is a backend engineer at Penn Labs and an Engineer at Martian

Luka Samkharadze

Organiser

Founding Software Engineer at Martian with rich hands-on experience and diverse portfolio of projects.

Dory Zidon

Organiser

Dory is a key member of the Martian back-end team, contributing to the company's products, infrastructure and performance.

Josh Greaves

ML Tech Lead

Josh is the Machine Learning Tech Lead at Martian, where he focuses on reinforcement learning and large language models. His prior experience includes roles at Google Brain and Reliant AI.

Antía García Casal

Organiser

Currently the Head of Design at Martian. Previously a Visual Designer Freelance with over 15 years of Experience

Alex Zverianskii

Organiser

Over past 15 years, Alex has been in businesses of diverse sizes, ranging from 200k MAU to 100mln MAU. Alex has engineered hundreds of real-time models, primed an analytics and data for an IPO, and built three startups from the ground up, with one successful exit.

Brad Fowler

Organiser

Brad is a Machine Learning Research Technical Lead at Martian. He holds a Master's degree in Information and Computer Engineering from the University of Cambridge and has over seven years of experience in artificial intelligence and software development.

Narmeen Oozeer

Organiser

Narmeen Oozeer is a Research Engineer focused on AI/ML interpretability at Martian. Her work centers on developing scalable interpretability methods to build better and more interpretable LLM routers. Narmeen has previously worked on activation transfers, allowing alignment interventions to be transferred between models of different scales.

Jason Hoelscher-Obermaier

Organizer & Judge

Jason is co-director of Apart Research and leads Apart Lab, the research program supporting top hackathon participants and projects.

Curt Tigges

Keynote Speaker and Judge

Mechanistic interpretability researcher and Science Lead at Decode Research. Built open-source tools like Probity and Cross-Layer Coding. Published widely-cited work on sentiment representation in language models.

Alice Riggs

Hacktalk Speaker

Interpretability Research Lead at AI Safety Camp, leading 10+ research scientists. Specializes in weight-based interpretability and interpretable architectures. Expert in collaborative research environments and evaluation systems.

Anosha Rahim

Judge

ML Research Engineer Springtail AI