Jan 24, 2025
CoTEP: A Multi-Modal Chain of Thought Evaluation Platform for the Next Generation of SOTA AI Models
Alyssia J, Martin CL
As advanced state-of-the-art models like OpenAI's o-1 series, the upcoming o-3 family, Gemini 2.0 Flash Thinking and DeepSeek display increasingly sophisticated chain-of-thought (CoT) capabilities, our safety evaluations have not yet caught up. We propose building a platform that allows us to gather systematic evaluations of AI reasoning processes to create comprehensive safety benchmarks. Our Chain of Thought Evaluation Platform (CoTEP) will help establish standards for assessing AI reasoning and ensure development of more robust, trustworthy AI systems through industry and government collaboration.
Jason Schreiber
High-quality evaluations of chains of thought are an interesting opportunity. I'd love to see experiments in this direction!
Esben Kran
This is a super cool project! Really interesting to get expert-driven CoTs in for evaluation. There's a few questions regarding the impact on AI safety since it's a capability evaluation and will help to get stronger training data but the actual outlined strategy seems very reasonable. I highly suggest moving forward with this work and getting experimental data about existing CoT models, especially DeepSeek's R1 since it represents the next paradigm and CoT is visible. Great work.
Cite this work
@misc {
title={
@misc {
},
author={
Alyssia J, Martin CL
},
date={
1/24/25
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}