This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
AI Security Evaluation Hackathon: Measuring AI Capability
65b750b6007bebd5884ddbbf
AI Security Evaluation Hackathon: Measuring AI Capability
May 27, 2024
Accepted at the 
65b750b6007bebd5884ddbbf
 research sprint on 

WashBench – A Benchmark for Assessing Softening of Harmful Content in LLM-generated Text Summaries

In this work, we explore the tradeoff between toxicity removal and information retention in LLM-generated summaries. We hypothesize that LLMs are less likely to preserve toxic content when summarizing toxic text due to their safety fine-tuning to avoid generating toxic content. In high-stakes decision-making scenarios, where summary quality is important, this may create significant safety risks. To quantify this effect, we introduce WashBench, a benchmark containing manually annotated toxic content.

By 
Sev Geraskin, Jakub Kryś, Luhan Mikaelson, Simon Wisdom
🏆 
4th place
3rd place
2nd place
1st place
 by peer review
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

This project is private