Dual-classifier system for LLM security
Nicolás David Galindo, Daniel Libardo Diaz Gonzalez, Juan Jacobo Izquierdo Becerra, David Andrés Ramírez Pontificia
This project is a dual-model classification system designed to protect reasoning Large Language Models (LLMs) from attacks like prompt injections and jailbreaks. Because both models are equally important, it uses a two-layered defense strategy:
1. The Prompt Classifier (First Model): Acts as the front door. It analyzes the user's initial prompt and immediately blocks it if it matches known attack patterns, acting as a highly efficient first line of defense.
2. The Behavioral Classifier (Second Model): Acts as the internal monitor. If an attack slips past the first filter (like an indirect injection hidden in a document), this second model monitors the LLM's internal "thinking" trace in real-time as it streams. If the model's reasoning is compromised, it cuts the connection and drops the response before the user ever sees it.
Rather than being traditional neural networks trained from scratch, both of these classifiers use embedding models and a vector database (like pgvector or Milvus) to perform similarity searches against known safe and malicious patterns. Neither model alone is enough, but together they ensure the system is protected at both the entry point and during the generation phase.
No reviews are available yet
Cite this work
@misc {
title={
(HckPrj) MnM:Real-Time Behavioral Detection of Prompt Injection via Reasoning Stream Monitoring
},
author={
Nicolás David Galindo, Daniel Libardo Diaz Gonzalez, Juan Jacobo Izquierdo Becerra, David Andrés Ramírez Pontificia
},
date={
},
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}


