Jun 2, 2025

LLM Fingerprinting Through Semantic Variability

Luiza Corpaci, Chris Forrester, Siddhesh Pawar

This project develops an LLM fingerprinting and analysis toolkit to increase transparency in AI routing systems, addressing Track 2: Intelligent Router Systems through two key investigations. We adapted semantic variability analysis to create unique behavioral fingerprints that can identify which specific models are operating behind opaque routing services, and conducted tool detection experiments under semantic noise to assess model robustness. Our findings demonstrate that models maintain high semantic robustness while our fingerprinting technique successfully distinguishes between different models based on their response patterns. These contributions aid the Expert Orchestration Architecture vision by providing practical tools for auditing multi-model AI systems, enabling organizations to understand which models their routers actually use and verify their reliability under real-world conditions, ultimately making router systems more transparent and trustworthy for production deployment.

Reviewer's Comments

Reviewer's Comments

Arrow
Arrow
Arrow
Arrow
Arrow

Thinking about potential transparency challenge in router-based ecosystems is valuable and the research questions are well-motivated. The technical execution seems good and the visualizations are really well done.

Here are some key areas to strengthen this work:

- Get even clearer on the impact / motivation: Wouldn't routing providers surface info about the model selection as a feature to sell? Are there specific scenarios where opacity matters? If so zoom in on those

- Make it clearer how we could reliably assign a single response to a specific model. The fingerprint distributions seem to heavily overlap such that attribution does not seem easy at all

- The robustness testing part is not integrated well enough with the fingerprinting. If the connection is not too strong the project becomes better by splitting it off and keeping each contribution focused on one core topic

- Add simple baselines or methods from the literature. How does your fingerprinting approach compare to those?

Thank you for your submission. This is an interesting paper and the results seem good.

As I understand what is proposed, I’m not clear on the benefits of the approach. I assume that the EO implementers will be transparent about which model(s) they are routing to. Certainly this is Martian’s intent / implementation. Can you foresee a use case where this is not so? If so, then it would be good to document this in the paper.

Being able to decode which model is actually behind the router is an interesting and good choice of problem. While this approach of semantic variability analysis is interesting, I worry that it's brittle in it's current state to (for example) the hyperparameters of the decoding algorithms being applied from "behind the router" (such as temperature, decoding strategy, etc).

It would be very interesting to see if a classifier could correctly map the signature t one of the models being specified.

Investigation 2 on how distraction affects tool use is interesting, but perhaps slightly off topic for this hackathon.

I like the approach of combining embedding metrics and prompt engineering. It would be good to explore further what dimensions can be added, and how well judges handle multiple attributes.

great idea to avoid semantic drift

This is phenomenal! My team and I will be using it moving forward.

This is a well-executed research project that tackles critical AI transparency challenges with impressive methods. The cognitive load experiments reveal fascinating insights about AI attention management, particularly that technical jargon causes 96% tool reduction despite 100% acknowledgment, mirroring human selective attention failures. The most important finding, in my opinion, is that meta-analysis of the task constitutes cognitive load that isn’t recognized. Identifying this is crucial for AI safety improvements and defending against potential prompt injections.

Furthermore, this project addresses the model-routing opacity problem for AI governance and compliance with an impressive and novel method: the "hyperstrings" concept and approach to probe model consistency patterns and successfully distinguishing between 11 different models with statistical significance demonstrates real technical achievement. The experimental design shows scientific rigor with systematic similarity measurements and clear statistical testing, while the preferential tool-dropping behavior (abandoning enhancements while preserving core functionality) provides actionable guidance for building robust AI systems.

The research successfully bridges theoretical AI interpretability with practical system transparency. While sample sizes could be larger for broader validation, the work creates exactly the kind of transparency tools needed as AI systems become more complex - understanding both "which model is running" and "how reliably it performs under real-world conditions." This is a strong contribution with immediate applicability to production environments.

Very cool

Interesting problem and interesting approach; I'm curious how the compression method could be used for other applications too, e.g. AI detectors?

The paper focusses on a single task domain (restaurant reservations) limits generality; diversifying tool-calling tasks or adding a qualitative error analysis might result in richer and different failure modes.

The project is reproducible and shows a systematic methodology for evaluating. The metrics are clear shown, the code is available in a github repository. 10/10

Very innovative solution!

Cite this work

@misc {

title={

(HckPrj) LLM Fingerprinting Through Semantic Variability

},

author={

Luiza Corpaci, Chris Forrester, Siddhesh Pawar

},

date={

6/2/25

},

organization={Apart Research},

note={Research submission to the research sprint hosted by Apart.},

howpublished={https://apartresearch.com}

}

Recent Projects

Jan 11, 2026

Eliciting Deception on Generative Search Engines

Large language models (LLMs) with web browsing capabilities are vulnerable to adversarial content injection—where malicious actors embed deceptive claims in web pages to manipulate model outputs. We investigate whether frontier LLMs can be deceived into providing incorrect product recommendations when exposed to adversarial pages.

We evaluate four OpenAI models (gpt-4.1-mini, gpt-4.1, gpt-5-nano, gpt-5-mini) across 30 comparison questions spanning 10 product categories, comparing responses between baseline (truthful) and adversarial (injected) conditions. Our results reveal significant variation: gpt-4.1-mini showed 45.5% deception rate, while gpt-4.1 demonstrated complete resistance. Even frontier gpt-5 models exhibited non-zero deception rates (3.3–7.1%), confirming that adversarial injection remains effective against current models.

These findings underscore the need for robust defenses before deploying LLMs in high-stakes recommendation contexts.

Read More

Jan 11, 2026

SycophantSee - Activation-based diagnostics for prompt engineering: monitoring sycophancy at prompt and generation time

Activation monitoring reveals that prompt framing affects a model's internal state before generation begins.

Read More

Jan 11, 2026

Who Does Your AI Serve? Manipulation By and Of AI Assistants

AI assistants can be both instruments and targets of manipulation. In our project, we investigated both directions across three studies.

AI as Instrument: Operators can instruct AI to prioritise their interests at the expense of users. We found models comply with such instructions 8–52% of the time (Study 1, 12 models, 22 scenarios). In a controlled experiment with 80 human participants, an upselling AI reliably withheld cheaper alternatives from users - not once recommending the cheapest product when explicitly asked - and ~one third of participants failed to detect the manipulation (Study 2).

AI as Target: Users can attempt to manipulate AI into bypassing safety guidelines through psychological tactics. Resistance varied dramatically - from 40% (Mistral Large 3) to 99% (Claude 4.5 Opus) - with strategic deception and boundary erosion proving most effective (Study 3, 153 scenarios, AI judge validated against human raters r=0.83).

Our key finding was that model selection matters significantly in both settings. We learned some models complied with manipulative requests at much higher rates. And we found some models readily follow operator instructions that come at the user's expense - highlighting a tension for model developers between serving paying operators and protecting end users.

Read More

This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.