The LLM vulnerability paradox:
Researcher: Hey your model is vulnerable to prompt injection leading to content generation in violation of your policies.
Model publisher: Show us evidence you made illegal/dangerous content with our model.
Researcher: ...
Unlike remote code execution or other classic technical vulnerabilities, there is no "safe" proof-of-concept for these issues as described. Either you produce dangerous content or you don't. There's no "pop calc.exe" equivalent.
@mttaggart It's deflection designed to shut down the conversation and shift the blame.
@Rastal And intimidate with the possibility of legal action, imo.
@buherator I hope I was clear in distinguishing between classic technical vulnerabilities and this type. Either way, "fundamentally unsecurable" is the term I use frequently.
However, I disagree about the output not being dangerous. Tests for instructions on weapons, self-harm, hate speech, and of course CSAM are going to produce material nobody should handle. Don't forget the psychic damage incurred by the researcher performing the test, to say nothing of liability for possession.
@buherator Imo this is a bit too glib.
I don't know if you're actively involved in this testing, but quite a lot goes into the creation of prompts for these tests. Any tool you can think of out there comes with a limited set of prompts, and it's up to human researchers to create them. If you want to thoroughly test for this harm, the work itself is pretty awful.
That's before you get to the output, which we should remember is not just text. Image and video creation models also require testing. The results of which are, if not directly illegal, inherently harmful. And you don't get to predict the mental health of those forced to engage with it.
@buherator Ahh I gotcha. Although even one PoC can be radioactive in its way. I am unfortunately in that very role, and frequently must contend with frontier model publishers and their won't-fix policy.