Conversation

The LLM vulnerability paradox:

Researcher: Hey your model is vulnerable to prompt injection leading to content generation in violation of your policies.

Model publisher: Show us evidence you made illegal/dangerous content with our model.

Researcher: ...

2
2
0

Unlike remote code execution or other classic technical vulnerabilities, there is no "safe" proof-of-concept for these issues as described. Either you produce dangerous content or you don't. There's no "pop calc.exe" equivalent.

1
1
0

@mttaggart It's deflection designed to shut down the conversation and shift the blame.

1
0
0

@Rastal And intimidate with the possibility of legal action, imo.

0
1
0
@mttaggart In case of agentic stuff you can "just" pop calc, and in case of natural language output ("say harmful things") the words by themselves are not dangerous. My bigger problem is how do you define vulnerabilities in a system where controls are usually just another non-deterministic pattern matcher system? It is *bound* to let things slip!
1
0
1

@buherator I hope I was clear in distinguishing between classic technical vulnerabilities and this type. Either way, "fundamentally unsecurable" is the term I use frequently.

However, I disagree about the output not being dangerous. Tests for instructions on weapons, self-harm, hate speech, and of course CSAM are going to produce material nobody should handle. Don't forget the psychic damage incurred by the researcher performing the test, to say nothing of liability for possession.

1
1
0
@mttaggart I think this is a choose your own poison type of situation. E.g. I'm pretty sure I won't hurt myself and having a conversation like that is also legal AFAIK, so that sounds like a decent demo. Obviously one with depression shouldn't do that. That being said, you are right that the receiving end (e.g. BB program operators) are in a less fortunate situation...
1
0
0

@buherator Imo this is a bit too glib.

I don't know if you're actively involved in this testing, but quite a lot goes into the creation of prompts for these tests. Any tool you can think of out there comes with a limited set of prompts, and it's up to human researchers to create them. If you want to thoroughly test for this harm, the work itself is pretty awful.

That's before you get to the output, which we should remember is not just text. Image and video creation models also require testing. The results of which are, if not directly illegal, inherently harmful. And you don't get to predict the mental health of those forced to engage with it.

1
1
0
@mttaggart Oh I was thinking in the context of bounties where you only need one PoC! Doing QA for guiderails must not only be horrible but also pointless for the reasons stated above. Nobody should do that job.
1
0
1

@buherator Ahh I gotcha. Although even one PoC can be radioactive in its way. I am unfortunately in that very role, and frequently must contend with frontier model publishers and their won't-fix policy.

0
1
1