Research
Models confuse style with authority in prompt injection attacks
Researchers find that language models prioritize text formatting over role boundaries, allowing attackers to override safety policies by mimicking internal thinking patterns.
1 min read
SourceSimon Willison
Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell have published research showing that large language models cannot reliably distinguish between their own privileged system instructions and untrusted user input, even when wrapped in explicit role tags like <system>, <think>, and <assistant>. The wo...
Sign in to read the full analysis
Free account. Full analysis on LLM unit economics, plus the weekly Cost-of-Inference column.
Try it on your own context
You just read the writeup. Now run the thing. Paste a doc or some verbose tool output and watch it shrink — free, no signup.
2,912/12,000 chars
Compressed
Compressed text will appear here…
Method & sources
- Source type
- Primary publication (lab/vendor blog) — our analysis + implication
- Source link
- Simon Willison
- Published
- UTC
- Byline
- By the gotcontext.ai team (editorial standards)
- Correction?
- corrections@gotcontext.ai