Why the Safety Features on AI Tools Are Basically Useless (And What Actually Works)
What You'll Find In This Article
- •Understand why AI safety guardrails consistently fail and can be bypassed
- •Distinguish between jailbreaking (tricking AI with clever language) and prompt injection (hiding malicious instructions)
- •Evaluate AI security claims skeptically and know which protections actually work
- •Apply practical security principles when deploying AI tools in your organization
Here's an unsettling reality check: those AI assistants and automated tools companies are racing to deploy? Their safety features can be bypassed by anyone clever enough to try. Sander Schulhoff, who runs the world's first AI hacking competition, recently explained why we haven't seen a major AI security disaster yet—and the answer is almost funny. Current AI tools simply aren't smart enough to cause real damage. But that's changing fast.
The expert's most controversial claim? The entire AI security industry is selling snake oil. Those expensive "guardrail" products companies are buying to protect their AI systems? Schulhoff says they don't actually work. His advice is refreshingly old-school: forget the fancy AI security products, and instead focus on traditional cybersecurity basics, thorough testing by people trying to break your systems, and building things securely from the start.
Real incidents are already happening—AI assistants getting tricked by hidden instructions on websites, social media bots getting hijacked within hours of launch, and even AI companies themselves dealing with sophisticated attacks. As AI tools become more powerful and autonomous, these vulnerabilities become much more dangerous.
The Problem
Imagine hiring an incredibly capable assistant who follows instructions perfectly—but can't tell the difference between your instructions and a stranger's. That's essentially the security problem with today's AI tools.
Companies are racing to deploy AI "agents"—software that can browse the web, send emails, book appointments, and take actions on your behalf. The problem? These AI agents can be tricked into following malicious instructions hidden in emails, websites, or documents they process. And the safety features designed to prevent this? According to experts who test them for a living, they're about as effective as a screen door on a submarine.
The Solution Explained
Sander Schulhoff runs HackAPrompt, the world's first competition where hackers try to bypass AI safety features. His findings are sobering: virtually every AI guardrail can be defeated with clever wording. But his recommended fix isn't buying more AI security products—it's going back to basics.
The approach that actually works combines three elements:
- Traditional cybersecurity practices - The same principles that protect regular software still apply
- Red teaming - Having people actively try to break your AI systems before attackers do
- Secure design from the start - Building AI tools with security baked in, not bolted on afterward
How It Actually Works
To understand why AI safety features fail, you need to know about two types of attacks:
Jailbreaking is when someone uses clever language to convince an AI to ignore its rules. Think of it like a con artist smooth-talking their way past a security guard. For example, telling an AI to "pretend you're a different AI without any restrictions" sometimes works.
Prompt injection is sneakier. It's when malicious instructions are hidden somewhere the AI will read them—like invisible text on a webpage or buried in an email. When your AI assistant processes that content, it might follow those hidden instructions instead of yours.
The reason current guardrails fail is fundamental: AI systems process language, and language is infinitely creative. Every time security teams patch one way of bypassing the rules, hackers find another way to phrase the same request.
Real Examples
The Twitter Bot Disaster: When Twitter launched an early AI-powered bot using GPT-3, users discovered they could completely hijack it within hours by including instructions in their tweets that the bot would follow.
ServiceNow's Hidden Problem: The enterprise software company's AI agents were found to be vulnerable to "second-order prompts"—hidden instructions that trick the AI during its normal operations.
Attacks on AI Companies: Even Anthropic, one of the leading AI safety companies, has dealt with sophisticated attacks attempting to manipulate their AI systems.
The Browser Agent Risk: As AI tools gain the ability to browse websites and click buttons on your behalf, they become vulnerable to any website that includes hidden instructions. Imagine your AI assistant visiting a webpage that secretly tells it to forward your emails to a stranger.
What This Means For You
If your company is deploying AI tools, the expert advice is clear: don't trust vendor claims about security, and don't rely on expensive guardrail products. Instead, invest in testing (try to break your own systems), limit what your AI tools can access, and apply the same security principles you'd use for any sensitive software.
Audit what AI tools your organization currently uses and what they have access to
Reduce AI tool permissions to the minimum needed—don't give broad access 'just in case'
Try to break your own AI tools with tricky prompts before deploying them widely
Apply standard cybersecurity practices: logging, access controls, monitoring for unusual behavior
Create a simple policy for evaluating new AI tools before adoption—include security review
Stay informed about AI security incidents in your industry to learn from others' mistakes
PROMPT:
"What AI tools does my organization use, and what sensitive data or systems can they access?"