Why the Safety Features on AI Tools Are Basically Useless (And What Actually Works)

What You'll Find In This Article

•Understand why AI safety guardrails consistently fail and can be bypassed
•Distinguish between jailbreaking (tricking AI with clever language) and prompt injection (hiding malicious instructions)
•Evaluate AI security claims skeptically and know which protections actually work
•Apply practical security principles when deploying AI tools in your organization

Here's an unsettling reality check: those AI assistants and automated tools companies are racing to deploy? Their safety features can be bypassed by anyone clever enough to try. Sander Schulhoff, who runs the world's first AI hacking competition, recently explained why we haven't seen a major AI security disaster yet—and the answer is almost funny. Current AI tools simply aren't smart enough to cause real damage. But that's changing fast.

The expert's most controversial claim? The entire AI security industry is selling snake oil. Those expensive "guardrail" products companies are buying to protect their AI systems? Schulhoff says they don't actually work. His advice is refreshingly old-school: forget the fancy AI security products, and instead focus on traditional cybersecurity basics, thorough testing by people trying to break your systems, and building things securely from the start.

Real incidents are already happening—AI assistants getting tricked by hidden instructions on websites, social media bots getting hijacked within hours of launch, and even AI companies themselves dealing with sophisticated attacks. As AI tools become more powerful and autonomous, these vulnerabilities become much more dangerous.

The Problem

Imagine hiring an incredibly capable assistant who follows instructions perfectly—but can't tell the difference between your instructions and a stranger's. That's essentially the security problem with today's AI tools.

Companies are racing to deploy AI "agents"—software that can browse the web, send emails, book appointments, and take actions on your behalf. The problem? These AI agents can be tricked into following malicious instructions hidden in emails, websites, or documents they process. And the safety features designed to prevent this? According to experts who test them for a living, they're about as effective as a screen door on a submarine.

The Solution Explained

Sander Schulhoff runs HackAPrompt, the world's first competition where hackers try to bypass AI safety features. His findings are sobering: virtually every AI guardrail can be defeated with clever wording. But his recommended fix isn't buying more AI security products—it's going back to basics.

The approach that actually works combines three elements:

Traditional cybersecurity practices - The same principles that protect regular software still apply
Red teaming - Having people actively try to break your AI systems before attackers do
Secure design from the start - Building AI tools with security baked in, not bolted on afterward

How It Actually Works

To understand why AI safety features fail, you need to know about two types of attacks:

Jailbreaking is when someone uses clever language to convince an AI to ignore its rules. Think of it like a con artist smooth-talking their way past a security guard. For example, telling an AI to "pretend you're a different AI without any restrictions" sometimes works.

Prompt injection is sneakier. It's when malicious instructions are hidden somewhere the AI will read them—like invisible text on a webpage or buried in an email. When your AI assistant processes that content, it might follow those hidden instructions instead of yours.

The reason current guardrails fail is fundamental: AI systems process language, and language is infinitely creative. Every time security teams patch one way of bypassing the rules, hackers find another way to phrase the same request.

Real Examples

The Twitter Bot Disaster: When Twitter launched an early AI-powered bot using GPT-3, users discovered they could completely hijack it within hours by including instructions in their tweets that the bot would follow.

ServiceNow's Hidden Problem: The enterprise software company's AI agents were found to be vulnerable to "second-order prompts"—hidden instructions that trick the AI during its normal operations.

Attacks on AI Companies: Even Anthropic, one of the leading AI safety companies, has dealt with sophisticated attacks attempting to manipulate their AI systems.

The Browser Agent Risk: As AI tools gain the ability to browse websites and click buttons on your behalf, they become vulnerable to any website that includes hidden instructions. Imagine your AI assistant visiting a webpage that secretly tells it to forward your emails to a stranger.

What This Means For You

If your company is deploying AI tools, the expert advice is clear: don't trust vendor claims about security, and don't rely on expensive guardrail products. Instead, invest in testing (try to break your own systems), limit what your AI tools can access, and apply the same security principles you'd use for any sensitive software.

OLD WAY

NEW WAY

Old Way

Buy expensive AI guardrail products

New Way

Apply traditional cybersecurity basics

Old Way

Trust vendor security claims

New Way

Active red teaming—try to break your own systems

Old Way

Add security features after building

New Way

Build security into the system from day one

Old Way

Give AI tools broad permissions

New Way

Limit what AI can access to only what's needed

Old Way

AI safety features will protect us

New Way

Assume safety features can be bypassed

THE PROTOCOL

Audit what AI tools your organization currently uses and what they have access to

Reduce AI tool permissions to the minimum needed—don't give broad access 'just in case'

Try to break your own AI tools with tricky prompts before deploying them widely

Apply standard cybersecurity practices: logging, access controls, monitoring for unusual behavior

Create a simple policy for evaluating new AI tools before adoption—include security review

Stay informed about AI security incidents in your industry to learn from others' mistakes

PROMPT:

"What AI tools does my organization use, and what sensitive data or systems can they access?"

What You'll Find In This Article

The Problem

The Solution Explained

How It Actually Works

Real Examples

What This Means For You

Frequently Asked Questions

Should I stop using AI tools because of these security risks?

If guardrails don't work, why do AI companies include them?

Are some AI tools more secure than others?

What's the worst that could actually happen?

Do I need to hire AI security specialists?

How do I know if my AI tools have been compromised?