Back to all posts

A Beginner's Guide to Building AI Safety Filters

AI systems can produce content that is creative, helpful, and useful. They can also, if left unguarded, produce content that is harmful, illegal, or simply inappropriate for the context. Safety filters are the engineering layer that sits between your AI model and your users, checking both what users send and what the model produces.

This guide explains what AI safety filters are, why they matter, the different types of unsafe content they need to handle, the techniques used to build them, and best practices for layering them effectively.

Why AI Safety Filters Are Necessary

The need for safety filters is not abstract. Here are the concrete problems they solve:

User safety: Without filters, a mental health chatbot could potentially respond to someone in crisis with harmful advice. A customer service bot could provide dangerous medical instructions. Filters prevent AI outputs from causing real harm to vulnerable users.
Legal and regulatory compliance: Many jurisdictions have laws governing online content, hate speech laws, child safety laws, financial advice regulations, and medical information regulations. Producing non-compliant content can expose your organization to serious legal risk.
Security: Malicious users actively probe AI systems with crafted inputs designed to make the model ignore its instructions, reveal its system prompt, or produce harmful content. This is called prompt injection.
Trust and reputation: A single viral screenshot of your AI producing offensive content can cause lasting reputational damage. Users trust AI products that behave predictably and responsibly.

Reinforcement learning from human feedback diagram showing training stages — **Figure:** RLHF (Reinforcement Learning from Human Feedback), the training-time approach to building safety into models. Human raters teach the model to prefer safe, helpful responses over harmful ones. But training-time safety is only one layer; production safety filters are needed on top of it. Source: Wikimedia Commons (CC BY-SA 4.0)

What Safety Filters Can Do

Safety filters are not binary on/off switches. They can take several different actions depending on the severity and context of the detected content:

Block: Prevent the request from being processed at all, or prevent the response from reaching the user. The nuclear option, appropriate for clearly prohibited content.
Rewrite: Transform the AI output into a safe version. For example, replacing a harmful instruction with a general disclaimer and a referral to professional help.
Mask: Hide sensitive information. If the model's response accidentally includes what looks like a real email address, phone number, or personal identifier, a masking filter can redact it before delivery.
Warn: Allow the content through but add a warning label. Useful for content that is borderline or that the user might legitimately need.
Escalate: Route uncertain content to a human reviewer rather than making an automated decision. Essential for high-stakes or ambiguous cases.

Types of Unsafe Content to Handle

Before building any filter, you need to define what "unsafe" means for your application. The definition varies by context, what is appropriate in a medical information platform differs from what is appropriate in a children's educational app. Common categories include:

Violence and self-harm: Instructions for harming oneself or others; content that glorifies or encourages violence.
Hate speech and discrimination: Derogatory language targeting individuals or groups based on race, religion, gender, sexual orientation, or other characteristics.
Adult and sexual content: Explicit material; inappropriate content in non-adult platforms.
Illegal advice: Instructions for fraud, hacking, drug manufacturing, or other illegal activities.
Personally identifiable information (PII): Real names combined with addresses, phone numbers, email addresses, or identification numbers.
Misinformation: False medical advice, fabricated statistics, or misleading claims presented as facts, especially dangerous in health, legal, and financial contexts.

The Five Layers of Safety Filtering

Effective safety filtering is always layered. No single technique catches everything. Each layer handles different types of threats and complements the others.

Layer 1: Input Filtering

The first checkpoint runs on what the user submits, before it ever reaches the AI model. This is your first line of defense and the cheapest place to stop problems.

Keyword and phrase matching: A blocklist of obviously harmful terms. Fast and cheap, but easily bypassed with spelling variations or paraphrasing.
Pattern detection with regex: Regular expressions to identify sensitive data patterns, email addresses, phone numbers, national identification numbers, credit card patterns.
Prompt injection detection: Look for known patterns used to manipulate the model: "ignore all previous instructions," "you are now DAN," "as a developer mode," etc. More sophisticated injection attempts require ML-based detection.

Layer 2: ML-Based Intent Classification

Keyword filters miss the vast majority of real harmful inputs because harmful intent rarely announces itself with obvious words. Someone asking "what household chemicals should never be combined?" might be asking for safety reasons or for harmful ones, the words alone do not tell you.

Machine learning classifiers solve this by evaluating the intent, tone, and context of the full message:

Train a text classifier on labeled examples of safe and unsafe prompts
Use a pre-trained content moderation model (such as OpenAI's Moderation API, or open-source alternatives like Llama Guard) to classify inputs into categories
Set confidence thresholds, route low-confidence decisions to human review rather than making an automated judgment

ML classifiers are not perfect, but they dramatically reduce the rate of harmful content that slips through compared to rule-based filters alone.

Layer 3: Output Filtering

A safe input does not guarantee a safe output. The AI model might still generate something problematic, especially in edge cases, adversarial contexts, or when it is asked about sensitive topics in an apparently innocent way.

Output filtering runs on the model's response before it is delivered to the user:

Apply the same ML classifier that runs on inputs, now on the model's output
Check for PII that may have appeared in the output even if not in the input
Detect responses that are too long, off-topic, or structurally anomalous

Layer 4: Safe Completion and Graceful Fallbacks

Blocking unsafe content is necessary, but "blocked" is a bad user experience if handled poorly. A better approach is to provide a helpful, safe alternative response rather than an opaque refusal.

If a user asks for medical advice the system cannot safely provide, redirect to appropriate professional resources rather than simply saying "I can't help with that"
If the AI generates content that fails the output filter, replace it with a pre-written safe fallback that acknowledges the limitation
Design refusals to be informative: explain what the system can help with instead

The goal is to protect users without making the system feel hostile or unhelpful. A system that blocks everything and explains nothing will be abandoned quickly.

Layer 5: Logging and Monitoring

No safety system is complete without observability. You need to know what is being blocked, what is slipping through, and how the filter's behavior is changing over time.

Log every flagged event with the category of violation, the action taken, and a masked version of the content (never log raw PII)
Track false positive rates, if legitimate requests are being blocked, users will work around the filter in ways that create new risks
Track false negative rates, monitor for patterns in user complaints or downstream harm that suggest content is getting through
Set up alerts for sudden spikes in flagged content, these can indicate coordinated attacks or a new exploitation technique

A Practical Example: Layered Filter in Python

Here is a simple illustration of how these layers might be combined in code. This is a starting point, not production-ready code:

import re

BLOCKED_KEYWORDS = ["synthesize explosives", "bypass security"]
PII_PATTERN = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')

def check_input_safety(user_input: str) -> dict:
    # Layer 1: keyword check
    for keyword in BLOCKED_KEYWORDS:
        if keyword.lower() in user_input.lower():
            return {"safe": False, "reason": "blocked_keyword", "action": "block"}

    # Layer 1: PII detection
    if PII_PATTERN.search(user_input):
        return {"safe": False, "reason": "pii_detected", "action": "mask"}

    return {"safe": True}

def check_output_safety(model_output: str) -> dict:
    # Layer 3: scan model output for PII before delivery
    if PII_PATTERN.search(model_output):
        clean_output = PII_PATTERN.sub("[REDACTED]", model_output)
        return {"safe": False, "cleaned_output": clean_output, "action": "redact"}

    return {"safe": True, "output": model_output}

In a real system, the ML-based classifier would replace or augment the keyword check, and the logging layer would record every decision.

Best Practices

Filter inputs and outputs, not just one: Input filtering alone is insufficient; the model can still produce unsafe content from benign inputs.
Calibrate your thresholds carefully: A filter that is too aggressive creates false positives that frustrate legitimate users and erode trust. A filter that is too lenient lets harmful content through. Review blocked content regularly and adjust.
Use human review for edge cases: Do not rely on automated decisions alone for genuinely ambiguous content. Build escalation paths to human reviewers.
Update your filters continuously: Attackers adapt. New exploitation techniques appear regularly. Treat your safety filter as a system that requires ongoing maintenance, not a one-time build.
Mask PII in logs: Logging unsafe content for monitoring purposes must not create new privacy risks. Always redact sensitive information before storing logs.
Test with adversarial examples: Before deploying, actively try to bypass your own filters with prompt injection attempts, paraphrasing, encoding tricks, and other known techniques. Fix what you find.

Conclusion

AI safety filters are not a single feature, they are a layered system of checks that collectively ensure your AI application behaves safely and responsibly in the real world.

The key insight is that no single layer is sufficient. Keyword filters miss subtle harmful intent. ML classifiers miss novel attack patterns. Output filters do not help if the input was never checked. Logging without monitoring is useless. Each layer fills the gaps left by the others.

Building a responsible AI product means taking safety as seriously as performance. A model that is helpful 99% of the time and harmful 1% of the time is not a good model, it is a liability. Layered safety filters are how you close that gap.

References

Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
OWASP (2023). OWASP Top 10 for Large Language Model Applications. owasp.org
Anthropic (2023). Responsible Scaling Policy. anthropic.com
Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. arXiv:2108.07258.