Skip to content
Home AI Security Tools OpenAI Guardrails
OpenAI Guardrails

OpenAI Guardrails

NEW
Category: AI Security
License: Free (Open-Source)
Suphi Cankurt
Suphi Cankurt
AppSec Enthusiast
Updated April 3, 2026
5 min read
Key Takeaways
  • MIT-licensed Python library that wraps the OpenAI client with automatic input/output validation, moderation, PII detection, jailbreak detection, and hallucination checks.
  • Tripwire mechanism halts agent execution immediately when a guardrail detects a violation — supports parallel execution for low latency or blocking mode for zero wasted tokens.
  • Three guardrail types: input guardrails validate user prompts, output guardrails validate agent responses, and tool guardrails wrap function calls to validate before and after execution.
  • Integrates directly with the OpenAI Agents SDK via GuardrailAgent for native safety enforcement in multi-agent workflows.

OpenAI Guardrails is an MIT-licensed Python library that provides a drop-in safety wrapper for the OpenAI Python client, adding automatic input/output validation with a tripwire mechanism that immediately halts agent execution when a violation is detected. It is listed in the AI security category.

The library is MIT licensed and maintained by OpenAI. The latest release is v0.2.1 (December 2025). It integrates directly with the OpenAI Agents SDK via the GuardrailAgent component for native safety enforcement in agentic workflows.

OpenAI Guardrails is distinct from the guardrails feature built into the Agents SDK — this library provides a broader set of configurable safety controls that can wrap any OpenAI API call, not just agent interactions.

What is OpenAI Guardrails?

OpenAI Guardrails sits between your application and the OpenAI API, intercepting requests and responses to run safety checks. The library uses a tripwire mechanism: when a guardrail detects a violation, it immediately raises an exception that halts execution, preventing unsafe content from being processed or returned.

Three types of guardrails are supported. Input guardrails validate user prompts before the agent processes them. Output guardrails validate agent responses before they reach users. Tool guardrails wrap function calls, validating both the arguments going in and the results coming out.

Built-in guardrails cover the most common safety requirements: content moderation, PII detection (via Microsoft’s Presidio), jailbreak detection, hallucination detection, URL filtering, NSFW content filtering, and off-topic detection. Custom guardrails can be added for domain-specific requirements. The Guardrails Wizard provides a no-code configuration interface for building validation pipelines visually without writing JSON by hand.

Tripwire Mechanism
When a guardrail detects a violation, it triggers a tripwire that immediately raises an exception and halts agent execution. Supports parallel mode (lower latency) or blocking mode (zero wasted tokens) depending on your latency vs. cost priorities.
Three Guardrail Types
Input guardrails validate user prompts before agent processing. Output guardrails validate responses before delivery. Tool guardrails wrap function calls to validate arguments and results — critical for agentic workflows where tools have real-world side effects.
Agents SDK Integration
Native integration with the OpenAI Agents SDK via GuardrailAgent. Automatically applies safety checks to agent inputs and outputs, catching violations through InputGuardrailTripwireTriggered and OutputGuardrailTripwireTriggered exceptions.

Key Features

FeatureDetails
ArchitectureDrop-in wrapper for OpenAI Python client
Input GuardrailsValidate user prompts before agent execution; run only for the first agent in a chain
Output GuardrailsValidate agent responses; run only for the agent producing final output
Tool GuardrailsWrap function tools to validate calls before and after execution
TripwireImmediate exception on violation, halting agent execution
Execution ModesParallel (lower latency) or blocking (zero token waste)
Built-in ChecksModeration, PII detection, jailbreak, hallucination, URL filtering, NSFW, off-topic
PII DetectionMicrosoft Presidio integration for local PII scanning
Custom GuardrailsDefine custom checks with configurable thresholds
ConfigurationJSON-based guardrail configuration or visual Guardrails Wizard
EvaluationBuilt-in framework for benchmarking guardrail performance on labeled datasets
LicenseMIT

Input guardrails

Input guardrails run on the user’s initial input before the agent begins processing. They are the first line of defense, catching jailbreak attempts, PII in prompts, off-topic requests, and content policy violations before any tokens are consumed.

One important detail: input guardrails run only for the first agent in a chain. In multi-agent workflows with handoffs, subsequent agents don’t re-run input guardrails on the original user message. For full coverage in orchestrated systems, use tool guardrails on individual function calls.

Input guardrails support two execution strategies. In parallel mode (the default), the guardrail runs concurrently with the agent for lower latency, but the agent may consume some tokens before a tripwire fires. In blocking mode, the guardrail must complete before the agent starts, preventing any token consumption on violations.

Output guardrails

Output guardrails validate the agent’s final response before it reaches the user. They catch hallucinated content, sensitive data in responses, toxic language, and policy violations in generated text.

Output guardrails run only for the agent that produces the final output in a multi-agent chain. They always execute after agent completion and do not support parallel execution — by definition, they need the complete response to validate.

Tool guardrails

Tool guardrails matter most for agentic applications where function calls have real-world consequences. They wrap function tools (decorated with @function_tool) to validate both the arguments going into the function and the results coming out.

Input tool guardrails can prevent execution entirely, replace the output with a safe default, or raise a tripwire. Output tool guardrails can replace results or raise tripwires. This is the mechanism for preventing agents from making unauthorized API calls, executing dangerous operations, or returning sensitive data from tool executions.

Scope in multi-agent workflows
Input guardrails run only for the first agent, and output guardrails run only for the final agent. In complex orchestration with managers and handoffs, tool guardrails provide the most comprehensive coverage since they apply to every function call regardless of which agent makes it.

Getting Started

1
Install the library — Run pip install openai-guardrails. Requires Python 3.9+ and an OpenAI API key for model-based guardrail checks.
2
Configure guardrails — Define your safety configuration in JSON, selecting which built-in guardrails to enable (moderation, PII, jailbreak, hallucination, etc.) and setting thresholds.
3
Wrap your OpenAI client — Replace your standard OpenAI client with the guardrails-wrapped version. The wrapper intercepts all API calls and runs configured checks automatically.
4
Add to Agents SDK (optional) — For agentic workflows, use GuardrailAgent to integrate guardrails directly into the OpenAI Agents SDK. This adds input, output, and tool guardrails to your agent chain.
5
Evaluate and tune — Use the built-in evaluation framework to benchmark guardrail performance on labeled datasets. Adjust thresholds to balance safety against false positive rates.

When to use OpenAI Guardrails

OpenAI Guardrails fits teams building applications on the OpenAI platform that need configurable safety controls without building custom validation infrastructure. The drop-in wrapper approach means existing applications can add guardrails with minimal code changes.

The tool guardrails feature matters most for agentic applications. When AI agents can call functions that transfer money, send emails, query databases, or modify records, validating those calls before execution is critical. The tripwire mechanism blocks a flagged tool call immediately, not just logs it.

For teams already using the OpenAI Agents SDK, the native GuardrailAgent integration is the most natural fit. The guardrails become part of the agent architecture rather than external middleware.

Best for
Teams building on the OpenAI platform (API and Agents SDK) that need drop-in safety validation with built-in moderation, PII detection, jailbreak protection, and tool-level guardrails for agentic workflows.

For a broader overview of AI security tools, see the AI security tools guide. For provider-agnostic validation with a larger validator ecosystem, consider Guardrails AI.

For dialog flow control and multi-turn conversation safety, NeMo Guardrails covers capabilities beyond input/output validation. For comprehensive AI evaluation and observability, look at Galileo AI.

Frequently Asked Questions

What is OpenAI Guardrails?
OpenAI Guardrails is an MIT-licensed Python library that provides a drop-in wrapper for OpenAI’s Python client, adding automatic input/output validation and moderation. It includes built-in guardrails for moderation, PII detection, jailbreak detection, hallucination detection, URL filtering, and custom checks.
Is OpenAI Guardrails free?
The library itself is free and open-source under the MIT license. However, it calls OpenAI APIs for some guardrail checks, which incur standard OpenAI API costs. PII detection uses Microsoft’s Presidio framework locally.
How does the tripwire mechanism work?
When a guardrail detects a violation, it sets tripwire_triggered to true, which raises an InputGuardrailTripwireTriggered or OutputGuardrailTripwireTriggered exception. This immediately halts agent execution. Guardrails can run in parallel mode (lower latency, but the agent may consume some tokens before the tripwire fires) or blocking mode (guardrail completes before the agent starts).
How does OpenAI Guardrails compare to Guardrails AI?
OpenAI Guardrails is specifically designed for the OpenAI ecosystem with native Agents SDK integration and a tripwire-based execution model. Guardrails AI is a provider-agnostic framework with a larger validator library (Guardrails Hub) and re-prompting capabilities. Choose OpenAI Guardrails for OpenAI-native agent workflows; choose Guardrails AI for provider-agnostic validation.