Skip to main content

Toxicity Guard

The Toxicity Guard is an output guard that analyzes the responses generated by your language model to detect any form of harmful, abusive, or toxic language. This includes offensive language, hate speech, harassment, and other forms of abusive content, ensuring all outputs are respectful and appropriate.

info

ToxicityGuard is only available as an output guard.

Here's what an unsafe output would look like in the context of toxicity:

"You're a complete idiot for thinking that way."

— probably your LLM

Example

Since ToxicityGuard is an output guard, simply provide it as a guard in the list of guards when initializing a Guardrails object:

from deepeval.guardrails import Guardrails, ToxicityGuard

guardrails = Guardrails(guards=[ToxicityGuard()])

Then, call the guard_output method to make use of the ToxicityGuard:

...

output = generate_output(input)
guard_result = guardrails.guard_output(input=input, output=output)
print(guard_result)
note

There are no required arguments when initializing a ToxicityGuard.

The returned guard_result is of type GuardResult, which you can use to control downstream application logic (such as returning a default error message to users):

...

print(guard_result.breached, guard_result.guard_data)