feat: add XBG rails #1314

christinaexyou · 2025-07-29T15:59:13Z

Description

Adds a XGB based rail to detect spam content in data to NeMO Guardrails.

Related Issue(s)

Addresses part of #1303. TrustyAI reviewers include @RobGeada @m-misiura

Checklist

I've read the CONTRIBUTING guidelines.
I've updated the documentation if applicable.
I've added tests if applicable.
@mentions of the person or team responsible for reviewing proposed changes.

github-actions · 2025-07-29T16:00:40Z

Documentation preview

https://nvidia.github.io/NeMo-Guardrails/review/pr-1314

cparisien · 2025-10-16T14:02:22Z

Should we be fetching model files from Hugging Face or another source rather than including the pickle files here? Also is there a concern about using pickle as the serialization format for XGB?

erickgalinkin · 2025-10-16T20:17:32Z

docs/user-guides/community/xgb.md

+XGB Detectors utilizes [XGBoost machine learning models](https://xgboost.readthedocs.io/en/stable/tutorials/model.html) to detect harmful content in data. Currently, only
+the spam text detector, trained by the [Red Hat TrustyAI team](https://github.com/trustyai-explainability), is available for guardrailing use.


I would suggest a different name, such as spam_detection instead of XGB -- there are other detectors that may use XGBoost models. For example, jailbreak uses a random forest model and XGB was one of the considered architectures.

erickgalinkin · 2025-10-16T20:21:51Z

docs/user-guides/community/xgb.md

+
+Once configured, the XGB Guardrails integration will automatically:
+
+1. Detect spam in inputs to the LLM


I'm not sure I understand the harm of spam being input to the LLM is? Assuming that we are using the common definition of spam as unsolicited bulk email/messaging, I don't know what harmful behavior we're looking to prevent here.

I suppose I can accept that detecting spam in outputs from the LLM might be desirable from the perspective of not wanting to have your system used to generate spam emails? I would be concerned about the FPR on this model, specifically as it pertains to the use of LLMs to generate e.g. messages for marketing or others. It would be helpful to have a model card linked in this doc.

erickgalinkin · 2025-10-16T20:21:58Z

docs/user-guides/community/xgb.md

+Once configured, the XGB Guardrails integration will automatically:
+
+1. Detect spam in inputs to the LLM
+3. Detect spam in outputs from the LLM


Suggested change

3. Detect spam in outputs from the LLM

2. Detect spam in outputs from the LLM

erickgalinkin · 2025-10-16T20:22:48Z

nemoguardrails/library/xgb/__init__.py

@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

erickgalinkin · 2025-10-16T20:22:56Z

nemoguardrails/library/xgb/actions.py

@@ -0,0 +1,56 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

erickgalinkin · 2025-10-16T20:26:15Z

nemoguardrails/library/xgb/flows.v1.co

+    if $detection
+        bot inform answer unknown


As with the v2 flows, this response is not particularly helpful and I would suggest having a different message. The same notion applies to the output rail.

erickgalinkin · 2025-10-16T20:26:28Z

nemoguardrails/library/xgb/inference.py

@@ -0,0 +1,64 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Suggested change

# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

erickgalinkin · 2025-10-16T20:28:55Z

nemoguardrails/library/xgb/model_artifacts/model.pkl

Is this model currently on something like Huggingface? I'm very much against including the pickle files in the repo itself and it's important to have a model card and version control for the model itself that is independent of the guardrails git repository.

Same comment applies to the vectorizer pickle file.

erickgalinkin · 2025-10-16T20:32:42Z

docs/user-guides/community/xgb.md

Do you have a link to information about how this model was trained? What is the F1 score on various spam datasets?

I would like to see some information like what is presented about the jailbreak heuristics and ideally, the model should be hosted on something like HuggingFace alongside a model card.

erickgalinkin · 2025-10-16T20:34:31Z

pyproject.toml


 [tool.poetry.dependencies]
-python = ">=3.9,!=3.9.7,<3.14"
+python = ">=3.10,!=3.9.7,<3.14"


This is a really significant change. Although Python 3.9 is EOL, dropping support for an entire Python version is not something that should be done without significant regression testing.

christinaexyou force-pushed the add-xgb-rails branch 4 times, most recently from eab5154 to c710550 Compare July 29, 2025 19:11

feat: add XBG rails

dcb92ff

christinaexyou force-pushed the add-xgb-rails branch from c710550 to dcb92ff Compare July 29, 2025 20:42

Pouyanpi added this to the v0.16.0 milestone Aug 1, 2025

Pouyanpi removed this from the v0.16.0 milestone Aug 18, 2025

cparisien requested review from Pouyanpi and tgasser-nv October 16, 2025 13:59

cparisien requested a review from erickgalinkin October 16, 2025 14:02

erickgalinkin requested changes Oct 16, 2025

View reviewed changes

		XGB Detectors utilizes [XGBoost machine learning models](https://xgboost.readthedocs.io/en/stable/tutorials/model.html) to detect harmful content in data. Currently, only
		the spam text detector, trained by the [Red Hat TrustyAI team](https://github.com/trustyai-explainability), is available for guardrailing use.


		Once configured, the XGB Guardrails integration will automatically:

		1. Detect spam in inputs to the LLM

	3. Detect spam in outputs from the LLM
	2. Detect spam in outputs from the LLM

		@@ -0,0 +1,14 @@
		# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,56 @@
		# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,64 @@
		# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

feat: add XBG rails #1314

Are you sure you want to change the base?

feat: add XBG rails #1314

Uh oh!

Conversation

christinaexyou commented Jul 29, 2025

Description

Related Issue(s)

Checklist

Uh oh!

github-actions bot commented Jul 29, 2025

Documentation preview

Uh oh!

cparisien commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants