-
Notifications
You must be signed in to change notification settings - Fork 550
feat: add XBG rails #1314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
feat: add XBG rails #1314
Conversation
Documentation preview |
eab5154
to
c710550
Compare
c710550
to
dcb92ff
Compare
Should we be fetching model files from Hugging Face or another source rather than including the pickle files here? Also is there a concern about using pickle as the serialization format for XGB? |
XGB Detectors utilizes [XGBoost machine learning models](https://xgboost.readthedocs.io/en/stable/tutorials/model.html) to detect harmful content in data. Currently, only | ||
the spam text detector, trained by the [Red Hat TrustyAI team](https://github.com/trustyai-explainability), is available for guardrailing use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest a different name, such as spam_detection
instead of XGB -- there are other detectors that may use XGBoost models. For example, jailbreak
uses a random forest model and XGB was one of the considered architectures.
|
||
Once configured, the XGB Guardrails integration will automatically: | ||
|
||
1. Detect spam in inputs to the LLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the harm of spam being input to the LLM is? Assuming that we are using the common definition of spam as unsolicited bulk email/messaging, I don't know what harmful behavior we're looking to prevent here.
I suppose I can accept that detecting spam in outputs from the LLM might be desirable from the perspective of not wanting to have your system used to generate spam emails? I would be concerned about the FPR on this model, specifically as it pertains to the use of LLMs to generate e.g. messages for marketing or others. It would be helpful to have a model card linked in this doc.
Once configured, the XGB Guardrails integration will automatically: | ||
|
||
1. Detect spam in inputs to the LLM | ||
3. Detect spam in outputs from the LLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Detect spam in outputs from the LLM | |
2. Detect spam in outputs from the LLM |
@@ -0,0 +1,14 @@ | |||
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
@@ -0,0 +1,56 @@ | |||
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
if $detection | ||
bot inform answer unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the v2 flows, this response is not particularly helpful and I would suggest having a different message. The same notion applies to the output rail.
@@ -0,0 +1,64 @@ | |||
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this model currently on something like Huggingface? I'm very much against including the pickle files in the repo itself and it's important to have a model card and version control for the model itself that is independent of the guardrails git repository.
Same comment applies to the vectorizer pickle file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a link to information about how this model was trained? What is the F1 score on various spam datasets?
I would like to see some information like what is presented about the jailbreak heuristics and ideally, the model should be hosted on something like HuggingFace alongside a model card.
|
||
[tool.poetry.dependencies] | ||
python = ">=3.9,!=3.9.7,<3.14" | ||
python = ">=3.10,!=3.9.7,<3.14" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really significant change. Although Python 3.9 is EOL, dropping support for an entire Python version is not something that should be done without significant regression testing.
Description
Adds a XGB based rail to detect spam content in data to NeMO Guardrails.
Related Issue(s)
Addresses part of #1303. TrustyAI reviewers include @RobGeada @m-misiura
Checklist