Skip to content

Conversation

Adarsh0427
Copy link
Contributor

@Adarsh0427 Adarsh0427 commented Aug 21, 2025

Proposed change

Resolves #2038

Features Implemented

Core Features

  • Field Types: String, Number, Date, Boolean.
  • Comparison Operators: >, <, >=, <=, = for numeric and date fields.
  • Free Text Search: Unmatched tokens go to default field.
  • Case Sensitivity Control: Optional case-sensitive field matching.
  • Strict Mode: Toggle between lenient and strict field validation.
  • Error Handling: Comprehensive error messages for debugging.
  • Field Validations:
    - Quoted strings: Support for multi-word values: project:"OWASP ZAP" (support all ASCII characters inside quotes).
    - Unquoted string: Support single word value: language:python.
    - Nested quoted string: Support nested quoted values: project:"OWASP "ZAP"".
    - Boolean values: true, false, yes, no, 1, 0, on, off.
    - Date format: YYYY-MM-DD/YYYYMMDD with comparison operators .
    - Number: Integer values within the range of [0,2^32].
  • Robust Parsing: Handles malformed queries gracefully

Query Output Structure:

[
  {
    "type": "string | number | date | boolean",
    "field": "field_name",
    "op": "comparison_operator",   // optional: used only for number/date fields
    "value": '[str | bool | int]'
  }
]

Query Examples

Example 1: Multi-field Query

Query: author:alice stars:100 language:"python" archived:false

Output:
{'field': 'author', 'type': 'string', 'value': 'alice'}
{'field': 'stars', 'type': 'number', 'op': '=', 'value': 100}
{'field': 'language', 'type': 'string', 'value': 'python'}
{'field': 'archived', 'type': 'boolean', 'value': False}

Example 2: Numeric Comparison

Query: stars:<50 author:bob

Output:
{'field': 'stars', 'type': 'number', 'op': '<', 'value': 50}
{'field': 'author', 'type': 'string', 'value': 'bob'}

Example 3: Quoted String

Query: project:"OWASP ZAP"

Output:
{'field': 'project', 'type': 'string', 'value': 'OWASP ZAP'}

Example 4: Date Range Query

Query: created:>=2023-01-01 stars:>50

Output:
{'field': 'created', 'type': 'date', 'op': '>=', 'value': '2023-01-01'}
{'field': 'stars', 'type': 'number', 'op': '>', 'value': 50}

Example 5: Free Text Search (default_field = "query")

Query: author:"aman" free_text "quoted text" 

Output:
{'type': 'string', 'field': 'author', 'value': 'aman'}
{'type': 'string', 'field': 'query', 'value': 'free_text'}
{'type': 'string', 'field': 'query', 'value': 'quoted text'}

Checklist

  • Equality, comparisons, quoted strings
  • Implicit AND
  • Unknown field validation with clear error messages
  • Define initial OWASP-focused field set
  • Query validation and syntax error handling
  • defined comprehensive test cases

@Adarsh0427 Adarsh0427 requested a review from arkid15r as a code owner August 21, 2025 22:46
Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

Summary by CodeRabbit

  • New Features

    • Advanced search parsing with fielded queries (e.g., author:John), free-text, numeric operators (<, <=, >, >=, =), booleans, and dates.
    • Configurable case sensitivity and default field behavior.
    • Structured results for easier filtering and downstream use.
  • Error Handling

    • Clear, structured errors for invalid fields, types, and dates, with an optional strict mode to enforce schema.
  • Chores

    • Added a parsing library dependency to support the new search capabilities.
  • Tests

    • Extensive coverage for parsing behavior, operators, booleans, dates, and edge cases.

Walkthrough

Adds a new pyparsing-based query parser module that converts GitHub-like search expressions into structured condition dictionaries, adds pyparsing to Poetry dependencies, and introduces unit tests covering parsing, schema validation, operators, dates, booleans, strict mode, and case sensitivity.

Changes

Cohort / File(s) Summary
Query parser implementation
backend/apps/common/search/query_parser.py
New module introducing FieldType enum, has_enum_value, QueryCondition dataclass, QueryParserError, and QueryParser class with a pyparsing grammar, per-type parsers (STRING, NUMBER, DATE, BOOLEAN), tokenization, strict/case-sensitive options, and methods to produce structured condition dicts.
Dependency update
backend/pyproject.toml
Adds pyparsing = "^3.2.3" to project dependencies.
Parser tests
backend/tests/apps/common/search/query_parser_test.py
New pytest suite covering schema validation, field parsing, numeric operators, boolean parsing, date validation, strict vs non-strict behavior, case sensitivity, multi-token queries, and various edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (4 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Implemented Initial Query Parser for Search Interfaces" accurately and concisely describes the primary change — adding an initial query parser — and aligns with the files and tests in the changeset.
Linked Issues Check ✅ Passed The changes implement the primary coding objectives from issue [#2038]: basic string equality, numeric and date comparison operators, structured output consumable by backend, logical AND composition, quoted-string handling, and case-sensitivity/strict-mode behavior; optional features such as OR and negation were not implemented and remain future work.
Out of Scope Changes Check ✅ Passed All changes in the provided summary are focused on the new query parser module, its tests, and the pyparsing dependency in pyproject.toml; no unrelated feature or business-logic changes are present in the diff summary.
Description Check ✅ Passed The PR description documents the implemented features, examples, and links to issue #2038, and it corresponds to the added parser, tests, and dependency shown in the changeset.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/tests/apps/common/search/query_parser_test.py (1)

159-188: Final set of incorrectly nested test methods.

test_boundary_conditions_and_edge_cases, test_long_field_names, and test_numeric_values are all incorrectly nested.

Extract and fix all nested test methods:

-    def test_boundary_conditions_and_edge_cases(self):
-        @pytest.mark.parametrize(
-            ("query", "expected"),
-            [
-                ("author:", {"field": "query", "type": "string", "string": "author:"}),
-                (":123", {"field": "query", "type": "string", "string": ":123"}),
-                ('"field: true"', {"field": "query", "type": "string", "boolean": True}),
-                ("2341", {"field": "query", "type": "string", "string": "2341"}),
-            ],
-        )
-        def test_boundary_conditions_and_edge_cases(self, query, expected):
-            result = self.parser.parse(query)
-            assert result == [expected]
+    @pytest.mark.parametrize(
+        ("query", "expected_string"),
+        [
+            ("author:", "author:"),
+            (":123", ":123"),
+            ('"field: true"', "field: true"),
+            ("2341", "2341"),
+        ],
+    )
+    def test_boundary_conditions(self, query, expected_string):
+        result = self.parser.parse(query)
+        assert len(result) == 1
+        assert result[0]["field"] == "query"
+        assert result[0]["type"] == "string"
+        assert result[0]["string"] == expected_string
 
-        def test_long_field_names():
-            long_field_parser = QueryParser(allowed_fields={"verylongfieldname": "string"})
-            long_value = "a" * 1000
-            result = long_field_parser.parse(f'verylongfieldname:"{long_value}"')
-            assert isinstance(result, list)
-            assert result[0]["string"] == long_value
+    def test_long_field_names_and_values(self):
+        long_field_parser = QueryParser(allowed_fields={"verylongfieldname": "string"})
+        long_value = "a" * 1000
+        result = long_field_parser.parse(f'verylongfieldname:"{long_value}"')
+        assert isinstance(result, list)
+        assert result[0]["string"] == long_value
 
-        @pytest.mark.parametrize("val", ["0", "-1", "999999", "3.14159"])
-        def test_numeric_values(self, val):
-            result = self.parser.parse(f"stars:{val}")
-            assert len(result) == 1
-            assert result[0]["type"] == "number"
-            assert result[0]["field"] == "stars"
-            assert result[0]["number"] == str(int(float(val)))
-            assert result[0]["op"] == "="
+    @pytest.mark.parametrize("val", ["0", "-1", "999999", "3.14159"])
+    def test_numeric_values(self, val):
+        result = self.parser.parse(f"stars:{val}")
+        assert len(result) == 1
+        assert result[0]["type"] == "number"
+        assert result[0]["field"] == "stars"
+        assert result[0]["number"] == str(int(float(val)))
+        assert result[0]["op"] == "="
🧹 Nitpick comments (2)
backend/apps/common/search/query_parser.py (1)

443-445: Redundant empty value checks after tokenization.

The empty string checks on Lines 441-442 and 444-445 seem redundant since _split_tokens should not produce empty tokens, and _parse_token should handle the parsing appropriately.

Consider removing redundant checks or documenting why they're necessary:

             for token in tokens:
-                if not token or token.strip() == "":
-                    continue
                 field, raw_value = self._parse_token(token)
-                if raw_value is None or raw_value.strip() == "":
+                if not raw_value:
                     continue
backend/tests/apps/common/search/query_parser_test.py (1)

31-32: Test expectation message doesn't match the actual error.

The test expects "Expected end of text" but tests for an invalid field name with dashes. Consider using a more descriptive assertion.

-        with pytest.raises(Exception, match="Expected end of text"):
+        with pytest.raises(QueryParserError, match="must be alphanumeric"):
             QueryParser(allowed_fields={"field-with-dash": "string"})
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 18d6f8b and 70a73b2.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • backend/apps/common/search/query_parser.py (1 hunks)
  • backend/pyproject.toml (1 hunks)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/apps/common/search/query_parser_test.py (1)
backend/apps/common/search/query_parser.py (3)
  • QueryParser (76-469)
  • QueryParserError (28-73)
  • parse (422-469)
🔇 Additional comments (1)
backend/pyproject.toml (1)

50-50: LGTM! Dependency addition aligns with PR objectives.

The addition of pyparsing = "^3.2.3" is appropriate for implementing the query parser module. This is a well-established parsing library that aligns with the suggestions from issue #2038.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
backend/apps/common/search/query_parser.py (3)

166-176: Field name validation relies on truthiness of ParseResults; wrap ParseException for consistent error types.

parseString(..., parseAll=True) raises ParseException on invalid names; the current if not ... check is unreachable. Wrap both default_field and field validations to raise FIELD_NAME_ERROR consistently.

Apply this diff:

-        if not QueryParser._FIELD_NAME.parseString(default_field, parseAll=True):
-            QueryParser._raise_invalid_field_name(default_field)
+        try:
+            QueryParser._FIELD_NAME.parseString(default_field, parseAll=True)
+        except ParseException:
+            QueryParser._raise_invalid_field_name(default_field)
@@
-            if not QueryParser._FIELD_NAME.parseString(field, parseAll=True):
-                QueryParser._raise_invalid_field_name(field)
+            try:
+                QueryParser._FIELD_NAME.parseString(field, parseAll=True)
+            except ParseException:
+                QueryParser._raise_invalid_field_name(field)

Also applies to: 170-176


53-63: Default field on QueryParserError forces “Field: ''” in messages.

Using "" makes __str__ always include a field line. Prefer empty string so the field line only appears when provided.

Apply this diff:

-        field: str = "<field>",
+        field: str = "",
@@
-        if self.field:
+        if self.field:
             parts.append(f"Field: '{self.field}'")

Also applies to: 31-37


191-213: Tokenization via Regex doesn’t handle escaped quotes inside quoted strings.

Example: project:"OWASP \"ZAP\" Guide" will split incorrectly. Consider replacing _split_tokens with a pyparsing-based tokenizer that reuses _QUOTED_VALUE and _TOKEN_PATTERN, or, at minimum, extend the regex to account for escaped quotes.

I can provide a consolidated grammar (OneOrMore(Group(_FIELD_NAME + ':' + _FIELD_VALUE) | _QUOTED_VALUE | Word(…))) to replace _split_tokens if you’d like.

backend/tests/apps/common/search/query_parser_test.py (2)

165-167: Incorrect expected shape for quoted free text.

'"field: true"' becomes a free-text token; the parser emits {"type": "string", "field": "query", "string": "field: true"} — not a boolean. Adjust the expectation.

Apply this diff:

-                ('"field: true"', {"field": "query", "type": "string", "boolean": True}),
+                ('"field: true"', {"field": "query", "type": "string", "string": "field: true"}),

27-33: Avoid asserting on pyparsing’s internal error text.

"Expected end of text" is parser-specific and brittle. Prefer asserting the raised type and/or your error_type.

Example:

-        with pytest.raises(Exception, match="Expected end of text"):
+        with pytest.raises(QueryParserError) as ei:
             QueryParser(allowed_fields={"field-with-dash": "string"})
+        assert ei.value.error_type in {"FIELD_NAME_ERROR", "CONFIGURATION_ERROR"}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 18d6f8b and 70a73b2.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • backend/apps/common/search/query_parser.py (1 hunks)
  • backend/pyproject.toml (1 hunks)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/apps/common/search/query_parser_test.py (1)
backend/apps/common/search/query_parser.py (3)
  • QueryParser (76-469)
  • QueryParserError (28-73)
  • parse (422-469)
🔇 Additional comments (3)
backend/pyproject.toml (1)

50-50: CI environment mismatch & import failure — manual verification needed

The quick import test on your runner reported Python 3.11.2 and failed to find the pyparsing module, so we can’t confirm compatibility with Python 3.13 yet. Please verify in your CI:

  • That the test matrix actually installs and uses Python 3.13 (e.g., via a python3.13 job).

  • That pyparsing==3.2.3 installs without requiring extra wheels (run pip install pyparsing==3.2.3).

  • That a simple import check passes under Python 3.13:

    #!/bin/bash
    set -e
    python3.13 --version
    python3.13 - <<'PY'
    import sys
    print("Python:", sys.version)
    import pyparsing as pp
    print("pyparsing:", pp.__version__)
    PY
backend/apps/common/search/query_parser.py (1)

451-456: Field name normalization is always case-insensitive; confirm intended behavior.

Even with case_sensitive=True, field names are lowercased before lookup, making field matching case-insensitive by design. If the PR intent was “case sensitivity for values only,” this is fine; if field names should honor case in strict mode, we’ll need a change.

Would you like field-name case sensitivity to be enforced when case_sensitive=True and/or strict=True? I can adjust lookup accordingly.

backend/tests/apps/common/search/query_parser_test.py (1)

180-188: Numeric default operator assertion matches intended behavior.

This test verifies that stars:<val> defaults to the “=” operator and that the numeric value is normalized to an integer representation, which aligns with the intended parser behavior.

Please confirm locally—once the parser fix is merged—that this test passes in your environment by running:

python3 -m pytest backend/tests/apps/common/search/query_parser_test.py::TestQueryParser::test_numeric_values -q

Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bots have some suggestions to address -- please take a look.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
backend/apps/common/search/query_parser.py (4)

99-111: Fix grammatical error in docstring.

Line 109 contains a typo that was previously flagged in review comments.

Apply this diff to fix the grammatical error:

-    - Only supports implicit AND logic between conditions
+    - Only supports implicit AND logic between conditions

262-283: Update docstring to match actual behavior.

The docstring on Line 270 still contains the previously flagged inconsistency about the return value when operator is None.

Update the docstring to match the actual behavior:

-            Tuple of (operator, clean_value), if operator is none then ("=", clean_value)
+            Tuple of (operator, clean_value), defaulting to ("=", clean_value) if no operator

308-333: Consider returning actual boolean values instead of strings.

Based on previous review comments, this method returns string representations ("True"/"False") instead of actual boolean values, which could be confusing for API consumers.

Consider returning actual boolean values:

     def _parse_boolean_value(field: str, value: str) -> bool:
-        """Parse boolean value from string and return its string representation.
+        """Parse boolean value from string.
 
         Returns:
-            String representation of boolean value ("True" or "False")
+            Boolean value
         if value.lower() in QueryParser._BOOLEAN_TRUE_VALUES:
-            return "True"
+            return True
         if value.lower() in QueryParser._BOOLEAN_FALSE_VALUES:
-            return "False"
+            return False

450-489: Potential issue with case-sensitive handling affecting field values.

The current implementation converts the entire query to lowercase when case_sensitive is False, which affects both field names and field values. This was flagged in previous reviews as potentially undesirable for string searches where case preservation might be important.

Consider only normalizing field names while preserving field values:

         conditions = []
         query = query.strip()
-        if not self.case_sensitive:
-            query = query.lower()
         tokens = self._split_tokens(query)
         for token in tokens:
             field, raw_value = self._parse_token(token)
 
             if field is None:
                 conditions.append(self._create_text_search_condition(raw_value))
                 continue
 
-            field = field.lower()  # field name normalization
+            # Normalize field name based on case sensitivity setting
+            if not self.case_sensitive:
+                field = field.lower()
🧹 Nitpick comments (1)
backend/apps/common/search/query_parser.py (1)

387-420: Update to handle boolean return type change.

If the _parse_boolean_value method is changed to return actual booleans, this method will need to be updated accordingly.

If you change _parse_boolean_value to return booleans, update this line:

-            condition_dict["boolean"] = self._parse_boolean_value(field, raw_value)
+            condition_dict["boolean"] = str(self._parse_boolean_value(field, raw_value))
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 70a73b2 and a430a0e.

📒 Files selected for processing (2)
  • backend/apps/common/search/query_parser.py (1 hunks)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/apps/common/search/query_parser_test.py
🔇 Additional comments (17)
backend/apps/common/search/query_parser.py (17)

1-2: LGTM!

The module docstring clearly describes the purpose of this query parser module.


5-17: LGTM!

All necessary pyparsing imports are included, and the import organization is clean and logical.


20-27: LGTM!

The FieldType enum properly extends both str and Enum, which provides string compatibility while maintaining type safety. The supported field types (STRING, NUMBER, DATE, BOOLEAN) align with the PR objectives.


29-49: LGTM!

This utility function provides a clean way to check enum membership without relying on private attributes. The exception handling approach is appropriate and follows best practices.


51-97: LGTM!

The QueryParserError class provides comprehensive error context with structured information for API responses. The implementation includes proper error categorization and contextual information that will be helpful for debugging and user feedback.


112-124: LGTM!

The pyparsing grammar definitions are well-structured and comprehensive. The use of class-level constants provides good organization and reusability. The boolean value sets cover common representations appropriately.


125-134: LGTM!

The field name validation logic is sound, combining pyparsing validation with a lowercase check. This ensures field names follow the expected alphanumeric + underscore pattern and maintains consistency with the lowercase requirement.


135-169: LGTM!

The constructor provides good initialization with comprehensive error handling. The parameter validation and exception wrapping ensure clear error messages for configuration issues.


170-179: LGTM!

The static helper methods for raising specific error types provide clean separation of concerns and consistent error message formatting.


180-210: LGTM!

The field validation logic correctly uses the is_in_enum helper function, which addresses the previous review concern about using private enum attributes. The validation is thorough and includes both field names and field types.


211-244: LGTM!

The tokenization logic properly handles quoted strings and various token types using regular expressions. The pyparsing integration provides robust parsing with appropriate error handling.


245-261: LGTM!

The token parsing method correctly extracts field:value pairs and gracefully handles non-matching tokens by returning None for the field name.


284-307: LGTM!

The string parsing implementation correctly handles both quoted and unquoted strings as mentioned in the past review comments. The approach preserves quotes for quoted strings while validating unquoted values against the restricted character set.


334-361: LGTM!

The numeric value parsing correctly handles comparison operators and converts string values to integers. The error handling provides appropriate context for debugging.


362-386: LGTM!

The date parsing implementation validates the YYYY-MM-DD format and handles comparison operators appropriately. The error handling catches various exception types that could occur during parsing.


421-432: LGTM!

The text search condition creation is straightforward and properly maps unmatched tokens to the default field.


433-449: LGTM!

The unknown field handling respects the strict mode setting, raising errors only when strict validation is enabled.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
backend/tests/apps/common/search/query_parser_test.py (7)

26-28: Reduce coupling to private internals (BOOLEAN and COMPARISON*).*

Tests should not depend on private attributes; it makes them brittle if internals are renamed/refactored. Hardcode the public contract here instead.

Apply this diff:

-        self.boolean_values = self.parser._BOOLEAN_TRUE_VALUES | self.parser._BOOLEAN_FALSE_VALUES
-        self.true_boolean_values = self.parser._BOOLEAN_TRUE_VALUES
-        self.comparison_operators = self.parser._COMPARISON_OPERATORS
+        # Avoid coupling to private attributes; freeze the supported sets locally.
+        self.boolean_values = {"true", "1", "yes", "on", "false", "0", "no", "off"}
+        self.true_boolean_values = {"true", "1", "yes", "on"}
+        self.comparison_operators = ["<", "<=", ">", ">=", "="]

106-117: Rename for clarity and assert error type in strict mode.

These are “unknown field” cases, not malformed tokens. Also assert UNKNOWN_FIELD_ERROR in strict mode to lock in the API contract.

Apply this diff:

-    malformed_queries = ['not_field:"operation"', "not_field:operation"]
+    unknown_field_queries = ['not_field:"operation"', "not_field:operation"]
 
-    @pytest.mark.parametrize("query", malformed_queries)
-    def test_malformed_queries(self, query):
+    @pytest.mark.parametrize("query", unknown_field_queries)
+    def test_unknown_field_non_strict_drops_token(self, query):
         result = self.parser.parse(query)
         assert not result
 
-    @pytest.mark.parametrize("query", malformed_queries)
-    def test_invalid_queries(self, query):
-        with pytest.raises(QueryParserError):
-            self.strict_parser.parse(query)
+    @pytest.mark.parametrize("query", unknown_field_queries)
+    def test_unknown_field_strict_raises(self, query):
+        with pytest.raises(QueryParserError) as ei:
+            self.strict_parser.parse(query)
+        assert ei.value.error_type == "UNKNOWN_FIELD_ERROR"

118-124: Nit: also assert default operator for valid dates.

This locks in the implicit “=” default for dates in the public contract.

Apply this diff:

         for date_str in valid_dates:
             result = self.parser.parse(f"created:{date_str}")
             assert len(result) == 1
             assert result[0]["date"] == date_str
+            assert result[0]["op"] == "="

125-129: Strengthen strict-mode expectations for invalid dates.

Also assert the specific error type to avoid regressions in error taxonomy.

Apply this diff:

-        for inv_date_str in invalid_dates:
-            with pytest.raises(QueryParserError, match="Invalid date value"):
-                self.strict_parser.parse(f"created:{inv_date_str}")
+        for inv_date_str in invalid_dates:
+            with pytest.raises(QueryParserError, match="Invalid date value") as ei:
+                self.strict_parser.parse(f"created:{inv_date_str}")
+            assert ei.value.error_type == "DATE_VALUE_ERROR"

134-137: Avoid duplicate coverage; add a more meaningful init failure case.

This duplicates Line 31’s invalid type check. Consider replacing it with invalid default_field validation.

Apply this diff:

-    def test_invalid_initialization(self):
-        with pytest.raises(QueryParserError):
-            QueryParser(allowed_fields={"field": "invalid_type"})
+    def test_invalid_default_field_name(self):
+        # Default field must be lowercase alphanumeric/underscore
+        with pytest.raises(QueryParserError) as ei:
+            QueryParser(allowed_fields={"test": "string"}, default_field="Default")
+        assert ei.value.error_type == "FIELD_NAME_ERROR"

159-165: Add a couple more assertions for completeness.

Assert the parsed field and type as well to ensure no accidental regressions.

Apply this diff:

         result = long_field_parser.parse(f'verylongfieldname:"{long_value}"')
         assert isinstance(result, list)
-        assert result[0]["string"] == '"' + long_value + '"'
+        assert result[0]["string"] == '"' + long_value + '"'
+        assert result[0]["field"] == "verylongfieldname"
+        assert result[0]["type"] == "string"

1-174: Optional: add a few targeted tests to harden edge cases.

If helpful, I can open a follow-up patch with these additions:

# Add alongside existing tests

def test_date_comparison_operators(self):
    for op in ["<", "<=", ">", ">=", "="]:
        result = self.parser.parse(f"created:{op}2024-01-01")
        assert len(result) == 1
        assert result[0]["field"] == "created"
        assert result[0]["type"] == "date"
        assert result[0]["op"] == op
        assert result[0]["date"] == "2024-01-01"

def test_invalid_number_value_non_strict_is_dropped(self):
    result = self.parser.parse("stars:abc")
    assert result == []

def test_invalid_number_value_strict_raises(self):
    with pytest.raises(QueryParserError) as ei:
        self.strict_parser.parse("stars:abc")
    assert ei.value.error_type == "NUMBER_VALUE_ERROR"

def test_invalid_boolean_value_non_strict_is_dropped(self):
    result = self.parser.parse("archived:maybe")
    assert result == []

def test_invalid_boolean_value_strict_raises(self):
    with pytest.raises(QueryParserError) as ei:
        self.strict_parser.parse("archived:maybe")
    assert ei.value.error_type == "BOOLEAN_VALUE_ERROR"

def test_escaped_quotes_in_quoted_string(self):
    result = self.parser.parse(r'author:"John \"JD\" Doe"')
    assert len(result) == 1
    assert result[0]["field"] == "author"
    assert result[0]["type"] == "string"
    # Parser preserves quotes and escape sequences for strings
    assert result[0]["string"] == r'"john \"jd\" doe"'
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a430a0e and fbb7796.

📒 Files selected for processing (2)
  • backend/apps/common/search/query_parser.py (1 hunks)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/apps/common/search/query_parser.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/apps/common/search/query_parser_test.py (1)
backend/apps/common/search/query_parser.py (3)
  • QueryParser (98-525)
  • QueryParserError (50-95)
  • parse (158-197)
🔇 Additional comments (8)
backend/tests/apps/common/search/query_parser_test.py (8)

39-55: Parametrized basic string parsing looks solid.

Good coverage of quoted value normalization under case-sensitive vs. insensitive parsers. Assertions target the canonical lowercase field and expected string casing correctly.


56-61: Verify intended semantics for “case-sensitive field matching.”

Current implementation lowercases field names during tokenization regardless of the case_sensitive flag, so fields are effectively case-insensitive in all modes. The PR objectives mention “optional case-sensitive field matching.” If field-case sensitivity is a requirement, tests should assert failures for mismatched case under case_sensitive=True (and the parser likely needs an update). Otherwise, consider clarifying in docs/tests that case sensitivity applies only to values/free text, not field keys.


62-70: Numeric comparison operator coverage is correct and concise.

The loop-based assertions exercise all supported operators and validate normalization to integer string.


71-79: Boolean normalization checks are accurate.

Asserting "True"/"False" based on canonical true/false sets matches the parser’s behavior.


80-105: Complex/mixed queries test reads well and matches spec.

Nice validation of: quoted strings, numeric op/value, boolean, and default-field mapping for free text.


138-147: Custom default field behavior is validated correctly.

Confirms that unknown field tokens are dropped in non-strict mode and free text maps to the configured default field.


148-158: Boundary cases are sensible and assertions match spec.

Treating empty or non key:value tokens (including quoted colon content) as free text is correct.


166-174: Numeric value normalization and default operator checks look good.

Covers negative, large, and float inputs with truncation semantics and implicit "=".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
backend/tests/apps/common/search/query_parser_test.py (2)

155-163: Test boundary conditions comprehensively.

The test covers various malformed inputs but could be more comprehensive. Consider adding edge cases like empty strings, whitespace-only strings, and special characters.

     @pytest.mark.parametrize(
         ("query"),
-        ["author:", ":123", '"field: true"', "2341"],
+        ["author:", ":123", '"field: true"', "2341", "", "   ", "field:"],
     )
     def test_boundary_conditions(self, query):
         result = self.parser.parse(query)
+        if not query.strip():
+            assert not result
+            return
         assert len(result) == 1
         assert result[0]["field"] == "query"
         assert result[0]["type"] == "string"
         assert result[0]["string"] == query

174-181: Confirm float–to–int truncation logic matches test

Verified that in backend/apps/common/search/query_parser.py, _parse_number_value uses int(float(clean_value)), which truncates any decimal portion (e.g. "3.14159"3). The test in backend/tests/apps/common/search/query_parser_test.py (lines 174–181) asserts against str(int(float(val))), so it accurately reflects the current implementation.

• If only integer semantics are intended, no change is needed.
• If full–precision floats or a different rounding behavior should be supported, please update:

  • _parse_number_value in backend/apps/common/search/query_parser.py
  • Corresponding assertions in backend/tests/apps/common/search/query_parser_test.py
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fbb7796 and b8331ff.

📒 Files selected for processing (1)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/apps/common/search/query_parser_test.py (1)
backend/apps/common/search/query_parser.py (3)
  • QueryParser (98-525)
  • QueryParserError (50-95)
  • parse (158-197)
🪛 Ruff (0.12.2)
backend/tests/apps/common/search/query_parser_test.py

124-124: Single quotes found but double quotes preferred

Replace single quotes with double quotes

(Q000)

🔇 Additional comments (5)
backend/tests/apps/common/search/query_parser_test.py (5)

1-182: Address previous review feedback about nested test functions.

Based on the past review comments, there were originally nested test functions inside other test methods that wouldn't be discovered by pytest. This version appears to have addressed those issues by flattening the test structure, which is excellent.


6-29: Well-structured test class with comprehensive setup.

The test class is well-organized with clear setup of multiple parser configurations (standard, case-sensitive, strict, case-sensitive-strict) and test data. This provides good coverage of different parser behaviors.


71-78: Robust boolean value testing.

The test correctly validates boolean parsing across various representations and properly checks the normalized output format ("True"/"False" strings).


102-104: Comprehensive complex query testing.

The parametrized test cases effectively validate mixed content scenarios with different field types, quoted strings, and free text tokens in a single query.


144-152: Ignore incorrect test logic concern

The test correctly reflects the parser’s behavior: any token prefixed by an unknown field (here, “query:test”) is dropped (via _handle_unknown_field) and only true free-text tokens (free_text) are mapped to default_field. The implementation adds the default_field only for tokens without a colon, so the test’s expectation is accurate.

Likely an incorrect or invalid review comment.

@Adarsh0427 Adarsh0427 requested a review from arkid15r August 22, 2025 14:44
@Adarsh0427 Adarsh0427 changed the title Implement Initial Query Parser for Search Interfaces Implemented Initial Query Parser for Search Interfaces Aug 25, 2025
Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @Adarsh0427

I have suggestions and some questions about the implementation:

value: String value to parse

Returns:
String representation of boolean value ("True" or "False")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see returning str() form of True/False is by design. Could you add more details about this, maybe a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to this Returns: String ("True" or "False") based on the boolean input value..
Is this statement fine ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I expected something that explains the decision to return strings instead of booleans. Basically, why do you return "True" instead of True here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _parse_boolean_value(value: str) -> str:
        """Parse boolean string to standardized string format.

        Converts boolean-like strings to "True" or "False" for consistent
        storage in dictionaries without needing type conversion.

        Args:
            value: Boolean string to parse

        Returns:
            "True" or "False" as string

        Raises:
            QueryParserError: If value is not a valid boolean

        """

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'author:"John Doe" stars:>100 archived:false some "free text"'
...
{"type": "string", "field": "author", "string": '"john doe"'},
{"type": "number", "field": "stars", "op": ">", "number": "100"},
{"type": "boolean", "field": "archived", "boolean": "False"},
{"type": "string", "field": "query", "string": "some"},
    {"type": "string", "field": "query", "string": '"free text"'},

I think it should return naturally expected type/value pairs, e.g. 100 instead of "100" for number, True / False instead of "True"/"False" for boolean, string stays as is.
Any suggestions why it might be not a good idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to change the code to store type/value pairs ?
The structure of return type of query_parser would be List[Dict[str, str | int | bool]].

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be unified with type/field/op/value instead of type/field/op/[string/boolean/number]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point, so the structure of result should be:

{
    "type": "str",
    "field": "str",
    "op": "str",
    "value": "[str | bool | int]"
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, value may be one of str, bool, int consistent with type.

Copy link
Contributor Author

@Adarsh0427 Adarsh0427 Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the parser behavior to return value with its actual data type (int, bool, str) and also set an overflow for integer below 0 or above 1e32.
defined dataclass for QueryCondition:

@dataclass
class QueryCondition:
    """Dataclass representing a parsed query condition."""

    type: str
    field: str
    value: str | int | bool
    op: str | None = None

    def to_dict(self) -> dict:
        """Convert the QueryCondition to a dictionary representation."""
        data = asdict(self)
        if data.get("op") is None:
            data.pop("op")
        return data

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
backend/apps/common/search/query_parser.py (1)

171-176: Fix case-sensitivity: preserve values; only normalize field names when requested.

Lowercasing the entire query (Line 174) and always lowercasing parsed field names (Line 385) breaks case-sensitive value searches and also makes case_sensitive=True ineffective for fields. Normalize only field names when case_sensitive=False, and keep values intact.

@@
-        query = query.strip()
-        if not self.case_sensitive:
-            query = query.lower()
-        tokens = self._split_tokens(query)
+        query = query.strip()
+        tokens = self._split_tokens(query)
@@
-                field, raw_value = self._parse_token(token)
+                field, raw_value = self._parse_token(token)
                 if field is None:
                     conditions.append(self._create_text_search_condition(raw_value))
                     continue
-
-                if field not in self.field_schema:
-                    self._handle_unknown_field(field)
-                    continue
-
-                field_type = self.field_schema[field]
-                condition = self.create_condition(field, field_type, raw_value)
+                # Normalize field name depending on case sensitivity
+                normalized_field = field if self.case_sensitive else field.lower()
+                if normalized_field not in self.field_schema:
+                    self._handle_unknown_field(normalized_field)
+                    continue
+                field_type = self.field_schema[normalized_field]
+                condition = self.create_condition(normalized_field, field_type, raw_value)
                 if condition:
                     conditions.append(condition)
@@
-        field, value = token.split(":", 1)
-        field = field.strip().lower()
+        field, value = token.split(":", 1)
+        field = field.strip()

Also applies to: 384-387

🧹 Nitpick comments (4)
backend/apps/common/search/query_parser.py (4)

116-119: Avoid passing a set to pyparsing.oneOf; use a deterministic literal list.

Sets are unordered; prefer a stable sequence or a whitespace-delimited string.

-    _COMPARISON_OPERATORS = {">", "<", ">=", "<=", "="}
-    _COMPARISON_OPERATOR = Optional(oneOf(_COMPARISON_OPERATORS), default="=")
+    _COMPARISON_OPERATORS = (">=", "<=", ">", "<", "=")
+    _COMPARISON_OPERATOR = Optional(oneOf(">= <= > < ="), default="=")

503-508: Do not silently truncate numeric values; reject non-integers instead.

int(float(clean_value)) turns “10.9” into 10. If only integers are supported, fail fast; if decimals are allowed, return them explicitly without truncation.

-            operator, clean_value = QueryParser._parse_comparison_pattern(value)
-            numeric_value = int(float(clean_value))
-            return operator, str(numeric_value)
+            operator, clean_value = QueryParser._parse_comparison_pattern(value)
+            # Accept only integers to avoid silent truncation of floats
+            numeric_value = int(clean_value)
+            return operator, str(numeric_value)

Please confirm desired behavior for inputs like score:>=10.5 or size:1e3; adjust parsing accordingly if decimals/scientific notation must be supported.


101-109: Clarify docstring phrasing about case sensitivity.

Make it explicit that only field-name matching is normalized and values are preserved.

-    - Handle case sensitivity/insensitivity values
+    - Configurable case sensitivity for field names; values are preserved

216-237: Output shape diverges from PR objective (“value” field vs type-specific keys).

You emit type-specific keys ("string", "number", "date", "boolean") instead of a generic "value" alongside "type". If downstream expects "value", this is a contract mismatch.

Options:

  • Keep current shape but document it clearly in the API contract and tests.
  • Or emit "value" with the canonical string representation and keep "type" for disambiguation. We can provide a small adapter if you prefer not to change the parser’s internals.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b8331ff and 919d872.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • backend/apps/common/search/query_parser.py (1 hunks)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/apps/common/search/query_parser_test.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-22T14:42:54.652Z
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.

Applied to files:

  • backend/apps/common/search/query_parser.py
🧬 Code graph analysis (1)
backend/apps/common/search/query_parser.py (1)
backend/apps/ai/agent/tools/rag/rag_tool.py (1)
  • query (34-63)
🔇 Additional comments (3)
backend/apps/common/search/query_parser.py (3)

420-437: LGTM: default operator handling is correct and consistent.

Optional(..., default="=") plus normalization in _parse_comparison_pattern achieves the intended default “=”. Good cleanup.


119-120: LGTM: YYYYMMDD support and normalization.

Supporting both YYYY-MM-DD and YYYYMMDD with _normalize_date() meets the requirement and keeps output consistent.

Also applies to: 404-417


221-237: LGTM: clean use of structural pattern matching.

The match on FieldType is concise and readable.

Updated the field name regex pattern and modified the return logic in the field name validation method.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/apps/common/search/query_parser.py (1)

343-347: Support escaped quotes inside quoted values.

Current regex breaks on \" (e.g., project:"OWASP "ZAP""). Allow escapes.

-        reg_components = {
-            "key_value": r'\S+:"[^"]*"',
-            "quoted_string": r'"[^"]*"',
-            "unquoted_word": r"\S+",
-        }
+        reg_components = {
+            "key_value": r'\S+:"([^"\\]|\\.)*"',
+            "quoted_string": r'"([^"\\]|\\.)*"',
+            "unquoted_word": r"\S+",
+        }
🧹 Nitpick comments (3)
backend/apps/common/search/query_parser.py (3)

14-14: Remove unused import (ruff F401).

alphas is imported but never used.

-from pyparsing import (
+from pyparsing import (
     Group,
     Optional,
     ParseException,
     QuotedString,
     Regex,
     Word,
     ZeroOrMore,
     alphanums,
-    alphas,
     oneOf,
 )

101-109: Docstring tweak for clarity on case sensitivity.

Clarify that only field names are normalized in case-insensitive mode; values are preserved.

-    - Handle case sensitivity/insensitivity values
+    - Case sensitivity: normalize field names only when disabled; preserve values

116-119: Make operator grammar deterministic; remove redundant set.

Passing a set to oneOf is unnecessary and order-dependent; define choices inline and drop _COMPARISON_OPERATORS.

-    _COMPARISON_OPERATORS = {">", "<", ">=", "<=", "="}
-    _COMPARISON_OPERATOR = Optional(oneOf(_COMPARISON_OPERATORS), default="=")
+    _COMPARISON_OPERATOR = Optional(oneOf(">= <= > < ="), default="=")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 919d872 and 656dc4c.

📒 Files selected for processing (1)
  • backend/apps/common/search/query_parser.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-22T14:42:54.652Z
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.

Applied to files:

  • backend/apps/common/search/query_parser.py
🧬 Code graph analysis (1)
backend/apps/common/search/query_parser.py (1)
backend/apps/ai/agent/tools/rag/rag_tool.py (1)
  • query (34-63)
🪛 Ruff (0.12.2)
backend/apps/common/search/query_parser.py

14-14: pyparsing.alphas imported but unused

Remove unused import: pyparsing.alphas

(F401)

🔇 Additional comments (6)
backend/apps/common/search/query_parser.py (6)

119-119: Nice: YYYYMMDD support.

Adding compact date format with normalization is aligned with review feedback.


237-243: Graceful non-strict parsing is correct.

Returning None on value errors and skipping tokens in non-strict mode matches the PR objective.


257-273: Unknown-field handling matches strict mode semantics.


445-465: String parsing behavior is consistent with quoted/unquoted rules.

Preserving quotes for quoted values and validating unquoted tokens via _UNQUOTED_VALUE is clear and robust.


467-487: Boolean parsing returns strings by design—docstring clarifies this.

The implementation and documentation are consistent.


490-509: Reject non-integer values or fully support floats in numeric parsing
The current implementation uses int(float(clean_value)), silently truncating decimals (e.g. "1.9"1). Either enforce integers only or parse floats end-to-end. Which do you prefer? For integer-only fields, for example, you can add:

+import re
@@ -506,3 +506,9 @@
             operator, clean_value = QueryParser._parse_comparison_pattern(value)
-            numeric_value = int(float(clean_value))
-            return operator, str(numeric_value)
+            if not re.fullmatch(r"[+-]?\\d+", clean_value):
+                raise QueryParserError(
+                    message="Invalid numeric value: integers only",
+                    error_type="NUMBER_VALUE_ERROR",
+                )
+            return operator, clean_value

@Adarsh0427 Adarsh0427 requested a review from arkid15r August 28, 2025 16:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
backend/apps/common/search/query_parser.py (1)

341-355: Support escaped quotes in tokenization (e.g., project:"OWASP "ZAP"").

Current patterns break on escaped quotes. Accept standard backslash escapes in quoted segments.

@@
-        reg_components = {
-            "key_value": r'\S+:"[^"]*"',
-            "quoted_string": r'"[^"]*"',
-            "unquoted_word": r"\S+",
-        }
+        reg_components = {
+            # Support escaped quotes inside quotes: \"
+            "key_value": r'\S+:"([^"\\]|\\.)*"',
+            "quoted_string": r'"([^"\\]|\\.)*"',
+            "unquoted_word": r"\S+",
+        }
🧹 Nitpick comments (1)
backend/apps/common/search/query_parser.py (1)

283-286: Tiny cleanup: reuse _remove_quotes in empty-string check.

Minor simplification; avoids repeated strip chains.

-        return token.strip().strip('"').strip() == ""
+        return QueryParser._remove_quotes(token).strip() == ""
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 656dc4c and f1a0e01.

📒 Files selected for processing (1)
  • backend/apps/common/search/query_parser.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-22T14:42:54.652Z
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.

Applied to files:

  • backend/apps/common/search/query_parser.py
🔇 Additional comments (4)
backend/apps/common/search/query_parser.py (4)

115-117: Good: default operator is “=” via grammar.

Using Optional(oneOf(...), default="=") simplifies downstream logic and fixes “missing ‘=’” cases.


117-117: Good: YYYYMMDD support added.

Accepting both YYYY-MM-DD and YYYYMMDD meets reviewer request and normalizing via _normalize_date keeps output consistent.


235-241: Non-strict mode resilience is correct.

Gracefully skipping invalid tokens (returning None) unless strict=True aligns with the PR’s robustness goal.


294-298: Config validation: message clarity and schema normalization look good.

Clear error typing and enum validation via has_enum_value avoid private enum internals; default_field normalization is sensible.

Also applies to: 320-329

@kasya
Copy link
Collaborator

kasya commented Aug 30, 2025

hey @Adarsh0427 ! Could you resolve conflicts on this one? thanks!

Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments:

  • please revisit _parse_boolean_value return value type
  • also address/resolve coderabbit's comments (the ready for review code should have only human authored comments or bot's comments that you couldn't resolve yourself)

value: String value to parse

Returns:
String representation of boolean value ("True" or "False")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I expected something that explains the decision to return strings instead of booleans. Basically, why do you return "True" instead of True here?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
backend/apps/common/search/query_parser.py (2)

100-108: Fix case-sensitivity behavior to match docs; stop lowercasing entire query (mutates values).

Doc says “normalize field names only, preserve original value casing,” but parse() lowercases the whole query. This alters string/date values and contradicts the stated behavior.

Apply:

@@
-        query = query.strip()
-        if not self.case_sensitive:
-            query = query.lower()
+        query = query.strip()

Also applies to: 170-174


342-356: Tokenizer breaks on escaped quotes inside quoted values (e.g., project:"OWASP "ZAP"").

Update regex to accept escaped quotes so quoted segments aren’t split incorrectly.

@@
-        reg_components = {
-            "key_value": r'\S+:"[^"]*"',
-            "quoted_string": r'"[^"]*"',
-            "unquoted_word": r"\S+",
-        }
+        reg_components = {
+            # Support escaped quotes within quotes
+            "key_value": r'\S+:"([^"\\]|\\.)*"',
+            "quoted_string": r'"([^"\\]|\\.)*"',
+            "unquoted_word": r"\S+",
+        }
🧹 Nitpick comments (4)
backend/apps/common/search/query_parser.py (4)

528-539: Optionally validate calendar-correct dates (reject 2025-13-40).

Current regex checks format only. Add a lightweight datetime.strptime validation after normalization.

@@
-            result = QueryParser._DATE_PATTERN.parseString(clean_value, parseAll=True)
-            result[0] = QueryParser._normalize_date(result[0])
-            return operator, result[0]
+            result = QueryParser._DATE_PATTERN.parseString(clean_value, parseAll=True)
+            normalized = QueryParser._normalize_date(result[0])
+            # Calendar validation
+            from datetime import datetime
+            datetime.strptime(normalized, "%Y-%m-%d")
+            return operator, normalized

328-329: Clarify default_field type override semantics.

You always coerce default_field to STRING, potentially overriding a schema-provided non-string type. Either enforce via explicit config error or document the override.

-        validated_fields[default_field] = FieldType.STRING
+        # Ensure default_field is STRING
+        if default_field in validated_fields and validated_fields[default_field] is not FieldType.STRING:
+            raise QueryParserError(
+                message="default_field must be of type 'string'",
+                error_type="CONFIGURATION_ERROR",
+                field=default_field,
+            )
+        validated_fields[default_field] = FieldType.STRING

Would you like me to update tests to cover this case?


284-287: Minor: simplify empty-token check.

Small readability tweak; behavior unchanged.

-        return token.strip().strip('"').strip() == ""
+        return token.strip('"').strip() == ""

209-214: Docstring: mention None return in non-strict mode.

create_condition() can return None when value parsing fails and strict=False; reflect this in the docstring.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 91b01b2 and be2bb97.

📒 Files selected for processing (1)
  • backend/apps/common/search/query_parser.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-22T14:42:54.652Z
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.

Applied to files:

  • backend/apps/common/search/query_parser.py
🔇 Additional comments (1)
backend/apps/common/search/query_parser.py (1)

112-116: Good call on allowing unbalanced quotes for robustness.

Including " in _UNQUOTED_VALUE together with the fallback path enables graceful handling of unclosed quotes (per retrieved learnings). LGTM.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
backend/apps/common/search/query_parser.py (1)

171-174: Preserve value casing; stop lowercasing the entire query (contradicts doc).

Lowercasing mutates values when case_sensitive=False; doc says “normalize field names only, preserve original value casing.” Field names are already normalized in _parse_token.

-        if not self.case_sensitive:
-            query = query.lower()
+        # Preserve original value casing; field names are normalized during token parsing.
🧹 Nitpick comments (5)
backend/apps/common/search/query_parser.py (5)

112-116: Avoid accepting stray quotes in unquoted string values; split string-vs-comparison grammars.

UNQUOTED_VALUE includes ", which allows tokens like language:foo" to pass string parsing. Keep " only for comparison parsing (to support unclosed quotes), and use a stricter UNQUOTED_STRING for strings.

@@
-    QUOTED_VALUE = QuotedString(quoteChar='"', escChar="\\", unquoteResults=False)
-    UNQUOTED_VALUE = Word(alphanums + '+-.<>=/_"')
-    FIELD_VALUE = QUOTED_VALUE | UNQUOTED_VALUE
+    QUOTED_VALUE = QuotedString(quoteChar='"', escChar="\\", unquoteResults=False)
+    # For comparisons (numbers/dates), allow " to support robustness with unclosed quotes.
+    UNQUOTED_VALUE = Word(alphanums + '+-.<>=/_"')
+    # For plain string fields, do not allow stray quotes in unquoted values.
+    UNQUOTED_STRING = Word(alphanums + "+-.<>=/_")
+    FIELD_VALUE = QUOTED_VALUE | UNQUOTED_STRING
@@
-        try:
-            result = QueryParser.FIELD_VALUE.parseString(value, parseAll=True)
-            return result[0]
+        try:
+            result = QueryParser.FIELD_VALUE.parseString(value, parseAll=True)
+            return result[0]

Also applies to: 450-466


349-364: Precompile the token parser to avoid rebuilding regex every call.

Minor perf win on hot path.

@@ class QueryParser:
+    # Precompiled token parser
+    TOKEN_REGEX = r'\S+:"([^"\\]|\\.)*"|"([^"\\]|\\.)*"|\S+'
+    TOKEN_PARSER = ZeroOrMore(Regex(TOKEN_REGEX))
@@
-        token_regex = "|".join(reg)
-        parser = ZeroOrMore(Regex(token_regex))
-
         try:
-            result = parser.parseString(query, parseAll=True)
+            result = QueryParser.TOKEN_PARSER.parseString(query, parseAll=True)
             return result.asList()

512-516: Remove redundant quote-stripping in number/date parsers.

_parse_comparison_pattern already removes quotes.

@@
-            value = QueryParser._remove_quotes(value)
-            operator, clean_value = QueryParser._parse_comparison_pattern(value)
+            operator, clean_value = QueryParser._parse_comparison_pattern(value)
@@
-            value = QueryParser._remove_quotes(value)
-            operator, clean_value = QueryParser._parse_comparison_pattern(value)
+            operator, clean_value = QueryParser._parse_comparison_pattern(value)

Also applies to: 539-544


37-44: Doc nit: “Raises: None” is misleading.

This function never raises; simplify the doc.

-    Raises:
-        None: This function catches ValueError internally and returns False.
+    Note:
+        Returns False if the value is not a valid enum member; does not raise.

284-287: Tiny simplification.

Same behavior with less work.

-        return token.strip().strip('"').strip() == ""
+        return token.strip(' "').strip() == ""
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between be2bb97 and 599aef4.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • backend/apps/common/search/query_parser.py (1 hunks)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/apps/common/search/query_parser_test.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-22T14:42:54.652Z
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.

Applied to files:

  • backend/apps/common/search/query_parser.py
🔇 Additional comments (1)
backend/apps/common/search/query_parser.py (1)

221-236: No changes required; project targets Python 3.13, so match-case is supported. CI’s setup-python and pyproject.toml both specify Python 3.13.

@Adarsh0427
Copy link
Contributor Author

  • also address/resolve coderabbit's comments (the ready for review code should have only human authored comments or bot's comments that you couldn't resolve yourself)

Understood, I will take care next time

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
backend/apps/common/search/query_parser.py (2)

506-531: Numeric parsing behavior acknowledged.

Converting floats/exponents to ints (e.g., 1.9 -> 1, 1e3 -> 1000) is intentional per discussion; OverflowError handling is correct.


179-182: Preserve value casing; don’t lowercase the entire query.

Lowercasing mutates values (e.g., quoted phrases, project names) and contradicts the stated behavior. Normalize field names in _parse_token (already done) and keep values untouched.

-        if not self.case_sensitive:
-            query = query.lower()
+        # Preserve value casing; field names are normalized in _parse_token
🧹 Nitpick comments (6)
backend/apps/common/search/query_parser.py (6)

113-116: Align docstrings with intended case-sensitivity behavior.

Clarify that field names are always lowercased; values are preserved regardless of the flag (the flag is for downstream matching).

@@
-    - Case sensitivity: normalize field names only, preserve original value casing when
-      case sensitive is True
+    - Case sensitivity: field names are always normalized to lowercase; values are preserved.
@@
-            case_sensitive: Controls value-case behavior for downstream search,
-                        field names are always normalized to lowercase.
+            case_sensitive: Downstream matching hint for values; field names are always
+                        normalized to lowercase by this parser.

Also applies to: 144-146


258-263: Keep quotes for free-text tokens to retain phrase semantics.

Dropping quotes loses the signal for phrase searches. Preserve the token as-is.

-            value=QueryParser._remove_quotes(token),
+            value=token,

534-559: Validate calendar dates, not just format.

Regex accepts impossible dates (e.g., 2025-99-99). In strict mode, raise; otherwise skip.

Add import:

+from datetime import date as _date

Validate after normalization:

@@
-            result[0] = QueryParser._normalize_date(result[0])
-            return operator, result[0]
+            result[0] = QueryParser._normalize_date(result[0])
+            try:
+                _date.fromisoformat(result[0])
+            except ValueError as e:
+                raise QueryParserError(
+                    message=f"Invalid calendar date: {e!s}", error_type="DATE_VALUE_ERROR"
+                ) from e
+            return operator, result[0]

Also applies to: 3-5


351-376: Avoid rebuilding the token parser per call.

Move the token regex/parser to class-level constants to reduce per-parse allocations.

Add class-level constants:

@@
     DATE_PATTERN = Regex(r"\d{4}-\d{2}-\d{2}") | Regex(r"\d{8}")  # YYYY-MM-DD or YYYYMMDD format
+    TOKEN_REGEX = r'\S+:"([^"\\]|\\.)*"|"([^"\\]|\\.)*"|\S+'
+    TOKEN_PARSER = ZeroOrMore(Regex(TOKEN_REGEX))

Use it:

-        token_regex = "|".join(reg)
-        parser = ZeroOrMore(Regex(token_regex))
+        parser = QueryParser.TOKEN_PARSER

Also applies to: 118-126


388-409: Defensive guard for empty tokens.

Unlikely, but safe to avoid IndexError on token[0].

         token = token.strip()
+        if not token:
+            return None, token
         if token[0] == '"':
             return None, token

248-264: Minor: unify string parsing path.

Consider reusing FIELD_VALUE.parseString(...)[0] here for consistency with _parse_string_value (keeps quotes and escaping consistently). If you prefer the simpler value=token, ignore.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 22e2d15 and 752c39a.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • backend/apps/common/search/query_parser.py (1 hunks)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/tests/apps/common/search/query_parser_test.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.
📚 Learning: 2025-08-22T14:42:54.652Z
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.

Applied to files:

  • backend/apps/common/search/query_parser.py
🧬 Code graph analysis (1)
backend/apps/common/search/query_parser.py (1)
backend/apps/ai/agent/tools/rag/rag_tool.py (1)
  • query (34-63)
🔇 Additional comments (2)
backend/apps/common/search/query_parser.py (2)

361-366: Escaped quotes in quoted segments — good catch.

The updated regex handles \" correctly; this fixes broken tokenization for inputs like project:"OWASP \"ZAP\"".


118-126: UNQUOTED_VALUE includes " by design — acceptable given unclosed-quote handling.

Matches the retrieved learning to accept tokens like created:"2024-12-01 robustly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
backend/apps/common/search/query_parser.py (1)

179-182: Stop lowercasing the entire query (breaks value casing).

Lowercasing the whole query mutates values (e.g., project titles, authors) in case-insensitive mode. Field normalization is already handled in _parse_token. Preserve original value casing.

-        query = query.strip()
-        if not self.case_sensitive:
-            query = query.lower()
+        query = query.strip()
🧹 Nitpick comments (5)
backend/apps/common/search/query_parser.py (5)

113-116: Clarify case-sensitivity doc: preserve values; lowercase fields only.

Current wording implies preserving values only when case_sensitive=True. Parsing should always preserve values; only field names are normalized.

-    - Case sensitivity: normalize field names only, preserve original value casing when
-      case sensitive is True
+    - Case sensitivity: normalize field names only; always preserve original value casing.
+      The `case_sensitive` flag affects downstream matching, not parsing here.

3-5: Validate calendar dates (e.g., reject 2025-02-30).

Add a real date check after normalization to prevent impossible dates slipping through.

-from dataclasses import asdict, dataclass
-from enum import Enum
+from dataclasses import asdict, dataclass
+from enum import Enum
+from datetime import date
-            result = QueryParser.DATE_PATTERN.parseString(clean_value, parseAll=True)
-            result[0] = QueryParser._normalize_date(result[0])
-            return operator, result[0]
+            result = QueryParser.DATE_PATTERN.parseString(clean_value, parseAll=True)
+            normalized = QueryParser._normalize_date(result[0])
+            # Calendar validation
+            try:
+                date.fromisoformat(normalized)
+            except ValueError as ve:
+                raise QueryParserError(
+                    message=f"Invalid calendar date: {normalized}",
+                    error_type="DATE_VALUE_ERROR",
+                ) from ve
+            return operator, normalized

Also applies to: 553-555


513-516: Document numeric coercion and improve range error.

Make truncation semantics explicit and avoid raising/catching a bare OverflowError intentionally.

-        Returns:
-            Tuple of (operator, numeric_value) where numeric_value is integer
-            and "=" is the default operator.
+        Returns:
+            Tuple of (operator, numeric_value) where numeric_value is an integer.
+            Floats and scientific notation are accepted and truncated via int(float(...)).
+            "=" is the default operator.
-            if (numeric_value > 2**64 or numeric_value < 0):
-                raise OverflowError
+            if numeric_value > 2**64 or numeric_value < 0:
+                raise ValueError("numeric value out of supported range [0, 2**64]")
-        except (ValueError, OverflowError, ParseException) as e:
+        except (ValueError, OverflowError, ParseException) as e:
             raise QueryParserError(
                 message=f"Invalid numeric value: {e!s}", error_type="NUMBER_VALUE_ERROR"
             ) from e

Also applies to: 525-527, 529-533


258-263: Confirm intent: quoted strings kept for field values, but stripped for free-text.

Field values preserve quotes (e.g., '"C# C++"'), while free-text tokens drop quotes. Is this asymmetry intentional for API consumers?

I can align both to return unquoted values consistently (and add a flag indicating original quoting) if desired.


293-296: Minor simplification for emptiness check.

Equivalent and a bit cleaner.

-        return token.strip().strip('"').strip() == ""
+        return token.strip().strip('"') == ""
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 752c39a and 34c4b0d.

📒 Files selected for processing (1)
  • backend/apps/common/search/query_parser.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.
📚 Learning: 2025-08-22T14:42:54.652Z
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.

Applied to files:

  • backend/apps/common/search/query_parser.py
🪛 Ruff (0.12.2)
backend/apps/common/search/query_parser.py

526-526: Abstract raise to an inner function

(TRY301)

🔇 Additional comments (2)
backend/apps/common/search/query_parser.py (2)

398-406: Field-name normalization looks good.

Always lowercasing field names at parse time is consistent with the schema and avoids surprises. Values are left intact (once the whole-query lowercasing is removed).


361-379: Tokenization handles escaped quotes correctly.

Good use of patterns allowing \" inside quotes for both key-value and standalone quoted tokens.

@Adarsh0427 Adarsh0427 requested a review from arkid15r September 5, 2025 19:32
Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've made a good progress with this PR and very close to merging it. However, you still skip our mandatory steps to ensure code quality and tests. Please run make check-test locally. Also look into these when you get a chance:

@Adarsh0427 Adarsh0427 requested a review from arkid15r September 6, 2025 03:32
arkid15r
arkid15r previously approved these changes Sep 10, 2025
Copy link
Collaborator

@arkid15r arkid15r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully, this will serve as a strong foundation to build upon.

Thank you @Adarsh0427 👍

Copy link

@arkid15r arkid15r added this pull request to the merge queue Sep 10, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/apps/common/search/query_parser.py (1)

172-175: Do not lowercase the entire query (breaks values in case-insensitive mode)

Lowercasing the whole query mutates quoted/unquoted values (e.g., project names). Normalize field names only; keep values intact.

-        query = query.strip()
-        if not self.case_sensitive:
-            query = query.lower()
+        query = query.strip()
🧹 Nitpick comments (4)
backend/apps/common/search/query_parser.py (4)

117-120: Clarify case-sensitivity statement to match intended behavior

Recommend wording that does not imply value lowercasing.

-        - Case sensitivity: normalize field names only, preserve original value casing when
-          case sensitive is True
+        - Case sensitivity: normalize field names only; preserve original value casing

348-390: Tokenizer creation on every call; consider precompiling at class level

building the Regex/ZeroOrMore parser each time adds overhead. Precompile once as a class attribute.

 class QueryParser:
@@
-    def _split_tokens(query: str) -> list[str]:
+    # Precompiled tokenizer
+    TOKENIZER = ZeroOrMore(
+        Regex('|'.join([
+            r'\S+:"([^"\\]|\\.)*"',
+            r'"([^"\\]|\\.)*"',
+            r'\S+',
+        ]))
+    )
+
+    def _split_tokens(query: str) -> list[str]:
@@
-        parser = ZeroOrMore(
-            Regex(
-                "|".join(
-                    [
-                        regex_components["key_value"],
-                        regex_components["quoted_string"],
-                        regex_components["unquoted_word"],
-                    ]
-                )
-            )
-        )
         try:
-            result = parser.parseString(query, parseAll=True)
+            result = QueryParser.TOKENIZER.parseString(query, parseAll=True)

472-491: String parsing is strict on unclosed quotes for field values—confirm intent

Free-text tokens tolerate unclosed quotes, but field string values require closed quotes. If you want parity with dates’ “unclosed quote tolerance,” consider accepting unclosed quoted strings here too.


122-129: Grammar definitions look solid; minor import tidy-up

QUOTED_VALUE with escaping and default “=” via Optional/oneOf are correct. You can collapse the two pyparsing import blocks for cleanliness.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1ea487 and 88c2147.

⛔ Files ignored due to path filters (1)
  • backend/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • backend/apps/common/search/query_parser.py (1 hunks)
  • backend/pyproject.toml (1 hunks)
  • backend/tests/apps/common/search/query_parser_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/tests/apps/common/search/query_parser_test.py
  • backend/pyproject.toml
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.
📚 Learning: 2025-08-22T14:42:54.652Z
Learnt from: Adarsh0427
PR: OWASP/Nest#2114
File: backend/tests/apps/common/search/query_parser_test.py:120-121
Timestamp: 2025-08-22T14:42:54.652Z
Learning: The query parser in backend/apps/common/search/query_parser.py can gracefully handle unclosed quotes in field values. For example, `created:"2024-12-01` (missing closing quote) is parsed correctly to produce `{'field': 'created', 'type': 'date', 'op': '=', 'date': '2024-12-01'}`. Test data with unclosed quotes like `'"2023-12-31'` is intentional to verify this robust parsing behavior.

Applied to files:

  • backend/apps/common/search/query_parser.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Run frontend e2e tests
  • GitHub Check: Run frontend unit tests
  • GitHub Check: Run backend tests
🔇 Additional comments (2)
backend/apps/common/search/query_parser.py (2)

409-417: Field-name normalization looks good

Lowercasing field identifiers at parse time aligns with the “fields always lowercase” rule.


493-516: Boolean parsing returns actual booleans—LGTM

Consistent with type semantics and downstream usage.

Comment on lines +518 to +551
def _parse_number_value(value: str) -> tuple[str, int]:
"""Parse numeric value with optional comparison operator.

Args:
value: String value to parse

Returns:
Tuple of (operator, numeric_value) where numeric_value is integer,
clamped to [0, 2^32] range and "=" is the default operator.

Raises:
QueryParserError: If value is not a valid number or int overflow occurs

"""
try:
operator, clean_value = QueryParser._parse_comparison_pattern(
QueryParser._remove_quotes(value)
)

numeric_value = int(float(clean_value))
if numeric_value > 2**32 or numeric_value < 0:
operator = ">" if numeric_value < 0 else operator
numeric_value = max(0, min(numeric_value, 2**32))
except ValueError as e:
raise QueryParserError(
message=f"Invalid numeric value: {e!s}", error_type="NUMBER_VALUE_ERROR"
) from e
except OverflowError as e:
raise QueryParserError(
message=f"Numeric value overflow: {e!s}", error_type="NUMBER_VALUE_ERROR"
) from e

return operator, numeric_value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Do not change the comparison operator when clamping numeric values

Flipping the operator for negatives (e.g., stars:<-5 => stars:>0) changes intent and can over-broaden results. Clamp the value but keep the operator unchanged.

-            numeric_value = int(float(clean_value))
-            if numeric_value > 2**32 or numeric_value < 0:
-                operator = ">" if numeric_value < 0 else operator
-                numeric_value = max(0, min(numeric_value, 2**32))
+            numeric_value = int(float(clean_value))
+            # Clamp to allowed range without altering the operator semantics
+            numeric_value = max(0, min(numeric_value, 2**32))

If you prefer, in non-strict mode you could skip unsatisfiable constraints (e.g., "<0") instead of clamping; in strict mode, raise.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _parse_number_value(value: str) -> tuple[str, int]:
"""Parse numeric value with optional comparison operator.
Args:
value: String value to parse
Returns:
Tuple of (operator, numeric_value) where numeric_value is integer,
clamped to [0, 2^32] range and "=" is the default operator.
Raises:
QueryParserError: If value is not a valid number or int overflow occurs
"""
try:
operator, clean_value = QueryParser._parse_comparison_pattern(
QueryParser._remove_quotes(value)
)
numeric_value = int(float(clean_value))
if numeric_value > 2**32 or numeric_value < 0:
operator = ">" if numeric_value < 0 else operator
numeric_value = max(0, min(numeric_value, 2**32))
except ValueError as e:
raise QueryParserError(
message=f"Invalid numeric value: {e!s}", error_type="NUMBER_VALUE_ERROR"
) from e
except OverflowError as e:
raise QueryParserError(
message=f"Numeric value overflow: {e!s}", error_type="NUMBER_VALUE_ERROR"
) from e
return operator, numeric_value
def _parse_number_value(value: str) -> tuple[str, int]:
"""Parse numeric value with optional comparison operator.
Args:
value: String value to parse
Returns:
Tuple of (operator, numeric_value) where numeric_value is integer,
clamped to [0, 2^32] range and "=" is the default operator.
Raises:
QueryParserError: If value is not a valid number or int overflow occurs
"""
try:
operator, clean_value = QueryParser._parse_comparison_pattern(
QueryParser._remove_quotes(value)
)
numeric_value = int(float(clean_value))
# Clamp to allowed range without altering the operator semantics
numeric_value = max(0, min(numeric_value, 2**32))
except ValueError as e:
raise QueryParserError(
message=f"Invalid numeric value: {e!s}", error_type="NUMBER_VALUE_ERROR"
) from e
except OverflowError as e:
raise QueryParserError(
message=f"Numeric value overflow: {e!s}", error_type="NUMBER_VALUE_ERROR"
) from e
return operator, numeric_value
🤖 Prompt for AI Agents
In backend/apps/common/search/query_parser.py around lines 518 to 551, the code
currently flips the comparison operator when clamping out-of-range numeric
values (e.g., negative values become ">" with 0), which alters query semantics;
instead, clamp numeric_value to the [0, 2**32] range but do NOT modify operator
— preserve the original operator returned by _parse_comparison_pattern; if you
want optional behavior, implement separate handling for strict vs non-strict
modes (in strict mode raise a QueryParserError for unsatisfiable constraints
like "<0", in non-strict mode either skip such filters or leave them clamped)
but the immediate fix is: remove the operator reassignment and only adjust
numeric_value to max(0, min(numeric_value, 2**32)).

Merged via the queue into OWASP:main with commit 51ac3f5 Sep 10, 2025
25 checks passed
arkid15r added a commit that referenced this pull request Sep 14, 2025
* Implement Initial Query Parser for Search Interfaces
[#2038]

* fixed suggestions addressed bots

* fixed edge cases for token parsing and enhaced test cases

* fixed suggestions addressed by bot in query_parser_test.py

* addressed the suggestions and fixed them

* Refine field name regex and validation logic

Updated the field name regex pattern and modified the return logic in the field name validation method.

* fixed suggestions addressed by bot

* Fix Docs case-sensitivity behavior explanation

* Resolved pnpm run format issue in some frontend files

* Restored frontend files format to original

* Update code

* Resolved suggestions

* format check

* updated parser to return value with type and enhanced test cases

* Implement overflow check for numeric value parsing

* fixed pre-commit checks and handled try/except more effectively

* Update code

---------

Co-authored-by: Arkadii Yakovets <[email protected]>
Co-authored-by: Arkadii Yakovets <[email protected]>
Co-authored-by: Kate Golovanova <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Query Parser for Search Interfaces
3 participants