Skip to content

Splits span wide time ranges when backfilling historical data from multiple time periods #5854

@earlbread

Description

@earlbread

Describe the bug
When backfilling logs from various time periods simultaneously, splits are not well-distributed by timestamp. Most splits end up spanning nearly the entire time range instead of being partitioned into smaller time windows. This leads to inefficient time-based pruning during queries.

Steps to reproduce (if applicable)

  1. Set up a Quickwit index with timestamp field
  2. Backfill historical logs from multiple time periods (e.g., logs from different days/weeks)
  3. Observe split metadata - most splits will have very wide time ranges

Expected behavior
Splits should have bounded time ranges even during backfill operations, enabling efficient time-based pruning during queries.

Related Issues
#5848 , #5850

Before PR #5850:

  • Splits containing both old and recent data were deleted prematurely
  • Data loss occurred when any document in a split exceeded retention period
  • Example: A split with data from 2023-2025 would be entirely deleted once 2023 data expired

After PR #5850 (bug fixed):

  • Wide time range splits cannot be deleted until ALL data expires
  • Query performance degraded as time-based pruning becomes less effective
  • Example: A split with 99% expired data and 1% recent data must be kept entirely

Configuration:

  1. Quickwit Version: qw-airmail-20250522-hotfix
  2. The index_config:
version: 0.8

index_id: log.common.karrot_audit_log_v1

doc_mapping:
  field_mappings:
    # CreatedAt field
    - name: created_at
      type: datetime
      input_formats:
        - iso8601
        - unix_timestamp
      output_format: unix_timestamp_nanos
      fast: true
      fast_precision: microseconds
      description: "Kafka log creation time"

    # Timestamp field
    - name: timestamp
      type: datetime
      input_formats:
        - iso8601
        - unix_timestamp
      output_format: unix_timestamp_nanos
      fast: true
      fast_precision: microseconds
      description: "Log occurrence timestamp"

    # Actor fields
    - name: actor
      type: object
      field_mappings:
        - name: type
          type: text
          tokenizer: raw
          fast: true
          description: "Actor Type (Employee, User, System)"

        - name: metadata
          type: json
          tokenizer: default
          description: "Actor Metadata"

    # Event fields
    - name: event
      type: object
      field_mappings:
        - name: type
          type: text
          tokenizer: raw
          fast: true
          description: "Event Type (Normal, Authorization, Privacy, Location)"

        - name: operation
          type: text
          tokenizer: default
          fast: true
          description: "Event Operation"

        - name: reason
          type: text
          tokenizer: default
          description: "Event Reason"

        - name: resource
          type: object
          field_mappings:
            - name: type
              type: text
              tokenizer: lowercase
              fast: true
              description: "Event Resource Type"

            - name: value
              type: text
              tokenizer: lowercase
              fast: true
              description: "Event Resource Value"

        - name: metadata
          type: json
          tokenizer: default
          description: "Event Metadata"

    # Source fields
    - name: source
      type: object
      field_mappings:
        - name: type
          type: text
          tokenizer: raw
          fast: true
          description: "Source Type (Admin, Service)"

        - name: metadata
          type: object
          field_mappings:
            - name: name
              type: text
              tokenizer: lowercase
              fast: true
              description: "Source Name"

            - name: country_code
              type: text
              tokenizer: raw
              fast: true
              description: "Source Country Code"

            - name: url
              type: text
              tokenizer: default
              description: "Source URL"

            - name: request_id
              type: text
              tokenizer: raw
              fast: true
              description: "Source Request ID"

    - name: env
      type: text
      tokenizer: raw
      fast: true
      description: "Environment (alpha, prod)"

    - name: region
      type: text
      tokenizer: raw
      fast: true
      description: "Region (kr, ca, jp, gb)"

  # Timestamp field for time-based partitioning
  timestamp_field: created_at

indexing_settings:
  merge_policy:
    type: "stable_log"
    merge_factor: 10
    max_merge_factor: 12
    maturation_period: 48h
  commit_timeout_secs: 5

search_settings:
  default_search_fields:
    - source.metadata.name
    - event.resource.type
    - event.resource.value

retention:
  period: 365 days
  schedule: daily

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions