-
Notifications
You must be signed in to change notification settings - Fork 482
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When backfilling logs from various time periods simultaneously, splits are not well-distributed by timestamp. Most splits end up spanning nearly the entire time range instead of being partitioned into smaller time windows. This leads to inefficient time-based pruning during queries.
Steps to reproduce (if applicable)
- Set up a Quickwit index with timestamp field
- Backfill historical logs from multiple time periods (e.g., logs from different days/weeks)
- Observe split metadata - most splits will have very wide time ranges
Expected behavior
Splits should have bounded time ranges even during backfill operations, enabling efficient time-based pruning during queries.
Before PR #5850:
- Splits containing both old and recent data were deleted prematurely
- Data loss occurred when any document in a split exceeded retention period
- Example: A split with data from 2023-2025 would be entirely deleted once 2023 data expired
After PR #5850 (bug fixed):
- Wide time range splits cannot be deleted until ALL data expires
- Query performance degraded as time-based pruning becomes less effective
- Example: A split with 99% expired data and 1% recent data must be kept entirely
Configuration:
- Quickwit Version: qw-airmail-20250522-hotfix
- The index_config:
version: 0.8
index_id: log.common.karrot_audit_log_v1
doc_mapping:
field_mappings:
# CreatedAt field
- name: created_at
type: datetime
input_formats:
- iso8601
- unix_timestamp
output_format: unix_timestamp_nanos
fast: true
fast_precision: microseconds
description: "Kafka log creation time"
# Timestamp field
- name: timestamp
type: datetime
input_formats:
- iso8601
- unix_timestamp
output_format: unix_timestamp_nanos
fast: true
fast_precision: microseconds
description: "Log occurrence timestamp"
# Actor fields
- name: actor
type: object
field_mappings:
- name: type
type: text
tokenizer: raw
fast: true
description: "Actor Type (Employee, User, System)"
- name: metadata
type: json
tokenizer: default
description: "Actor Metadata"
# Event fields
- name: event
type: object
field_mappings:
- name: type
type: text
tokenizer: raw
fast: true
description: "Event Type (Normal, Authorization, Privacy, Location)"
- name: operation
type: text
tokenizer: default
fast: true
description: "Event Operation"
- name: reason
type: text
tokenizer: default
description: "Event Reason"
- name: resource
type: object
field_mappings:
- name: type
type: text
tokenizer: lowercase
fast: true
description: "Event Resource Type"
- name: value
type: text
tokenizer: lowercase
fast: true
description: "Event Resource Value"
- name: metadata
type: json
tokenizer: default
description: "Event Metadata"
# Source fields
- name: source
type: object
field_mappings:
- name: type
type: text
tokenizer: raw
fast: true
description: "Source Type (Admin, Service)"
- name: metadata
type: object
field_mappings:
- name: name
type: text
tokenizer: lowercase
fast: true
description: "Source Name"
- name: country_code
type: text
tokenizer: raw
fast: true
description: "Source Country Code"
- name: url
type: text
tokenizer: default
description: "Source URL"
- name: request_id
type: text
tokenizer: raw
fast: true
description: "Source Request ID"
- name: env
type: text
tokenizer: raw
fast: true
description: "Environment (alpha, prod)"
- name: region
type: text
tokenizer: raw
fast: true
description: "Region (kr, ca, jp, gb)"
# Timestamp field for time-based partitioning
timestamp_field: created_at
indexing_settings:
merge_policy:
type: "stable_log"
merge_factor: 10
max_merge_factor: 12
maturation_period: 48h
commit_timeout_secs: 5
search_settings:
default_search_fields:
- source.metadata.name
- event.resource.type
- event.resource.value
retention:
period: 365 days
schedule: daily
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working