514-labs / olap-agent-ref Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

Reference documents for my agents building OLAP systems

2 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
0-schema-design.md		0-schema-design.md
1-type-rules.md		1-type-rules.md
2-order-by.md		2-order-by.md
3-nesting-guide.md		3-nesting-guide.md
3.1-New-JSON-type.md		3.1-New-JSON-type.md
4-managing-NULL.md		4-managing-NULL.md
README.md		README.md

Repository files navigation

olap-agent-ref

Reference documents for my agents building OLAP systems.

Proposed syllabus; I am most sure about the ones coming up soonest. Super open to feedback.

Phase 1 — Data Modeling & Table / Database Design

✅ Types (strictest type that fits; bytes processed rule)
✅ LowCardinality & Enums (dictionary encoding thresholds and trade‑offs)
✅ ORDER BY keys (cluster low→high cardinality; align with filters)
✅ Denormalization (wide/flat beats joins for OLAP reads)
✅ Flattening and nesting (Nested vs JSON vs flatten; when to explode)
✅ Nullability vs defaults (avoid null bitmaps; when NULL is justified)
Compression & encoding overview (codecs, sort effects, how to measure)
ClickHouse codecs (ZSTD/LZ4/Delta/DoubleDelta/T64; per‑column picks)
Numeric & date precision (Decimal vs Float; DateTime64; codec synergy)
Strings & categoricals (LowCardinality vs Enum vs String; memory impact)
JSON type & subcolumns (new JSON; typed paths; dynamic path limits)
NamedTuple vs Nested (single object vs repeated arrays; query cost)
One table = one grain (line‑item vs order; dedupe and keys)
Partitioning strategy (time‑based pruning and lifecycle)
ORDER BY × PARTITION BY (skip indexes, merges, hot/cold access)
Surrogate keys & IDs (UInt sizes, hashing, collisions)
Time‑series modeling (bucketing, sessions, downsampling)
Schema evolution (ALTER safety; promote hot paths; land raw JSON)
Constraints & immutability (why no FKs; append‑only patterns)
Modeling recap & linter rules (agent checks; detection SQL)

Phase 2 — Query Optimization

Query planning & read path (marks, granules, pruning, EXPLAIN)
Predicate pushdown & filter order (min/max indexes; early pruning)
Aggregation optimization (GROUP BY keys; combinators; partials)
Join strategies (hash vs merge joins; collocation; prejoin choices)
Projections (alternate physical layouts for fast paths)
Sampling & approximations (SAMPLE; quantiles; uniq variants)
Materialized views (schema‑on‑write; incremental rollups; pitfalls)
Aggregating/SummingMergeTree (design pre‑agg tables correctly)
Arrays at scale (ARRAY JOIN vs array functions; when to flatten)
Low‑latency dashboards (buckets, precompute, refresh patterns)
Compression tuning in practice (per‑column experiments, benchmarks)
Caching & memory locality (mark cache, OS cache, query cache)
Parallelism & threads (max_threads; read concurrency; I/O)
Memory limits & spills (max_memory_usage; OOM mitigation)
High‑cardinality handling (distinct strategies; sketches; indexes)
Secondary/skip indexes (bloom, token, set; when they help)
Lambda/UDF patterns (vectorized transforms vs row‑by‑row)
Anti‑patterns (SELECT *; random ORDER BY; over‑wide ORDER BY)
Distributed queries & sharding (clusters, locality, rebalance)
Optimization playbook (checklist; query_log metrics; guardrails)

Phase 3 — Updates & Migration

Immutability & merges (why UPDATE/DELETE is different in OLAP)
Lightweight deletes/updates (mutation engine; safe patterns)
ReplacingMergeTree for dedupe (version column; correctness)
Backfills & rebuilds (CTAS; chunked INSERT SELECT; copy‑on‑write)
TTL & retention (expire/move by partition; cold storage tiers)
Schema registry & contracts (versioned schemas; compatibility)
Migration safety (shadow tables; dual‑write; cutover/rollback)
Data validation (counts, checksums, sampling; dbt tests)
Late‑arriving data (upserts, re‑aggregation, idempotency)
Migration toolkit (automation scripts; playbooks; SLAs)

Phase 4 — Integrations & Operations

Streaming ingest (Kafka; MV routers; backpressure)
CDC from OLTP (Debezium; transforms; ordering)
APIs & webhooks (direct inserts; dedupe; rate limits)
Data lake interop (S3/Parquet; external tables; staging)
Real‑time dashboards (freshness targets; rollup layers)
Observability (system tables; query_log; tracing; alerts)
Cost management (codecs, TTL, merges, part counts, storage)
Security & RBAC (roles, row policies, masking)
Backups & disaster recovery (BACKUP/RESTORE; snapshots)
Capstone & linter pack (agent rules; end‑to‑end review)

About

Reference documents for my agents building OLAP systems

Custom properties

Report repository

Releases

No releases published

Packages

No packages published