Reference documents for my agents building OLAP systems.
Proposed syllabus; I am most sure about the ones coming up soonest. Super open to feedback.
- ✅ Types (strictest type that fits; bytes processed rule)
- ✅ LowCardinality & Enums (dictionary encoding thresholds and trade‑offs)
- ✅ ORDER BY keys (cluster low→high cardinality; align with filters)
- ✅ Denormalization (wide/flat beats joins for OLAP reads)
- ✅ Flattening and nesting (Nested vs JSON vs flatten; when to explode)
- ✅ Nullability vs defaults (avoid null bitmaps; when NULL is justified)
- Compression & encoding overview (codecs, sort effects, how to measure)
- ClickHouse codecs (ZSTD/LZ4/Delta/DoubleDelta/T64; per‑column picks)
- Numeric & date precision (Decimal vs Float; DateTime64; codec synergy)
- Strings & categoricals (LowCardinality vs Enum vs String; memory impact)
- JSON type & subcolumns (new JSON; typed paths; dynamic path limits)
- NamedTuple vs Nested (single object vs repeated arrays; query cost)
- One table = one grain (line‑item vs order; dedupe and keys)
- Partitioning strategy (time‑based pruning and lifecycle)
- ORDER BY × PARTITION BY (skip indexes, merges, hot/cold access)
- Surrogate keys & IDs (UInt sizes, hashing, collisions)
- Time‑series modeling (bucketing, sessions, downsampling)
- Schema evolution (ALTER safety; promote hot paths; land raw JSON)
- Constraints & immutability (why no FKs; append‑only patterns)
- Modeling recap & linter rules (agent checks; detection SQL)
- Query planning & read path (marks, granules, pruning, EXPLAIN)
- Predicate pushdown & filter order (min/max indexes; early pruning)
- Aggregation optimization (GROUP BY keys; combinators; partials)
- Join strategies (hash vs merge joins; collocation; prejoin choices)
- Projections (alternate physical layouts for fast paths)
- Sampling & approximations (SAMPLE; quantiles; uniq variants)
- Materialized views (schema‑on‑write; incremental rollups; pitfalls)
- Aggregating/SummingMergeTree (design pre‑agg tables correctly)
- Arrays at scale (ARRAY JOIN vs array functions; when to flatten)
- Low‑latency dashboards (buckets, precompute, refresh patterns)
- Compression tuning in practice (per‑column experiments, benchmarks)
- Caching & memory locality (mark cache, OS cache, query cache)
- Parallelism & threads (max_threads; read concurrency; I/O)
- Memory limits & spills (max_memory_usage; OOM mitigation)
- High‑cardinality handling (distinct strategies; sketches; indexes)
- Secondary/skip indexes (bloom, token, set; when they help)
- Lambda/UDF patterns (vectorized transforms vs row‑by‑row)
- Anti‑patterns (SELECT *; random ORDER BY; over‑wide ORDER BY)
- Distributed queries & sharding (clusters, locality, rebalance)
- Optimization playbook (checklist; query_log metrics; guardrails)
- Immutability & merges (why UPDATE/DELETE is different in OLAP)
- Lightweight deletes/updates (mutation engine; safe patterns)
- ReplacingMergeTree for dedupe (version column; correctness)
- Backfills & rebuilds (CTAS; chunked INSERT SELECT; copy‑on‑write)
- TTL & retention (expire/move by partition; cold storage tiers)
- Schema registry & contracts (versioned schemas; compatibility)
- Migration safety (shadow tables; dual‑write; cutover/rollback)
- Data validation (counts, checksums, sampling; dbt tests)
- Late‑arriving data (upserts, re‑aggregation, idempotency)
- Migration toolkit (automation scripts; playbooks; SLAs)
- Streaming ingest (Kafka; MV routers; backpressure)
- CDC from OLTP (Debezium; transforms; ordering)
- APIs & webhooks (direct inserts; dedupe; rate limits)
- Data lake interop (S3/Parquet; external tables; staging)
- Real‑time dashboards (freshness targets; rollup layers)
- Observability (system tables; query_log; tracing; alerts)
- Cost management (codecs, TTL, merges, part counts, storage)
- Security & RBAC (roles, row policies, masking)
- Backups & disaster recovery (BACKUP/RESTORE; snapshots)
- Capstone & linter pack (agent rules; end‑to‑end review)