Push down InList or hash table references from HashJoinExec depending on the size of the build side #18393

adriangb · 2025-10-31T01:05:17Z

This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in #17171.

A "target state" is tracked in #18393 (this PR).
There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own:

Refactor create_hashes to accept array references #18448
Refactor InListExpr to store arrays and support structs #18449 (depends on Refactor create_hashes to accept array references #18448)
(This PR): Refactor state management in HashJoinExec and use CASE expressions for more precise filters #18451

As those are merged I will rebase this PR to keep track of the "remaining work", and we can use this PR to explore big picture ideas or benchmarks of the final state.

…SE structure

adriangb

Leaving my review comments, will post benchmarks afterwards.

Although this PR is large I think there's a clear path to split it up into independent smaller PRs:

Refactor create_hashes to accept references (changes only to hash_utils.rs)
Refactor InListExpr to store arrays and support structs, re-using create_hashes_from_arrays from (1) (changes only to in_list.rs).
Refactor the data structures used to track pushdown data in HashJoinExec (changes only to files in datafusion/physical-plan/src/joins/hash_join/).
Introduce the CASE statement structure into the filter pushdown of HashJoinExec and the repartition hash PhysicalExpr (changes only to files in datafusion/physical-plan/src/joins/hash_join/, adds HashExpr in datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs).
Add hash table pushdown (adds HashTableLookupExpr in datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs), adds HashTable to PushdownStrategy, adds create_membership_predicate, etc.).
Add InListExpr pushdown (adds PushdownStrategy::InList, etc.)

There is also potential to remove the barrier expression by filtering out only from known partitions CASE ... ELSE true i.e. if we don't have information for all partitions only filter out data that we know won't match in our partition and then update the filter to ELSE false in the case where we have information from all partitions. This might be useful for the distributed case, not sure though.

I think we could somehow unify the hashing / inner data structures of the join hash table and the InList expression - they are very similar - to at least eliminate one round of hashing. I wonder if there's a version of an InList expression that avoids building a Vec<ScalarValue> altogether and instead just wraps an ArrayRef + metadata (data types etc) + an optional hash lookup. That would be quite versatile, we could essentially replace the join hash tables with that structure.

adriangb · 2025-10-31T20:16:25Z

datafusion/common/src/hash_utils.rs

 #[cfg(not(feature = "force_hash_collisions"))]
-pub fn create_hashes<'a>(
-    arrays: &[ArrayRef],
+pub fn create_hashes_from_arrays<'a>(
+    arrays: &[&dyn Array],


Recommend this for its own PR.

I think this is a nice refactor for this function, however I decided not to deprecate / replace it to avoid churn. If this were it's own PR I think it would be worth it to just make a new version and deprecate the old one, replacing all references in DataFusion (which I imagine is most users; even if this is pub I think it's mostly pub to be used in other crates within this repo). I did not investigate if this can help avoid clones in any other call sites.

adriangb · 2025-10-31T20:17:37Z

datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs

-    -           RepartitionExec: partitioning=Hash([a@0, b@1], 12), input_partitions=1
-    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, e], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ a@0 >= ab AND a@0 <= ab AND b@1 >= bb AND b@1 <= bb OR a@0 >= aa AND a@0 <= aa AND b@1 >= ba AND b@1 <= ba ]
+    -           RepartitionExec: partitioning=Hash([a@0, b@1], 4), input_partitions=1
+    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, e], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ CASE hash_repartition % 4 WHEN 0 THEN a@0 >= aa AND a@0 <= aa AND b@1 >= ba AND b@1 <= ba AND struct(a@0, b@1) IN (SET) ([{c0:aa,c1:ba}]) WHEN 2 THEN a@0 >= ab AND a@0 <= ab AND b@1 >= bb AND b@1 <= bb AND struct(a@0, b@1) IN (SET) ([{c0:ab,c1:bb}]) ELSE false END ]


Note that this automatically excludes empty partitions and defaults to false, which works with inner joins. I'm not sure how we'd structure this for other join types.

adriangb · 2025-10-31T20:19:19Z

datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs

-    -           RepartitionExec: partitioning=Hash([a@0, b@1], 12), input_partitions=1
-    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, e], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ a@0 >= aa AND a@0 <= ab AND b@1 >= ba AND b@1 <= bb ]
+    -           RepartitionExec: partitioning=Hash([a@0, b@1], 4), input_partitions=1
+    -             DataSourceExec: file_groups={1 group: [[test.parquet]]}, projection=[a, b, e], file_type=test, pushdown_supported=true, predicate=DynamicFilter [ a@0 >= aa AND a@0 <= ab AND b@1 >= ba AND b@1 <= bb AND struct(a@0, b@1) IN (SET) ([{c0:aa,c1:ba}, {c0:ab,c1:bb}]) ]


In this case there's only 1 partition with data (because of hash collisions) -> we optimize away the CASE expression. This is relevant because the same thing would happen with a point lookup primary key join.

adriangb · 2025-10-31T20:20:19Z

datafusion/physical-expr/src/expressions/in_list.rs

+/// Specialized Set implementation for StructArray
+struct StructArraySet {
+    array: Arc<StructArray>,
+    hash_set: ArrayHashSet,
+}


I think this is a nice improvement / feature that can easily be it's own PR.

adriangb · 2025-10-31T20:20:41Z

datafusion/physical-expr/src/expressions/in_list.rs

+        let has_nulls = self.array.null_count() != 0;
+
+        // Compute hashes for all rows in the input array
+        let mut input_hashes = vec![0u64; v.len()];


Could we use a thread local to avoid repeated re-allocations here? I imagine that would work quite well.

adriangb · 2025-10-31T20:25:18Z

datafusion/physical-plan/src/joins/hash_join/exec.rs

-    /// Create a new `JoinLeftData` from its parts
-    pub(super) fn new(


Clippy was complaining about too many arguments, a constructor was not necessary anyway

adriangb · 2025-10-31T20:26:10Z

datafusion/physical-plan/src/joins/hash_join/exec.rs

+    let membership = if num_rows == 0 {
+        PushdownStrategy::Empty
+    } else {
+        // If the build side is small enough we can use IN list pushdown.
+        // If it's too big we fall back to pushing down a reference to the hash table.
+        // See `PushdownStrategy` for more details.
+        if let Some(in_list_values) =
+            build_struct_inlist_values(&left_values, max_inlist_size)?
+        {
+            PushdownStrategy::InList(in_list_values)
+        } else {
+            PushdownStrategy::HashTable(Arc::clone(&hash_map))
+        }
+    };


I did some remodeling of the data structures we use to track state. I think the new structure is much better - even if we didn't move forward with the rest of the PR.

adriangb · 2025-10-31T20:26:55Z

datafusion/physical-plan/src/joins/hash_join/inlist_builder.rs

+    // Size check using built-in method
+    // This is not 1:1 with the actual size of ScalarValues, but it is a good approximation
+    // and at this point is basically "free" to compute since we have the arrays already.
+    let estimated_size = join_key_arrays
+        .iter()
+        .map(|arr| arr.get_array_memory_size())
+        .sum::<usize>();
+
+    if estimated_size > max_size_bytes {
+        return Ok(None);
+    }


Note: this is where we check the size to set the size limit.

adriangb · 2025-10-31T20:27:28Z

datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs

@@ -0,0 +1,292 @@
+// Licensed to the Apache Software Foundation (ASF) under one


This is the part used to handle the cases where we push down the entire hash table.

adriangb · 2025-10-31T20:29:24Z

datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs

+/// Build-side data reported by a single partition
+pub(crate) enum PartitionBuildData {
+    Partitioned {
+        partition_id: usize,
+        pushdown: PushdownStrategy,
+        bounds: PartitionBounds,
+    },
+    CollectLeft {
+        pushdown: PushdownStrategy,
+        bounds: PartitionBounds,
+    },
+}
+
+/// Per-partition accumulated data (Partitioned mode)
+#[derive(Clone)]
+struct PartitionData {
+    bounds: PartitionBounds,
+    pushdown: PushdownStrategy,
+}
+
+/// Build-side data organized by partition mode
+enum AccumulatedBuildData {
+    Partitioned {
+        partitions: Vec<Option<PartitionData>>,
+    },
+    CollectLeft {
+        data: Option<PartitionData>,
+    },
+}


This is the refactoring of the data structures we store to track state. It's now much cleaner, e.g. avoids comparing partitions by id.

adriangb · 2025-10-31T20:57:20Z

Here's one interesting benchmark:

COPY (SELECT uuid() as k, uuid() as v FROM generate_series(1, 5) t(i))
TO 'small_table_uuids.parquet'
OPTIONS (
  'MAX_ROW_GROUP_SIZE' '50000',
  'BLOOM_FILTER_ENABLED::k' 'true'
);
COPY (SELECT random()::text as v1, random()::text as v2, uuid() as k FROM generate_series(1, 100000000) t(i))
TO 'large_table_uuids.parquet'
OPTIONS (
  'MAX_ROW_GROUP_SIZE' '50000',
  'BLOOM_FILTER_ENABLED::k' 'true'
);

CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION 'small_table_uuids.parquet';
CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION 'large_table_uuids.parquet';

SET datafusion.execution.parquet.pushdown_filters = true;
SET datafusion.execution.parquet.reorder_filters = true;
SET datafusion.runtime.metadata_cache_limit = 0;

-- Join the two tables, with a filter on small_table
SELECT *
FROM small_table s JOIN large_table l ON s.k = l.k;

This is similar to the query we benchmarked in our recent blog post but using UUIDs instead of ints so that the min/max stats pushdown doesn't really help (hence why main is ~ same as DuckDB instead of faster as in the blog post).

Full benchmark script

#!/usr/bin/env python3
"""
Benchmark script comparing DataFusion with/without inlist pushdown vs DuckDB.

Groups:
1. branch (no inlist): hash_join_inlist_pushdown_max_size = 0
2. branch (w/ inlist): hash_join_inlist_pushdown_max_size = default (999999)
3. main: using datafusion-cli-main
4. duckdb: using duckdb CLI
"""

import subprocess
import tempfile
import time
import os
import sys
from pathlib import Path

# Configuration
DATAFUSION_CLI = "./target/release/datafusion-cli"
DATAFUSION_CLI_MAIN = "./datafusion-cli-main"
DUCKDB_CLI = "duckdb"
NUM_RUNS = 5  # Number of times to run each benchmark

# Data generation settings
SMALL_TABLE_SIZE = 5
LARGE_TABLE_SIZE = 100_000_000
SMALL_TABLE_FILE = "small_table_uuids.parquet"
LARGE_TABLE_FILE = "large_table_uuids.parquet"


def run_command(cmd, input_sql=None, description=""):
    """Run a command and measure execution time."""
    print(f"  Running: {description}...", end=" ", flush=True)

    start = time.time()
    try:
        if input_sql:
            result = subprocess.run(
                cmd,
                input=input_sql,
                capture_output=True,
                text=True,
                timeout=600  # 10 minute timeout
            )
        else:
            result = subprocess.run(
                cmd,
                capture_output=True,
                text=True,
                timeout=600
            )

        elapsed = time.time() - start

        if result.returncode != 0:
            print(f"FAILED (exit code {result.returncode})")
            print(f"    stderr: {result.stderr}")
            return None

        print(f"{elapsed:.3f}s")
        return elapsed
    except subprocess.TimeoutExpired:
        print("TIMEOUT")
        return None
    except Exception as e:
        print(f"ERROR: {e}")
        return None


def create_data():
    """Create test data files if they don't exist."""
    if os.path.exists(SMALL_TABLE_FILE) and os.path.exists(LARGE_TABLE_FILE):
        print(f"Data files already exist, skipping creation.")
        return True

    print(f"Creating test data...")

    data_gen_sql = f"""
COPY (SELECT uuid() as k, uuid() as v FROM generate_series(1, {SMALL_TABLE_SIZE}) t(i))
TO '{SMALL_TABLE_FILE}'
OPTIONS (
  'MAX_ROW_GROUP_SIZE' '50000',
  'BLOOM_FILTER_ENABLED::k' 'true'
);

COPY (SELECT random()::text as v1, random()::text as v2, uuid() as k FROM generate_series(1, {LARGE_TABLE_SIZE}) t(i))
TO '{LARGE_TABLE_FILE}'
OPTIONS (
  'MAX_ROW_GROUP_SIZE' '50000',
  'BLOOM_FILTER_ENABLED::k' 'true'
);
"""

    result = subprocess.run(
        [DATAFUSION_CLI],
        input=data_gen_sql,
        capture_output=True,
        text=True
    )

    if result.returncode != 0:
        print(f"Failed to create data: {result.stderr}")
        return False

    print(f"Data created successfully.")
    return True


def create_datafusion_sql(inlist_size):
    """Create SQL for DataFusion with specified inlist pushdown size."""
    return f"""
CREATE EXTERNAL TABLE small_table STORED AS PARQUET LOCATION '{SMALL_TABLE_FILE}';
CREATE EXTERNAL TABLE large_table STORED AS PARQUET LOCATION '{LARGE_TABLE_FILE}';

SET datafusion.execution.parquet.pushdown_filters = true;
SET datafusion.execution.parquet.reorder_filters = true;
SET datafusion.optimizer.hash_join_inlist_pushdown_max_size = {inlist_size};
SET datafusion.runtime.metadata_cache_limit = '0M';

SELECT *
FROM small_table s JOIN large_table l ON s.k = l.k;
"""


def create_duckdb_sql():
    """Create SQL for DuckDB."""
    return f"""
SELECT *
FROM '{SMALL_TABLE_FILE}' s JOIN '{LARGE_TABLE_FILE}' l ON s.k = l.k;
"""


def run_benchmark_group(name, cmd, sql_content, num_runs=NUM_RUNS):
    """Run a benchmark group multiple times and collect results."""
    print(f"\n{name}:")
    times = []

    for i in range(num_runs):
        elapsed = run_command(cmd, input_sql=sql_content, description=f"Run {i+1}/{num_runs}")
        if elapsed is not None:
            times.append(elapsed)

    if times:
        avg = sum(times) / len(times)
        min_time = min(times)
        max_time = max(times)
        print(f"  Results: min={min_time:.3f}s, avg={avg:.3f}s, max={max_time:.3f}s")
        return times
    else:
        print(f"  No successful runs")
        return []


def main():
    print("=" * 60)
    print("DataFusion Inlist Pushdown Benchmark")
    print("=" * 60)

    # Verify executables exist
    if not os.path.exists(DATAFUSION_CLI):
        print(f"Error: {DATAFUSION_CLI} not found")
        sys.exit(1)

    if not os.path.exists(DATAFUSION_CLI_MAIN):
        print(f"Error: {DATAFUSION_CLI_MAIN} not found")
        sys.exit(1)

    try:
        subprocess.run([DUCKDB_CLI, "--version"], capture_output=True, check=True)
    except (subprocess.CalledProcessError, FileNotFoundError):
        print(f"Error: duckdb CLI not found or not working")
        sys.exit(1)

    # Create data
    if not create_data():
        sys.exit(1)

    # Run benchmarks
    results = {}

    # 1. Branch without inlist pushdown
    results["branch_no_inlist"] = run_benchmark_group(
        "Branch (no inlist, size=0)",
        [DATAFUSION_CLI],
        create_datafusion_sql(0)
    )

    # 2. Branch with inlist pushdown
    results["branch_with_inlist"] = run_benchmark_group(
        "Branch (w/ inlist, size=999999)",
        [DATAFUSION_CLI],
        create_datafusion_sql(999999)
    )

    # 3. Main branch
    results["main"] = run_benchmark_group(
        "Main branch",
        [DATAFUSION_CLI_MAIN],
        create_datafusion_sql(999999)
    )

    # 4. DuckDB
    results["duckdb"] = run_benchmark_group(
        "DuckDB",
        [DUCKDB_CLI],
        create_duckdb_sql()
    )

    # Summary
    print("\n" + "=" * 60)
    print("SUMMARY")
    print("=" * 60)

    for name, times in results.items():
        if times:
            avg = sum(times) / len(times)
            print(f"{name:25s}: {avg:.3f}s avg over {len(times)} runs")
        else:
            print(f"{name:25s}: No successful runs")

    print("\nAll times (seconds):")
    for name, times in results.items():
        if times:
            times_str = ", ".join(f"{t:.3f}" for t in times)
            print(f"  {name}: [{times_str}]")


if __name__ == "__main__":
    main()

============================================================
SUMMARY
============================================================
branch_no_inlist         : 1.010s avg over 5 runs
branch_with_inlist       : 0.290s avg over 5 runs
main                     : 1.696s avg over 5 runs
duckdb                   : 1.634s avg over 5 runs

adriangb · 2025-10-31T21:58:24Z

❯ ./bench.sh compare main use-in-list         
Comparing main and use-in-list
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ use-in-list ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  516.69 ms │   503.63 ms │     no change │
│ QQuery 2     │  101.85 ms │   126.92 ms │  1.25x slower │
│ QQuery 3     │  266.65 ms │   262.82 ms │     no change │
│ QQuery 4     │  220.78 ms │   208.05 ms │ +1.06x faster │
│ QQuery 5     │  393.63 ms │   379.17 ms │     no change │
│ QQuery 6     │  145.39 ms │   142.40 ms │     no change │
│ QQuery 7     │  542.04 ms │   557.89 ms │     no change │
│ QQuery 8     │  437.90 ms │   415.90 ms │ +1.05x faster │
│ QQuery 9     │  647.38 ms │   611.35 ms │ +1.06x faster │
│ QQuery 10    │  355.89 ms │   342.20 ms │     no change │
│ QQuery 11    │   78.78 ms │    79.20 ms │     no change │
│ QQuery 12    │  217.50 ms │   195.61 ms │ +1.11x faster │
│ QQuery 13    │  379.35 ms │   348.38 ms │ +1.09x faster │
│ QQuery 14    │  194.42 ms │   178.55 ms │ +1.09x faster │
│ QQuery 15    │  274.45 ms │   259.19 ms │ +1.06x faster │
│ QQuery 16    │   67.99 ms │    59.72 ms │ +1.14x faster │
│ QQuery 17    │  708.83 ms │   630.79 ms │ +1.12x faster │
│ QQuery 18    │ 1002.48 ms │  1055.66 ms │  1.05x slower │
│ QQuery 19    │  319.69 ms │   290.27 ms │ +1.10x faster │
│ QQuery 20    │  253.56 ms │   246.25 ms │     no change │
│ QQuery 21    │  760.56 ms │   697.09 ms │ +1.09x faster │
│ QQuery 22    │   94.52 ms │    78.84 ms │ +1.20x faster │
└──────────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)          │ 7980.32ms │
│ Total Time (use-in-list)   │ 7669.89ms │
│ Average Time (main)        │  362.74ms │
│ Average Time (use-in-list) │  348.63ms │
│ Queries Faster             │        12 │
│ Queries Slower             │         2 │
│ Queries with No Change     │         8 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

xudong963 · 2025-11-01T02:31:45Z

duckdb : 1.634s avg over 5 runs

Surprised that duckdb takes so long, I was thinking it also can push down a inlist for hash join

adriangb · 2025-11-01T05:51:44Z

duckdb : 1.634s avg over 5 runs

Surprised that duckdb takes so long, I was thinking it also can push down a inlist for hash join

As far as I know they only do min/max stats: https://duckdb.org/2024/09/09/announcing-duckdb-110#dynamic-filter-pushdown-from-joins

adriangb · 2025-11-02T00:45:55Z

Here are some new numbers after a couple more optimizations to InListExpr:

============================================================
SUMMARY
============================================================
branch_no_inlist         : 0.670s avg over 5 runs
branch_with_inlist       : 0.163s avg over 5 runs
main                     : 1.499s avg over 5 runs
duckdb                   : 1.386s avg over 5 runs

I'd like to clarify why the InListExpr makes such a difference:

It allows some statistics pruning. The pruning expressions know how to handle InList and explode it into a chain of col = 1 OR col = 2 ... when there are less than 20 items in the list. This may have some impact but similar to the current min/max pushdown depends on properties of the data.
It allows bloom filter pruning. If the above case is hit (less than 20 items) then each one of the OR cases will be checked against a bloom filter on the column if one exists.

I think it's mainly that latter optimization that provides the win.

And TPCH:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ use-in-list ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  516.69 ms │   478.97 ms │ +1.08x faster │
│ QQuery 2     │  101.85 ms │    88.72 ms │ +1.15x faster │
│ QQuery 3     │  266.65 ms │   247.67 ms │ +1.08x faster │
│ QQuery 4     │  220.78 ms │   201.75 ms │ +1.09x faster │
│ QQuery 5     │  393.63 ms │   348.56 ms │ +1.13x faster │
│ QQuery 6     │  145.39 ms │   131.40 ms │ +1.11x faster │
│ QQuery 7     │  542.04 ms │   499.63 ms │ +1.08x faster │
│ QQuery 8     │  437.90 ms │   389.98 ms │ +1.12x faster │
│ QQuery 9     │  647.38 ms │   593.22 ms │ +1.09x faster │
│ QQuery 10    │  355.89 ms │   341.61 ms │     no change │
│ QQuery 11    │   78.78 ms │    70.63 ms │ +1.12x faster │
│ QQuery 12    │  217.50 ms │   195.95 ms │ +1.11x faster │
│ QQuery 13    │  379.35 ms │   338.09 ms │ +1.12x faster │
│ QQuery 14    │  194.42 ms │   176.03 ms │ +1.10x faster │
│ QQuery 15    │  274.45 ms │   246.27 ms │ +1.11x faster │
│ QQuery 16    │   67.99 ms │    62.21 ms │ +1.09x faster │
│ QQuery 17    │  708.83 ms │   648.45 ms │ +1.09x faster │
│ QQuery 18    │ 1002.48 ms │  1118.42 ms │  1.12x slower │
│ QQuery 19    │  319.69 ms │   268.59 ms │ +1.19x faster │
│ QQuery 20    │  253.56 ms │   241.12 ms │     no change │
│ QQuery 21    │  760.56 ms │   719.05 ms │ +1.06x faster │
│ QQuery 22    │   94.52 ms │    74.86 ms │ +1.26x faster │
└──────────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)          │ 7980.32ms │
│ Total Time (use-in-list)   │ 7481.19ms │
│ Average Time (main)        │  362.74ms │
│ Average Time (use-in-list) │  340.05ms │
│ Queries Faster             │        19 │
│ Queries Slower             │         1 │
│ Queries with No Change     │         2 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

- Fix error handling in shared_bounds.rs: properly propagate errors instead of silently swallowing with .ok() - Improve hash collision handling in InListExpr by sampling array values - Run cargo fmt to fix formatting issues Addresses CI feedback on PR apache#18438. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

adriangb · 2025-11-02T07:14:25Z

Marking as draft again. Looking at the changes while they're mostly in the right spirit there's a lot of details that need more manual care. I'll do that before marking as ready for review.

adriangb · 2025-11-02T20:50:08Z

I've done some cleanup, the first three PRs are ready for review:

That will be most of the changes, the only two follwup PRs will be to push down the hash table references and InListExpr

adriangb added 9 commits October 30, 2025 16:44

Push down hash table references from HashJoinExec into scans using CA…

a35a342

…SE structure

fix lints

be006e8

update snapshot

7e28cc4

implement for CollectLeft

3332644

use inlist expressions for build sides <128MB

018fbfc

wip on InList pushdown

6bbba43

fix struct building

0cbf645

omit empty partitions, update slts

dbaa756

fmt

488c64b

adriangb mentioned this pull request Oct 31, 2025

Push down hash table references from HashJoinExec into scans using CASE structure #18306

Draft

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate physical-plan Changes to the physical-plan crate labels Oct 31, 2025

adriangb mentioned this pull request Oct 31, 2025

Push down entire hash table from HashJoinExec into scans #17171

Open

adriangb added 5 commits October 30, 2025 20:06

update snaps

b727111

lots of cleanup/refactor

2a9033a

working but InList sets don't support Struct arrays

83a7d42

add optimizations

6908152

avoid cloning arrays, tweak default value

feb9c40

github-actions bot added the physical-expr Changes to the physical-expr crates label Oct 31, 2025

adriangb added 2 commits October 31, 2025 15:13

lints, typos, formatting

3aa305e

fix test

8d4cdfd

github-actions bot added the documentation Improvements or additions to documentation label Oct 31, 2025

adriangb commented Oct 31, 2025

View reviewed changes

some nits

932438d

adriangb added 3 commits October 31, 2025 16:38

fix

5654512

fmt

bf5a0b6

fix some more tests

ba073f7

fix dict handling

c309d56

adriangb marked this pull request as ready for review October 31, 2025 21:59

update tpch slt

e32c080

use array storage for InListExpr

31d7cc3

github-actions bot added the proto Related to proto crate label Nov 2, 2025

adriangb marked this pull request as draft November 2, 2025 07:13

adriangb added 8 commits November 2, 2025 01:53

clean up hash_utils.rs

b9c595e

attempt to unify in_list.rs (failing tests)

6a90a9c

in_list.rs test passing

533d650

fmt

e608c58

Improve InListExpr::list docstring

71faf93

fix null handling in InListExpr

5a01cdc

fix PartialEq for InListExpr

f34d341

fix

d198ab3

2010YOUY01 mentioned this pull request Nov 3, 2025

Refactor create_hashes to accept array references #18448

Open

This was referenced Nov 3, 2025

Refactor InListExpr to store arrays and support structs #18449

Open

Refactor state management in HashJoinExec and use CASE expressions for more precise filters #18451

Open

		/// Create a new `JoinLeftData` from its parts
		pub(super) fn new(

		@@ -0,0 +1,292 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Push down InList or hash table references from HashJoinExec depending on the size of the build side #18393

Are you sure you want to change the base?

Push down InList or hash table references from HashJoinExec depending on the size of the build side #18393

Uh oh!

Conversation

adriangb commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Oct 31, 2025

Uh oh!

xudong963 commented Nov 1, 2025

Uh oh!

adriangb commented Nov 1, 2025

Uh oh!

adriangb commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Nov 2, 2025

Uh oh!

adriangb commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adriangb commented Oct 31, 2025 •

edited

Loading

adriangb left a comment •

edited

Loading

adriangb commented Oct 31, 2025 •

edited

Loading

adriangb commented Nov 2, 2025 •

edited

Loading

adriangb commented Nov 2, 2025 •

edited

Loading