Lower latency associative scan option #10599

oliverdutton · 2022-05-06T08:28:51Z

oliverdutton
May 6, 2022

In one of my problems the implementation was bottlenecked by a cumulative matmul. JAX has a handy implemention of a work-efficient associative scan for this lax.associative_scan. This reduces the procedure from N steps to 2 log_2{N}-2 steps. There is a work-inefficient implementation that reduces this to log_2{N} steps, shown below which is faster for small problem sizes where the GPU is not saturated.

Would anyone else be interested in having a work-inefficient option in the lax.associative scan? If so I can put together a pull request.

see https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda, http://www.cs.cmu.edu/~guyb/papers/Ble93.pdf, https://en.wikipedia.org/wiki/Prefix_sum

import jax
from functools import partial
from jax import numpy as jnp, lax, random, jit
slice_in_dim = jax.lax.slice_in_dim

def sequential(operator, x):
    def _body(prev_cumulative, elem):
        cumulative = operator(prev_cumulative, elem)
        return cumulative, cumulative
    _, y = jax.lax.scan(
        f=lambda carry,x: _body(carry, x),
        init=x[0],
        xs=x[1:]
    )
    return jnp.concatenate([x[:1], y])

def work_inefficient_all_prefix_sum(operator, x):
    with jax.ensure_compile_time_eval():
        # Hillis, W. D. and Steele, G. L. (1986). Data parallel algorithms. Communications of the ACM, 29(12), 1170–1183
        # log_2{n} steps
        n = x.shape[0]
        j_max = jnp.ceil(jnp.log2(n)).astype(int)

        l = slice_in_dim(x, 0   ,n-2**0)
        r = slice_in_dim(x, 2**0,n)
        for j in range(0,j_max-1):
            prev_l=l
            n = r.shape[0]
            r_ = operator(l, r)
            l_max = 2**j
            r_max = 2**(j+1)
            if r_max > n:
                l_max = l_max-r_max+n
                l = slice_in_dim(l, 0, l_max)
            else:
                l = jnp.concatenate([slice_in_dim(l, 0, 2**j), slice_in_dim(r_, 0, n-2**(j+1))])
            r = slice_in_dim(r_, 2**j, n)
        n = r.shape[0]
        final_r_ = operator(slice_in_dim(l, 0, n), r)
        return jnp.concatenate([slice_in_dim(prev_l, 0, 2**j), slice_in_dim(r_, 0, 2**j), final_r_])

key = random.PRNGKey(42)
x = random.uniform(key, (500,3,3))
x  /= jnp.linalg.norm(x, axis=(-1,-2), keepdims=True) # not real normalisation, but means 
operator = lambda x,y: jnp.matmul(y,x)

functions = {
    'sequential': sequential,
    'work-efficient': lax.associative_scan,
    'work-inefficient': work_inefficient_all_prefix_sum
}
for name, f in functions.items():
    print(name)
    f = jit(partial(f, operator))
    inputs = (x,)
    _ = jax.block_until_ready(f(*inputs)) # Compile once
    timings = %timeit -n 10 -r 10 -o _ = jax.block_until_ready(f(*inputs))


## output
sequential
8.59 ms ± 837 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
work-efficient
214 µs ± 2.57 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
work-inefficient
141 µs ± 2.48 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

carlosgmartin · 2025-10-29T19:17:13Z

carlosgmartin
Oct 29, 2025

Looks like a lot of people would be interested (myself included).

Are you still interested in creating such a PR?

0 replies

oliverdutton · 2025-10-31T09:14:27Z

oliverdutton
Oct 31, 2025
Author

Cool, I'll put together a PR in the next couple of days.

I will make the code clearer and generalise it to accept trees of arrays. I'm pretty sure that if r_max > n will never get executed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lower latency associative scan option #10599

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Lower latency associative scan option #10599

Uh oh!

Uh oh!

oliverdutton May 6, 2022

Replies: 2 comments

Uh oh!

Uh oh!

carlosgmartin Oct 29, 2025

Uh oh!

Uh oh!

oliverdutton Oct 31, 2025 Author

oliverdutton
May 6, 2022

carlosgmartin
Oct 29, 2025

oliverdutton
Oct 31, 2025
Author