Create cubecl-scan crate #863

Friedrich-S · 2025-09-06T13:04:42Z

This PR adds a new cubecl-scan crate to implement the associative scan (inclusive/exclusive) operation as an optimized GPU primitive.

Rough Overview

The goal is to implement something very much like the JAX associative_scan operation. The user should be able to provide an arbitrary associative operator based on which the implementation should compute the scan operation across a tensor.

Current Status

So far, I have implemented a somewhat functional version of the decoupled lookback algorithm. It lives in src/kernels/decoupled_lookback.rs and is currently functional in cubecl-wgsl and cubecl-cpu. Unfortunately, in cubecl-cuda it currently hangs indefinitely, and I suspect it is related to incorrect memory ordering of the atomic stores in the aggregates and flags.

This has sat around on my local disk for too long and I have recently encountered a case in a personal project where I would like to use burn, but require the presence of a fast scan implementation. I am sharing this as a draft PR in the hopes that others might look at it as well.

The goal is to also implement a naive version of the scan operation using multiple passes for targets that do not have forward-progress guarantees (questionable for Vulkan already) or lack the necessary atomic operations.

Notes

It was hard to get the memory ordering to be roughly what I want due to being unable to specify atomic ordering modes like in standard Rust atomics and memory/compiler barriers. A hacky solution for now was to use atomic values for the aggregates. However, that means that it is impossible to use aggregate types that do not work with atomics. The algorithm itself does not need them to be atomic, and I would like to get rid of that as well. For example, the Linear Recurrent Unit proposed by DeepMind requires the scan to be performed over vectors, which would not be supported by the current implementation.

Related Issues

tracel-ai/burn#3806

initial decoupled lookback scan implementation

df2aead

laggui mentioned this pull request Oct 1, 2025

Add tensor.cumsum(dim) operation for cumulative sum computation tracel-ai/burn#3806

Merged

2 tasks

This was referenced Oct 2, 2025

Add cummin operation for tensors across backends tracel-ai/burn#3821

Closed

Add cummax operation for tensors across backends tracel-ai/burn#3820

Closed

Add cumulative operations: cumprod, cummin, and cummax tracel-ai/burn#3819

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create cubecl-scan crate #863

Create cubecl-scan crate #863

Uh oh!

Friedrich-S commented Sep 6, 2025 •

edited by antimora

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Create cubecl-scan crate #863

Are you sure you want to change the base?

Create cubecl-scan crate #863

Uh oh!

Conversation

Friedrich-S commented Sep 6, 2025 • edited by antimora Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rough Overview

Current Status

Notes

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Friedrich-S commented Sep 6, 2025 •

edited by antimora

Loading