Reduce the latency by increase the batch size for vision transformer #603

xiaopoc · 2025-08-07T23:51:22Z

Description of the change:
Currently, each image in obs.images in pi0.py is processed sequentially by the vision transformer in embed_prefix(), which leads to redundant kernel launches and increased runtime.

Motivation:
Batching all images together and encoding them in one forward pass reduces kernel launch overhead and enables better fusion. This can lead to lower latency during inference (~5ms speed up on RTX 4090).

Proposed Change:

Stack images across cameras along a new axis, e.g., stacked_images = jnp.stack(list(obs.images.values()), axis=1)
Flatten and encode all images in one forward pass via self.PaliGemma.img(...)
Update the token aggregation and attention masks accordingly

Reduce the latency by increase the batch size for vision transformer

5f37423

xiaopoc requested review from kvablack and uzhilinsky as code owners August 7, 2025 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce the latency by increase the batch size for vision transformer #603

Reduce the latency by increase the batch size for vision transformer #603

xiaopoc commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reduce the latency by increase the batch size for vision transformer #603

Are you sure you want to change the base?

Reduce the latency by increase the batch size for vision transformer #603

Conversation

xiaopoc commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants