Skip to content

Conversation

@sciascid
Copy link
Contributor

Changes used to evaluate and improve batching at the Raft level.
These are proof-of-concepts, not necessarily complete nor sufficiently tested,
performance evaluation only!

@sciascid
Copy link
Contributor Author

sciascid commented Sep 25, 2025

Setup:

3 node cluster, all running on a laptop, with synchronous writes (sync_interval: always).

Workload:

nats bench js pub --replicas=3 --clients=10 --msgs=100000 --create --purge --size=1024 test

Throughput:

Optimization Throughput (msgs/s)
Baseline 91
Async stream writes 584
Async + Improved batching 593
Async + Reduced lock contention 21964
All combined 22967

Batching effectiveness:

batch_comparison

An easy way to collect batch sizes.
For performance testing only. Will be removed.
This is the baseline for perfomance testing Raft's batching
capabilities. The behavior of the batching mechanism Raft
is easier to observe if disk writes are synchronous.
I.e we want to write() + fsync() the Raft log. So that
producers can easily keep the proposal queue busy.
To do so one can set "sync_interval= always". However, that
results in disastrous performance: when the leader receives
acks for a "big" batch of log entries, the upper layer will
write() and fsync() all entries in the batch, individually.

So this commit disables "sync always" on stream writes.
This *should* work in principle because the data is already in
the raft log. Alternatively, one could implement "group commit"
for streams, i.e. fsync() only one time after processing a batch
of entries.

For performance testing only at this point.
This commit removes a "pathological" case from the current Raft
batching mechanism: if the proposal queue contains more entries
than one batch can fit, then raft will send a full batch, followed
by a small batch containing the leftovers.
However, it was observed that its quite possible that while the
first batch was being stored and sent, clients may already have
pushed more stuff into the proposal queue in the meantime.
With this fix the server will compose and send a full batch, then
the leftovers are handled as follows: if more proposals were pushed
into the proposal queue, then we carry over the leftovers to the
next iteration. So that the leftovers are batched together with the
proposals that were added pushed in the meantime.
If there are no more proposals, then we send the leftovers right away.

For performance testing only at point.
This is an attempt to reduce contention between Propose() and
sendAppendEntry(). Change Propose() to acquire a read lock on Raft, and
avoid locking Raft during storeToWAL() (which potentially does IO and
may take a long time). This works as long as sendAppendEntry() is called
from the Raft's goroutine only, unless the entry does not require to be
stored to the Raft log. So the rest of the changes are for enforcing the
above requirement:
  * Change EntryLeaderTransfer so that it is not store to the Raft log.
  * Push EntryPeerState and EntrySnapshot entries to the proposal queue.
  * Make sure EntrySnapshot entries skip the leader check, so make sure
    those are not batched together with other entries.

For performance testing only at this point.
Limit batch size based on the configured max_payload.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants