feat(logs): batching #1338

JoshuaMoelans · 2025-08-13T15:10:13Z

Implements a double buffer queue for logs batching.

upon SDK init, starts a 'batching thread' which flushes the buffer every 5s (or when it receives a 'buffer full' signal from an enqueue attempt)
uses an 'active buffer' to enqueue incoming logs, and switches to a secondary buffer while flushing the active one.
if no buffer is currently available, we retry twice (in case it frees up again in the meantime). If we are unable to enqueue, we drop an incoming log.

#skip-changelog

# Conflicts: # examples/example.c

supervacuus

This is a first review and I guess it will still need a couple of rounds. Maybe we can work on these things a bit closer next.

src/sentry_logs.c

src/sentry_core.c

src/sentry_logs.c

JoshuaMoelans · 2025-08-27T15:09:04Z

Current Issues

Non-Issues

Dropping Logs

‼️
Marked as non-issue, since this heavily depends on the system we're running on. By the order in which threads execute, we could be dropping more logs than 'expected', but this is not something we can fix as it is just an inherent scheduling problem.
As long as we have some multi-threaded test running locally that works fine, we are happy. Abuse quote also stops ingesting logs if we have more than 20k logs/s project-wide, so a realistic stress-test would send way fewer logs than the test we currently run.
‼️

test_logs_threaded still seems to have quite some variability with the current implementation. Running the test locally we see between 500-1500 dropped logs (out of 5000, so between 10-30%), whereas in CI we sometimes have more than 50% unsuccessful enqueues.

The time-to-flush seems higher than local testing (from an average of 600us to 1200us) which could explain the high number of dropped logs. In the 5/50 example below, we even have flushes taking 76000us, which definitely blocks the logs buffer long enough to drop 90% of incoming logs.

Example runs:
22/50 flushes - 56% dropped
5/50 flushes - 90% dropped
1/50 flushes - 98% dropped ❓ time to flush was only 5265us, not sure why we dropped so many 🤔

Potential Solution

We can only handle 20k logs/s project-wide, before abuse quota hits. This equates to one log every 50us; our stress-test might therefore be stressing the system too much (source: internal notion page)

Fixed

Stuck flushing (Windows)

In some runs we seemingly get stuck trying to flush logs. We see a trigger every 5s, even though the timer_task should have stopped

Solution

We had a leftover sleep in the code that was removed in 208952b ; this caused the Windows tests to run for 30 minutes before closing

Threaded logs test getting stuck (Windows)

Different from the Fixed Logs Flushing problem mentioned below, we noticed that on 32-bit Windows the threaded logs test gets 'stuck'. See an example run here.

In the logs we can notice the following; an unexpected return on the logs flush trigger followed by a failed assert in sentry_value.c.

2025-08-29T08:35:16.2307649Z [sentry] WARN Logs flush trigger returned unexpected value
2025-08-29T08:35:16.2317402Z [sentry] DEBUG Time to failed enqueue log is 0 us
2025-08-29T08:35:16.2317765Z 
2025-08-29T08:35:16.2317896Z [sentry] WARN Unable to enqueue log
2025-08-29T08:35:16.2324158Z Assertion failed: !"invalid thing type", file D:\a\sentry-native\sentry-native\src\sentry_value.c, line 534

Solution

Fixed in #1362 , we had some overlooked ownership issues with collecting the attributes for the logs, which caused Windows to show an error window (causing the test to hang). This was not a Windows-specific issue, but CI luckily (accidentally) caught it.

- too much variability in thread scheduling, so we can expect pretty much anything

…flush

JoshuaMoelans · 2025-09-01T07:34:46Z

The switch statement on the return value of sentry__cond_wake_timeout seems to be receiving different values than I was expecting for Windows (hence the current Windows CI failures). Will need to investigate what this actually returns.

Currently it expects:
0 -> condition variable triggered (Linux)
0 -> timeout triggered (Windows) IF (GetLastError() == ERROR_TIMEOUT)
1 -> condition variable triggered (Windows)
ETIMEDOUT -> timeout triggered (Linux)
default -> unexpected value, we continue instead of flushing (which is a temporary change, just to see if this happens because then our tests will fail)

* add `before_send_log` callback * add `before_send_log` callback tests * (temporary) add debug for calling sentry_options_free * remove early return * add late return * cleanup

* let the producer thread sleep for 1ms between logs * fix two missing NULL checks in the json serializer * clean up logging and early exits in `enqueue_log_single()` * clean up ownerships in logs * eliminate clones (we expect that everything outlives the logs being sent except local construction) * use incref everywhere where we ref global state. this was the cause of the UAF, partially solved with the clones but a few were missing. no reason to clone if we do not want to disconnect for a particular object graph * clarify that add_attribute expects ownership * minimize scope_mut by moving os_context out * raise that log output in throughput tests add to variability (stdout logging should be turned off when running a limit) * log error in case we weren't able to start the log batching worker * fix clang-cl warning * ci: fix failing mingw build (#1361) * ci: fix failing mingw build * split `ASM_MASM_COMPILER` and `_FLAGS` * add `ASM_MASM_FLAGS` in `mingw` install step * specify the `CMAKE_ASM_MASM_COMPILER` as a `FILEPATH` * clean up CMAKE_DEFINES construction so it is easier to diff in the future * fix `LLVM_MINGW_INSTALL_PATH` to be referenced locally rather than $env (cherry picked from commit 519554f) * use UNREACHABLE macro to fix anal warnings

github-actions · 2025-09-11T13:42:39Z

	Messages
📖	Do not forget to update Sentry-docs with your feature once the pull request gets approved.

Generated by 🚫 dangerJS against c405106

* first attempt at double buffered * remove the sleep from the windows thread func * clean up thread waiting in the example * adapt the double buffer to use retries, fix remaining issues, clean up and write inline docs * return early in example on sentry_init error. * fix formatting via shorter name for thread gate atomic * improve inline docs of log_buffer_t members * fetch os_context from scope * move scope/options data retrieval into separate function + add expected keys to test * update logs API to return status code * cleanup * add log-event trace connection test * remove duplicate test * specify macOS SDKROOT --------- Co-authored-by: JoshuaMoelans <[email protected]>

* add sentry logs option * add sentry logs option to example * feat(logs): add sentry log API + send first logs (#1272) * add sentry log API + send first logs * fix log_level_as_string * attach attributes to logs * attach formatted message + args * add to example * add more attributes * cleanup * windows warning-as-error * windows warning-as-error v2 * windows warning-as-error v2 (final) * add unit tests for initial logs * memleak attempted fix * memleak attempted fix 2 * cleanup * use `sentry_level_t` instead of new log level enum * add SENTRY_LEVEL_TRACE to sentry_logger * quick anti-brownout fix - see #1274 * fix missing SENTRY_LEVEL_INFO string return * fix logger level check + add test * cleanup logs parameter extraction * warn-as-error fix * const char* fix * static function * feat(logs): add (u)int64 sentry_value_t type (#1301) * add (u)int64 sentry_value_t type * add value_to_msgpack missing switch cases * remove undefined behavior test (C99 6.3.1.4) * avoid Windows sized integer name collision * cleanup & apply code review feedback * more cleanup & remove type coercion * update logs param conversion * own uint64 string * apply suggestions from code review * fixed log parameter conversion * update example to avoid warning-as-error * feat(logs): batching (#1338) * initial queue attempt * add timer * prototype double buffer approach * update logs unit tests for batching * replace timer with bgworker * add first integration tests * update example.c with correct log thread amounts * cleanup * add wait for 'adding' logs in flush logic * go back to single queue for performance testing * add time checks * add ToDos + cleanup sentry_value_t cloning * initial attempt * cond_wait for timer + 'adding' spinlock * add sleep for tests * add sleep for tests * force flush before attempting timer_worker shutdown * add proper cond_wait for 'adding' counter * revert to manual flush on shutdown instead of timer thread flush * add separate timer_stop atomic * cleanup + replace 'adding' cond_wait by pure spinlock * change bgworker for simpler thread implementation * cleanup * fix memleak * fixes * cleanup * cleanup * windows fixes * update shutdown order * fixes * explicit check to stop timer task * windows cleanup * loosen threaded test assertion for CI - too much variability in thread scheduling, so we can expect pretty much anything * add continue for unexpected logs flush trigger instead of attempting flush * Windows re-add condition variable trigger case * feat(logs): add `before_send_log` (#1355) * add `before_send_log` callback * add `before_send_log` callback tests * (temporary) add debug for calling sentry_options_free * remove early return * add late return * cleanup * fix ownership issues in single buffer batching (#1362) * let the producer thread sleep for 1ms between logs * fix two missing NULL checks in the json serializer * clean up logging and early exits in `enqueue_log_single()` * clean up ownerships in logs * eliminate clones (we expect that everything outlives the logs being sent except local construction) * use incref everywhere where we ref global state. this was the cause of the UAF, partially solved with the clones but a few were missing. no reason to clone if we do not want to disconnect for a particular object graph * clarify that add_attribute expects ownership * minimize scope_mut by moving os_context out * raise that log output in throughput tests add to variability (stdout logging should be turned off when running a limit) * log error in case we weren't able to start the log batching worker * fix clang-cl warning * ci: fix failing mingw build (#1361) * ci: fix failing mingw build * split `ASM_MASM_COMPILER` and `_FLAGS` * add `ASM_MASM_FLAGS` in `mingw` install step * specify the `CMAKE_ASM_MASM_COMPILER` as a `FILEPATH` * clean up CMAKE_DEFINES construction so it is easier to diff in the future * fix `LLVM_MINGW_INSTALL_PATH` to be referenced locally rather than $env (cherry picked from commit 519554f) * use UNREACHABLE macro to fix anal warnings * batching double buffered (#1365) * first attempt at double buffered * remove the sleep from the windows thread func * clean up thread waiting in the example * adapt the double buffer to use retries, fix remaining issues, clean up and write inline docs * return early in example on sentry_init error. * fix formatting via shorter name for thread gate atomic * improve inline docs of log_buffer_t members * fetch os_context from scope * move scope/options data retrieval into separate function + add expected keys to test * update logs API to return status code * cleanup * add log-event trace connection test * remove duplicate test * specify macOS SDKROOT --------- Co-authored-by: JoshuaMoelans <[email protected]> * add flush retry for missed flush requests * move flush retry into flush function * add docs --------- Co-authored-by: Mischan Toosarani-Hausberger <[email protected]> * update CHANGELOG.md * use `trace_id` from scoped spans for logs * fix copy-paste leftover + docs * add log_sleep for thread test + variable NUM_LOGS * no `usleep` on windows :( * fix seconds->milliseconds * cleanup * test(logs): add 32-bit vargs test (#1370) * add vargs conversion test * add ifdef for 32-bit systems * Update tests/unit/test_logs.c Co-authored-by: Mischan Toosarani-Hausberger <[email protected]> * add comment * fix comment --------- Co-authored-by: Mischan Toosarani-Hausberger <[email protected]> * Apply suggestions from code review Co-authored-by: Mischan Toosarani-Hausberger <[email protected]> * post-merge cleanup * pin ruamel version * let's unpin ruamel.yaml.clib to get 0.2.14 which seemingly fixes the missing header: https://sourceforge.net/p/ruamel-yaml-clib/tickets/47/#de77 * add empty payload check * log output of logger tests if we fail to find the pre-crash marker * fix: move `is_tsan` marker into the `has_crashpad` condition... ...so we can ignore in which module a `crashpad` test runs. * fix: update `has_crashpad` condition comment * really only move `is_tsan`, but keep the module level `pytestmark` * CHANGELOG.md update * CHANGELOG.md update * CHANGELOG.md update --------- Co-authored-by: Mischan Toosarani-Hausberger <[email protected]>

JoshuaMoelans added 21 commits August 8, 2025 13:16

initial queue attempt

fbed0e2

add timer

234bbb6

prototype double buffer approach

5815c1e

update logs unit tests for batching

914514d

replace timer with bgworker

d1988a8

add first integration tests

d07ef33

Merge branch 'joshua/feat/logs' into joshua/feat/logs_batching

b3d624c

# Conflicts: # examples/example.c

update example.c with correct log thread amounts

b2dbe18

cleanup

ccd88d2

add wait for 'adding' logs in flush logic

845f5bc

go back to single queue for performance testing

ad88a88

add time checks

28d4a50

add ToDos + cleanup sentry_value_t cloning

3d360cb

initial attempt

f067959

cond_wait for timer + 'adding' spinlock

91b2f7f

add sleep for tests

c0395ae

add sleep for tests

ea3fffb

force flush before attempting timer_worker shutdown

98d6c21

add proper cond_wait for 'adding' counter

e1e2a1b

revert to manual flush on shutdown instead of timer thread flush

cb834f7

add separate timer_stop atomic

d320067

supervacuus requested changes Aug 14, 2025

View reviewed changes

JoshuaMoelans added 8 commits August 14, 2025 16:03

cleanup + replace 'adding' cond_wait by pure spinlock

c85f685

change bgworker for simpler thread implementation

3c28646

cleanup

651ca04

fix memleak

02cea90

fixes

3907d90

cleanup

d8bbb23

cleanup

b8c110e

windows fixes

6f2be36

JoshuaMoelans added 3 commits August 27, 2025 13:57

update shutdown order

394c492

fixes

8da409e

explicit check to stop timer task

ed9176b

JoshuaMoelans added 3 commits August 28, 2025 11:20

windows cleanup

208952b

loosen threaded test assertion for CI

b34b5c8

- too much variability in thread scheduling, so we can expect pretty much anything

add continue for unexpected logs flush trigger instead of attempting …

b63d7e4

…flush

Windows re-add condition variable trigger case

6d30ff1

JoshuaMoelans mentioned this pull request Sep 2, 2025

feat: structured logging #1271

Merged

21 tasks

JoshuaMoelans and others added 2 commits September 8, 2025 13:53

feat(logs): add before_send_log (#1355)

978abfb

* add `before_send_log` callback * add `before_send_log` callback tests * (temporary) add debug for calling sentry_options_free * remove early return * add late return * cleanup

This was referenced Sep 11, 2025

(outdated) logs - initial batching prototype #1334

Draft

batching double buffered #1365

Merged

JoshuaMoelans changed the title ~~logs: single-queue batching~~ feat(logs): batching Sep 11, 2025

supervacuus and others added 4 commits September 11, 2025 16:07

add flush retry for missed flush requests

f022d1e

move flush retry into flush function

cdad215

add docs

c405106

JoshuaMoelans marked this pull request as ready for review September 12, 2025 09:07

JoshuaMoelans merged commit c18e451 into joshua/feat/logs Sep 12, 2025
35 of 36 checks passed

JoshuaMoelans deleted the joshua/feat/logs_batching_single_queue branch September 12, 2025 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(logs): batching #1338

feat(logs): batching #1338

Uh oh!

JoshuaMoelans commented Aug 13, 2025 •

edited

Loading

Uh oh!

supervacuus left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JoshuaMoelans commented Aug 27, 2025 •

edited

Loading

Uh oh!

JoshuaMoelans commented Sep 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat(logs): batching #1338

feat(logs): batching #1338

Uh oh!

Conversation

JoshuaMoelans commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

supervacuus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JoshuaMoelans commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current Issues

Non-Issues

Dropping Logs

Potential Solution

Fixed

Stuck flushing (Windows)

Solution

Threaded logs test getting stuck (Windows)

Solution

Uh oh!

JoshuaMoelans commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JoshuaMoelans commented Aug 13, 2025 •

edited

Loading

JoshuaMoelans commented Aug 27, 2025 •

edited

Loading

JoshuaMoelans commented Sep 1, 2025 •

edited

Loading

github-actions bot commented Sep 11, 2025 •

edited

Loading