Skip to content

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Feb 22, 2024

This is a very simple change that will hopefully help folks identify things like https://discourse.julialang.org/t/inconsistent-cpu-utilisation-in-threads-loops/110512 a little faster.

Before:

julia> @time @sync begin
           Threads.@spawn @ccall sleep(1::Cuint)::Cuint
           sleep(.5)
           Threads.@spawn GC.gc()
       end
  1.078393 seconds (927 allocations: 65.078 KiB, 5.85% gc time, 1.19% compilation time)

After:

julia> @time @sync begin
           Threads.@spawn @ccall sleep(1::Cuint)::Cuint
           sleep(.5)
           Threads.@spawn GC.gc()
       end
  1.040344 seconds (1.35 k allocations: 65.266 KiB, 50.85% gc time: 94% of which was waiting for safepoints, 1.09% compilation time)

base/timing.jl Outdated
@@ -57,7 +57,7 @@ function GC_Diff(new::GC_Num, old::GC_Num)
new.poolalloc - old.poolalloc,
new.bigalloc - old.bigalloc,
new.freecall - old.freecall,
new.total_time - old.total_time,
(new.total_time - old.total_time) + (new.total_time_to_safepoint - old.total_time_to_safepoint),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer introducing a new field in GC_Diff that is the safepoint time, and break the top-level report into two sections. As an example when I run a scaling test, I want to see that total_time stays constant (e.g. the work of GC scales well with the number of threads and GC workers). If gc time starts creeping up I might use that as a decision point to use more GC threads.

On the other hand if the latency to start GC is high (time to safepoint) I can use that as a decision point to look with a profiler if I spend to much time in certain ccalls.

Munging them together yields a simple: "GC is slow" but no actionable information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair. In my view, "GC is slow" is very actionable information — and throwing more information at people isn't always helpful. How would you want this to appear? Perhaps we take inspiration from compilation time?

2.39% compilation time: 12% of which was recompilation

@KristofferC
Copy link
Member

How many people that see this message will know what "waiting for safe point" means? It's quite a lot of text to add.

@vchuravy
Copy link
Member

So instead of "waiting for safe point", I would call it latency. Or we could just print it as (30%|10%) GC eg. Total is the sum of those two values.

Another way of thinking about TTSP is that it is like the GC lock.

@adienes
Copy link
Member

adienes commented Feb 22, 2024

the information certainly seems useful and 👍 to the functionality, but I'm just noting that #52883 also recently lengthened the message in the parenthetical

maybe at this point @time contains enough information that it could be appropriate to print over multiple lines?

@mbauman
Copy link
Member Author

mbauman commented Feb 22, 2024

My only goal here is to match what "gc time" means to me (and I'd wager the vast majority of Julia users) — and that's the total time spent inside GC.gc(), sleeping or not.

I really don't want yet another way of interpreting a particular category's breakdown. So in my view we only have two choices, and I'm happy with either:

  • Match what we do with compilation/recompilation: X% category time: Y% of which was caveat
  • Only print the total time spent inside GC.gc(), but have the details available in @timed's output

It does only print out this extra info if more than 0.5% of the GC time was spent waiting to start. I could make the caveat shorter with just "waiting" but when I wrote "waiting for safepoint" I chose to do so because I figured it'd make for good googlefodder.

@mbauman mbauman added GC Garbage collector observability metrics, timing, understandability, reflection, logging, ... labels Feb 22, 2024
@raminammour
Copy link
Contributor

raminammour commented Feb 22, 2024

I am not an expert, but as people want more an more information going into @time (compilation, GC,...) , which has been very useful, then good heuristics on when to print the fine grained info may be worthwhile: say in this case, only print the "waiting" (+1 to omitting "for safepoint") time only if it is more than 10% (for example) of total time.
It may be arbitrary, but it preserves the concise output in most cases, and only alerts the user that something "else" is happening when it becomes important. The full info can always go in @timed for the gurus :)

Edit: ignore above, I just saw in the code that @mbauman already only shows this if it is > 50%

Thanks for doing this, I learned something new reading this!

@vchuravy
Copy link
Member

vchuravy commented Feb 22, 2024

One challenge with using total=TTSP+GC is that TTSP and useful time overlap.

So you could get a measurement where with two threads total is equivalent to 100% (one thread immediately request GC and the other ignores that request until the end).

Right now GC % is a measurement of something automatic, but useful. Mixing in TTSP makes the time reported less useful for me. Which is why I likened it to the recently added lock measurement.

Put another way, currently GC time is a measurement of work and how efficiently that work is being processed, whereas TTSP is a measurement of work not occuring.

I should say that I am all in favor of surfacing this information since it I strongly believe that this is an issue we as a community are not aware of enough.

@oscardssmith
Copy link
Member

The counterpoint to this is that we really do need to make it so that TTSP is always ~0. Right now ccall is the main time where this doesn't happen, but we already basically know what we need to do to fix this.

@mbauman
Copy link
Member Author

mbauman commented Feb 22, 2024

Put another way, currently GC time is a measurement of work and how efficiently that work is being processed, whereas TTSP is a measurement of work not occuring.

Ok. But it's the combined number that is the measurement of time my work is not happening. That's why it's what's useful to me — and I'd wager most folks.

@vchuravy
Copy link
Member

But the action to take in either case is drastically different. In one case I might need to add some strategical GC.safepoint calls or annotate ccall as GC safe, with the goal of reducing latency and in the other case I might need to preallocate, reduce heap usage in general...

Maybe we report TTSP in the lock section? The more I think about that the more I like the idea. GC is waiting on a lock to acquire exclusive access to the heap.

@d-netto
Copy link
Member

d-netto commented Feb 23, 2024

Want to note that there seems to be some history in the interpretation "GC time = mark time + sweep time".

Note that GCBenchmarks displays/interprets GC time like "mark time + sweep time" (doesn't count TTSP in it), and pretty much all dashboards we have internally at RAI do so.

Having GC times being defined in two ways in GCBenchmarks (which is arguably our main open-source tools for GC performance analysis now) and @time seems inconsistent.

@mbauman
Copy link
Member Author

mbauman commented Feb 23, 2024

I'm just confused how to concretely — and succinctly and understandably — say that this is due to the GC without saying it's GC time. I suppose it could be in its own section:

1.040344 seconds (1.35 k allocations: 65.266 KiB, 5.85% gc time,  49.85% time waiting for gc, 1.09% compilation time)

One additional challenge is in determining the "floor" for elision — this time is always nonzero.

I did also think about using the locks section, but there are two issues there: the existing metric is different (time vs count), and even if we started counting counts, I'd still want to separate out GC locks vs. ReentrantLocks which means that we still need to add a new descriptor.

@vchuravy
Copy link
Member

At the core of the issue for me are units. Currently @time measures real time.
The time a human would experience. In programming we often refer to cpu time, which is a useful concept in concurrent programming.

Take as an example the program below it uses 10 units of real-time and 20 units of cpu-time. If we look at the time G the GC takes to run it would be 2 units of real time and 4 units of cpu time. In both cases GC takes 20% of the program execution.

T1: ----RxxxGG
T2: --------GG

Now what about TTSP? In the ascii diagram above R is the moment we requested GC and x is the time waited on the other mutator. If we use the real-time measurement it would seem like we are using 4 units or 40% of our time on waiting, but using CPU time we can see that this is not the case and TTSP takes 20% of our CPU time, a much more accurate representation of the cost of getting to the safepoint.

This problem gets even worse the more threads we are adding:

T1: ----RxxxGG
T2: --------GG
T3: --------GG
T4: --------GG
T5: --------GG

Still 10 units of real time, but 50 units of cpu time. GC takes 20%, but TTSP takes 40% in real time and 8% of cpu-time.

But it's the combined number that is the measurement of time my work is not happening. That's why it's what's useful to me — and I'd wager most folks.

I don't disagree (besides being cute and stating that GC is still doing your work, but work you procrastinated on).

But I wager that you actually care about the cpu-time ratio and not the real-time ratio, and phrasing a multi-core measurement in real-time is not useful.
It doesn't convey the "how big of a problem is this". I want to make sure that the information we present is actionable.

I don't think we currently track CPU-time accurately throughout the system,
but that is a separate issue.

@mbauman
Copy link
Member Author

mbauman commented Feb 23, 2024

So what's your suggestion on how to present this information? In my mind, the overarching goal of @time's parenthetical is to summarize some overheads that you might not have explicitly asked for and might not expect — and which might be highly variable. I put time-to-safepoint in that bin. And I still say it fits the category of a GC-related overhead.

I hear you on the units, but I must confess I don't have a solid understanding of what the percentages are supposed to mean in the context of multithreading. If you were to have pressed me (before I had looked at implementation yesterday) I might have guessed they're numerated by CPU time and denominated by wall time — but maybe that's just because compilation time sometimes reports >100%.

I still haven't quite grokked how the GC tracks multiple threads waiting for a safepoint — but it seems that it only counts the time from the first/worst one? That is, this diagram:

T1: --RxxxxxxxxxGG
T2: ----RxxxxxxxGG
T3: ------RxxxxxGG
T4: ---------RxxGG

The GC stats — as they currently exist — seem like they do the real-world timing, it's just the worst-case-thread one. (edit: I'm guessing that's because the other threads start waiting on behalf of the first thread's collection before they can even try to start their own, and then we end up with 4 synchronized collections back-to-back, but the effect is that this number is more real-time-like than I expected):

julia> @time @sync begin
           Threads.@spawn (@ccall sleep(1::Cuint)::Cuint; GC.gc())
           Threads.@spawn (@ccall sleep(2::Cuint)::Cuint; GC.gc())
           Threads.@spawn (@ccall sleep(3::Cuint)::Cuint; GC.gc())
           Threads.@spawn (@ccall sleep(4::Cuint)::Cuint; GC.gc())
       end
  4.050298 seconds (3.43 k allocations: 163.219 KiB, 75.02% gc time: 99% of which was waiting for safepoints, 0.73% compilation time)

@vchuravy
Copy link
Member

vchuravy commented Feb 23, 2024

The GC stats — as they currently exist — seem like they do the real-world timing, it's just the worst-case-thread one:

Yes we currently only track the time the thread requesting GC spends waiting for all the other threads to reach a safe point and it gains exclusive access to the heap and thus object graph.

The real time vs cpu time is probably also the reason why compile time can exceed total time

@mbauman
Copy link
Member Author

mbauman commented Feb 23, 2024

So what's your suggestion on how to present this information?

:)

Under the assumption that folks are trying to get all threads evenly chuggingly along, I think this PR is capturing the situation quite well. It's that worst-case thread that you end up waiting for:

julia> @time @sync begin
           Threads.@spawn (@ccall sleep(1::Cuint)::Cuint; GC.gc(); @ccall sleep(4::Cuint)::Cuint)
           Threads.@spawn (@ccall sleep(2::Cuint)::Cuint; GC.safepoint(); @ccall sleep(3::Cuint)::Cuint)
           Threads.@spawn (@ccall sleep(3::Cuint)::Cuint; GC.safepoint(); @ccall sleep(2::Cuint)::Cuint)
           Threads.@spawn (@ccall sleep(4::Cuint)::Cuint; GC.safepoint(); @ccall sleep(1::Cuint)::Cuint)
       end
  8.071955 seconds (3.96 k allocations: 190.641 KiB, 37.86% gc time: 98% of which was waiting for safepoints, 0.33% compilation time)
Task (done) @0x000000010a344650

julia> 8.071955 * (1-.3786)
5.015912837

@@ -57,7 +58,8 @@ function GC_Diff(new::GC_Num, old::GC_Num)
new.poolalloc - old.poolalloc,
new.bigalloc - old.bigalloc,
new.freecall - old.freecall,
new.total_time - old.total_time,
new.total_time - old.total_time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new.total_time - old.total_time,
new.total_time - old.total_time,

@fingolfin
Copy link
Member

The code in this PR seems useful. Is the issue "only" the way the measurement is presented? If so, the place I'd look if I don't understand the output of @time would be the docstring for @time. I think that should explain this -- not necessarily in great detail, but at least a brief sentence perhaps with a link to a place that explain more about safepoints. But ideally the user should be able to start with ?@time and then from there figure out what this is about, and why they might care (so the explanation could mention that this is in particular relevant if multithreading is in use, or so)

@IanButterworth
Copy link
Member

What about 37.86% gc time: 98% waiting, and the docs explain that gc is waiting for a safe point?

@mbauman
Copy link
Member Author

mbauman commented Jul 24, 2024

Unfortunately that doesn't resolve Valentin's objection, which is the inclusion of waiting time in the overall GC percentage number. That's where this is at loggerheads.

@IanButterworth
Copy link
Member

Ok. What about going back to gc_time being what it used to be and something like
0.75% gc time: 49x longer waiting,

@MasonProtter
Copy link
Contributor

MasonProtter commented Feb 26, 2025

If people can't agree on how the information is presented, could we at the very least make this information accessible to @timed and then let the user access it and present it to themselves how they choose?

It's really unfortunate that this vital diagnostic info is so hard to access.

@StefanKarpinski
Copy link
Member

My belated 2¢:

  • Always count wait time as GC time
  • If waiting is insignificant, display 50.85% gc time as always (but it now includes waiting time)
  • If waiting is significant, display 50.85% gc time [47.80% waiting]
  • Waiting as percentage of total time, not percentage of GC time—percent of a percent is too confusing

Original post output would look like this:

julia> @time @sync begin
           Threads.@spawn @ccall sleep(1::Cuint)::Cuint
           sleep(.5)
           Threads.@spawn GC.gc()
       end
  1.040344 seconds (1.35 k allocations: 65.266 KiB, 50.85% gc time [47.80% waiting], 1.09% compilation time)

@oscardssmith
Copy link
Member

@StefanKarpinski the one other question here is if 5 out of 8 threads spend time waiting for GC, how does that get displayed?

@StefanKarpinski
Copy link
Member

How does GC time get computed as a percentage in that case? Seems like the same issue.

@oscardssmith
Copy link
Member

oscardssmith commented Apr 16, 2025

GC time (while we have a synchronous GC) is fairly simple because when GC is running, all mutator threads are paused. The tricky part about pause time is that by definition some number between 1 and n-1 threads are paused, so you are somewhat forced to deal with this.

IMO the right answer is that GC pause time should be % of total time that any threads spend waiting for GC to start. This has the advantage that is matches the other timings from @time in that it is a % of wall-time, but does have the potentially slightly confusing aspect that it means % gc pause+%compilation (for example) can be >100%

@MasonProtter
Copy link
Contributor

IMO the right answer is that GC pause time should be % of total time that any threads spend waiting for GC to start.

Just to make sure I understand what exactly you mean, if I have two threads and run for 2 seconds, and one thread waits for 1 second and the other does not wait at all. That's 25% in this situation, not 50% right?

@oscardssmith
Copy link
Member

that would be 50%. In general, we expect pause times to be low, so this shouldn't matter too much (i.e. if the pause time is ever >1% that's a problem).

@StefanKarpinski
Copy link
Member

Maybe waiting should be in total seconds? Or average seconds/thread?

@MasonProtter
Copy link
Contributor

that would be 50%. In general, we expect pause times to be low, so this shouldn't matter too much (i.e. if the pause time is ever >1% that's a problem).

I've had extremely allocations heavy multithreaded code where I'm pretty sure the wait times were the main factor causing my code to fail to scale with threads.

I.e. running on 32 processes (64 core machine), I had something like a 30x speedup, running on 32 threads i had like a 2-3x speedup.

My understanding was that waiting for GC is often responsible for cases like that.

@StefanKarpinski
Copy link
Member

I think that displaying GC wait time in either total seconds or seconds/thread seems clearer.

@vchuravy
Copy link
Member

It's essentially a confusion of units. We are comparing cpu-time with wall-time.

I think wall-time is a good indicator for "TTSP is a problem", but to dig into what is happening, one would be interested in the distribution of TTSP per thread.

Like if it is the case that one thread is waiting for all other threads that's fine, but if all threads are waiting for one thread that's a problem for me.

So we could report wall-time + sum(cpu-time)/threads. But that only gives me the average, reporting the distribution is probably too verbose. Maybe we can do 50/75th percentile? (That is, approaching a quantile-based skewness measure...)

@oscardssmith
Copy link
Member

Like if it is the case that one thread is waiting for all other threads that's fine, but if all threads are waiting for one thread that's a problem for me.

IMO this is wrong. in an ideal world as soon as one thread requests GC all other threads should notice in <1ms. There are some known cases where that won't happen currently, bu tthose are places where we probably should be inserting safe-regions.

@IanButterworth
Copy link
Member

What about adding a general-purpose average of thread idle time on threads that have a task running within the timed region?

1.040344 seconds (1.35 k allocations: 65.266 KiB, 5.85% gc time, 25.32% avg. active thread idle time, 1.09% compilation time)

Specific to if threads are active so you don't end up with it showing up on

$ julia -t16 -e "@time 1+1"

avg. active thread idle time is a bit long though.

@adienes
Copy link
Member

adienes commented Apr 17, 2025

maybe just idle threads ? and entirely subjectively TTSP / CPU time is the ratio that feels the most intuitive to me

1.040344 seconds (1.35 k allocations: 65.266 KiB, 5.85% gc time, 1.58% idle threads, 1.09% compilation time)

and it still makes sense to map that back to wall time under the optimal ideal that your performance "should" scale linearly with thread count

like say you have a program that does some work in time T and forces single threading, and all the other threads wait. with 100 threads you'll see

T seconds ( 99% idle threads)

which makes sense since if your workload could scale linearly, it would only take T / 100 seconds, meaning that the opportunity cost is indeed 99%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector observability metrics, timing, understandability, reflection, logging, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.