Skip to content

Conversation

@dylanjwolff
Copy link
Contributor

@dylanjwolff dylanjwolff commented Sep 6, 2025

Currently, Shuttle schedulers have no visibility into the system under test. The algorithms only see which tasks are enabled at a given time, not what those tasks are actually doing. This lack of semantic information greatly limits what we can do with Shuttle schedulers. Many algorithms, such as Partial Order Sampling (POS) and Selective Uniform Random Walk (SURW) leverage information about the events being executed to make better scheduling decisions (e.g. make a different decision if two events conflict, or if a particular task is accessing a resource “of interest”). Such information is also useful for developing a general notion of behavioral coverage — for example, tracking new reads-from edges.

This PR adds a next_event field to the Task struct, which is accessible to scheduling algorithms. The Event enum gives the schedulers access to the signature of the resource being accessed, as well as the source location of the access (via track_caller).

As a small "demo" application, I also modified the Metrics scheduler to count the number and kind of events observed during a Shuttle test. As part of this demo, I did make some light changes to the Scheduler API so that it returns a &Task instead of a TaskId. This change allows nested schedulers such as the Metrics scheduler to directly lookup information about the task chosen by inner schedulers without iterating over the runnable tasks.

Note: this PR does use some unsafe lifetime casting to allow Task to hold a reference to the Task/Resource signatures. Just outright cloning the signatures is maybe ~5-8% slower in the worst case. I'm open to discussion/suggestions here, but I think the other approaches to avoid cloning without unsafe might end up being significantly more complicated than the unsafe code without much real benefit. In particular, using Rc<_> for signatures requires heap allocations on initialization, while many resource constructors are const. Note that this is similar to the unsafe added for the runnable tasks slice, and, as in that case, we are providing a safe API to the scheduling algorithms.

I am marking this as a draft for now because it depends on two other open PRs (#207 and #210) #216.

It currently includes the changes from these other PRs. To look at the changes unique to this PR, see 4778a31

I've rebased this on #216, will mark as ready for review as soon as that is merged


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dylanjwolff dylanjwolff force-pushed the next-event-on-task branch 2 times, most recently from 870fd91 to a8eab76 Compare September 19, 2025 23:51
@sarsko
Copy link
Contributor

sarsko commented Sep 23, 2025

Just for the record I'm generally pro unsafe

@sarsko
Copy link
Contributor

sarsko commented Sep 23, 2025

Could we get #210 merged and then get this one out in a reviewable format?

@dylanjwolff dylanjwolff force-pushed the next-event-on-task branch 2 times, most recently from 61edacc to 5f487b5 Compare October 14, 2025 18:30
@dylanjwolff dylanjwolff marked this pull request as ready for review October 16, 2025 16:15
unsafe { std::mem::transmute(&self.next_event) }
}

pub(crate) unsafe fn set_next_event(&mut self, event: Event<'_>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAFETY comment describing how to make it safe. Also describe why unsafe is needed

@sarsko
Copy link
Contributor

sarsko commented Oct 30, 2025

Did a "local review" now, making a commit instead of writing comments (since Dylan's internship is over, I'll be driving the PR to merge). Most of my changes are nits, but:

I am not convinced the set_next_events in Acquire::poll are safe. The reason for this is that we may cancel the acquire.

Related to this, I believe there is a race on the next_event field when used in Acquire::poll.

Re it not being safe:
If we were to create a BatchSemaphore with 0 permits, then try to acquire 1 permit by calling acquire, manually polling it, then dropping it on Pending (this whole thing can also be achieved by having multiple Futures and polling them via tokio::select!, FuturesUnordered, and so on), then dropping the BatchSemaphore, then we're left in a state where the Task contains a reference to a ResourceSignature that does not exist anymore.

Re it being racy:
If multiple acquires are done by a single task within a single context switch (again, FuturesUnordered, tokio::select, etc.) then the next_event field will at any point be one of these, but it should really be all of them.

How to fix:
These are related, so I'll discuss the fixes together

There's a few different ways of solving this. We could refactor the entire solution into not using unsafe at all. If we wanna keep the unsafe, then the core of the issue is that there is a point in time separation between set_next_event in Acquire::poll and the switch_keeping_current_event in block_on. During this time, the next event could be invalidated. I therefore think switch_keeping_current_event is "too powerful" and it is unfortunate that it is pub(crate). Even if this were to be refactored, the gap between the Poll:Pending and the sleep_unless_woken in block_on still exists and needs to be accommodated.

Thus, we still need a way to set the event and also a a way to accommodate multiple events. It is perfectly valid to try to acquire one BatchSemaphore, then another, then go acquire some Atomic or whatever, but at that point, only the Atomic is relevant — the BatchSemaphore acquires are not actually the next event. The only case where the BatchSemaphore acquires are actually the next event is when it leads to a Poll::Pending in block_on. This means this mechanism only has to support multiple BatchSemaphore acquires, and has to support the BatchSemaphore being dropped between the Acquire::poll and the switch.

The immediate way I can think of to solve that is to add something in the Drop handler of Acquire which tries to remove the entry on Drop, similar to how we currently remove waiters. This assumes that the lifetime of the Acquire is less than the lifetime of the BatchSemaphore (which it is), and that the lifetime of the Acquire is less than the task that owns it (which it also should be. It would be a bunch of trys regardless, as the there could very well be a different next event, and we could be somewhere in shutdown.). As for supporting multiple events: I think this should be a fairly simple change to have Event::BatchSemaphoreAcq support multiple semaphores. This whole scheme requires this to be the only next event where event multiplexing is possible. We know this is the case at the moment, as all the other events call switch directly, which is guaranteed to uphold the SAFETY requirements. I think this will remain the case in the future as well, as the plan is to model everything via the BatchSemaphore, but it is something to be aware of if we are ever to add more Events, particularly Events in an async context.

All this said:

I am not a big fan of how easy violating the SAFETY contract was, which I guess is true for a lot of unsafe. I do generally agree with Dylan in his comments above about unsafe vs Rc or other solutions. I'm trying to think if there is a way to refactor to facilitate this. Ideally ResourceSignatures would be static, which we could facilitate, though it would be kind of annoying to support with const constructors. It may be that the only way to do it would be to "lazily upgrade" a ResourceSignature when it is used in an Event.

cc @jorajeev

@dylanjwolff
Copy link
Contributor Author

@sarsko Thanks for the detailed writeup. I need some more time to think about all of it, but as a quick alternative option to get rid of the unsafe: we could create a small Copy struct that contains only signature hash(es) and the address of the Signature as a usize. This avoids the issues with Rc and I think should be reasonably fast as long as the struct fits in SIMD registers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants