Skip to content

Conversation

madsmtm
Copy link
Owner

@madsmtm madsmtm commented Sep 29, 2025

#783 relaxed our safety policy to allow us to mark things as automatically safe. This would be very useful to do in objc2-metal as well, which is what this PR does.

There are a lot of wrinkles to that though, since Metal is fairly low-level, and there's a lot of seemingly innocuous things that can cause memory safety issues down the line, it effectively allows doing arbitrary things on another processor (the GPU). One example of this is MTLFunction, which amounts to an unsafe extern "C" fn() that the GPU will gladly call with whatever arguments you specify, and thus we have to mark functions taking it as unsafe.

In a similar vein, all MTLResources are marked as unsafe, as accesses to these are unsynchronized, can be unretained, and need to be bound before use. MTLBuffer is also untyped. See also the more broad questions raised in #773 (comment).

For now, I'd like to postpone the discussion of bounds safety, I've used unsafe-default-safety.not-bounds-affecting = true to do that. It still has a lot of false positives, I have fixed some of those in this PR, but we'll have to deal with the rest later. I have opened #793 to track this.

Replaces #773, #774 and #790. Closes #685. This is important for gfx-rs/wgpu#5641 and for making Metal 4 nicer to use.

@MarijnS95
Copy link
Contributor

MarijnS95 commented Oct 1, 2025

I'm trying this out on our codebase, and stuck in a hard place between using this or #773 as a basis for a change pending other changes in the upcoming release. A few function thoughts:

  • MTLHeapDescriptor setType: should remain safe or that's a semver break;
  • dispatchThreadgroupsWithIndirectBuffer:threadsPerThreadgroup: should probably be marked unsafe because it takes a MTLGPUAddress;
    • I tried debugging fn can_affect_bounds() and it should catch this, but it equally doesn't catch some MTL4BufferRange?
  • gpuResourceID() was previously unsafe, it allows to do unsafe things but isn't inherently unsafe itself so perhaps it should be treated as safe like many .as_ptr() on Rust slices etc;
  • setMaxBufferBindCount: / setMaxTextureBindCount: / setMaxSamplerStateBindCount: could probably be marked as safe, because their values are "bounds" checked later:
    -[MTLDebugDevice newArgumentTableWithDescriptor:error:]:4160: failed assertion `Argument Table Descriptor Validation
    max texture bind count must not be more than 128.
    

EDIT: I pushed the fixes at metal-safe...Traverse-Research:objc2:metal-safe

@madsmtm
Copy link
Owner Author

madsmtm commented Oct 1, 2025

  • I tried debugging fn can_affect_bounds() and it should catch this, but it equally doesn't catch some MTL4BufferRange?

That's because there's now two checks: Both can_affect_bounds, and the is_likely_bounds_affecting here:

/// Whether a parameter or function name is likely to require a bounds check.
///
/// This is only a best-effort heuristic, the library author may have called
/// this any number of other things.
pub(crate) fn is_likely_bounds_affecting(name: &str) -> bool {
let name = name.to_lowercase();
name.contains("idx")
|| name.contains("index")
|| name == "i"
|| name.contains("capacity")
|| name.contains("range")
|| name.contains("offset")
|| name.contains("count")
}

EDIT: I pushed the fixes at metal-safe...Traverse-Research:objc2:metal-safe

Thanks, I pulled that in.

@MarijnS95
Copy link
Contributor

Ah! If you extend that function, the half-assed exceptions from my commit can go again. It's surely not complete.

@madsmtm
Copy link
Owner Author

madsmtm commented Oct 1, 2025

I've looked through the documentation on Metal resources, and I think the correct choice is to mark (basically) all resource (MTLBuffer/MTLTexture/``) accesses as unsafe, since the user has to synchronize accesses to these?

I have implemented this in the latest commit, please let me know what you think?

Comment on lines 175 to 178
# SAFETY: Modifying residency is safe, it's effectively the same as
# controlling what's in the L1/L2/L3 cache on the CPU.
# protocol.MTLResidencySet.methods.requestResidency.unsafe = false
# protocol.MTLResidencySet.methods.endResidency.unsafe = false
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarijnS95 could you verify this?

As I understand it, residency sets are only an optimization knob? If a resource is needed, but isn't resident, it's still automatically made resident, right?

Copy link
Contributor

@MarijnS95 MarijnS95 Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madsmtm sure, there are two stages to it: the methods quoted here are an optimization knob to make things resident ahead of time. Everything in a residency set attached to a command buffer (or queue before a command buffer was committed to it) will be made/kept resident when the command buffer is supposedly going to be running...

Caution

Or at least so I thought, requestResidency is quite clear that it performs some work upfront to make commit: faster but endResidency directly says it makes things no longer resident which -to me- seem to defeat the advantage of attaching it to command buffers. I don't know what will happen if that's called on a MTLResidencySet that's attached to a command buffer that's Enqueued but not Completed.

What also counts is what resources are in the set at any given point in time. As far as I know the state after commit: is not buffered, so if anyone removes a resource and commits it again while a MTLCommandBuffer is still in flight that uses said resource, things could go kaboom as well when something in the system wants to reclaim memory (by calling endResidency?).

In the end I don't exactly understand the significance of these functions. Maybe we're asking too complicated questions, the documentation is quite terse.


I'm not confident what residency means on Unified Memory architectures (unless page files exist and are used in times of high memory pressure); perhaps buffers that "exist" in memory may not be in the MMU for the GPU (perhaps per-process)?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested things a bit by modifying this example to create a MTLResidencySet and add _terrainTextures to that instead of using useResource:usage:, roughly:

MTLResidencySetDescriptor *setDescriptor = [[MTLResidencySetDescriptor alloc] init];
NSError* error;
id<MTLResidencySet> residencySet = [device newResidencySetWithDescriptor:setDescriptor error:&error];
if (!residencySet) { NSLog(@"Failed to create residence set, error %@", error); }

for (int i = 0; i < _terrainTextures.size(); i++) {
    [residencySet addAllocation: _terrainTextures[i].diffSpecTextureArray];
    [residencySet addAllocation: _terrainTextures[i].normalTextureArray];
}

[residencySet commit];
[commandQueue addResidencySet: residencySet];

Things didn't work without doing either one of these, using both didn't change the output. Re-reading, I guess that's what is stated in the documentation, so that probably makes sense. It also matches what the Metal debugger showed.

I tried calling both endResidency on each frame, it didn't change the result, but the CPU usage grew from ~30% to ~50%, so that seems to only be an optimization.

In any case, pretty sure we can consider residency and useResource:usage: as safe, we can move the unsafety of not having a resource bound to setTexture:atIndex: and similar command encoder functions?

Comment on lines 183 to 186
# These affect lifetime safety, and can cause use-after-free if used incorrectly.
class.MTLCommandBufferDescriptor.methods."setRetainedReferences:".unsafe = true
protocol.MTLCommandQueue.methods.commandBufferWithUnretainedReferences.unsafe = true
protocol.MTLIOCommandQueue.methods.commandBufferWithUnretainedReferences.unsafe = true
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarijnS95 pointed out in #773 (comment):

Which is just.. Ugh! Not yet sure how we'll resolve that. Maybe all command buffer and command encoder methods need to be unsafe? Maybe it'll be fine since most of them already are?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, all dispatches on the GPU (shader dispatches, draws, copy commands, ...) should likely be unsafe because of resource lifetimes 🤷

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading the docs for commandBufferWithUnretainedReferences, it says:

Returns a command buffer from the command queue that doesn’t maintain strong references to resources.

So it sounds like the things that aren't retained only applies to MTLResource and subprotocols, and that it'd still keep a reference to any MTLFence, MTLCommandBufferHandler, MTLSamplerState etc. that you pass to the buffer / its encoders?

As such, I think we can again get around this by marking MTLResource as unsafe (and probably add another safety requirement to the docs)?

@madsmtm
Copy link
Owner Author

madsmtm commented Oct 2, 2025

I think this PR is ready now. Overall, I believe that marking MTLResource and MTLFunction as unsafe was the correct choice, and that this mostly resolves the safety concerns outlined in #773 (comment).

I've gotten a bit lost between all our threads on this, but I think I've answered all your concerns @MarijnS95? WDYT about this PR in it's current state (if you look past all the bounds-safety stuff)?

(I strongly suspect there's still more to be done here, but I'm honestly reaching my limit of the amount of unsafe code I can review, I find myself glazing over things by now - so while this PR is probably unsound in a few places, I think it's enough of an improvement that it'd be worth it to merge it now anyhow).

@madsmtm madsmtm marked this pull request as ready for review October 2, 2025 22:06
madsmtm and others added 5 commits October 4, 2025 00:07
This has a lot of false positives, but it's probably better than
nothing.
To not consider MTLIndexType and similar as bounds-affecting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-framework Affects the framework crates and the translator for them enhancement New feature or request I-unsound A soundness hole, or affecting soundness
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Safety of methods appear to be illogical
2 participants