Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with partially bound texture arrays with variable descriptor count #2206

Open
mbechard opened this issue Apr 13, 2024 · 5 comments
Open

Comments

@mbechard
Copy link
Contributor

If I have a texture array that has been sized 10, but I'm only providing 8 elements in a call to vkUpdateDescriptorSets(), it doesn't seem like the other elements in the array in MVKDescriptorSet get cleared out. This can cause a crash with deleted resources, since bindings for elements 8 and 9 will still try to get bound in MVKDescriptorSetLayoutBinding::bind. That function loops over the entire length of the variableDescriptorCount, and doesn't seem to account for partially bound sizes.
Happy to do a PR, just looking for confirmation my assumptions are correct.
I would think the solution is to call MVKImageDescriptor::reset() on every element of the array that isn't set in the Update. Is that correct?
Thanks

@mbechard
Copy link
Contributor Author

mbechard commented Apr 15, 2024

Thinking about this more. I think when a MVK resource is destroyed it needs to know all of the partially bound descriptors that reference it, and clears out those references. Otherwise dangling references to resources can occur when converting the MVK state into Metal calls.
My test case only uses variable sized descriptors, but I'm thinking this crash would affect statically sized descriptors as well that are partially bound.

@billhollings
Copy link
Contributor

Perhaps I'm missing something here.

vkUpdateDescriptorSets() is specifically designed to update only the descriptors that you indicate, and leave the others unchanged.

What behaviour are you seeing on other Vulkan platforms? What kind of crash are you seeing here? Do you have a call trace, or a small app that can demonstrate the issue you're encountering?

I think when a MVK resource is destroyed it needs to know all of the partially bound descriptors that reference it

This is existing behaviour. The resources are retained until they are removed from the descriptors. However, Vulkan does require that an app not destroy resources while they are in-flight in an executing command buffer, and that a descriptor holding a destroyed resource must not be used further while it holds that resource.

@mbechard
Copy link
Contributor Author

mbechard commented Apr 16, 2024

It works fine on Windows for all GPUs there. There are two crashes so far I've seen that show up.
Here is the call stack from the first crash. The mtlTexture passed into setFragmentTexture is invalid.

#0	0x00000001a23e5834 in objc_msgSend ()
#1	0x00000001a2fcd6e8 in -[MTLDebugRenderCommandEncoder setFragmentTexture:atIndex:] ()
#2	0x00000001187fe6f8 in MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20::operator()(MVKCommandEncoder*, MVKMTLTextureBinding&) const at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.mm:1110
#3	0x00000001187fe6b0 in decltype(std::declval<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&>()(std::declval<MVKCommandEncoder*>(), std::declval<MVKMTLTextureBinding&>())) std::__1::__invoke[abi:v15006]<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&, MVKCommandEncoder*, MVKMTLTextureBinding&>(MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&, MVKCommandEncoder*&&, MVKMTLTextureBinding&) at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/invoke.h:394
#4	0x00000001187fe654 in void std::__1::__invoke_void_return_wrapper<void, true>::__call<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&, MVKCommandEncoder*, MVKMTLTextureBinding&>(MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&, MVKCommandEncoder*&&, MVKMTLTextureBinding&) at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/invoke.h:479
#5	0x00000001187fe620 in std::__1::__function::__alloc_func<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20, std::__1::allocator<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20>, void (MVKCommandEncoder*, MVKMTLTextureBinding&)>::operator()[abi:v15006](MVKCommandEncoder*&&, MVKMTLTextureBinding&) at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/function.h:185
#6	0x00000001187fd49c in std::__1::__function::__func<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20, std::__1::allocator<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20>, void (MVKCommandEncoder*, MVKMTLTextureBinding&)>::operator()(MVKCommandEncoder*&&, MVKMTLTextureBinding&) at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/function.h:359
#7	0x00000001187e033c in std::__1::__function::__value_func<void (MVKCommandEncoder*, MVKMTLTextureBinding&)>::operator()[abi:v15006](MVKCommandEncoder*&&, MVKMTLTextureBinding&) const at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/function.h:512
#8	0x00000001187e02dc in std::__1::function<void (MVKCommandEncoder*, MVKMTLTextureBinding&)>::operator()(MVKCommandEncoder*, MVKMTLTextureBinding&) const at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/function.h:1197
#9	0x00000001187d8fec in void MVKResourcesCommandEncoderState::encodeBinding<MVKMTLTextureBinding, MVKSmallVectorImpl<MVKMTLTextureBinding, mvk_smallvector_allocator<MVKMTLTextureBinding, 8> > >(MVKSmallVectorImpl<MVKMTLTextureBinding, mvk_smallvector_allocator<MVKMTLTextureBinding, 8> >&, bool&, std::__1::function<void (MVKCommandEncoder*, MVKMTLTextureBinding&)>) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.h:455
#10	0x00000001187d891c in MVKGraphicsResourcesCommandEncoderState::encodeBindings(MVKShaderStage, char const*, bool, std::__1::function<void (MVKCommandEncoder*, MVKMTLBufferBinding&)>, std::__1::function<void (MVKCommandEncoder*, MVKMTLBufferBinding&, MVKArrayRef<unsigned int const>)>, std::__1::function<void (MVKCommandEncoder*, MVKMTLTextureBinding&)>, std::__1::function<void (MVKCommandEncoder*, MVKMTLSamplerStateBinding&)>) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.mm:901
#11	0x00000001187d9e14 in MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.mm:1088
#12	0x00000001186cdef0 in MVKCommandEncoderState::encode(unsigned int) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.h:93
#13	0x00000001186cde14 in MVKCommandEncoder::finalizeDrawState(MVKGraphicsStage) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandBuffer.mm:714

Here is the call stack from the second. This one is much rarer. In this one the MVKImageViewPlane::_imageView is NULL.
What's interesting here though is that at this point the _planes member of MVKImageView seems to be empty. And since getMTLTexture can only be called if _planes.size() > 0, I think it's a race condition where the MVKImageView is getting destroy() called on it while this code is executing. Likely because I've called vkDestroyImageView in another thread.

#1	0x000000011868dcc4 in MVKImageViewPlane::getMTLTexture() at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKImage.mm:1566
#2	0x000000011871ec54 in MVKImageView::getMTLTexture(unsigned char) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKImage.h:574
#3	0x000000011873a0f4 in MVKImageDescriptor::bind(MVKCommandEncoder*, VkPipelineBindPoint, MVKDescriptorSetLayoutBinding*, unsigned int, bool*, MVKShaderResourceBinding&, MVKArrayRef<unsigned int>, unsigned int&) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptor.mm:918
#4	0x0000000118735f44 in MVKDescriptorSetLayoutBinding::bind(MVKCommandEncoder*, VkPipelineBindPoint, MVKDescriptorSet*, MVKShaderResourceBinding&, MVKArrayRef<unsigned int>, unsigned int&) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptor.mm:222
#5	0x000000011861a55c in MVKDescriptorSetLayout::bindDescriptorSet(MVKCommandEncoder*, VkPipelineBindPoint, unsigned int, MVKDescriptorSet*, MVKShaderResourceBinding&, MVKArrayRef<unsigned int>, unsigned int&) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptorSet.mm:46
#6	0x000000011873f4c8 in MVKPipelineLayout::bindDescriptorSets(MVKCommandEncoder*, VkPipelineBindPoint, MVKArrayRef<MVKDescriptorSet*>, unsigned int, MVKArrayRef<unsigned int>) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKPipeline.mm:60

I also notice that if I use MVK_CONFIG_PREFILL_METAL_COMMAND_BUFFERS_STYLE_IMMEDIATE_ENCODING, or ensure all my resource destructions happen in my main thread (instead of a worker thread), I can't make the crash occur. So that's another hint it's a race condition.

What I think is happening is that the worker thread is clearing up resources at the same time the the descriptor binding operations are occurring inside of MoltenVK. For the first case the MVKGraphicsResourcesCommandEncoderState was given a MTLTexture that has since been deleted. For the second case the MVKImageView had it's _planes object cleaned up between the start of the getMTLTexture and the end of it (since it shouldn't be able to get into MVKImageViewPlane::getMTLTexture() when _planes.size() == 0), which it is when I inspect the stack at this point.

So the question is, is it legal to destroy resources pointed to by a descriptor, while that descriptor is part of a descriptor set that is being used, even though that particular descriptor isn't referenced by the shader?

@billhollings
Copy link
Contributor

So the question is, is it legal to destroy resources pointed to by a descriptor, while that descriptor is part of a descriptor set that is being used, even though that particular descriptor isn't referenced by the shader?

As I mention above, Vulkan does require that an app not destroy resources while they are in-flight in an executing command buffer, and that a descriptor holding a destroyed resource must not be used further while it holds that destroyed resource:

The application must not destroy any other type of Vulkan object until all uses of that object by the device (such as via command buffer execution) have completed.

The following Vulkan objects must not be destroyed while any command buffers using the object are in the pending state:

  • ...
  • VkBuffer
  • VkBufferView
  • VkImage
  • VkImageView
  • ...

It's possible that you may not be encountering this on other platforms, because the representation within the platform command encoder may not depend on the Vulkan resource object the way it does in MoltenVK and Metal. To improve performance, MoltenVK deliberately uses a MTLCommandBuffer that does not retain the references to the MTLTextures and MTLBuffers.

Having said that, MoltenVK does retain the resource objects within descriptors, but if you also destroy the descriptor set on your worker thread, then it might all disappear too early.

@mbechard
Copy link
Contributor Author

mbechard commented Apr 17, 2024

Thanks for your continued attention to this. The spec also states:

VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT indicates that descriptors in this binding that are not dynamically used need not contain valid descriptors at the time the descriptors are consumed. A descriptor is dynamically used if any shader invocation executes an instruction that performs any memory access using the descriptor. If a descriptor is not dynamically used, any resource referenced by the descriptor is not considered to be referenced during command execution.

This clarification was only added recently, in spec v1.3.210.

There is also this validation layer issue that mentions a similar workflow:
KhronosGroup/Vulkan-Docs#1794

I do understand why this would be problematic for MoltenVK though, as the translation layer between it and Metal requires accessing resources on the CPU to do the translation that would otherwise not be accessed by other implementations.

Is my understanding correct? If so, I can try to think of a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants