Skip to content

[BUG] Failure to deploy model due to VersionConflict #4334

@maxlepikhin

Description

@maxlepikhin

What is the bug?
Failure to deploy a model due to version conflict exception. Suspecting that concurrent updates to multiple nodes step on each other w/o retrying when the exception occurs.

Exception

[2025-10-21T18:09:30,853][ERROR][o.o.m.a.d.TransportDeployModelAction] [opensearch-cluster-nodes-1] Failed to deploy model ROP0B5oBBqOXgeKlR2fw
org.opensearch.OpenSearchStatusException: Document version conflict updating ROP0B5oBBqOXgeKlR2fw in index .plugins-ml-model
	at org.opensearch.remote.metadata.client.impl.LocalClusterIndicesClient.lambda$updateDataObjectAsync$2(LocalClusterIndicesClient.java:245) [opensearch-remote-metadata-sdk-3.3.0.0.jar:?]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.3.0.jar:3.3.0]
	at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:124) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.action.update.TransportUpdateAction.lambda$innerExecute$0(TransportUpdateAction.java:207) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.3.0.jar:3.3.0]
	at org.opensearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction$1.handleException(TransportInstanceSingleOperationAction.java:245) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:428) [opensearch-security-3.3.0.0.jar:3.3.0.0]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1607) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.NativeMessageHandler.lambda$handleException$0(NativeMessageHandler.java:495) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.NativeMessageHandler.handleException(NativeMessageHandler.java:493) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.NativeMessageHandler.handlerResponseError(NativeMessageHandler.java:485) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.NativeMessageHandler.handleMessage(NativeMessageHandler.java:195) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.NativeMessageHandler.messageReceived(NativeMessageHandler.java:149) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.InboundHandler.messageReceivedFromPipeline(InboundHandler.java:152) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:144) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:804) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.InboundBytesHandler.forwardFragments(InboundBytesHandler.java:137) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.InboundBytesHandler.doHandleBytes(InboundBytesHandler.java:77) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:124) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:113) [opensearch-3.3.0.jar:3.3.0]
	at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-3.3.0.jar:3.3.0]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1519) [netty-handler-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1377) [netty-handler-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1428) [netty-handler-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:530) [netty-codec-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:469) [netty-codec-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) [netty-codec-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:697) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:660) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.125.Final.jar:4.1.125.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.125.Final.jar:4.1.125.Final]
	at java.base/java.lang.Thread.run(Thread.java:1447) [?:?]
Caused by: org.opensearch.index.engine.VersionConflictEngineException: [ROP0B5oBBqOXgeKlR2fw]: version conflict, required seqNo [19], primary term [1]. current document has seqNo [20] and primary term [1]

How can one reproduce the bug?

  1. Give multiple nodes ml role. In our case it's 3 nodes with the same roles for each.
  2. Attempt to deploy a model from a mounted disk.

What is the expected behavior?
Model deployment is successful

What is your host/environment?
Ubuntu 24.04 in minikube

Do you have any screenshots?
N/A

Do you have any additional context?
N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions