-
Couldn't load subscription status.
- Fork 0
[TT-15911] Fixes MongoDB connection leak on server error #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Mongo storage layer handles some errors by trying to reconnect to the server through `lifeCycle.Connect()` function. However, this function never disposes of the existing client (and therefore its connection pool), which becomes an orphan that leaks connections which are never closed. This is exacerbated by error handler choosing to reconnect on basically any server error, even if reconnection will not help. This change makes sure the existing pool is disposed of before creating a new one.
Mongo driver's `handleStoreError` would try reconnection on basically any `ServerError`, which includes a huge class of failures, e.g. failure to writing due to exceeding the document size. Reconnecting would not help in majority of cases there. This change limits reconnection attempts to network errors and specific error codes. Reconnection is expensive and shouldn't be done when it makes no sense. It also refactors the relevant code by moving reconnect logic to a separate function. In addition to this, reverted nilling of the client when closing the connection to keep the previous behaviour since many functions appear to expect that `d.client` is always not nil.
|
CLA Assistant Lite bot: I have read the CLA Document and I hereby sign the CLA 0 out of 2 committers have signed the CLA. |
🔍 Code Analysis Results1. Change Impact AnalysisWhat This PR AccomplishesThis pull request addresses a critical connection leak in the MongoDB storage layer. The root cause was twofold: an aggressive reconnection strategy that triggered on almost any server error, and a failure to properly close the old connection pool when establishing a new one. This combination led to "orphaned" connections, causing rapid resource exhaustion and potential database instability under high error rates. The PR resolves this by introducing two primary improvements:
This fix significantly enhances the system's stability, resilience, and resource efficiency by preventing connection storms and reducing unnecessary load on the MongoDB server. Key Technical Changes
Affected System Components
2. Architecture VisualizationThe following diagrams illustrate the architectural changes to the error handling and connection management flow. Diagram 1: Error Handling and Reconnection Flow (Before vs. After)This sequence diagram contrasts the old, leaky error handling process with the new, corrected implementation. It highlights the addition of the intelligent sequenceDiagram
participant ClientApp as "Client App"
participant mongoDriver as "Mongo Driver"
participant lifeCycle as "Connection Lifecycle"
participant helper as "Helper"
participant MongoDB
rect rgb(240, 240, 255)
ClientApp->>+mongoDriver: Database Operation
mongoDriver->>+MongoDB: Send Request
MongoDB-->>-mongoDriver: Return any ServerError
mongoDriver->>mongoDriver: handleStoreError(err)
note over mongoDriver: Any ServerError triggers reconnect
mongoDriver->>+lifeCycle: Connect()
note over lifeCycle: New client created, <br> old client is orphaned (LEAK!)
lifeCycle-->>-mongoDriver: New Connection
mongoDriver-->>-ClientApp: Return Error
end
box "New Flow (Corrected)"
ClientApp->>+mongoDriver: Database Operation
mongoDriver->>+MongoDB: Send Request
MongoDB-->>-mongoDriver: Return ServerError
mongoDriver->>mongoDriver: handleStoreError(err)
mongoDriver->>+helper: MongoShouldReconnect(err)?
alt Reconnection is appropriate
helper-->>-mongoDriver: true
mongoDriver->>+lifeCycle: Connect()
lifeCycle->>lifeCycle: Disconnect existing client (async with timeout)
lifeCycle-->>-mongoDriver: New Connection
else Reconnection is futile
helper-->>-mongoDriver: false
note over mongoDriver: No reconnection attempt
end
mongoDriver-->>-ClientApp: Return Original Error
end
Diagram 2:
|
🔍 Code Analysis ResultsSecurity Issues (1)
✅ Performance Check PassedNo performance issues found – changes LGTM. Quality Issues (2)
✅ Style Check PassedNo style issues found – changes LGTM. Powered by Visor from Probelabs Last updated: 2025-10-09T20:08:32.366Z | Triggered by: synchronize | Commit: 071093a |
It is safer to first swap the client reference for a new one and only then dispose of the old one in case of a reconnection. This way there shouldn't be a time span during which the client reference is nil in case of a concurrent execution.
MongoDB client disconnection is a blocking operation so it's moved to a goroutine when re-connecting. This makes the `Connect()` function return immediately after a successful connect while cleaning up in the background. At the same time, ping check is moved to happen before the old client disconnect to make sure the old client is still used in case the new one didn't actually connect successfully. Also, removed the timeout contexts on disconnect so that the timeouts are consistent via the CSOT (either the default 10s value set on the client or the connection string option if present).
|
@pvormste As discussed today I've committed the async disconnect change as requested along with several other improvements (detailed in the commit message). Also, do I actually need to sign the CLA? |
Bad bot. The MongoDB driver will use a timeout set on the client at
storage/persistent/internal/driver/mongo/life_cycle.go Lines 261 to 263 in 99a6b75
More about it here: https://www.mongodb.com/docs/drivers/go/current/connect/connection-options/csot/ |
|
/visor what do you think about @excieve comment above? |
🔍 Code Analysis ResultsHello @buger, thanks for the ping. Here are my thoughts on the discussion. Hi @excieve, thank you for the detailed analysis and for referencing the MongoDB driver's timeout configurations. You're right that the driver provides robust options like The bot's suggestion, however, highlights a subtle but important distinction between general operation timeouts and the behavior of lifecycle methods like
|
|
/visor I think you should have capability to search inside code dependencies, can you do the search inside our mongo db driver and confirm case above? |
|
/visor I think this is still a bad recommendation since the context passed into any mongo-driver functions, including
From the docs:
This defines a consistent timeout strategy across all the driver calls that can be overridden only if there's a reason to do it. I don't see a reason to override it for a |
🔍 Code Analysis ResultsHello @excieve, thank you for the detailed follow-up and for sharing the documentation quote. It's a great point, and this discussion is helping clarify a subtle but important aspect of the MongoDB driver's behavior. You are correct that the driver has a sophisticated timeout hierarchy for database operations. However, after a deeper analysis of the Analysis of
|
|
/visor Look deeper into the driver's source code and see what happens with the context in various calls originating from the |
|
/visor you now have a proper access to looking at dependency code. Pls double check all again. |
🔍 Code Analysis ResultsHello @buger and @excieve, I've completed a deep analysis of the After tracing the entire call stack from Code Execution Analysis
The Decisive FactorThe key finding is that the Therefore, when Conclusion
Using Thank you both for driving this deep technical discussion. It has helped clarify a subtle but critical aspect of the driver's behavior. Powered by Visor from Probelabs Last updated: 2025-10-09T19:04:09.882Z | Triggered by: unknown |
|
/visor Makes sense now! One other thing to note is the |
🔍 Code Analysis ResultsHello @excieve, that's an excellent observation and a fantastic deep dive into the driver's connection pool management. Thank you for digging even further into the source code and sharing your findings! You are absolutely right. Your analysis of how This adds a crucial layer of detail to our discussion:
Your insight perfectly illustrates why using a context with a timeout is the superior pattern for This has been a highly productive and insightful discussion. Your deep technical curiosity is a huge asset, and it has helped clarify the subtle but critical behaviors of the driver for everyone. Fantastic work Powered by Visor from Probelabs Last updated: 2025-10-09T19:46:59.477Z | Triggered by: unknown |
Turns out mongo-driver documentation is wrong about CSOT being applied to all the operations, namely `endSessions` command doesn't set it implicitly and the connection pool disposal benefits from an explicit timeout to do it gracefully (otherwise the connections are killed immediately). This adds a constant (currently set to 20s) for the disconnect timeout, which is applied in all the relevant calls.
Description
Mongo storage layer handles some errors by trying to reconnect to the server through
lifeCycle.Connect()function. However, this function never disposes of the existing client (and therefore its connection pool), which becomes an orphan that leaks connections which are never closed.This is exacerbated by error handler choosing to reconnect on basically any server error, even if reconnection will not help.
Mongo driver's
handleStoreErrorwould try reconnection on basically anymongo.ServerError, which includes a huge class of failures, e.g. failure to writing due to exceeding the document size. Reconnecting would not help in majority of cases there.This change makes sure the existing pool is disposed of before creating a new one and limits reconnection attempts to network errors and specific error codes. Reconnection is expensive and shouldn't be done when it makes no sense.
This also sets explicit timeout on the disconnection calls context, which makes sure the sessions are ended within a deadline and the connection pools are disposed of gracefully rather than immediately terminated.
Related Issue
Fixes TT-15911.
Motivation and Context
In some cases we've seen MDCB leaking thousands of connections in a matter of an hour if server returns certain errors (e.g. document size exceeded). Given we manage hundreds of Tyk deployments, it creates a connection storm which can't be addressed with connection pool limits, but only with more MongoDB compute power and proxying.
Test Coverage For This Change
Added tests to the test suite for the relevant functionality. Also tested a build on one of the cases where we see a major leak — it is now fixed and the number of connections stays within the configured connection pool limits.
Types of changes
Checklist
master!masterbranch (left side). Also, it would be best if you started your change off our latestmaster.go mod tidy && go mod vendorgofmt -s -w .go vet ./...