Skip to content

Conversation

nishkarsh-db
Copy link
Collaborator

Description

Added a executeWithRetry function in DatabricksHTTPClient and replaced all execute calls by executeWithRetry calls (except for thrift, UC volume client).

Testing

Tested through existing unit tests (but changed some mocks). Also added new unit tests for Retry Handling and Retry Strategies.

Additional Notes to the Reviewer

Copy link
Collaborator

@shivam2680 shivam2680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add more details PR description.

Copy link
Collaborator

@shivam2680 shivam2680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed till databrickshttpclient.

Comment on lines 5 to 11
THRIFT_OPEN_SESSION,
THRIFT_CLOSE_SESSION,
THRIFT_METADATA,
THRIFT_CLOSE_OPERATION,
THRIFT_CANCEL_OPERATION,
THRIFT_EXECUTE_STATEMENT,
THRIFT_FETCH_RESULTS,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the THRIFT_ prefix here?

@shivam2680 shivam2680 changed the base branch from main to retry-unification September 12, 2025 06:44
Copy link
Collaborator

@shivam2680 shivam2680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 63 to 65
if (accumulatedTimeTempUnavailable <= 0 || accumulatedTimeRateLimit <= 0) {
return response;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will still recommend keeping timestamps for request start and current timestamp and using that for checking timeout. Current approach is also correct. will leave this decision to you

THRIFT_FETCH_RESULTS(RequestRetryability.NON_IDEMPOTENT),
CLOUD_FETCH(RequestRetryability.IDEMPOTENT),
VOLUME_LIST(RequestRetryability.IDEMPOTENT),
VOLUME_SHOW_VOLUMES(RequestRetryability.IDEMPOTENT),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VOLUME_SHOW_VOLUMES -> what does this mean? how does it differ from LIST?

VOLUME_LIST(RequestRetryability.IDEMPOTENT),
VOLUME_SHOW_VOLUMES(RequestRetryability.IDEMPOTENT),
VOLUME_GET(RequestRetryability.IDEMPOTENT),
VOLUME_PUT(RequestRetryability.NON_IDEMPOTENT),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we retry a PUT volume request? I would assume we simply overwrite the existing file, is that not the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do overwrite the existing file.

while (cause != null) {
if (cause instanceof DatabricksRetryHandlerException) {
throw new DatabricksHttpException(
cause.getMessage(), cause, DatabricksDriverErrorCode.INVALID_STATE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why INVALID_STATE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the current behaviour of the driver.

Copy link
Collaborator

@vikrantpuppala vikrantpuppala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are severe bugs which will break retry behaviour, also signals that the testing in this PR is not adequate, please take a deeper look

int statusCode = response.getStatusLine().getStatusCode();

// Get retry delay from strategy
retryDelayMillis = strategy.retryRequestAfter(response, attempt, connectionContext);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all very confusing, we calculate exponential backoff first then always overwrite it here. why did we even calculate the exponential backoff at line 47?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because if httpclient.execute(request) throws an exception, exponential backoff is followed rather than retrying it again immediately.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces request-level retry handling to the Databricks JDBC driver by implementing a strategy pattern for HTTP request retries based on request type. The retry logic differentiates between idempotent and non-idempotent requests, with configurable timeout and retry limits.

  • Replaced direct execute calls with executeWithRetry throughout the codebase
  • Added request type classification with appropriate retry strategies
  • Introduced comprehensive unit tests for retry strategies and handlers

Reviewed Changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
Test files Updated all test mocks to use executeWithRetry instead of execute
New test classes Added comprehensive tests for retry strategies and handlers
HttpRequestTypeBasedRetryHandler Main retry handler implementing strategy pattern
IRetryStrategy Interface for retry strategies
IdempotentRetryStrategy Strategy for idempotent requests with exponential backoff
NonIdempotentRetryStrategy Strategy for non-idempotent requests respecting Retry-After headers
HTTPRequestType enum Classification of request types with retryability
HTTP client implementations Added executeWithRetry methods and integration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

* @return a jittered delay value between value and value * 1.2
*/
public static int addJitter(int value) {
return (int) (value * (1.0 + (RANDOM.nextDouble() * 0.2)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not have a separate function for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks more readable with this being a seperate function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adds unnecessary complexity and multiple function definitions to go through, also define these values as constant.

prepareRequestHeaders(request, supportGzipEncoding);

IRetryStrategy strategy = RetryUtils.getRetryStrategy(requestType);
LOGGER.debug(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. @samikshya-db can you help nishkarsh with adding retry logs to telemetry as well

requestType,
strategy.getClass().getSimpleName());

RetryTimeoutManager retryTimeoutManager = new RetryTimeoutManager(connectionContext);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not efficient to initialize for every request. You are just passing connection context here, so you can move it to class constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need the retry timeout values to be reset for each request. And storing it in the httpclient could be dangerous in multi-threaded scenarios.

Copy link
Collaborator

@shivam2680 shivam2680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! please wait for @vikrantpuppala's approval

public CloseableHttpResponse executeWithRetry(
HttpUriRequest request, RequestType requestType, boolean supportGzipEncoding)
throws DatabricksHttpException {
prepareRequestHeaders(request, supportGzipEncoding);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already happening in execute, don't think we need this here

int retryAttempt = 0;

while (true) {
// follow exponential backoff if executing the request throws IOException
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

CloseableHttpResponse response = httpClient.execute(request);
int statusCode = response.getStatusLine().getStatusCode();
Optional<Integer> retryAfterHeader = RetryUtils.extractRetryAfterHeader(response);
// Get retry delay from strategy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove obvious comments

// Get retry delay from strategy
Optional<Integer> retryDelay =
strategy.retryRequestAfter(
statusCode, retryAfterHeader, retryAttempt, connectionContext);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this method to shouldRetry() which should only return a boolean

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should take in your timeoutmanager as well and check if your timeouts have already exceeded limits

retryAttempt,
requestType,
e.getMessage());
if (!retryTimeoutManager.evaluateRetryDecisionForException(strategy, e, retryDelayMillis)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this with strategy.shouldRetryException which takes in your timeoutManager

boolean isStatusCodeRetriable(int statusCode, IDatabricksConnectionContext connectionContext);

/* Returns the delay in milliseconds after which a request should be retried, or empty if it shouldn't be retried */
Optional<Integer> retryRequestAfter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this completely

int statusCode, Optional<Integer> retryDelayMillis) {
if (retryDelayMillis.isEmpty()) {
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not happen here, it should only check for timeouts

* @return a jittered delay value between value and value * 1.2
*/
public static int addJitter(int value) {
return (int) (value * (1.0 + (RANDOM.nextDouble() * 0.2)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adds unnecessary complexity and multiple function definitions to go through, also define these values as constant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants