Skip to content

Conversation

@dibahlfi
Copy link
Contributor

This PR addresses scenarios where customer provided excluded region was not being honored during certain transient failures as the SDK ends up calling global endpoint which can point to a region that is part of excluded region list.
It also adds logging enhancements to ensure when a region is marked unavailable we have the proper context.

@dibahlfi dibahlfi requested a review from a team as a code owner October 23, 2025 23:18
@Copilot Copilot AI review requested due to automatic review settings October 23, 2025 23:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where customer-provided excluded regions were not being honored during transient failures when the SDK calls the global endpoint, which could point to an excluded region. It also adds logging enhancements to provide proper context when regions are marked unavailable.

Key changes:

  • Modified region selection logic to properly respect both user-excluded and circuit-breaker-excluded locations
  • Added context parameters to endpoint unavailability tracking methods for better diagnostics
  • Updated retry policy to calculate retries based on applicable (non-excluded) regions

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test_service_request_retry_policy_async.py New async test verifying write failover respects excluded regions during service request errors
test_service_request_retry_policy.py New sync test verifying write failover respects excluded regions during service request errors
test_location_cache.py Added tests for excluded region handling, regional fallback behavior, and global endpoint fallback scenarios
_global_endpoint_manager_async.py Added context parameter to endpoint unavailability tracking methods and their call sites
_service_request_retry_policy.py Updated retry count calculation to use applicable (non-excluded) regional routing contexts and added context to mark_endpoint_unavailable calls
_location_cache.py Refactored region filtering logic to properly separate user-excluded from circuit-breaker-excluded locations and respect exclusions in global endpoint resolution
_global_endpoint_manager.py Added context parameter to endpoint unavailability tracking methods and their call sites
_endpoint_discovery_retry_policy.py Added context parameter when marking endpoints unavailable

return write_regional_routing_context.get_primary()
if self.connection_policy.EnableEndpointDiscovery:
# Get the list of applicable write locations, which respects excluded locations.
applicable_contexts = self._get_applicable_write_regional_routing_contexts(request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will break 403.3 scenarios where we want to ignore preferred regions. I think would be better to add the needed routing logic in session retry policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tomas, I refactored the code. Do you think it will break scenarios where want to ignore preferred regions?

@dibahlfi
Copy link
Contributor Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Contributor Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi
Copy link
Contributor Author

/azp run python - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants