-
Notifications
You must be signed in to change notification settings - Fork 1
Swim bugs completed #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swim bugs completed #141
Conversation
description: Specify the name of the device | ||
family such as Switches and Hubs, etc. | ||
type: str | ||
api_task_timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) Can we say what is the difference b/w global timeout and this value?
b) In document you mentioned as default of 2400 seconds but that is not true. Instead can you put in recommendations?
c) Add the version when do we start to support it..
Check the sample snippet and update if required.. Also get approval from @DNACENSolutions
api_task_timeout:
description: |
Timeout duration in seconds for image distribution API operations.
Controls how long the system waits for image distribution tasks to complete,
including image transfer to target devices, network propagation, and distribution validation.
Operation phases covered by this timeout:
- Image preparation and validation on Catalyst Center
- Network transfer to target devices
- Installation verification on target devices
- Distribution status confirmation
Default of 1800 seconds (30 minutes) accounts for:
- Large image files (up to several GB)
- Multiple target devices in site-based operations
- Network latency and bandwidth constraints
- Device processing and storage capabilities
Recommended timeout values:
- Small networks (1-10 devices): 900-1800 seconds
- Medium networks (10-50 devices): 1800-3600 seconds
- Large networks (50+ devices): 3600-7200 seconds
Note: This timeout is independent of the global 'dnac_api_task_timeout' parameter
and specifically applies to distribution operations only.
type: int
default: 1800
version_added: ??????
ACCESS, BORDER ROUTER, DISTRIBUTION, and | ||
CORE. | ||
type: str | ||
api_task_timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
api_task_timeout:
description: |
Timeout duration in seconds for image activation API operations.
Controls how long the system waits for image activation processes to complete
before timing out, including device reboot and startup verification.
Operation phases covered by this timeout:
- Image validation and preparation for activation
- Device upgrade mode processing (install/bundle/currentlyExists)
- Device reboot and startup sequence (if required)
- Post-activation connectivity and status verification
- Golden image validation (if applicable)
Default of 1800 seconds (30 minutes) accommodates:
- Device boot time variations (switches: 5-15 min, routers: 10-20 min)
- Image installation and verification processes
- Network convergence and connectivity restoration
- Multiple devices in concurrent activation scenarios
Recommended timeout values by device type:
- Access switches: 1200-1800 seconds (20-30 minutes)
- Distribution/Core switches: 1800-2700 seconds (30-45 minutes)
- Routers and complex devices: 2700-3600 seconds (45-60 minutes)
Warning: Setting timeout too low may cause premature failure reporting
for successful but slow activation processes.
type: int
default: 1800
version_added: ??????
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would ike the name api_task_timeout to be changed, it should be named for the activity it is addressing to. i.e. image_activation_timeout and image_distribution_timeout, both should be provided to user so they can individually plan to timeouts correctly. For example distribution on slow. network may take1Hr or more, activation will very based on if the device is a stack and and may vary from 10 mins to 30 minutes.
device_family_name: | ||
description: Specify the name of the device | ||
family such as Switches and Hubs, etc. | ||
type: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name "api_task_timeout" may get confused with the existing global dnac_api_task_timeout parameter. Please note that the parameter name should reflect its specific operation scope.
Can we say
distribution_task_timeout:
Timeout duration in seconds specifically for SWIM image distribution operations.
Controls how long the system waits for distribution tasks to complete....
activation_task_timeout:
Timeout duration in seconds specifically for SWIM image activation operations.
Controls how long the system waits for activation tasks to complete.....
OR
swim_operation_timeout:
Timeout duration in seconds for SWIM operations (distribution/activation)..
Overrides the global dnac_api_task_timeout for SWIM-specific tasks...........
OR
image_distribution_timeout. .. image_activation_timeout ...
Check with @DNACENSolutions and update...
type: int
default: 1800
"Site name not specified; defaulting to 'Global' to fetch all devices under this category", | ||
"INFO", | ||
) | ||
if self.compare_dnac_versions(self.get_ccc_version(), "2.3.5.3") <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_version = self.get_ccc_version()
if self.compare_dnac_versions(current_version, "2.3.5.3") <= 0:
site_name = "Global/.*"
self.log(
"Catalyst Center version {0} (≤2.3.5.3) detected - using wildcard pattern 'Global/.*' "
"to fetch devices from Global site and all child sites via legacy API".format(current_version),
"INFO",
)
else:
site_name = "Global"
self.log(
"Catalyst Center version {0} (>2.3.5.3) detected - using 'Global' site name "
"to fetch devices via enhanced site hierarchy API".format(current_version),
"INFO",
)
for item_dict in item["response"]: | ||
site_response_list.append(item_dict) | ||
else: | ||
self.log(site_name, "DEBUG") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't print the variable name directly. We instructed this many times in Code review.
return self | ||
|
||
if not isinstance(response, dict): | ||
self.msg = "response is not a dictionary" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.msg = "Invalid response format from API '{0}' - expected dictionary, got {1}".format(
api_name, type(response).__name__
)
self.status = "exited" | ||
return self | ||
|
||
task_info = response.get("response") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first check the task_info, if not then return..
task_info = response.get("response")
if not task_info:
self.msg = "Missing task information in response from API '{0}'".format(api_name)
self.log(self.msg, "ERROR")
self.status = "failed"
return self
if task_info.get("errorcode") is not None:
error_detail = task_info.get("detail", "Unknown error occurred")
self.msg = "API '{0}' returned error: {1}".format(api_name, error_detail)
self.log(self.msg, "ERROR")
self.status = "failed"
return self
self.status = "failed" | ||
return self | ||
|
||
task_id = task_info.get("taskId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if task_id is none, then return??
if not task_id:
self.msg = "Missing taskId in response from API '{0}'".format(api_name)
self.log(self.msg, "ERROR")
self.status = "failed"
return self
return self | ||
|
||
task_id = task_info.get("taskId") | ||
start_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is good to determine the timeout for this operation..
# Determine appropriate timeout for this operation
if operation_details:
operation_type = api_name.lower().replace("trigger_software_image_", "").replace("_", "")
max_timeout = self.get_operation_timeout(operation_type, operation_details)
else:
max_timeout = getattr(self, 'max_timeout', self.params.get("dnac_api_task_timeout", 1800))
poll_interval = self.params.get("dnac_task_poll_interval", 2)
self.log(
"Monitoring task '{0}' from API '{1}' with timeout {2}s and poll interval {3}s".format(
task_id, api_name, max_timeout, poll_interval
),
"INFO"
)
start_time = time.time()
task_status_history = []
while True:
elapsed_time = time.time() - start_time
|
||
task_id = task_info.get("taskId") | ||
start_time = time.time() | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a sample code to check the task_status.. But you can refine and update if required..
task_status_history = []
while True:
elapsed_time = time.time() - start_time
# Check for timeout
if elapsed_time >= max_timeout:
self.msg = (
"Timeout of {0} seconds reached for task '{1}' from API '{2}'. "
"Task may still be running on Catalyst Center. "
"Consider increasing timeout or checking task status manually."
).format(max_timeout, task_id, api_name)
self.log(self.msg, "WARNING")
self.status = "failed"
break
# Get current task status
try:
task_details = self.get_tasks_by_id(task_id)
self.log(
"Retrieved task details for ID '{0}': {1}".format(task_id, task_details),
"DEBUG"
)
except Exception as e:
self.msg = "Failed to retrieve task details for ID '{0}': {1}".format(task_id, str(e))
self.log(self.msg, "ERROR")
self.status = "failed"
break
if not task_details:
self.log(
"Empty task details received for ID '{0}', retrying...".format(task_id),
"WARNING"
)
time.sleep(poll_interval)
continue
task_status = task_details.get("status", "UNKNOWN")
# Track status changes for debugging
if not task_status_history or task_status_history[-1] != task_status:
task_status_history.append(task_status)
self.log(
"Task '{0}' status changed to: {1} (elapsed: {2:.1f}s)".format(
task_id, task_status, elapsed_time
),
"INFO"
)
# Handle terminal states
if task_status == "FAILURE":
try:
failure_details = self.get_task_details_by_id(task_id)
failure_reason = failure_details.get("failureReason", "Unknown failure reason")
self.msg = "Task '{0}' from API '{1}' failed: {2}".format(
task_id, api_name, failure_reason
)
self.log(self.msg, "ERROR")
except Exception as e:
self.msg = "Task '{0}' failed and unable to retrieve failure details: {1}".format(
task_id, str(e)
)
self.log(self.msg, "ERROR")
self.status = "failed"
break
elif task_status == "SUCCESS":
self.result["changed"] = True
self.msg = "Task '{0}' from API '{1}' completed successfully in {2:.1f} seconds".format(
task_id, api_name, elapsed_time
)
self.log(self.msg, "INFO")
self.status = "success"
break
# Handle unknown/unexpected status
elif task_status not in ["IN_PROGRESS", "PENDING", "RUNNING"]:
self.log(
"Unexpected task status '{0}' for task '{1}'. Continuing to monitor...".format(
task_status, task_id
),
"WARNING"
)
# Log periodic progress updates
if int(elapsed_time) % 30 == 0 and elapsed_time > 0: # Every 30 seconds
progress_info = task_details.get("progress", "No progress information")
self.log(
"Task '{0}' progress update: {1} (status: {2}, elapsed: {3:.1f}s)".format(
task_id, progress_info, task_status, elapsed_time
),
"DEBUG"
)
time.sleep(poll_interval)
# Log final task monitoring summary
self.log(
"Task monitoring completed for '{0}'. Final status: {1}, Total time: {2:.1f}s, Status history: {3}".format(
task_id, self.status, time.time() - start_time, " -> ".join(task_status_history)
),
"INFO"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_operation_timeout(self, operation_type, operation_details):
"""
Get the appropriate timeout value for specific SWIM operations.
Args:
operation_type (str): Type of operation ('distribution' or 'activation')
operation_details (dict): Operation configuration details
Returns:
int: Timeout value in seconds
"""
self.log("Determining timeout for {0} operation".format(operation_type), "DEBUG")
# Map operation types to their specific timeout parameter names
timeout_key_mapping = {
"distribution": "distribution_task_timeout", # Recommended name... Check and update the name
"activation": "activation_task_timeout" # Recommended name.... Check and update the name
}
# Get operation-specific timeout if provided
timeout_key = timeout_key_mapping.get(operation_type)
if timeout_key and timeout_key in operation_details:
operation_timeout = operation_details[timeout_key]
self.log(
"Using operation-specific {0}: {1} seconds for {2}".format(
timeout_key, operation_timeout, operation_type
),
"INFO"
)
return operation_timeout
# Fallback to global timeout
global_timeout = self.params.get("dnac_api_task_timeout", 1800) # Correct global name
self.log(
"Using global dnac_api_task_timeout: {0} seconds for {1}".format(
global_timeout, operation_type
),
"INFO"
)
return global_timeout
description: Specify the name of the device | ||
family such as Switches and Hubs, etc. | ||
type: str | ||
Image_distribution_timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image_distribution_timeout..
not Image_distribution_timeout
Type of Change
Description
6 Bugs fixed
Bug Fix:
Root Cause (if applicable):
Issue: dnac_api_task_timeout causing incomplete activations
Fix: Added a new configurable parameter api_task_timeout for both activation and distribution processes, as these operations can take longer depending on network conditions. The default value is set to 1800 seconds, but customers can modify it based on their network speed and performance needs.
Issue: Concurrent activations triggered without verifying device status
Fix: Implemented the same api_task_timeout mechanism to ensure proper synchronization during concurrent activations and prevent inaccurate task updates or errors caused by premature timeouts.
Issue: Incorrect DNAC version mapping causing unsupported API execution
Fix: Updated and correctly mapped all relevant APIs to DNAC version 3.1.3.0, ensuring compatibility and stable module behavior with the supported API set.
Issue: Device filtering not functioning accurately when site_name parameter is not provided
Fix: Implemented a default behavior that automatically treats the site as ‘Global’ when no site_name is specified, ensuring all child sites are included in the output.
Issue: Device filtering fails when site_name is explicitly set to ‘Global’
Fix: Applied the same logic as above—when the site_name or site_type is ‘Global’, the module now retrieves and returns all devices from the global and its child sites correctly.
Issue: Incorrect success message upon activation completion
Fix: Corrected the final success message logic to map and display the accurate device IPs associated with the activation result, ensuring clarity in task completion feedback.
Enhancement: N/A
Enhancement Description: N/A
Impact Area: N/A
Testing Done:
Test cases covered: [Mention test case IDs or brief points]
Checklist
Ansible Best Practices
ansible-vault
or environment variables)Documentation
Screenshots (if applicable)
Notes to Reviewers