Skip to content

Conversation

syed-khadeerahmed
Copy link

@syed-khadeerahmed syed-khadeerahmed commented Oct 7, 2025

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Description

6 Bugs fixed

Bug Fix:

  1. [Swim]: Some issues from 'dnac_api_task_timeout' need to be checked
  2. [Swim]: The module enables concurrent activations without verifying device status, leading to errors and inaccurate updates
  3. [Swim]: Incorrect DNAC version mapping in the module causes the run of unsupported APIs.
  4. [Swim]: Filtering devices without providing the 'site_name' parameter does not work accurately.
  5. [Swim]: Filtering devices that provide 'site_name' = 'Global' is not functioning correctly
  6. [Swim]: The message returned when activated successfully is not correct

Root Cause (if applicable):

  1. Issue: dnac_api_task_timeout causing incomplete activations
    Fix: Added a new configurable parameter api_task_timeout for both activation and distribution processes, as these operations can take longer depending on network conditions. The default value is set to 1800 seconds, but customers can modify it based on their network speed and performance needs.

  2. Issue: Concurrent activations triggered without verifying device status
    Fix: Implemented the same api_task_timeout mechanism to ensure proper synchronization during concurrent activations and prevent inaccurate task updates or errors caused by premature timeouts.

  3. Issue: Incorrect DNAC version mapping causing unsupported API execution
    Fix: Updated and correctly mapped all relevant APIs to DNAC version 3.1.3.0, ensuring compatibility and stable module behavior with the supported API set.

  4. Issue: Device filtering not functioning accurately when site_name parameter is not provided
    Fix: Implemented a default behavior that automatically treats the site as ‘Global’ when no site_name is specified, ensuring all child sites are included in the output.

  5. Issue: Device filtering fails when site_name is explicitly set to ‘Global’
    Fix: Applied the same logic as above—when the site_name or site_type is ‘Global’, the module now retrieves and returns all devices from the global and its child sites correctly.

  6. Issue: Incorrect success message upon activation completion
    Fix: Corrected the final success message logic to map and display the accurate device IPs associated with the activation result, ensuring clarity in task completion feedback.

Enhancement: N/A
Enhancement Description: N/A
Impact Area: N/A

Testing Done:

  • Manual testing
  • Unit tests
  • Integration tests

Test cases covered: [Mention test case IDs or brief points]

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • All the sanity checks have been completed and the sanity test cases have been executed

Ansible Best Practices

  • Tasks are idempotent (can be run multiple times without changing state)
  • Variables and secrets are handled securely (e.g., using ansible-vault or environment variables)
  • Playbooks are modular and reusable
  • Handlers are used for actions that need to run on change

Documentation

  • All options and parameters are documented clearly.
  • Examples are provided and tested.
  • Notes and limitations are clearly stated.

Screenshots (if applicable)

Notes to Reviewers

description: Specify the name of the device
family such as Switches and Hubs, etc.
type: str
api_task_timeout:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) Can we say what is the difference b/w global timeout and this value?
b) In document you mentioned as default of 2400 seconds but that is not true. Instead can you put in recommendations?
c) Add the version when do we start to support it..

Check the sample snippet and update if required.. Also get approval from @DNACENSolutions

api_task_timeout:
  description: |
    Timeout duration in seconds for image distribution API operations.
    Controls how long the system waits for image distribution tasks to complete,
    including image transfer to target devices, network propagation, and distribution validation.
    
    Operation phases covered by this timeout:
    - Image preparation and validation on Catalyst Center
    - Network transfer to target devices
    - Installation verification on target devices
    - Distribution status confirmation
    
    Default of 1800 seconds (30 minutes) accounts for:
    - Large image files (up to several GB)
    - Multiple target devices in site-based operations  
    - Network latency and bandwidth constraints
    - Device processing and storage capabilities
    
    Recommended timeout values:
    - Small networks (1-10 devices): 900-1800 seconds
    - Medium networks (10-50 devices): 1800-3600 seconds
    - Large networks (50+ devices): 3600-7200 seconds
    
    Note: This timeout is independent of the global 'dnac_api_task_timeout' parameter
    and specifically applies to distribution operations only.
  type: int
  default: 1800
  version_added: ?????? 

ACCESS, BORDER ROUTER, DISTRIBUTION, and
CORE.
type: str
api_task_timeout:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api_task_timeout:
  description: |
    Timeout duration in seconds for image activation API operations.
    Controls how long the system waits for image activation processes to complete
    before timing out, including device reboot and startup verification.
    
    Operation phases covered by this timeout:
    - Image validation and preparation for activation
    - Device upgrade mode processing (install/bundle/currentlyExists)
    - Device reboot and startup sequence (if required)
    - Post-activation connectivity and status verification
    - Golden image validation (if applicable)
    
    Default of 1800 seconds (30 minutes) accommodates:
    - Device boot time variations (switches: 5-15 min, routers: 10-20 min)
    - Image installation and verification processes
    - Network convergence and connectivity restoration
    - Multiple devices in concurrent activation scenarios
    
    Recommended timeout values by device type:
    - Access switches: 1200-1800 seconds (20-30 minutes)
    - Distribution/Core switches: 1800-2700 seconds (30-45 minutes)
    - Routers and complex devices: 2700-3600 seconds (45-60 minutes)
    
    Warning: Setting timeout too low may cause premature failure reporting
    for successful but slow activation processes.
  type: int
  default: 1800
  version_added: ??????

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ike the name api_task_timeout to be changed, it should be named for the activity it is addressing to. i.e. image_activation_timeout and image_distribution_timeout, both should be provided to user so they can individually plan to timeouts correctly. For example distribution on slow. network may take1Hr or more, activation will very based on if the device is a stack and and may vary from 10 mins to 30 minutes.

device_family_name:
description: Specify the name of the device
family such as Switches and Hubs, etc.
type: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "api_task_timeout" may get confused with the existing global dnac_api_task_timeout parameter. Please note that the parameter name should reflect its specific operation scope.

Can we say

distribution_task_timeout:
    Timeout duration in seconds specifically for SWIM image distribution operations.
    Controls how long the system waits for distribution tasks to complete.... 
  
activation_task_timeout:
    Timeout duration in seconds specifically for SWIM image activation operations.
    Controls how long the system waits for activation tasks to complete.....

OR

swim_operation_timeout:
    Timeout duration in seconds for SWIM operations (distribution/activation)..
    Overrides the global dnac_api_task_timeout for SWIM-specific tasks...........

OR

image_distribution_timeout. .. image_activation_timeout ...
Check with @DNACENSolutions and update...

type: int
default: 1800

"Site name not specified; defaulting to 'Global' to fetch all devices under this category",
"INFO",
)
if self.compare_dnac_versions(self.get_ccc_version(), "2.3.5.3") <= 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_version = self.get_ccc_version()
if self.compare_dnac_versions(current_version, "2.3.5.3") <= 0:
    site_name = "Global/.*"
    self.log(
        "Catalyst Center version {0} (≤2.3.5.3) detected - using wildcard pattern 'Global/.*' "
        "to fetch devices from Global site and all child sites via legacy API".format(current_version),
        "INFO",
    )
else:
    site_name = "Global"
    self.log(
        "Catalyst Center version {0} (>2.3.5.3) detected - using 'Global' site name "
        "to fetch devices via enhanced site hierarchy API".format(current_version),
        "INFO",
    )

for item_dict in item["response"]:
site_response_list.append(item_dict)
else:
self.log(site_name, "DEBUG")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't print the variable name directly. We instructed this many times in Code review.

return self

if not isinstance(response, dict):
self.msg = "response is not a dictionary"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    self.msg = "Invalid response format from API '{0}' - expected dictionary, got {1}".format(
        api_name, type(response).__name__
    )

self.status = "exited"
return self

task_info = response.get("response")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first check the task_info, if not then return..

task_info = response.get("response")
if not task_info:
    self.msg = "Missing task information in response from API '{0}'".format(api_name)
    self.log(self.msg, "ERROR")
    self.status = "failed"
    return self

if task_info.get("errorcode") is not None:
    error_detail = task_info.get("detail", "Unknown error occurred")
    self.msg = "API '{0}' returned error: {1}".format(api_name, error_detail)
    self.log(self.msg, "ERROR")
    self.status = "failed"
    return self

self.status = "failed"
return self

task_id = task_info.get("taskId")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if task_id is none, then return??

if not task_id:
    self.msg = "Missing taskId in response from API '{0}'".format(api_name)
    self.log(self.msg, "ERROR")
    self.status = "failed"
    return self

return self

task_id = task_info.get("taskId")
start_time = time.time()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is good to determine the timeout for this operation..

# Determine appropriate timeout for this operation
if operation_details:
    operation_type = api_name.lower().replace("trigger_software_image_", "").replace("_", "")
    max_timeout = self.get_operation_timeout(operation_type, operation_details)
else:
    max_timeout = getattr(self, 'max_timeout', self.params.get("dnac_api_task_timeout", 1800))

poll_interval = self.params.get("dnac_task_poll_interval", 2)

self.log(
    "Monitoring task '{0}' from API '{1}' with timeout {2}s and poll interval {3}s".format(
        task_id, api_name, max_timeout, poll_interval
    ), 
    "INFO"
)

start_time = time.time()
task_status_history = []

while True:
    elapsed_time = time.time() - start_time


task_id = task_info.get("taskId")
start_time = time.time()
while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a sample code to check the task_status.. But you can refine and update if required..

    task_status_history = []

    while True:
        elapsed_time = time.time() - start_time
        
        # Check for timeout
        if elapsed_time >= max_timeout:
            self.msg = (
                "Timeout of {0} seconds reached for task '{1}' from API '{2}'. "
                "Task may still be running on Catalyst Center. "
                "Consider increasing timeout or checking task status manually."
            ).format(max_timeout, task_id, api_name)
            self.log(self.msg, "WARNING")
            self.status = "failed"
            break

        # Get current task status
        try:
            task_details = self.get_tasks_by_id(task_id)
            self.log(
                "Retrieved task details for ID '{0}': {1}".format(task_id, task_details), 
                "DEBUG"
            )
        except Exception as e:
            self.msg = "Failed to retrieve task details for ID '{0}': {1}".format(task_id, str(e))
            self.log(self.msg, "ERROR")
            self.status = "failed"
            break

        if not task_details:
            self.log(
                "Empty task details received for ID '{0}', retrying...".format(task_id), 
                "WARNING"
            )
            time.sleep(poll_interval)
            continue

        task_status = task_details.get("status", "UNKNOWN")
        
        # Track status changes for debugging
        if not task_status_history or task_status_history[-1] != task_status:
            task_status_history.append(task_status)
            self.log(
                "Task '{0}' status changed to: {1} (elapsed: {2:.1f}s)".format(
                    task_id, task_status, elapsed_time
                ), 
                "INFO"
            )

        # Handle terminal states
        if task_status == "FAILURE":
            try:
                failure_details = self.get_task_details_by_id(task_id)
                failure_reason = failure_details.get("failureReason", "Unknown failure reason")
                self.msg = "Task '{0}' from API '{1}' failed: {2}".format(
                    task_id, api_name, failure_reason
                )
                self.log(self.msg, "ERROR")
            except Exception as e:
                self.msg = "Task '{0}' failed and unable to retrieve failure details: {1}".format(
                    task_id, str(e)
                )
                self.log(self.msg, "ERROR")
            
            self.status = "failed"
            break

        elif task_status == "SUCCESS":
            self.result["changed"] = True
            self.msg = "Task '{0}' from API '{1}' completed successfully in {2:.1f} seconds".format(
                task_id, api_name, elapsed_time
            )
            self.log(self.msg, "INFO")
            self.status = "success"
            break

        # Handle unknown/unexpected status
        elif task_status not in ["IN_PROGRESS", "PENDING", "RUNNING"]:
            self.log(
                "Unexpected task status '{0}' for task '{1}'. Continuing to monitor...".format(
                    task_status, task_id
                ), 
                "WARNING"
            )
            
        # Log periodic progress updates
        if int(elapsed_time) % 30 == 0 and elapsed_time > 0:  # Every 30 seconds
            progress_info = task_details.get("progress", "No progress information")
            self.log(
                "Task '{0}' progress update: {1} (status: {2}, elapsed: {3:.1f}s)".format(
                    task_id, progress_info, task_status, elapsed_time
                ), 
                "DEBUG"
            )

        time.sleep(poll_interval)

    # Log final task monitoring summary
    self.log(
        "Task monitoring completed for '{0}'. Final status: {1}, Total time: {2:.1f}s, Status history: {3}".format(
            task_id, self.status, time.time() - start_time, " -> ".join(task_status_history)
        ), 
        "INFO"
    )
        

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_operation_timeout(self, operation_type, operation_details):
    """
    Get the appropriate timeout value for specific SWIM operations.
    
    Args:
        operation_type (str): Type of operation ('distribution' or 'activation')
        operation_details (dict): Operation configuration details
        
    Returns:
        int: Timeout value in seconds
    """
    self.log("Determining timeout for {0} operation".format(operation_type), "DEBUG")
    
    # Map operation types to their specific timeout parameter names
    timeout_key_mapping = {
        "distribution": "distribution_task_timeout",  # Recommended name... Check and update the name
        "activation": "activation_task_timeout"       # Recommended name.... Check and update the name
    }
    
    # Get operation-specific timeout if provided
    timeout_key = timeout_key_mapping.get(operation_type)
    if timeout_key and timeout_key in operation_details:
        operation_timeout = operation_details[timeout_key]
        self.log(
            "Using operation-specific {0}: {1} seconds for {2}".format(
                timeout_key, operation_timeout, operation_type
            ), 
            "INFO"
        )
        return operation_timeout
    
    # Fallback to global timeout
    global_timeout = self.params.get("dnac_api_task_timeout", 1800)  # Correct global name
    self.log(
        "Using global dnac_api_task_timeout: {0} seconds for {1}".format(
            global_timeout, operation_type
        ),
        "INFO"
    )
    return global_timeout

description: Specify the name of the device
family such as Switches and Hubs, etc.
type: str
Image_distribution_timeout:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image_distribution_timeout..

not Image_distribution_timeout

@madhansansel madhansansel merged commit 352e7fa into cisco-en-programmability:main Oct 13, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants