|
| 1 | +# Investigation: Flaky Test `DeployCommandIncludesDeployFlagInArguments` |
| 2 | + |
| 3 | +## Issue Summary |
| 4 | + |
| 5 | +**Test Name:** `Aspire.Cli.Tests.Commands.DeployCommandTests.DeployCommandIncludesDeployFlagInArguments` |
| 6 | + |
| 7 | +**GitHub Issue:** https://github.com/dotnet/aspire/issues/11217 |
| 8 | + |
| 9 | +**Quarantined:** Yes |
| 10 | + |
| 11 | +**Test Location:** `/tests/Aspire.Cli.Tests/Commands/DeployCommandTests.cs:333` |
| 12 | + |
| 13 | +**Test Purpose:** Verifies that the deploy command passes the correct arguments (including `--deploy`, `--operation`, `--publisher`, and `--output-path`) to the AppHost when executing. |
| 14 | + |
| 15 | +## Test Results Analysis |
| 16 | + |
| 17 | +### Failure Pattern |
| 18 | + |
| 19 | +From the test results spanning September 20, 2025 to October 20, 2025 (63 total runs): |
| 20 | + |
| 21 | +- **Windows:** 30 failures / 30 runs (100% failure rate) |
| 22 | +- **Linux:** 9 failures / 21 runs (42.9% failure rate) |
| 23 | +- **macOS:** 3 failures / 12 runs (25.0% failure rate) |
| 24 | + |
| 25 | +### Key Observations |
| 26 | + |
| 27 | +1. **Platform-Specific Behavior:** |
| 28 | + - Windows: Consistently fails on **every** run |
| 29 | + - Linux: Fails intermittently (~43% of the time) |
| 30 | + - macOS: Fails less frequently (~25% of the time) |
| 31 | + |
| 32 | +2. **Failure Signature:** |
| 33 | + - All failures are `System.TimeoutException: The operation has timed out.` |
| 34 | + - Timeout occurs at line 333: `var exitCode = await result.InvokeAsync().WaitAsync(CliTestConstants.DefaultTimeout);` |
| 35 | + - Default timeout is 10 seconds (`TimeSpan.FromSeconds(10)`) |
| 36 | + |
| 37 | +3. **Consistent Output Pattern:** |
| 38 | + All runs (passing and failing) show the same stdout output: |
| 39 | + ``` |
| 40 | + Temporary workspace created at: <path> |
| 41 | + 🔬 Checking project type...: |
| 42 | + ../../../<guid>/AppHost.csproj |
| 43 | + 🛠 Building apphost... |
| 44 | + ../../../<guid>/AppHost.csproj |
| 45 | + 🛠 Generating artifacts... |
| 46 | + ``` |
| 47 | + |
| 48 | +4. **The Critical Difference:** |
| 49 | + - **Passing tests** have an additional newline at the end of stdout output |
| 50 | + - **Failing tests** stop at "Generating artifacts..." without completion |
| 51 | + |
| 52 | +## Root Cause Analysis |
| 53 | + |
| 54 | +### Test Code Structure |
| 55 | + |
| 56 | +The test at line 270-340 sets up a mock execution environment: |
| 57 | + |
| 58 | +```csharp |
| 59 | +[QuarantinedTest("https://github.com/dotnet/aspire/issues/11217")] |
| 60 | +public async Task DeployCommandIncludesDeployFlagInArguments() |
| 61 | +{ |
| 62 | + // ... setup code ... |
| 63 | +
|
| 64 | + RunAsyncCallback = async (projectFile, watch, noBuild, args, env, backchannelCompletionSource, options, cancellationToken) => |
| 65 | + { |
| 66 | + // Verify arguments |
| 67 | + Assert.Contains("--deploy", args); |
| 68 | + Assert.Contains("--operation", args); |
| 69 | + Assert.Contains("publish", args); |
| 70 | + Assert.Contains("--publisher", args); |
| 71 | + Assert.Contains("default", args); |
| 72 | + Assert.Contains("--output-path", args); |
| 73 | + Assert.Contains("/tmp/test", args); |
| 74 | + |
| 75 | + var deployModeCompleted = new TaskCompletionSource(); |
| 76 | + var backchannel = new TestAppHostBackchannel |
| 77 | + { |
| 78 | + RequestStopAsyncCalled = deployModeCompleted // KEY ISSUE |
| 79 | + }; |
| 80 | + backchannelCompletionSource?.SetResult(backchannel); |
| 81 | + await deployModeCompleted.Task; // Waits for RequestStopAsync to be called |
| 82 | + return 0; |
| 83 | + } |
| 84 | +} |
| 85 | +``` |
| 86 | + |
| 87 | +### The Race Condition |
| 88 | + |
| 89 | +The test creates a **race condition** between publishing activity processing and timeout: |
| 90 | + |
| 91 | +#### Expected Flow (from `PublishCommandBase.cs:214-227`): |
| 92 | +```csharp |
| 93 | +var backchannel = await backchannelCompletionSource.Task; // Gets backchannel |
| 94 | +var publishingActivities = backchannel.GetPublishingActivitiesAsync(cancellationToken); |
| 95 | +var noFailuresReported = await ProcessAndDisplayPublishingActivitiesAsync(publishingActivities, backchannel, cancellationToken); |
| 96 | +await backchannel.RequestStopAsync(cancellationToken); // Only called AFTER processing activities |
| 97 | +var exitCode = await pendingRun; |
| 98 | +``` |
| 99 | + |
| 100 | +#### The Problem: |
| 101 | + |
| 102 | +1. **`GetPublishingActivitiesAsync` Never Completes:** |
| 103 | + - The default `TestAppHostBackchannel.GetPublishingActivitiesAsync()` implementation yields activities then completes |
| 104 | + - However, the implementation yields a fixed set of activities and then **returns** |
| 105 | + - But in `ProcessAndDisplayPublishingActivitiesAsync`, the code waits for a `PublishingActivityTypes.PublishComplete` activity |
| 106 | + |
| 107 | +2. **Missing PublishComplete Activity:** |
| 108 | + Looking at `TestAppHostBackchannel.GetPublishingActivitiesAsync()` (lines 104-165), it yields: |
| 109 | + - One root step (InProgress) |
| 110 | + - Multiple tasks (child-task-1, child-task-2) |
| 111 | + - Root step (Completed) |
| 112 | + - **BUT NO `PublishingActivityTypes.PublishComplete` activity** |
| 113 | + |
| 114 | +3. **The Wait Sequence:** |
| 115 | + - Test sets `backchannelCompletionSource` → backchannel created |
| 116 | + - `PublishCommandBase` calls `GetPublishingActivitiesAsync` |
| 117 | + - `ProcessAndDisplayPublishingActivitiesAsync` waits for `PublishComplete` activity |
| 118 | + - Default implementation finishes yielding activities without `PublishComplete` |
| 119 | + - The async enumerable completes but no `PublishComplete` was received |
| 120 | + - Code likely hangs or continues waiting |
| 121 | + - `RequestStopAsync` is never called (or called too late) |
| 122 | + - `deployModeCompleted.Task` never completes |
| 123 | + - Test times out after 10 seconds |
| 124 | + |
| 125 | +### Why Windows Fails 100% |
| 126 | + |
| 127 | +Windows is likely slower at: |
| 128 | +- Creating temporary workspaces |
| 129 | +- Process startup/cleanup |
| 130 | +- File system operations |
| 131 | + |
| 132 | +This means the race condition is more likely to be lost on Windows, causing consistent timeouts. |
| 133 | + |
| 134 | +### Why macOS/Linux Succeed Sometimes |
| 135 | + |
| 136 | +On faster Unix-based systems: |
| 137 | +- The publishing activities enumeration might complete faster |
| 138 | +- There may be timing differences in how the async operations are scheduled |
| 139 | +- The test might "luck out" and complete within 10 seconds even with the bug |
| 140 | + |
| 141 | +However, the fundamental issue exists on all platforms - it's just a matter of timing. |
| 142 | + |
| 143 | +## Technical Details |
| 144 | + |
| 145 | +### Missing PublishComplete Activity |
| 146 | + |
| 147 | +Looking at `TestAppHostBackchannel.cs` lines 104-165: |
| 148 | + |
| 149 | +```csharp |
| 150 | +public async IAsyncEnumerable<PublishingActivity> GetPublishingActivitiesAsync([EnumeratorCancellation]CancellationToken cancellationToken) |
| 151 | +{ |
| 152 | + GetPublishingActivitiesAsyncCalled?.SetResult(); |
| 153 | + if (GetPublishingActivitiesAsyncCallback is not null) |
| 154 | + { |
| 155 | + // Custom callback path |
| 156 | + } |
| 157 | + else |
| 158 | + { |
| 159 | + // Default implementation yields activities but NO PublishComplete |
| 160 | + yield return new PublishingActivity { Type = PublishingActivityTypes.Step, ... }; |
| 161 | + yield return new PublishingActivity { Type = PublishingActivityTypes.Task, ... }; |
| 162 | + // ... more activities ... |
| 163 | + yield return new PublishingActivity { |
| 164 | + Type = PublishingActivityTypes.Step, |
| 165 | + Data = new PublishingActivityData { |
| 166 | + CompletionState = CompletionStates.Completed, |
| 167 | + ... |
| 168 | + } |
| 169 | + }; |
| 170 | + // Missing: PublishingActivityTypes.PublishComplete |
| 171 | + } |
| 172 | +} |
| 173 | +``` |
| 174 | + |
| 175 | +### What ProcessPublishingActivitiesDebugAsync Expects |
| 176 | + |
| 177 | +From `PublishCommandBase.cs` lines 285-320: |
| 178 | + |
| 179 | +```csharp |
| 180 | +await foreach (var activity in publishingActivities.WithCancellation(cancellationToken)) |
| 181 | +{ |
| 182 | + StartTerminalProgressBar(); |
| 183 | + if (activity.Type == PublishingActivityTypes.PublishComplete) // WAITS FOR THIS |
| 184 | + { |
| 185 | + publishingActivity = activity; |
| 186 | + break; // Only breaks when PublishComplete received |
| 187 | + } |
| 188 | + // ... process other activities ... |
| 189 | +} |
| 190 | +``` |
| 191 | + |
| 192 | +The loop continues **until** a `PublishComplete` activity is received or the enumerable ends. |
| 193 | + |
| 194 | +## Fix Strategy |
| 195 | + |
| 196 | +### Option 1: Add PublishComplete Activity (Recommended) |
| 197 | + |
| 198 | +Modify `TestAppHostBackchannel.GetPublishingActivitiesAsync()` to yield a final `PublishComplete` activity: |
| 199 | + |
| 200 | +```csharp |
| 201 | +yield return new PublishingActivity |
| 202 | +{ |
| 203 | + Type = PublishingActivityTypes.PublishComplete, |
| 204 | + Data = new PublishingActivityData |
| 205 | + { |
| 206 | + Id = "publish-complete", |
| 207 | + StatusText = "Publishing completed", |
| 208 | + CompletionState = CompletionStates.Completed, |
| 209 | + StepId = null |
| 210 | + } |
| 211 | +}; |
| 212 | +``` |
| 213 | + |
| 214 | +### Option 2: Use Custom Callback in Test |
| 215 | + |
| 216 | +Override the `GetPublishingActivitiesAsyncCallback` in the test to provide proper completion: |
| 217 | + |
| 218 | +```csharp |
| 219 | +var backchannel = new TestAppHostBackchannel |
| 220 | +{ |
| 221 | + RequestStopAsyncCalled = deployModeCompleted, |
| 222 | + GetPublishingActivitiesAsyncCallback = async (ct) => |
| 223 | + { |
| 224 | + // Yield minimal activities and complete properly |
| 225 | + yield return new PublishingActivity |
| 226 | + { |
| 227 | + Type = PublishingActivityTypes.PublishComplete, |
| 228 | + Data = new PublishingActivityData { /* ... */ } |
| 229 | + }; |
| 230 | + } |
| 231 | +}; |
| 232 | +``` |
| 233 | + |
| 234 | +### Option 3: Increase Timeout (Not Recommended) |
| 235 | + |
| 236 | +Simply increasing the timeout doesn't fix the underlying bug, though it might reduce failure rates on slower machines. |
| 237 | + |
| 238 | +## Validation Steps |
| 239 | + |
| 240 | +After applying the fix: |
| 241 | + |
| 242 | +1. **Run the specific test multiple times:** |
| 243 | + ```bash |
| 244 | + dotnet test tests/Aspire.Cli.Tests/Aspire.Cli.Tests.csproj \ |
| 245 | + -- --filter-method "*.DeployCommandIncludesDeployFlagInArguments" \ |
| 246 | + --filter-not-trait "quarantined=true" \ |
| 247 | + --filter-not-trait "outerloop=true" |
| 248 | + ``` |
| 249 | + |
| 250 | +2. **Run in a loop to verify stability:** |
| 251 | + ```bash |
| 252 | + for i in {1..50}; do |
| 253 | + echo "Run $i" |
| 254 | + dotnet test tests/Aspire.Cli.Tests/Aspire.Cli.Tests.csproj \ |
| 255 | + -- --filter-method "*.DeployCommandIncludesDeployFlagInArguments" \ |
| 256 | + --filter-not-trait "quarantined=true" \ |
| 257 | + --filter-not-trait "outerloop=true" || exit 1 |
| 258 | + done |
| 259 | + ``` |
| 260 | + |
| 261 | +3. **Test on all platforms:** |
| 262 | + - Windows (highest failure rate - critical) |
| 263 | + - Linux |
| 264 | + - macOS |
| 265 | + |
| 266 | +4. **Verify related tests still pass:** |
| 267 | + ```bash |
| 268 | + dotnet test tests/Aspire.Cli.Tests/Aspire.Cli.Tests.csproj \ |
| 269 | + -- --filter-class "*.DeployCommandTests" \ |
| 270 | + --filter-not-trait "quarantined=true" \ |
| 271 | + --filter-not-trait "outerloop=true" |
| 272 | + ``` |
| 273 | + |
| 274 | +5. **Check similar test patterns:** |
| 275 | + - `DeployCommandSucceedsWithoutOutputPath` (line 134) |
| 276 | + - `DeployCommandSucceedsEndToEnd` (line 201) |
| 277 | + - These tests might have the same issue if they rely on default `GetPublishingActivitiesAsync` |
| 278 | + |
| 279 | +## Additional Considerations |
| 280 | + |
| 281 | +### Other Affected Tests |
| 282 | + |
| 283 | +Review all uses of `TestAppHostBackchannel` to ensure they properly handle publishing activities: |
| 284 | +- `PublishCommandTests.cs` |
| 285 | +- `ExecCommandTests.cs` |
| 286 | +- `RunCommandTests.cs` |
| 287 | + |
| 288 | +Most of these don't use publishing activities, but should be reviewed for consistency. |
| 289 | + |
| 290 | +### TestAppHostBackchannel Design |
| 291 | + |
| 292 | +The default implementation of `GetPublishingActivitiesAsync` should probably: |
| 293 | +1. **Always** yield a `PublishComplete` activity at the end |
| 294 | +2. Or document that callers must provide a custom callback for publish scenarios |
| 295 | + |
| 296 | +### Long-term Solution |
| 297 | + |
| 298 | +Consider: |
| 299 | +1. Adding a test helper method that creates properly configured backchannels for deploy/publish scenarios |
| 300 | +2. Adding validation in `TestAppHostBackchannel` constructor to warn about incomplete configuration |
| 301 | +3. Updating all deploy/publish tests to use a standard pattern |
| 302 | + |
| 303 | +## Supporting Information |
| 304 | + |
| 305 | +### Test Result Summary (Last 30 Days) |
| 306 | + |
| 307 | +| OS | Passed | Failed | Total | Failure Rate | |
| 308 | +|---------|--------|--------|-------|--------------| |
| 309 | +| Windows | 0 | 30 | 30 | 100.0% | |
| 310 | +| Linux | 12 | 9 | 21 | 42.9% | |
| 311 | +| macOS | 9 | 3 | 12 | 25.0% | |
| 312 | +| **Total** | **21** | **42** | **63** | **66.7%** | |
| 313 | + |
| 314 | +### GitHub Actions Run URLs |
| 315 | + |
| 316 | +All run URLs are documented in the results.json file. Recent runs: |
| 317 | +- Latest failure (Windows): https://github.com/dotnet/aspire/actions/runs/18640301007 |
| 318 | +- Latest success (Linux): https://github.com/dotnet/aspire/actions/runs/18640301007 |
| 319 | +- Latest success (macOS): https://github.com/dotnet/aspire/actions/runs/18640301007 |
| 320 | + |
| 321 | +### Files to Review |
| 322 | + |
| 323 | +1. **Test file:** `/tests/Aspire.Cli.Tests/Commands/DeployCommandTests.cs:270-340` |
| 324 | +2. **Test helper:** `/tests/Aspire.Cli.Tests/TestServices/TestAppHostBackchannel.cs:104-165` |
| 325 | +3. **Command implementation:** `/src/Aspire.Cli/Commands/PublishCommandBase.cs:214-320` |
| 326 | +4. **Publishing activity types:** Search for `PublishingActivityTypes.PublishComplete` |
| 327 | + |
| 328 | +## Conclusion |
| 329 | + |
| 330 | +This is a **timing-dependent race condition** caused by the test's backchannel not properly completing the publishing activities stream. The test waits for `RequestStopAsync` to be called, but that only happens after processing publishing activities, which never completes because no `PublishComplete` activity is yielded. |
| 331 | + |
| 332 | +The fix is straightforward: ensure `TestAppHostBackchannel.GetPublishingActivitiesAsync()` yields a `PublishComplete` activity, or override it in the test to do so. |
| 333 | + |
| 334 | +**Confidence Level:** HIGH - This is a clear bug in test infrastructure, not a flaky test due to environmental factors. |
| 335 | + |
| 336 | +**Priority:** MEDIUM - The test is already quarantined, but fixing it will improve test coverage and prevent regression. |
0 commit comments