Db retries #146

maxdml · 2025-09-23T23:16:59Z

Address #93

This PR adds a retry method that can take in another function for retry until a specific condition is met. retry can be configured with functional options and has a retryWithResult generic counterpart for functions returning a value.

All system database methods are now called within retry or retryWithResult.

For example

dequeuedWorkflows, err := retryWithResult(ctx, func() ([]dequeuedWorkflow, error) {
return ctx.systemDB.dequeueWorkflows(ctx, dequeueWorkflowsInput{
	queue:              queue,
	executorID:         ctx.executorID,
	applicationVersion: ctx.applicationVersion,
})
}, withRetrierLogger(qr.logger))

retry performs exponential backoff with jitter. It defaults to infinite retries.

The default condition for retrying is isRetryablePGError, which matches the function returned error with a set of connection errors (postgres codes, pgx connection errors, net.Error)

Note: in the particular case of RunWorkflow, which does manage transactions, the entire function has to be retried. This is because the transaction object becomes invalid as soon as Commit/Rollback have been called, regardless of whether the operation was successful. See https://github.com/jackc/pgx/blob/61d3c965ad442cc14d6b0e39e0ab3821f3684c03/tx.go#L180

This PR also improves DBOSError with the ability to unwrap the underlying error, if any.

maxdml · 2025-09-24T19:15:47Z

cmd/dbos/templates/dbos-toolbox/go.mod.tmpl


-go 1.22.0
-
-toolchain go1.25.0


Not needed for template

maxdml · 2025-09-24T21:01:43Z

dbos/errors.go

 type DBOSError struct {
 	Message string        // Human-readable error message
 	Code    DBOSErrorCode // Error type code for programmatic handling
-	IsBase  bool          // Internal errors that shouldn't be caught by user code


This reverts commit 9127d2c.

…an error

…retry

maxdml · 2025-09-25T03:00:18Z

.github/workflows/tests.yml

-    - name: Cache Go modules
-      uses: actions/cache@v4
-      with:
-        path: |
-          ~/go/pkg/mod
-          ~/.cache/go-build
-        key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
-        restore-keys: |
-          ${{ runner.os }}-go-
-


Doesn't seem very useful

maxdml · 2025-09-25T03:00:41Z

cmd/dbos/postgres.go

 	}

-	logger.Info("Postgres available", "url", fmt.Sprintf("postgres://postgres:%s@localhost:5432", password))
+	logger.Info("Postgres available", "url", fmt.Sprintf("postgres://postgres:%s@localhost:5432", url.QueryEscape(password)))


…e like

maxdml · 2025-09-25T15:26:45Z

cmd/dbos/postgres.go

 		AutoRemove: true,
+		Binds: []string{
+			fmt.Sprintf("%s:%s", hostPgDataVolumeName, pgData),
+		},


It turns out that an explicit volume binding is required. On MacOS, exporting PGDATA is enough to instruct Docker desktop to mount the right thing. Docker desktop uses a light VM to run postgres, and persist this volume across container restarts.

On the github action runner -- and likely on other environments too? The mount is not persisted across containers restarts.

maxdml · 2025-09-25T15:32:24Z

dbos/workflow.go

-			wfID, err := ctx.GetWorkflowID()
-			if err != nil {
-				return nil, fmt.Errorf("failed to get workflow ID: %w", err)
-			}
-			err = ctx.(*dbosContext).systemDB.updateWorkflowOutcome(WithoutCancel(ctx), updateWorkflowOutcomeDBInput{
-				workflowID: wfID,
-				status:     WorkflowStatusError,
-				err:        newWorkflowUnexpectedInputType(fqn, fmt.Sprintf("%T", typedInput), fmt.Sprintf("%T", input)),
-			})
-			if err != nil {
-				return nil, fmt.Errorf("failed to record unexpected input type error: %w", err)
-			}


This is not required because typedErasedWorkflow is always call within RunWorkflow, in a goroutine that update the workflow outcome.

maxdml · 2025-09-25T15:33:44Z

integration/mocks_test.go

-	mockCtx.AssertExpectations(t)
-	mockChildHandle.AssertExpectations(t)
-	mockGenericHandle.AssertExpectations(t)


A mock built with mocks.NewMoc... already checks expectations before destroying the object.

dbos/workflow.go

kraftp · 2025-09-25T23:02:41Z

dbos/client.go

-		tx:     tx,
-	}
-	_, err = dbosCtx.systemDB.insertWorkflowStatus(uncancellableCtx, insertInput)
+	err := retry(dbosCtx, func() error {


Why do we need retries here? It's the client, it can just fail, there's no deeper risk

Not necessarily strictly speaking, but nice to have from a reliable library imo. Right now, all the client methods, which are wrapping the context methods, will retry.

Are you suggesting we should do the minimum amount of retries to avoid getting a corrupted Transact process (e.g., queue runner unusable), but should let the user handle connection errors on all other paths? E.g., cancel workflow.

It's not really a huge deal either way--retries in the client are excessive, but not unsafe (and the Python client retries)

I like the idea of doing less. We can easily add retries to client methods (enqueue, but also RetrieveWorkflow, CancelWorkflow, ResumeWorkflow, ListWorkflows, GetWorkflowSteps and ForkWorkflow) later if need be. The most important is to avoid having unusable-but-running Transact processes.

dbos/workflow.go

dbos/system_database.go

kraftp · 2025-09-26T00:54:42Z

chaos_tests/chaos_test.go

+	queue := dbos.NewWorkflowQueue(dbosCtx, "test_queue")
+
+	// Define step functions
+	stepOne := func(ctx dbos.DBOSContext, x int) (int, error) {


The other thing we could check is if a scheduled workflow keeps ticking after reconnection

kraftp

Looks good, important tests and fixes!

maxdml commented Sep 24, 2025

View reviewed changes

cmd/dbos/templates/dbos-toolbox/go.mod.tmpl

go 1.22.0

toolchain go1.25.0

Copy link

Collaborator Author

maxdml Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for template

maxdml commented Sep 24, 2025

View reviewed changes

maxdml added 28 commits September 24, 2025 15:22

chaos tests

61ef375

cleanup errors

214ffd7

retrier

1885024

use retrier

74d8919

fix defer in test

7224dfc

Revert "cleanup errors"

0d03ad4

This reverts commit 9127d2c.

imports

f4121d5

more wrapping

7fb6a62

tests

35ff511

no need to install docker in gha

dd1c15e

cli readme @ latest

5f01217

missing folder

0500445

QueryEscape

941d9b3

on the password, and do not pass env var

d6c422e

no -race for chaos

6d39fa4

fix

be3829a

double escape

7874bc9

cleanup

f7f417f

fix

0c4bb5c

info

a7bdb98

cleanup

756eae8

support Unwrap in DBOSError + have newWorkflowExecutionError receive …

816f1df

…an error

handle pgx.ErrTxClosed as retryable but wrap entire RunWorkflow in a …

e79bf4c

…retry

explicit warning for pgx.ErrTxClosed

a5b9f5f

cleanup

60a36dc

remove postgres from gha

acd5524

start postgres before running the test

9085bee

CLI must escape urls too

8fd09ad

maxdml commented Sep 25, 2025

View reviewed changes

maxdml added 9 commits September 24, 2025 20:16

check interface before using logger -- fixes mock path

ecab120

handle 'container is marked for removal and cannot be started' and th…

da38b95

…e like

retoggle debug

7062c15

some clues

0adbbea

a long timeout

9aef64a

more timeout

34f0268

+1

c54c692

log starts too

6c527dc

specify persistent volume when creating postgres container

e61ae86

maxdml commented Sep 25, 2025

View reviewed changes

maxdml marked this pull request as ready for review September 25, 2025 15:33

manually handle locks in getEvent

569bc32

kraftp reviewed Sep 25, 2025

View reviewed changes

dbos/workflow.go Outdated Show resolved Hide resolved

maxdml added 6 commits September 25, 2025 09:30

fix

74ffbe0

wrap tx only in retry in RunWorkflow

b87b0c7

client enqueue retry

7df56b6

should not use uncancellable contexts for retry itself

c875ec0

less chatty logs

8505f14

less chatty x2

d795789

kraftp reviewed Sep 25, 2025

View reviewed changes

dbos/workflow.go Show resolved Hide resolved

move cancel function setup out of RunWorkflow transaction

6f9ec3e

kraftp reviewed Sep 26, 2025

View reviewed changes

dbos/system_database.go Show resolved Hide resolved

kraftp reviewed Sep 26, 2025

View reviewed changes

kraftp approved these changes Sep 26, 2025

View reviewed changes

do less

dbf7b2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Db retries #146

Db retries #146

Uh oh!

maxdml commented Sep 23, 2025 •

edited

Loading

Uh oh!

maxdml Sep 24, 2025

Uh oh!

maxdml Sep 24, 2025

Uh oh!

maxdml Sep 25, 2025

Uh oh!

maxdml Sep 25, 2025

Uh oh!

maxdml Sep 25, 2025

Uh oh!

maxdml Sep 25, 2025 •

edited

Loading

Uh oh!

maxdml Sep 25, 2025

Uh oh!

Uh oh!

kraftp Sep 25, 2025

Uh oh!

maxdml Sep 25, 2025 •

edited

Loading

Uh oh!

kraftp Sep 26, 2025 •

edited

Loading

Uh oh!

maxdml Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kraftp Sep 26, 2025

Uh oh!

kraftp left a comment

Uh oh!

Uh oh!

Db retries #146

Are you sure you want to change the base?

Db retries #146

Uh oh!

Conversation

maxdml commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxdml Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxdml Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kraftp Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxdml Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kraftp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdml commented Sep 23, 2025 •

edited

Loading

maxdml Sep 25, 2025 •

edited

Loading

maxdml Sep 25, 2025 •

edited

Loading

kraftp Sep 26, 2025 •

edited

Loading

maxdml Sep 26, 2025 •

edited

Loading