Skip to content

Conversation

ilia-db
Copy link
Contributor

@ilia-db ilia-db commented Sep 3, 2025

Changes

Previous PRs in the stack:

Move from flat structure to cmd and internal client/server/proxy/keys/workspace sub packages.

Moved a bunch of hardcoded values to constants (some hardcoded stuff remains in the keys and workspace sub packages, will be improved in a follow up)

Added unit tests for servers connection manager logic and updated proxy tests.
The coverage isn't great yet, the goal is to avoid unit tests with excessive mocking and focus on integration tests from now on.

Why

/experimental already has the databricks-bundles python project, seems like a good place to put ssh stuff too

Tests

Unit and manual

@eng-dev-ecosystem-bot
Copy link
Collaborator

eng-dev-ecosystem-bot commented Sep 3, 2025

Run: 17575476240

Env ✅​pass ❌​FAIL 🔄​flaky 🙈​skip
❌​ aws linux 305 1 523
❌​ aws windows 304 1 2 522
✅​ aws-ucws linux 420 421
✅​ aws-ucws windows 421 420
✅​ azure linux 308 522
✅​ azure windows 309 521
🔄​ azure-ucws linux 413 7 420
✅​ azure-ucws windows 421 419
✅​ gcp linux 307 524
✅​ gcp windows 308 523
10 failing tests:
Test Name aws linux aws windows azure-ucws linux
TestAccept/bundle/deploy/dashboard/generate_inplace ✅​pass ✅​pass 🔄​flaky
TestAccept/bundle/deploy/dashboard/nested-folders ✅​pass ✅​pass 🔄​flaky
TestAccept/bundle/deploy/dashboard/simple ✅​pass ✅​pass 🔄​flaky
TestAccept/bundle/deploy/dashboard/simple_outside_bundle_root ✅​pass ✅​pass 🔄​flaky
TestAccept/bundle/deploy/dashboard/simple_syncroot ✅​pass ✅​pass 🔄​flaky
TestAccept/bundle/deployment/bind/dashboard ✅​pass ✅​pass 🔄​flaky
TestAccept/bundle/templates/default-python/combinations/classic/DATABRICKS_CLI_DEPLOYMENT=terraform/DLT=yes/NBOOK=no/PY=no ✅​pass 🔄​flaky ✅​pass
TestAccept/bundle/templates/default-python/combinations/classic/DATABRICKS_CLI_DEPLOYMENT=terraform/DLT=yes/NBOOK=yes/PY=no ✅​pass 🔄​flaky ✅​pass
TestDashboardAssumptions_WorkspaceImport ✅​pass ✅​pass 🔄​flaky
TestFetchRepositoryInfoAPI_FromRepo ❌​FAIL ❌​FAIL ✅​pass

@ilia-db ilia-db mentioned this pull request Sep 3, 2025
Base automatically changed from ssh-tunnel-server to main September 8, 2025 14:05
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the unit tests closely, but worry that they can run for a long time or be flaky. A good practice is to include a context.Context everywhere and add a short-lived deadline to tear down everything it spawns.

err := client.RunClient(ctx, wsClient, opts)
if err != nil && proxy.IsNormalClosure(err) {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal closure check was added here and removed from the server case.

Can they both be omitted (or hoisted into the "run" function)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved in #3569

cmd.Flags().DurationVar(&shutdownDelay, "shutdown-delay", 10*time.Minute, "Delay before shutting down after no pings from clients")
cmd.Flags().StringVar(&secretsScope, "secrets-scope-name", "", "Databricks secrets scope name")
cmd.MarkFlagRequired("secrets-scope-name")
cmd.Flags().StringVar(&publicKeySecretName, "client-key-name", "", "Databricks client key name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an actual key name or the secret name in the secret scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secret name in the scope - you have to combine it with secrets-scope-name to get the value.

Server uses the scope not only for downloading the client key value though, but also for managing its own key-pair

cmd.MarkFlagRequired("cluster")
cmd.Flags().IntVar(&maxClients, "max-clients", 10, "Maximum number of SSH clients")
cmd.Flags().DurationVar(&shutdownDelay, "shutdown-delay", 10*time.Minute, "Delay before shutting down after no pings from clients")
cmd.Flags().StringVar(&secretsScope, "secrets-scope-name", "", "Databricks secrets scope name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The product name is "secret scope" (singular).

"server",
f"--cluster={ctx.clusterId}",
f"--secrets-scope-name={secrets_scope}",
f"--client-key-name={public_key_secret_name}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend being more verbose to make it unambiguous what's being referred to here:

Idea:

  • --keypair-secret-scope-name
  • --keypair-secret-key-name

Or pubkey if the secret scope only stores the public key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The secret scope has 3 keys - public and private keys of the server, and the public key of the client (to be added to AuthorizedKeysFile).

Maybe --keys-secret-scope-name and --authorized-key-secret-name?

Technically we don't even need to pass these options to the server, as it can re-create the names using the same constants and logic the client uses. Since the client enforces the same server version, it will be safe. It doesn't hurt to keep server a bit more isolated though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to keep these isolated, then all of them can specify the scope name.

--server-pubkey-secret-name
--server-privkey-secret-name
--client-pubkey-secret-name

With the format being scope/key, like https://docs.databricks.com/aws/en/security/secrets/secrets-spark-conf-env-var#reference-a-secret-with-a-spark-configuration-property.

That removes all implicits.

ClientPublicKeyName: defaultClientPublicKeyName,
ServerTimeout: serverTimeout,
}
err := client.RunClient(ctx, wsClient, opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be shortened to client.Run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it in #3569

PortRange: serverPortRange,
}
return err
return server.RunServer(ctx, client, opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be shortened to server.Run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client variable here is unfortunate.

Common variable names for the workspace client include w or wsc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it in #3569

shutdownTimerMu sync.Mutex
connections map[string]*proxy.ProxyConnection
connectionsMu sync.Mutex
TimedOut chan bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal/server package listens to this channel (and terminates the process)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, it indeed doesn't make sense in this PR, but in the follow up I've moved connections to the internal/proxy, so the export is needed there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this implemented in the previous PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, git didn't register this as a rename for some reason. The main thing I've changed is removing databricks_ssh_key env var, since secrets logic was moved into the server itself

SSHDConfigPath string
}

func newSSHHandler(ctx context.Context, connections *connectionsManager, sshDConfigPath string) *sshHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a separate PR:

There is mixed casing for ssh/sshd. You can stick to just lowercase and only upcase the first char when it's the second word in an unexported variable, or the first in an exported variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the follow up the pure http.Handler part of the logic was moved to the internal/proxy, and sshD* mixed case is fixed there.

If you mean that I sometimes use ssh and sometimes SSH - it seems to be in line with go camel case + acronyms style? And the real reason is because Ssh throws me off :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also sshD, and the newSSHHandler also looks weird because of the double H.

They can be just sshd (referring to the process name), and newSshHandler to stick to Go patterns.

Or alternatively, move everything into an ssh package and call it handler, configPath, etc.

Move from flat structure to cmd and internal client/server/proxy/keys/workspace sub packages
@ilia-db
Copy link
Contributor Author

ilia-db commented Sep 9, 2025

I haven't looked at the unit tests closely, but worry that they can run for a long time or be flaky. A good practice is to include a context.Context everywhere and add a short-lived deadline to tear down everything it spawns.

Moved the tests to use the new testing/synctest std lib package in the follow up PR #3569

"server",
f"--cluster={ctx.clusterId}",
f"--secrets-scope-name={secrets_scope}",
f"--client-key-name={public_key_secret_name}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to keep these isolated, then all of them can specify the scope name.

--server-pubkey-secret-name
--server-privkey-secret-name
--client-pubkey-secret-name

With the format being scope/key, like https://docs.databricks.com/aws/en/security/secrets/secrets-spark-conf-env-var#reference-a-secret-with-a-spark-configuration-property.

That removes all implicits.

SSHDConfigPath string
}

func newSSHHandler(ctx context.Context, connections *connectionsManager, sshDConfigPath string) *sshHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also sshD, and the newSSHHandler also looks weird because of the double H.

They can be just sshd (referring to the process name), and newSshHandler to stick to Go patterns.

Or alternatively, move everything into an ssh package and call it handler, configPath, etc.

@ilia-db ilia-db added this pull request to the merge queue Sep 10, 2025
Merged via the queue into main with commit c2e3322 Sep 10, 2025
13 checks passed
@ilia-db ilia-db deleted the ssh-tunnel-exp branch September 10, 2025 10:38
ilia-db added a commit that referenced this pull request Sep 16, 2025
…3569)

## Changes
Based on #3544

Move pure client and server logic to the proxy package and test it. 

Now `proxyConnection` is completely private, and `proxy` package exposes
only 2 public functions:
- `RunClientProxy` function
- `NewProxyServer` constructor

Meanwhile the internal logic doesn't depend on any ssh stuff and can be
unit tested.

`internal/client` and `internal/server` packages are now much smaller
and contain mainly ssh/jobs/cluster glue.

Also adding `bench.sh` until we have proper e2e tests.

## Why
Tests

## Tests
Unit and manual tests

<!-- If your PR needs to be included in the release notes for next
release,
add a separate entry in NEXT_CHANGELOG.md as part of your PR. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants