Skip to content

Conversation

@BioQwer
Copy link

@BioQwer BioQwer commented Aug 26, 2025

i have been check in k8s

  1. hub is work
  2. create lab
  3. reboot hub
  4. lab is still working

i have been check in k8s
if self.working_dir:
self.working_dir = self._expand_user_properties(self.working_dir)
if self.port == 0:
# Prefer reading the port from the persisted server record, if available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not usually be correct. The persisted server.port is the connect port, whereas self.port sets the bind port (should almost never have a value other than the default in kubespawner).

Plus, access to the db from Spawners is deprecated and discouraged, so I don't think we should add this.

Can you share more about what problem this aims to solve? Maybe the answer is somewhere else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minrk

Can you share more about what problem this aims to solve?

Yes, of course.

Context

We run Hadoop and Spark on YARN.
To simplify the deployment of a Spark driver in JupyterLab we set the following
KubeSpawner options:

c.KubeSpawner.extra_pod_config = {
    "hostNetwork": True,
    "dnsPolicy": "ClusterFirstWithHostNet",
}

Screenshot

Situation

When we use a static port, only one Lab instance can be started per node,
which is insufficient for our workload.
This issue first appeared when we had 5 nodes serving 15 Hub users.

To work around the limitation we introduced a pre‑spawn hook that selects a
random port in the range 9000‑9299:

import random

def my_pre_spawn_hook(spawner):
    """Choose a random port for the notebook server."""
    spawner.port = random.choice([9000 + i for i in range(300)])

The Hub remembers the chosen port and uses it for health‑check requests, so
the solution works well—provided the Hub process stays alive.

If the Hub pod restarts, it loses the mapping of Lab instances to their
assigned ports, and the health checks start failing.
We also tried falling back to the default port 8888 for the health check,
but when no response is received the Hub deletes the Lab that is running on
the random port, which is not the behavior we want.


Key points

  • Problem: Static ports restrict us to a single Lab per node.
  • Current workaround: Randomly assign a port in a pre‑spawn hook and store
    it in the Hub for health checks.
  • Failure mode: The mapping is lost when the Hub pod restarts, causing
    orphaned Lab pods or premature deletions.

Feel free to let me know if any part needs more detail!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying that this is about host networking. Do you see url changed! in your logs when the hub restarts?

I think you're right that it will do the wrong thing if .port is specified from anything other than static config, specifically here it assumes get_pod_url() is right (uses self.port by default) and the persisted db value in self.server is wrong, but in your case, it is the opposite.

That code is really there to deal with cluster networking changes, so maybe we should either remove or reverse the port logic, leaving only the host? Either that or persist/restore self.port in load_state/get_state. I doubt persisting self.port is right, though. I'll need to think through some cases to know which is right. Removing the port check is the simplest and usually probably the right thing to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I know what would be simplest and most correct: add self.port to the pod manifest in the annotation hub.jupyter.org/port, then retrieve that in _get_pod_url instead of using self.port unconditionally. self.port can then be a fallback if undefined (e.g. across an upgrade).

Do you want to have a go at tackling that? If not, I can probably do it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is right (uses self.port by default) and the persisted db value in self.server is wrong, but in your case, it is the opposite.

yes it's persisted, but after restart hub.
hub clear port value.

That code is really there to deal with cluster networking changes, so maybe we should either remove or reverse the port logic, leaving only the host?

We can't delete port.
You should know port value for liveness probe.
http://<k8s_ip>:/api/

Should i do fix something for merge PR ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i debug this situation

before

  1. hub working
  2. user_1 start pod at random_port 1234
  3. hub persist in db port 1234
  4. user_1 pod is working
  5. hub restart
  6. in db hub rewrite port to default 8888 but ip is ok
  7. hub see user_1 pod
  8. hub try livecheck in k8s_private_ip:8888/api
  9. hub kill pod for user_1

after

  1. hub working
  2. user_1 start pod at random_port 1234
  3. hub persist in db port 1234
  4. user_1 pod is working
  5. hub restart
  6. in db hub read from db ip and port 1234
  7. hub see user_1 pod
  8. hub try livecheck in k8s_private_ip:8888/api
  9. user_1 pod is working -> user_1 happy

The meaning of self.port is confused here - it is not the port used to connect, it is the port used by the process to bind.

it only for get previous config of early started pod.

The issue is the connect port is not persisted properly, and _get_pod_url always retrieves the self.port config, which can change, but should not be permitted to change while a pod is running.

i'm resolve persisting in db, by changing here.

I suggested the fix of persisting self.port in the pod's annotations in get_pod_manifest and using the annotation in _get_pod_url, which I believe should solve the problem here. It's not that self.port can change, it's that self.port is used in _get_pod_url, since changing it is normal.

why should i get from get_pod_manifest, if hub reading from db ?

Copy link
Member

@minrk minrk Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is in get_pod_url using self.port instead of the actual port when the pod is running. Fixing that will fix the problem. Relying on deprecated db access will eventually break, and is not the right thing to do when the pod is not running. The fix is to persist the port in the pod manifest via the annotation, so get_pod_url gets the right value, and self.port config will still have the right effect rather than being overridden.

Another, smaller fix would be to replace the netloc check with only a hostname check, so that we don't rewrite the port. I'm not sure if there are situations where the port could change, but we know there are where the ip changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is in get_pod_url using self.port instead of the actual port when the pod is running.

are you sure that it fix it?
What will we do if not fixed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure that it fix it?

I believe it will

What will we do if not fixed?

Keep working to fix it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minrk i understand that i have not time to refactor this.
It's working on production for 2 months.
Many not investing time to JH because have this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants