-
Notifications
You must be signed in to change notification settings - Fork 306
#827 get from db ports firstly #893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
i have been check in k8s
| if self.working_dir: | ||
| self.working_dir = self._expand_user_properties(self.working_dir) | ||
| if self.port == 0: | ||
| # Prefer reading the port from the persisted server record, if available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would not usually be correct. The persisted server.port is the connect port, whereas self.port sets the bind port (should almost never have a value other than the default in kubespawner).
Plus, access to the db from Spawners is deprecated and discouraged, so I don't think we should add this.
Can you share more about what problem this aims to solve? Maybe the answer is somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share more about what problem this aims to solve?
Yes, of course.
Context
We run Hadoop and Spark on YARN.
To simplify the deployment of a Spark driver in JupyterLab we set the following
KubeSpawner options:
c.KubeSpawner.extra_pod_config = {
"hostNetwork": True,
"dnsPolicy": "ClusterFirstWithHostNet",
}Situation
When we use a static port, only one Lab instance can be started per node,
which is insufficient for our workload.
This issue first appeared when we had 5 nodes serving 15 Hub users.
To work around the limitation we introduced a pre‑spawn hook that selects a
random port in the range 9000‑9299:
import random
def my_pre_spawn_hook(spawner):
"""Choose a random port for the notebook server."""
spawner.port = random.choice([9000 + i for i in range(300)])The Hub remembers the chosen port and uses it for health‑check requests, so
the solution works well—provided the Hub process stays alive.
If the Hub pod restarts, it loses the mapping of Lab instances to their
assigned ports, and the health checks start failing.
We also tried falling back to the default port 8888 for the health check,
but when no response is received the Hub deletes the Lab that is running on
the random port, which is not the behavior we want.
Key points
- Problem: Static ports restrict us to a single Lab per node.
- Current workaround: Randomly assign a port in a pre‑spawn hook and store
it in the Hub for health checks. - Failure mode: The mapping is lost when the Hub pod restarts, causing
orphaned Lab pods or premature deletions.
Feel free to let me know if any part needs more detail!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying that this is about host networking. Do you see url changed! in your logs when the hub restarts?
I think you're right that it will do the wrong thing if .port is specified from anything other than static config, specifically here it assumes get_pod_url() is right (uses self.port by default) and the persisted db value in self.server is wrong, but in your case, it is the opposite.
That code is really there to deal with cluster networking changes, so maybe we should either remove or reverse the port logic, leaving only the host? Either that or persist/restore self.port in load_state/get_state. I doubt persisting self.port is right, though. I'll need to think through some cases to know which is right. Removing the port check is the simplest and usually probably the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I know what would be simplest and most correct: add self.port to the pod manifest in the annotation hub.jupyter.org/port, then retrieve that in _get_pod_url instead of using self.port unconditionally. self.port can then be a fallback if undefined (e.g. across an upgrade).
Do you want to have a go at tackling that? If not, I can probably do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is right (uses self.port by default) and the persisted db value in self.server is wrong, but in your case, it is the opposite.
yes it's persisted, but after restart hub.
hub clear port value.
That code is really there to deal with cluster networking changes, so maybe we should either remove or reverse the port logic, leaving only the host?
We can't delete port.
You should know port value for liveness probe.
http://<k8s_ip>:/api/
Should i do fix something for merge PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i debug this situation
before
- hub working
- user_1 start pod at random_port
1234 - hub persist in db port 1234
- user_1 pod is working
- hub restart
- in db hub rewrite port to default
8888but ip is ok - hub see user_1 pod
- hub try livecheck in k8s_private_ip:8888/api
- hub kill pod for user_1
after
- hub working
- user_1 start pod at random_port
1234 - hub persist in db port 1234
- user_1 pod is working
- hub restart
- in db hub read from db ip and port
1234 - hub see user_1 pod
- hub try livecheck in k8s_private_ip:8888/api
- user_1 pod is working -> user_1 happy
The meaning of
self.portis confused here - it is not the port used to connect, it is the port used by the process to bind.
it only for get previous config of early started pod.
The issue is the connect port is not persisted properly, and
_get_pod_urlalways retrieves theself.portconfig, which can change, but should not be permitted to change while a pod is running.
i'm resolve persisting in db, by changing here.
I suggested the fix of persisting
self.portin the pod's annotations inget_pod_manifestand using the annotation in_get_pod_url, which I believe should solve the problem here. It's not thatself.portcan change, it's thatself.portis used in_get_pod_url, since changing it is normal.
why should i get from get_pod_manifest, if hub reading from db ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is in get_pod_url using self.port instead of the actual port when the pod is running. Fixing that will fix the problem. Relying on deprecated db access will eventually break, and is not the right thing to do when the pod is not running. The fix is to persist the port in the pod manifest via the annotation, so get_pod_url gets the right value, and self.port config will still have the right effect rather than being overridden.
Another, smaller fix would be to replace the netloc check with only a hostname check, so that we don't rewrite the port. I'm not sure if there are situations where the port could change, but we know there are where the ip changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is in get_pod_url using self.port instead of the actual port when the pod is running.
are you sure that it fix it?
What will we do if not fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure that it fix it?
I believe it will
What will we do if not fixed?
Keep working to fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minrk i understand that i have not time to refactor this.
It's working on production for 2 months.
Many not investing time to JH because have this problem.

i have been check in k8s