-
Notifications
You must be signed in to change notification settings - Fork 71
Open
Description
Describe the bug
Getting zombie process exception as already reported for the sagemaker-inference-toolkit
To reproduce
Using 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker and custom inference script in a batch-transform causes to trigger such error. Even a simple initial time.sleep(60) in the inference.py script can be used to trigger the error.
A custom requirements.txt file also needs to be provided with custom inference script.
Here the full traceback:
Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
serving.main()
File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
_start_torchserve()
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
return attempt.get(self._wrap_exception)
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
raise value
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
ts_process = _retrieve_ts_server_process()
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
raise attempt.get()
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
raise value
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
if TS_NAMESPACE in process.cmdline():
File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
return self._proc.cmdline()
File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
return fun(self, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
self._raise_if_zombie()
File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
raise ZombieProcess(self.pid, self._name, self._ppid)
System information
A description of your system. Please provide:
- Sagemaker model image:
763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker - Sagemaker model mode: single-mode
- Batch-transform instance type: ml.g4dn.2xlarge
- Batch-transform Invocation timeout in seconds: 600
njain-io and Itto1992
Metadata
Metadata
Assignees
Labels
No labels