-
Notifications
You must be signed in to change notification settings - Fork 71
Open
Description
What did you find confusing? Please describe.
I tried to extend the image adding my code and run it locally. However, the server does not start and it doesn't publish any logs from our scripts.
Dockerfile:
FROM 763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-inference:1.10.0-cpu-py38
ENV SAGEMAKER_PROGRAM "my_amazing_entrypoint.py"
ENV SAGEMAKER_REGION "eu-west-1"
ENV SAGEMAKER_SUBMIT_DIRECTORY "/opt/ml/model/code"
WORKDIR "/opt/ml/model/"
COPY model_new.tar.gz "/opt/ml/model/model.tar.gz"
RUN tar -xf model.tar.gz
model.tar.gz:
.
| - code/
| - my_amazing_entrypoint.py
| - more_packages/
| - pytorch_model.pth
Commands executed:
docker build -t pytorch-test .
docker run -ti pytorch-test
Output:
Warning: TorchServe is using non-default JVM parameters: -XX:+UseContainerSupport -XX:InitialRAMPercentage=8.0 -XX:MaxRAMPercentage=10.0 -XX:-UseLargePages -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-05-25T10:26:35,124 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2022-05-25T10:26:35,219 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.5.2
TS Home: /opt/conda/lib/python3.8/site-packages
Current directory: /opt/ml/model
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 8
Max heap size: 3166 M
Python executable: /opt/conda/bin/python3.8
Config file: /home/model-server/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/model-server
Initial Models: ALL
Log dir: /opt/ml/model/logs
Metrics dir: /opt/ml/model/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 8
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/model-server
Model config: N/A
2022-05-25T10:26:35,225 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2022-05-25T10:26:35,247 [DEBUG] main org.pytorch.serve.ModelServer - Loading models from model store: tmp
2022-05-25T10:26:35,249 [WARN ] main org.pytorch.serve.ModelServer - Failed to load model: /home/model-server/tmp
org.pytorch.serve.archive.model.ModelNotFoundException: Model not found at: tmp
at org.pytorch.serve.archive.model.ModelArchive.downloadModel(ModelArchive.java:75) ~[model-server.jar:?]
at org.pytorch.serve.wlm.ModelManager.createModelArchive(ModelManager.java:167) ~[model-server.jar:?]
at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:133) ~[model-server.jar:?]
at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:69) ~[model-server.jar:?]
at org.pytorch.serve.ModelServer.initModelStore(ModelServer.java:194) [model-server.jar:?]
at org.pytorch.serve.ModelServer.startRESTserver(ModelServer.java:356) [model-server.jar:?]
at org.pytorch.serve.ModelServer.startAndWait(ModelServer.java:117) [model-server.jar:?]
at org.pytorch.serve.ModelServer.main(ModelServer.java:98) [model-server.jar:?]
2022-05-25T10:26:35,264 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-05-25T10:26:35,325 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2022-05-25T10:26:35,325 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2022-05-25T10:26:35,327 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2022-05-25T10:26:35,327 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-05-25T10:26:35,328 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2022-05-25T10:26:35,588 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:d90f998d682a,timestamp:1653474395
2022-05-25T10:26:35,589 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:334.71700286865234|#Level:Host|#hostname:d90f998d682a,timestamp:1653474395
2022-05-25T10:26:35,590 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:157.27763748168945|#Level:Host|#hostname:d90f998d682a,timestamp:1653474395
2022-05-25T10:26:35,590 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:32.0|#Level:Host|#hostname:d90f998d682a,timestamp:1653474395
2022-05-25T10:26:35,590 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:27624.3125|#Level:Host|#hostname:d90f998d682a,timestamp:1653474395
2022-05-25T10:26:35,591 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:3569.87109375|#Level:Host|#hostname:d90f998d682a,timestamp:1653474395
2022-05-25T10:26:35,591 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:12.7|#Level:Host|#hostname:d90f998d682a,timestamp:1653474395
Describe how documentation can be improved
It would be nice to add a section in the README file (or similar) with an example on how to run the image / container in a local docker installation.
Additional context
- This would improve developer experience by reducing the amount of time between trial and error. (It takes a while to deploy to Sagemaker)
- We can easily debug problems step by step.
nikhilatfai, carljeske, kwood and us
Metadata
Metadata
Assignees
Labels
No labels