This README shows how to run latency measurements on NVIDIA Jetson AGX Orin.
Measurement server is based on:
trtexec— standard TensorRT component that can measure inference time,- ENOT Latency Server — small open-source package that provides simple API for latency measurement.
The repository code was tested on Python 3.8.
To install the required packages run the following command:
pip install -r requirements.txtRun a measurement server on Jetson:
python tools/server.pyThe server gets a model in the ONNX format and measures its latency using trtexec:
<trtexec_path> \
--onnx=<onnx_model_path> \
--warmUp=<warmup> \
--iterations=<iterations> \
--avgRuns=<avgruns> \
--noDataTransfers \
--useSpinWait \
--useCudaGraph \
--separateProfileRun \
--percentile=95 \
--fp16NOTE: If you pass a model with
QuantizeLinearandDequantizeLinearlayers to latency server, an engine with INT8 kernels will be automatically created.
We get stable results with the following parameter values (default values for our measurements):
warmUp:10000(10 sec)iterations:10000avgRuns:100
Parameter values can be checked by the following command:
python tools/server.py --helpTo measure latency, use the following command:
python tools/measure.py --model-onnx=model.onnxIf you are running the client (tools/measure.py script) on another computer, please install the necessary packages first and then specify the server address using --host and --port arguments.
- run
tools/server.pyon a target device (NVIDIA Jetson AGX Orin), - run
tools/measure.pywith the specified server address.
TensorRT sometimes builds an FP32 engine even if we pass --fp16 flag to trtexec, this affect the measurement results (issue).
To make sure that the engine is correct, we compare its size with the reference size: FP32 engine size or ONNX model size if --compare-with-onnx is passed.
If the size of the built engine is too large, then it is incorrect, and we automatically rebuild it.
The measurement script uses 1.5 as a default threshold on reference size / current engine size value (this value can be changed using --threshold option).
Latency server tries to build a correct engine for --n-trials times (20 by default) until reference size / current engine size becomes higher than the threshold.
If trtexec has failed to create a correct engine for n_trials times, latency server returns None as model latency.
If you want to know the actual reference size / current engine size ratio, use --verbosity-level=1.