This repository was archived by the owner on Jun 3, 2025. It is now read-only.
Releases: neuralmagic/deepsparse
Releases · neuralmagic/deepsparse
DeepSparse v1.0.0
New Features:
- Support added for running multiple models with the same engine when using the Elastic Scheduler.
- When using the Elastic Scheduler, the caller can now use the
num_streams
argument to tune the number of requests that are processed in parallel. - Pipeline and annotation support added and generalized for transformers, yolov5, and torchvision.
- Documentation additions made for transformers, yolov5, torchvision, and serving that focus on model deployment for the given integrations.
- AWS SageMaker example created.
Changes:
- Click as a root dependency added as the new preferred route for CLI invocation and arg management.
Performance:
- Inference performance has been improved for unstructured sparse quantized models on AVX2 and AVX-512 systems that do not support VNNI instructions. This includes up to 20% on BERT and 45% on ResNet-50.
Resolved Issues:
- When a layer operates on a dataset larger than 2GB, potential crashes no longer happen.
- Assertion error addressed for Reduce operations where the reduction axis is of length 1.
- Rare assertion failure addressed related to Tensor Columns.
- When running the DeepSparse Engine on a system with a non-uniform system topology, model compilation now properly terminates.
Known Issues:
- In rare cases, the engine may crash with an assertion failure during model compilation for a convolution with a 1x1 kernel with 2x2 convolution strides; hotfix forthcoming.
- The engine will crash with an assertion failure when setting the
num_streams
parameter to fewer than the number of NUMA nodes; hotfix forthcoming. - In rare cases, the engine may enter an infinite loop when an operation has multiple inputs coming from the same source; hotfix forthcoming.
DeepSparse v0.12.2 Patch Release
This is a patch release for 0.12.0 that contains the following changes:
- Protobuf is restricted to version < 4.0 as the newer version breaks ONNX.
DeepSparse v0.12.1 Patch Release
This is a patch release for 0.12.0 that contains the following changes:
- Improper label mapping no longer crashes for validation flows within DeepSparse transformers.
- DeepSparse Server now exposes proper routes for SageMaker.
- Dependency issue with DeepSparse Server no longer installs an old version of a library that caused crashing issues in some use cases.
DeepSparse v0.12.0
New Features:
Documentation:
- SparseServer.UI: a Streamlit app for deploying the DeepSparse Server for exploring the inference performance of BERT on the question answering task.
- DeepSparse Server README:
deepsparse.server
capabilities, including single model and multi-model inferencing. - Twitter NLP Inference Examples added.
Changes:
Performance:
- Speedup for large batch sizes when using sync mode on AMD EPYC processors.
- AVX2 improvements for
- Up to 40% speedup out of the box for dense quantized models.
- Up to 20% speedup for pruned quantized BERT, ResNet-50 and MobileNet.
- Speedup from sparsity realized for ConvInteger operators.
- Model compilation time decreased on systems with many cores.
- Multi-stream Scheduler: certain computations that were executed during runtime are now precomputed.
- Hugging Face Transformers integration updated to latest state from upstream main branch.
Documentation:
- DeepSparse README: references to
deepsparse.server
,deepsparse.benchmark
, and Transformer pipelines. - DeepSparse Benchmark README: highlights of
deepsparse.benchmark
CLI command. - Transformers 🤗 Inference Pipelines: examples included on how to run inference via Python for several NLP tasks.
Resolved Issues:
- When running quantized BERT with a sequence length not divisible by 4, the DeepSparse Engine will no longer disable optimizations and see very poor performance.
- Users executing
arch.bin
now receive a correct architecture profile of their system.
Known Issues:
- When running the DeepSparse engine on a system with a nonuniform system topology, for example, an AMD EPYC processor where some cores per core-complex (CCX) have been disabled, model compilation will never terminate. A workaround is to set the environment variable
NM_SERIAL_UNIT_GENERATION=1
.
DeepSparse v0.11.2 Patch Release
This is a patch release for 0.11.0 that contains the following changes:
- Fixed an assertion error that would occur when using
deepsparse.benchmark
on AMD machines with the argument-pin none
.
Known Issues:
- When running quantized BERT with a sequence length not divisible by 4, the DeepSparse Engine will disable optimizations and see very poor performance.
DeepSparse v0.11.1 Patch Release
This is a patch release for 0.11.0 that contains the following changes:
- When running NanoDet-Plus-m, the DeepSparse Engine will no longer fail with an assertion (See #279).
- The DeepSparse Engine now respects the cpu affinity set by the calling thread. This is essential for the new Command-line (CLI) tool
multi-process-benchmark.py
to function correctly. This script allows users to measure the performance using multiple separate processes in parallel. - Fixed a performance regression on BERT batch size 1 sequence length 128 models.
DeepSparse v0.11.0
New Features:
- High-performance sparse quantized convolutional neural networks supported on AVX2 systems.
- CCX detection added to the DeepSparse Engine for AMD systems.
deepsparse.server
integration and CLIs added with Hugging Face transformers pipelines support.
Changes:
Performance improvements made for
- FP32 sparse BERT models
- batch size 1 networks
- quantized sparse BERT models
- Pooling operations
Resolved Issues:
- When hyperthreads are disabled in the BIOS, core/socket information on certain systems can now be detected.
- Hugging Face transformers validation flows for QQP now giving correct accuracy metrics.
- PyTorch downloaded for YOLO model stubs now supported.
Known Issues:
- When running NanoDet-Plus-m, the DeepSparse Engine will fail with an assertion (See #279). A hotfix is being pursued.
DeepSparse v0.10.0
New Features:
- Quantization support enabled on AVX2 instruction set for GEMM and elementwise operations.
NM_SPOOF_ARCH
environment variable added for testing different architectural configurations.- Elastic scheduler implemented as an alternative to the single-stream or multi-stream schedulers.
deepsparse.benchmark
application is now usable from the command-line after installing deepsparse to simplify benchmarking.deepsparse.server
CLI and API added with transformers support to make serving models like BERT with pipelines easy.
Changes:
- More robust architecture detection added to help resolve CPU topology, such as when running inside a virtual machine.
- Tensor columns improved, leading to significant speedups from 5 to 20% in pruned YOLO (larger batch size), BERT (smaller batch size), MobileNet, and ResNet models.
- Sparse quantized network performance improved on machines that do not support VNNI instructions.
- Performance improved for dense BERT with large batch sizes.
Resolved Issues:
- Possible crashes eliminated for:
- Pooling operations with small image sizes
- Rarely, networks containing convolution or GEMM operations
- Some models with many residual connections
Known Issues:
- None
DeepSparse v0.9.1 Patch Release
This is a patch release for 0.9.0 that contains the following changes:
- YOLACT models and other models with constant outputs no longer fail with a mismatched shape error on multi-socket systems with batch sizes greater than 1. However, a corner case exists where a model with a constant output whose first dimension is equal to the (nonunit) batch size will fail.
- GEMM operations where the number of columns of the output matrix is not divisible by 16 will no longer fail with an assertion error.
- Broadcasted inputs to elementwise operators no longer fail with an assertion error.
- Int64 multiplications no longer fail with an illegal instruction on AVX2.
DeepSparse v0.9.0
New Features:
- Support optimized for resize operators with coordinate transformations of pytorch_half_pixel and align_corners.
- Up-to-date version check implemented for DeepSparse.
- YOLACT and DeepSparse integration added in examples/dbolya-yolact.
Changes:
- The parameter for the number of sockets to use has been removed -- the Python interface now only takes only the number of cores as a parameter.
- Tensor columns have been optimized. Users will see performance improvements specifically for pruned quantized BERT models:
- The softmax operator can now take advantage of tensor columns.
- Inference batch sizes that are not divisible by 16 are now supported.
- Various performance improvements made to:
- certain networks, such as YOLOv5, on AVX2 systems.
- dense convolutions on some AVX-512 systems.
- API docs recompiled.
Resolved Issues:
- In rare circumstances, users could have experienced an assertion error when executing networks with depthwise convolutions.
Known Issues:
- YOLACT models fail with a mismatched shape error on multi-socket systems with batch size greater than 1. This issue applies to any model with a constant output.
- In some circumstances, GEMM operations where the number of columns of the output matrix is not divisible by 16 may fail with an assertion error.