What’s New in the OpenVINO™ Model Server: C++ Implementation and More

MaryT_Intel · ‎02-03-2021

Published on February 3, 2020

Key Takeaways

First released in 2018 and originally implemented in Python, the OpenVINO™ model server introduced efficient execution and deployment for inference using the Intel® Distribution of OpenVINO™ toolkit.
New release of OpenVINO™ model server implements C++ in version 2021.1, achieving scalability and significant throughput without compromising latency.
The primary factor of latency in AI inferencing is inference backend processing. The OpenVINO™ model server simplifies deployment and application design, and it does so without degrading execution efficiency.

Introduction

OpenVINO™ model server was first introduced in 2018. Originally implemented in Python, OpenVINO™ model server was praised for efficient execution by employing the Intel® Distribution of OpenVINO™ toolkit Inference Engine as a backend. Adoption was trivial for TensorFlow Serving (commonly known as TFServing) users, as OpenVINO™ model server leverages the same gRPC and REST APIs used by TFServing. OpenVINO™ model server made it possible to take advantage of the latest optimizations in Intel CPUs and AI accelerators without having to write custom code.

However, with increasingly efficient AI algorithms, additional hardware capacity, and advances in low precision inference, the Python implementation became insufficient for front-end scalability. With the latest release, we addressed this gap by introducing the next generation of OpenVINO™ model server, version 2021.1, which is implemented in C++. The general architecture of the newest 2021.1 OpenVINO™ model server version is presented in Figure 1.

Figure 1. OpenVINO™ model server high-level architecture.

Figure 1. OpenVINO™ model server high-level architecture.

Later in this post, we describe improvements related to execution efficiency and the new features introduced in version 2021.1. Performance results are analyzed for specific configurations, especially for increasing concurrency.

Materials and Methods

Server CPU	2x Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
CPU core number (per socket / total)	2 x 20 / 40
Memory (RAM)	12 x 16 GB 2933 MHz DDR4
Network (speed / mode)	40Gb / baseCR4 Full

Model name	ResNet 50 ver. 1
Model source framework	TensorFlow
Precisions	Floating point 32 (FP32), Integer 8 (INT8)
Batch size	1
GRPC workers	16
OV parameters plugin config *- CPU_THROUGHPUT_STREAMS*	It depends on total CPU core numbers and concurrency - 1, 10, or 40
Size of the request queue for inference execution - NIREQ	Concurrency (clients number) + 2

Measurement Methodology

The measurement estimates throughput and latency in a client-server architecture. The server and client platforms are connected by an isolated network to minimize any distorting impact on both the request stream sent to the server and the response stream sent to the client. HAProxy – a TCP load balancer – is the main measurement component used to collect results. Both HAProxy and a serving component are always run as Docker containers on different physical machines to create a practical scenario as close as possible to a typical deployment.

Any measurement setup consists of the following:

OpenVINO™ model server: A single serving component, which is an object under investigation (launched on the server platform).
HAProxy launched on the Client Platform: All transmitted data (in both streams: up-link and down-link) is forwarded through HAProxy. This is the main measuring component.
Single or Multiple parallel Clients that are launched on the client platform which is a different physical machine than the server platform.
A controller which starts all components and download data from HAProxy as well as it presents final metrics. This component is launched on the client platform.

Performance Results

OpenVINO™ model server 2021.1 is implemented in C++ to achieve high performance inference. We kept the following principles in mind when designing the architecture:

maximum throughput on a single instance
minimal load overhead over inference execution in the backend
minimal impact on latency

In Figures 2 and 3, throughput and latency metrics are compared as functions of concurrency (number of parallel clients). The comparison includes both OpenVINO™ model server versions: 2020.4 (implemented in Python) and the new 2021.1 (implemented in C++). In Figure 4, a combined throughput versus latency are presented as cross-correlation dependence. The intensity of workload is controlled by a change of number of parallel clients. All results are obtained using the best-known configurations of OpenVINO™ toolkit and OpenVINO™ model server (read more about this in the documentation) - especially by set the following parameters:

Size of the request queue for inference execution - NIREQ
Plugin config – CPU_THROUGHPUT_STREAMS
gRPC workers

Figure 2. Throughput results (higher is better) for both OpenVINO™ model server versions, versus concurrency measured by the number or parallel streams of requests. Collected for ResNet50 FP32 and ResNet50 INT8 models with batch size 1. Full configuration details in the Materials and Methods section above.

Figure 3. Latency results (lower is better) for both OpenVINO™ model server versions, versus concurrency measured by the number or parallel streams of requests. Collected for ResNet50 FP32 and ResNet50 INT8 models with batch size 1. Full configuration details in the Materials and Methods section above.

Figure 4. Throughput results (higher is better) for both OpenVINO™ model server versions, versus latency (lower is better) measured by the number or parallel streams of requests. Collected for ResNet50 FP32 and ResNet50 INT8 models with batch size 1. Full configuration details in the Materials and Methods section above.

While the Python version is performant for lower concurrency, the biggest advantage in the C++ implementation is scalability. With the C++ version, it is possible to achieve throughput of 1,600 fps without any increase in latency – a 3x improvement from the Python version.

OpenVINO™ model server can be also tuned for a single stream of requests – allocating all available resources to a single inference request. The table in Figure 5 shows response latency from a remote client. The chart visualizes the latency of each processing step for a ResNet50 model quantized to 8-bit precision.

Figure 5. Enumerated factors that contribute to latency among the key processing steps required for a ResNet50 INT8 model as a single stream of inference on a remote client. Measurements are an average of 10 runs.

With the OpenVINO™ model server C++ implementation, there is minimal impact to the latency from the service frontend. Data serialization and deserialization are reduced to a negligible amount thanks to the no-copy design. Network communication from a remote host contributes only 1.7ms in latency[1], even though the request message size is about 0.6MB for the ResNet50 model.

All in all, even for very fast AI models, the primary factor of inference latency is the inference backend processing. OpenVINO™ model server simplifies deployment and application design, without efficiency degradation.

In addition to Intel® CPUs, OpenVINO™ model server supports a range of AI accelerators like HDDL (for Intel® Vision Accelerator Design with Intel® Movidius™ VPU and Intel® Arria™ 10 FPGAs, Intel® NCS (for the Intel® Neural Compute Stick) and iGPU (for integrated GPUs). The latest Intel® Xeon® processors support BFloat16 data type to achieve the best performance.

Reduced Footprint

Docker Image Size

An important element of the footprint is the container image size. The Python version required several external dependencies which resulted in the image size that ranged from 1.4 GB to 2.6 GB, depending on the base image. The new C++ implementation reduces the number of dependencies – resulting in a much smaller image that ranges from 400MB for the default version with CPU, Intel® NCS and HDDL support to about 830MB for the version which includes support for the iGPU .

Memory Consumption

Memory usage is also greatly reduced after switching to the new version. The 2021.1 version allocates RAM based on the model size, number of stream and other configuration parameters. The initial amount of the allocated memory space will be smaller, though.

Figure 6 below shows Resident Set Size (RSS) memory consumption captured by the command “ps -o rss,vsz,pid” while serving a ResNet50 binary model. As you can see, minimal RAM allocation is required while serving models with OpenVINO™ model server.

Figure 6. RSS memory consumption when serving a ResNet50 binary model using both the Python and C++ versions of OpenVINO ™ Model Server.

Usage Improvements

Online Model Updates

The new 2021.1 version checks for changes to the configuration file and reloads models automatically without any interruption to the service. There is no need to restart the service when adding new model(s) to the configuration file or when making any other updates.

Azure Blob Storage

When deploying OpenVINO™ model server in the cloud, on-premise or at the edge, you can host your models with a range of remote storage providers. In addition to AWS S3, Minio and Google Cloud Storage, we recently added support for Azure blob storage. Now, you can simply point to a model path like az://container/model/ and an environment variable with your Azure storage connection string. The model(s) will be downloaded from the remote storage and served.

Updated Helm Chart

OpenVINO™ model server is easy to deploy in Kubernetes. Get started quickly using the helm chart. It can be used in cloud and on-premise infrastructure.

Simplified Deployment

Deploying on bare-metal is now even easier. Simply unpack the OpenVINO™ model server package to start using the service. See the documentation for more details.

Deploying in Docker containers is now easier as well. The startup command options have been simplified and a Docker image `entrypoint` has been added to the image. Starting the container requires just the arguments to define the model(s) (model name and model path) with optional serving configuration. Check the release notes to learn more.

Preview: Pipelines

In many real-life applications there is a need to answer AI related questions by calling multiple existing models in a specific sequence. Each model’s response may also require various transformations to be used in another model. In production deployments, multiple separate requests increase the network load – leading to increased latency and reduced efficiency. OpenVINO™ model server addresses this by introducing a Direct Acyclic Graph of processing nodes for a single client request.

In version 2021.1, we include a preview of this feature. With the preview, it is possible to create an arbitrary sequence of models with the condition that outputs and inputs of the connected models fit to each other without any additional data transformations. A practical example of such a pipeline is depicted in the diagram below.

This pipeline is sending a single request from the client to multiple distinct models for inference. The prediction results from each model are passed to argmax, which calculates the most likely classification based on combined probabilities.

Check out our example Python scripts for generating TensorFlow models that perform mathematical calculations and analysis. These model(s) can be converted to OpenVINO™ toolkit Intermediate Representation (IR) format and deployed with OpenVINO™ model server.

In future releases, we will expand the pipeline capabilities to include custom data transformations. This will enable additional scenarios for when data transformations cannot be easily implemented via a neural network.

Conclusion

To try the latest OpenVINO™ model server for yourself, download a pre-built container image from DockerHub or download and build from source via GitHub. For help getting started, check out the Documentation.

Download the Intel® Distribution of OpenVINO™ toolkit today and start deploying high-performance, deep learning applications with a write-once-deploy-anywhere efficiency. If you have any ideas in ways we can improve the product, we welcome contributions to the open-sourced OpenVINO™ toolkit. Finally, join the conversation to discuss all things Deep Learning and OpenVINO™ toolkit in our community forum.

Additional Resources

https://github.com/openvinotoolkit/model_server/blob/main/docs/performance_tuning.md (12 Oct 2020)

Footnotes

¹ Tested using a 40GbE network link.

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.