When executing inference operations, AI practitioners need an efficient way to integrate components that delivers great performance at scale while providing a simple interface between application and execution engine.
Thus far, TensorFlow* Serving has been the serving system of choice for several reasons:
- Efficient serialization and deserialization
- Fast gRPC interface
- Popularity of the TensorFlow framework
- Simple API definition
- Version management
There are, however, a few challenges to the successful adoption of TensorFlow Serving:
- Poor out-of-the-box CPU inference performance
- Lack of support for inference execution on Intel® FPGAs, Intel® Movidius™ VPUs, and Intel® Processor Graphics
- Lack of support for frameworks other than TensorFlow
- Complex and lengthy building process
For successful adoption, an inference platform should include acceptable latency even for demanding workloads, easy integration between training and deployment systems, scalability and a standard client interface. A new model server inference platform developed by Intel, the OpenVINO™ Model Server, offers the same interface as TensorFlow Serving gRPC API but employs inference engine libraries from the Intel® Distribution of OpenVINO™ toolkit. Based on convolutional neural networks (CNN), this toolkit extends workloads across Intel® hardware (including accelerators) and maximizes performance across computer vision accelerators—CPUs, integrated GPUs, Intel Movidius VPUs, and Intel FPGAs
Intel’s Poland-based AI inference platform team compared results captured from a gRPC client run against a Docker container using a TensorFlow* Serving image from Docker Hub (tensorflow/serving:1.10.1) and a Docker image built using an OpenVINO Model Server image with Intel Distribution for OpenVINO toolkit version 2018.3. We applied standard models from TensorFlow-Slim image classification models library, specifically resnet_v1_50, resnet_v2_50, resnet_v1_152 and resnet_v2_152.
Using identical client application code and hardware configuration in the Docker containers, OpenVINO Model Server delivered up to 5x the performance of TensorFlow Serving, depending on batch size. The improved performance of the OpenVINO Model Server means that the inference interface can be easily accessible over a network, opening new opportunities for supported applications and reducing the cost, latency and power consumption.
Figure 3: Performance Results for Model Resnet v1 50 Depending on Batch Size (Higher is Better)1
Figure 3: Performance Results for Model Resnet v1 50 Depending on Batch Size (Higher is Better)1
How OpenVINO Model Server Works
Models trained in TensorFlow, MxNet*, Caffe*, Kaldi*, or in ONNX format are optimized using the Model Optimizer included in the OpenVINO toolkit. This process is done just once. The output of the model optimizer is two files with .xml and .bin extensions. The XML represents the optimized graph, and the bin file contains the weights. These files are loaded into the Inference Engine, which provides a lightweight API for integration into the actual runtime application.OpenVINO Model Server allows these models to be served through the same gRPC interface as TensorFlow Serving.
An automated pipeline can be easily implemented, which first trains the models in the TensorFlow framework, then exports the results in a protocol buffer file and later converts them to Intermediate Representation format. As long the model includes layer types supported by OpenVINO, there are no extra steps needed. However, for a few non-supported layers there is still a way to complete the transformation by installing appropriate extensions for the missing operations. Refer to the Model Optimizer documentation for more details.
The same conversion can be completed for Caffe and MXNet models (and the recent OpenVINO release 2018.3 also supports Kaldi and ONNX models.) As a result, the OpenVINO Model Server can become the inference execution component for all these deep learning frameworks.
OpenVINO Model Server
The OpenVINO Model Server architecture stack is shown in Figure 5. It is implemented as a Python* service with gRPC libraries exposing the API from the TensorFlow Serving API. These are used as identical proto files, which make the API implementation fully compatible for the same clients. Therefore, no code changes are needed on the client side to connect to both serving components.
Key differences include inference execution implementation which relies on the Inference Engine API. With an optimized model format, and using Intel-optimized libraries for inference execution on CPUs, FPGAs, and VPUs, you can take advantage of significantly better performance.
How To Use OpenVINO Model Server
IR Model and Using Optimizer
The first step in enabling OpenVINO Model Server is to generate an IR model format out of the TensorFlow saved model representation using the Model Optimizer. You can use a command similar to the example below:
mo_tf.py --saved_model_dir /tf_models/resnet_v1_50 --output_dir /ir_models/resnet_v1_50/ --model_name resnet_v1_50
Model Optimizer arguments:
- Path to the Input Model: None
- Path for generated IR: /ir_models/resnet_v1_50/
- IR output name: resnet_v1_50
- Log level: ERROR
- Batch: Not specified, inherited from the model
- Input layers: Not specified, inherited from the model
- Output layers: Not specified, inherited from the model
- Input shapes: Not specified, inherited from the model
- Mean values: Not specified
- Scale values: Not specified
- Scale factor: Not specified
- Precision of IR: FP32
- Enable fusing: True
- Enable grouped convolutions fusing: True
- Move mean values to preprocess section: False
- Reverse input channels: False
TensorFlow specific parameters:
- Input model in text protobuf format: False
- Offload unsupported operations: False
- Path to model dump for TensorBoard: None
- Update the configuration file with input/output node names:
- Operations to offload: None
- Patterns to offload: None
- Use the config file: None
Model Optimizer version: 126.96.36.19935e231
[ SUCCESS ] Generated IR model.
[ SUCCESS ] XML file: /ir_models/resnet_v1_50/resnet_v1_50.xml
[ SUCCESS ] BIN file: /ir_models/resnet_v1_50/resnet_v1_50.bin
[ SUCCESS ] Total execution time: 7.75 seconds.
Folder Structure with Models
Before the models can be used in OpenVINO Model Server they should be placed in a folder structure similar to the one shown in Figure 6.
Each model should have a separate folder where every version is stored in subfolders with numerical names. This way, OpenVINO Model Server can handle multiple models and manage their versions in a similar manner to TensorFlow Serving.
The deployment process is limited to two steps:
- Building the Docker image, using the following command:
docker build -t openvino-model-server:latest
- Starting the Docker container, using the following commands:
docker run --rm -d -v /models/:/opt/ml:ro -p 9001:9001 openvino-model-server:latest
/ie-serving-py/start_server.sh ie_serving model --model_path /opt/ml/model1 --model_name my_model --port 9001
After those two steps are finished, OpenVINO Model Server runs as a Docker container in detached mode and listens for gRPC inference requests on port 9001. More details about the usage and configuration is included in the GitHub repository documentation.
Use Case Examples
Below are examples of OpenVINO Model Server adoptions. More details can be found in the github repository.
1. Standalone Inference Service
The simplest usage example for the OpenVINO Model Server is with a Docker container running on a single machine.
The prerequisites for the setup are:
- Downloading the OpenVINO installer
- Placing them in an appropriate folder structure where each model includes a set of numerical versions
While the docker container is configured and launched according to this documentation, it can be used to serve inference interface for local applications or over the network.
2. Integration with AWS Sagemaker
AWS Sagemaker* is an inference solution with REST API interface and capabilities for configuring custom pre and post processing. It passes inference requests to the TensorFlow Serving component which is installed in the same docker container along with Sagemaker services.
It is possible to replace the TensorFlow Serving component with OpenVINO Model Server without changing the client code or the Sagemaker component. The replacement is mostly transparent and the only needed modifications are in an updated Docker file for building the Sagemaker component and a minor code update that injects a command for starting OpenVINO Model Server instead of TensorFlow Serving. The example code is present in the OpenVINO Model Server source code repository.
You can take advantage of the capabilities of AWS Sagemaker while improving the performance and reducing the response latency.
3. Inference Serving Service in Kubernetes
OpenVINO Model Server can be easily deployed in a Kubernetes* environment. Such a configuration enables new capabilities due to its scalability and high availability. Each instance of OpenVINO Model Server is represented by a Kubernetes pod attached to a service exposed via nginx ingress.
Inference operations are stateless, which makes the infrastructure easy to scale horizontally up and down according to user demand. AI models need to be mounted inside the volumes stored in NAS solutions like NFS, CEPH, S3 or others supported by Kubernetes.
Advantages for AI Applications
To summarize, OpenVINO Model Server has multiple benefits for data scientists and inference consumers:
- Support for multiple frameworks: A common inference system can be used for frameworks like TensorFlow, Caffe, and MXNet that can export their models to the ONNX format. Just the conversion mechanism should be applied to generate the Intermediate Representation files with the Model Optimizer.
- Support for gRPC interface: OpenVINO Model Server utilizes a http2 protocol transfer mechanism, which is commonly recognized for its fast transfer rate and high tolerance of network latency.
- Ease of transition from existing API: For existing clients and applications relying on TensorFlow Serving API, the transition is mostly transparent.
- Improved performance: Performance is improved on identical hardware models and results in quicker response time.
- Support for Intel FPGAs and Intel Movidius VPUs: Users can take advantage of hardware capabilities and software optimizations for inference execution, both at the edge and in the data center.
- Ease of Python Service Implementation: The source code is easy to analyze, which makes it also simple to expand and add extra features. Python is widely used, well-known by the developers and also convenient for troubleshooting and debugging.
- Ease of installation and integration: Docker containers makes OpenVINO Model Server easy to integrate with a wide range of platforms and solutions.
The Intel AI inference platform team would like to thank the GE Healthcare team for their collaboration in designing and testing OpenVINO Model Server integration with AWS Sagemaker and providing valuable feedback about performance results. We would also like to thank Prashant Shah and the Intel AI business development team for their collaboration and partnership, the Intel AI benchmarking team in Poland for help in executing performance tests on multiple configurations and hardware, and the Intel OpenVINO development team in Nizhny for assistance and numerous consultations. These contributions enabled Marek Lewandowski and the AIPG inference platform team to design and develop OpenVINO Model Server in a short timeframe.
Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks.
Performance results are based on internal testing done on 27th September 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Test configuration: Dual Intel® Xeon® Platinum 8180 processor @ 2.50GHz, 376.28GB total system memory, Ubuntu-16.04-xenial operating system. Complete configuration details can be found here.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Intel, the Intel logo, Intel Xeon and others are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation.