Inference at Scale in Kubernetes

Learn best practices for configuring load balancing on @kubernetesio and how to simplify inference scaling with SW tools from @Intel.

Inference as a service is seeing wide adoption in the cloud and in on-premise data centers. Accommodating various types of model servers like TensorFlow* ServingOpenVINO™ Model Server or Seldon Core* in Kubernetes* is a great mechanism to achieve scalability and high-availability for such workloads. Nevertheless, the task of configuring the Kubernetes load balancer can be difficult. This article presents the most common challenges and recommended solutions.

One of the differences between inference and well-known load distribution methods is that inference typically uses a gRPC* (Google Remote Procedure Call) API instead of REST (Representational State Transfer) API. gRPC has great advantages over REST because of its low latency and efficiency in data serialization. Inference workload distribution presents several challenges, including potential load balancer bottlenecks, the optimal configuration of session affinity to pods, and the security of traffic between clients and endpoints (including proper access control).

This article describes recommended techniques for configuring load balancing in Kubernetes with a special focus on inference applications. It also describes how to protect served gRPC endpoints with Mutual Transport Layer Security (MTLS) authentication, both on the cluster and client side, and how inference scaling can be simplified with Inference Model Manager* for Kubernetes.

Load Balancing for Inference Requests

Inference systems can be configured and deployed in a scalable manner. However, Kubernetes might not distribute the compute load in an optimal way.

The reason is that gRPC, which is a common interface for inference requests, utilizes HTTP2 protocol, where every request is a stream inside the same TCP connection. For that reason, L3/L4 load balancers route all inference requests to a single target instance once the connection is established.

A preferred solution is to employ a Kubernetes Ingress Controller which will perform the routing using L7 load balancing with full support for gRPC protocol. Our example below is based on nginx-ingress which we have tested and found to be reliable and secure for gRPC traffic.

Below are examples of request distributions for two inference clients submitting a series of calls. In the first case, the clients connect via a Kubernetes Cluster IP assigned to the service with three nodes. In the second case, the clients connect via a nginx ingress controller routing the traffic to the same Kubernetes service.

Figure 1: Connection via Kubernetes* Service ClusterIP. Distribution of requests from two clients among three nodes. Colors represent clients. Test configuration in appendix.

Figure 2: Connection via Ingress Controller. Distribution of requests from two clients on three nodes. Colors represent clients. Test configuration in appendix.

Adding the L7 load balancer in the form of an ingress controller improves the distribution of the load on the whole cluster. As shown in the performance analysis below, it also adds a noticeable latency impact. However, it is still beneficial because it improves the overall capacity and system utilization and adds a layer of security due to added traffic encryption and authorization.

Security in Inference Endpoints

Security is an important aspect of an inference system. It is necessary to address threats and protect user assets like inference endpoints that serve AI models, input user data and return inference results. Kubernetes cluster security requires user identity authentication, user permission authorization, and traffic encryption between the clients and the inference system to prevent ‘sniffing’ of transferred data.

Traffic encryption is fairly easy to implement on the ingress controller side using TLS termination. User authentication and authorization can be client certificate-based or employed via a JSON Web Token (JWT) based mechanism. Here, the first option will be presented based on Mutual Transport Layer Security (mTLS) authentication. It is convenient to control access to inference endpoints for the applications, automation scripts, and other microservices.

With mTLS, the inference endpoint encrypts the traffic using the server certificate. The inference request also includes an embedded client TLS certificate for authentication purposes. Bi-directional trust for the certificates is needed between the client and the server to establish a secure connection. More details can be found in our whitepaper.

One of the proven options for implementing mTLS is to use an ingress nginx controller.

Performance Analysis

Testing Ingress Overhead

During the tests, the gRPC client was hosted on a pod in the same Kubernetes cluster as the model servers, which were serving ResNet v1.50 topology. Kubernetes infrastructure was set up in Google Cloud GKE service. The tests were done on an Intel® Xeon® Scalable processor-based system with 16 virtual cores and 32GB of RAM. The gRPC client sent sequential requests to inference endpoints to calculate response time statistics. The client was located inside the cluster in the same local network.

Figure 3: Model Server Processing and Transfer Overhead by Interface, Resnet v1 50 model. Test configuration in appendix.

In summary, adding an L7 load balancer with TLS connection termination impacts latency and increases the workload on the client side. The latency impact is proportional to the volume of data passed to the server. On the other hand, the L7 load balancer adds critical security features and efficient load distribution across all service instances. As shown in the following tests, it does not reduce the overall CPU capacity of the inference system.

The conclusion is that connectivity over inference service cluster IP should be used when the clients are located inside the cluster, even if the distribution of inference requests is not critical, and there is no need to encrypt the traffic on internal interfaces. In all other cases, connecting via ingress controller interface is the recommended technique for exposing the inference model servers.

Horizontal Scalability

During the tests, the inference service hosted on Kubernetes with model servers received workloads from multiple parallel clients in proportion to the overall capacity, each sending sequential requests over the gRPC interface. Captured statistics are related to the average latency and throughput with a variable number of backend instances. Each instance of the service was represented as a pod with constrained and isolated CPU capacity of four virtual cores of an Intel Xeon Scalable processor hosted in a Google Cloud environment. The hosting Kubernetes nodes had a total of 64GB RAM with 32 virtual CPU cores.

OpenVINO Model Server Scalability with Increased Number of Instances

Figure 4: OpenVINO Model Server Scalability with Increased Number of Instances, Relative Values Collected for ResNet v1.50. Test configuration in appendix.

To summarize the results of our testing, inference services can be enabled in Kubernetes in a highly scalable and secure way. It can be used to flexibly serve a large number of models for many clients and requests. Capacity can be scaled linearly both by adding pods in a single node and by adding more physical nodes. A variety of model servers like TensorFlow Serving and OpenVINO Model Server can be adopted in this way. Each model server has different strengths, so consider your frameworks, topologies, and usage pattern. It’s always best practice to test solutions individually.

Inference Model Manager for Kubernetes

Intel recently released an open source component for Kubernetes, Inference Model Manager, which simplifies the process of configuring Kubernetes for inference endpoints. It enables an easy to follow REST API for managing and controlling the endpoints in a multi-tenant environment using OID (OpenID) token-based authentication.

Inference Model Manager exposes two types of interfaces:

  • A management REST API with JWT based authentication & authorization
  • gRPC inference endpoints with mTLS client authentication and TensorFlow Serving API

Each of the models deployed on the platform is represented by an inference endpoint CRD (Custom Resource Definition) and hosted as a model server instance (TensorFlow Server or OpenVINO Model Server) embedded within a Kubernetes deployment. The gRPC API exposed by the model server instance will be accessible externally over the ingress load balancer. During initialization, the model server instance downloads the model it is configured with from the appropriate storage bucket.

Model Server Workflow

Figure 5: Model Server Workflow

Inference endpoints are grouped by tenants which represent teams of users. All inference endpoints created for a tenant are hosted in the same Kubernetes namespace that also groups other resources pertaining to that tenant.

Inference Model Manager integrates with third party components for model storage and identity providers through:

The REST management API exposed by the platform uses the same token and RBAC (Role Based Access Control) rules for authentication and authorization as the Kubernetes API. The REST API is for convenience and management simplification. A complete platform state is stored in the Kubernetes records.

Minio, an open source storage server with an Amazon S3-compatible API, is used for model storage. All user interactions with Minio can be implemented via management API only. Management API acts as a proxy for filtering the traffic based on user token validity and scope.

You can learn more about the Inference Model Manager component here and check out our extended documentation on this project for more details. For the latest technical updates from our team, follow us on @IntelAIDev.

Additional Resources:

  1. Intel Inference Model Manager on GitHub
  2. Intel whitepaper: Secure Inference at Scale in Kubernetes Container Orchestration
  3. Kubernetes concepts for services, load balancing, and networking: Ingress
  4. Kubernetes concepts for services, load balancing, and networking: Services
  5. Google* Cloud: Load Balancing Concepts
  6. gRPC Blog: Load Balancing

Appendix: Notices and Disclaimers, Test Configuration Details

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit

Performance results are based on internal testing done on 27th November 2018 and may not reflect all publicly available security updates. No product can be absolutely secure. Test configuration: Google Kubernetes Engine service using nodes equipped with Intel Xeon Processors of variable cores, capacity, and RAM allocation.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Intel, the Intel logo, Intel Xeon and others are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation.

Stay Connected

Keep tabs on all the latest news with our monthly newsletter.