Inference Scaling in Kubernetes

This article describes recommended techniques for configuring load balancing in Kubernetes with a special focus on inference applications. It also describes how to protect served gRPC endpoints with Mutual Transport Layer Security (MTLS) authentication, both on the cluster and client side, and how inference scaling can be simplified with Inference Model Manager* for Kubernetes. The examples covered here will use two model servers: TensorFlow* Serving and OpenVINO™ Model Server, with installation instructions included in the appendix.