AI/ML comes in two flavors: training and inference. Training entails processing huge volumes of data, perhaps petabytes, not once but multiple times, until a desired model quality is achieved. The models are only getting larger; Large Language Models (LLMs) such as Chat GPT-4 have more than a trillion parameters! Inferencing tasks can be real time, such as autonomous driving, recommendation engines, and disease diagnosis systems. The data used to train a model may be sensitive, and the data and models themselves are high value. Both training and inference workloads demand high performance, scalability, security, and for inference, often low latency. Let us look at how the cloud native ecosystem delivers along these axes before touching on some gaps and the work underway to address them.
Kubernetes* features such as resource-aware scheduling, which captures capability and capacity, along with Node Affinity/Anti-Affinity, Taints and Tolerations, and rich resource telemetry, work in conjunction with the Node Resource Interface to enable it to meet the performance needs of AI workloads. Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling provide the elasticity to tackle dynamically varying volumes of incoming requests to reduce resource-related costs. However, there is room to improve, specifically in four areas:
Graphical Processing Units (GPUs), popular in speeding machine learning tasks, are inadequately supported in Kubernetes. Intel-developed GPU scheduler enhancements seek to reduce the gap, particularly by being more optimistic and opportunistic in resource allocation to improve resource utilization.
Service Meshes, such as Istio* or Linkerd*, are increasingly used in Kubernetes clusters to provide common functionality such as network proxying and load balancing. In conjunction with HPA, load balancing helps to deliver low latency scalable inferencing.
Storage access speeds play a significant role, given the scale of AI data. NVMe over Fabrics (NVMe-oF), an extension of the NVMe network protocol, delivers faster connectivity with reduced CPU utilization. The Kubernetes CSI Driver, with the NVMe-oF plugin, brings remote storage access speeds very close to what is achieved with local storage. Soon, NVMe-oF, in conjunction with Infrastructure Processing Units (IPUs), will further improve data transfer bandwidth and reduce latency. Faster storage access helps to keep GPU units fed and busy!
Vector databases: User prompt engineering helps improve LLM output and is a popular alternative to the heavier and costlier option of fine-tuning foundation models. Prompt engineering typically involves a chain of different data processing tasks. Vector databases add context, semantics, and personalization, helping to refine prompts to deliver more accurate, relevant, and user-personalized results. Intel is exploring optimizations for the popular vector database - Milvus*.
We invite you to explore our energy efficient cloud native AI pipeline that facilitates horizontal scaling of individual components from video frame transcoding to inferencing. The pipeline packs in performance and power efficiency using Intel's newest matrix instructions.
Resource Isolation and Quality of Service (QoS)
Sharing resources provides cost savings but also brings the danger of noisy neighbors. Workloads can be a mix of short inference tasks requiring low latency and long running, compute intensive batch training jobs. Without resource isolation and limits enforcement, servicing one may compromise the other. QoS classes with policy enforcement help meet workload Service Level Agreements (SLAs). Intel is close to field trials for allocation and enforcement of processor cache and IO bandwidth resources geared towards meeting SLAs.
Ensuring data privacy is crucial in the AI/ML context due to its sensitivity (patient health care information, financial records, location data), which could lead to serious privacy violations, regulatory fines, and safety concerns if compromised. Protecting data integrity is also crucial to ensuring model validity. Finally, the models are highly valuable, given the resources expended to construct them. Given the importance of security, let us look at security support in cloud native.
In Kubernetes, namespaces segregate workspaces and the resources allocated to them, while Roles Based Access Control (RBAC) helps control access. Further, service meshes such as Istio support encrypting inter-microservice communication using Transport Layer Security (TLS) and mutual TLS. The secure connections can be terminated, based on use case needs, at the cluster level, in a worker node, or within individual pods.
In the Cloud Native space, an important security enhancement is the integration of Trusted Execution Environments (TEEs), also known as Confidential Computing. TEEs are hardware-backed, and provide total memory encryption, data integrity protection, access protection even from privileged processes such as the operating system or hypervisor running on the host, and provide support for attestation. Confidential computing paves the way to move regulated industry workloads into public clouds to reap cost and availability benefits. One may use KubeVirt* to launch a Confidential Virtual Machine (CVM), create entire Kubernetes clusters of CVMs, or launch a pod wrapped in a CVM using the Confidential Containers project. Further, Intel has security-hardened the Istio/Envoy proxy to leverage a TEE to protect TLS private keys. We also recently launched the open source project Confidential Cloud Native Primitives (CCNP) to better support measurement and attestation across different TEE usage models. Explore and influence CCNP to meet your trust needs.
Heterogeneity, Portability, and Cost Savings
No longer are workloads tethered to the cloud or data center. They are also emerging at edge locations to meet latency needs, satisfy data locality regulations, or reduce network bandwidth costs. Heterogeneity of hardware and software is to be expected across this vast deployment landscape. The Kubernetes device plugin framework, along with vendor-provided plugins, drivers, and operators, ease the discovery and use of special purpose accelerators such as Intel crypto, data streaming, load balancing accelerators, and devices such as GPUs and FPGAs.
Workload portability is a must to be able to offer a service across the cloud-to-edge deployment continuum. Containers are ideal for portability, but to truly meet AI performance needs, intermediate representations such as ONNX* and its runtime or Intel’s OpenVINO™ and oneAPI help provide hardware-optimized performance.
Further, the ability to run anywhere effectively facilitates leveraging idle hardware units, which in turn improves resource utilization and reduces costs. End users can focus on their SLAs while also making tradeoffs between cost and performance.
AI workloads are becoming the predominant cloud native workload even as they grow ever larger and more sophisticated. Maturing GPU support with storage access improvements will help keep GPU units busy and improve GPU utilization. End user experience will get smoother with options such as Model-as-a-Service, which, with resource isolation and QoS support, will better fit pay-per-use operational models to save on costs while meeting SLAs. Federated Learning using confidential secure edge Kubernetes clusters will unlock data silos to develop richer models with less bias.
Come join us in Cloud Native innovations to make AI accessible to all.
The Intel Open Source Cloud Software Engineering team works on open source cloud native projects such as Kubernetes, Confidential Containers, Istio/Envoy, DeathStarBench, Ceph, among others.