Project Battlematrix: Software Update
In May, we announced new scalable and accessible inference workstations code-named Project Battlematrix. This important project aims to accelerate Intel’s GPU and AI strategy by simplifying the adoption of Intel® Arc™ Pro B-series GPUs with a new inference optimized software stack.
The new software stack is built with ease of use and industry standards in mind. What does that mean? A new containerized solution built for Linux environment, optimized to deliver incredible inference performance with multi-GPU scaling and PCIe P2P data transfers, and designed to include enterprise-class reliability and manageability features such as ECC, SRIOV, telemetry and remote firmware updates.
We’re happy to share progress with release 1.0 of the LLM Scaler container. This release is critical for early customer enablement and includes:
-
vLLM optimizations:
- Performance optimizations of TPOP for long input length (>4K): up to 1.8x uplift for 40K seq length on 32B KPI model, and up to 4.2x uplift for 40K seq length on 70B KPI model
- Performance optimizations with ~10% output throughput improvement for 8B-32B KPI models compared to last drop
- By-layer online quantization to reduce the required GPU memory
- Pipeline parallelism support in vLLM (experimental)
- torch.compile (experimental)
- Speculative decoding (experimental)
- Support for embedding, rerank model
- Enhanced multi-modal model support
- Maximum length auto-detecting
- Data parallelism support
-
OneCCL benchmark tool enablement
-
XPU Manager:
- GPU Power
- GPU Firmware update
- GPU Diagnostic
- GPU Memory Bandwidth
With this release, we’re delivering to the timeline we shared when announcing these products earlier in May. Next, we are planning the release of a hardened version of LLM Scaler with extra functionalities for the end of Q3 and actively working towards a full feature set release later in Q4.
Resources: