Optimized Analytics Package for Spark* Platform (OAP for Spark* Platform)

ID 679216
Updated 11/18/2020
Version Latest



Pull Command

We are currently updating this container.


Optimized Analytics Package for Spark* Platform (OAP for Spark* Platform) is a project to optimize Apache Spark* in various aspects including cache, shuffle, execution engine, MLlib and so on. Currently, OAP for Spark* Platform includes the following optimizations:

  • SQL Data Source Cache: Optimize Spark* SQL Data Source using PMem as input data cache.
  • RDD Cache PMem Extension: Optimize Spark* RDD Cache using PMem.
  • Shuffle Remote PMem Extension: Optimize Spark* shuffle using remote PMem and RDMA.
  • Remote Shuffle: Shuffle implementation for writing shuffle to HDFS Filesystem compatible remote storage.
  • Unified Arrow Data Source and Native SQL Engine: Optimize SQL execuiton engine using vectorization, native, and columnar data.
  • OAP MLlib: Optimized implementation of part of MLlib agorithms.


Documentation and Sources

Get Started​
Docker* Repository
Main GiHub* Repository
Release Notes
Get Started Guide

Code Sources
Report Issue

License Agreement

LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the license file for additional details.

View All Containers and Solutions 🡢