Protect End-to-End Data Pipelines with BigDL Privacy-Preserving Machine Learning (PPML)



Protecting privacy and confidentiality is critical for large-scale data analysis and machine learning. BigDL PPML provides a trusted cluster environment for secure big data and artificial intelligence (AI) applications, even on untrusted cloud environments. 

Based on Intel® Software Guard Extensions (Intel® SGX), Intel has built BigDL Privacy Preserving Machine Learning (PPPML) to secure the end-to-end big data and AI pipeline.

Intel BigDL PPML

BigDL, a unified open source artificial intelligence solution platform from Intel, aims to make it easier for data scientists and data engineers to build end-to-end, distributed AI applications. Using Intel® SGX, Intel’s Trusted Execution Environment (TEE) and integrating with other hardware and software security measures, BigDL has built a distributed PPML platform aimed at protecting end-to-end distributed AI pipelines from data ingestion, data analysis, all the way to machine learning and deep learning. 


Figure 1. Intel BigDL PPML software stack 
All graphics created by the authors 


PPML protects data at rest, in transit, and in use: compute and memory are protected by SGX enclaves, storage (e.g. data and model) is protected by encryption, network communication is protected by both remote attestation and transport layer security (TLS) and optional federated learning support. 

With BigDL PPML, users can run trusted big data and AI applications in a secure and trusted fashion, including trusted Spark* data analysis (such as Spark SQL*, DataFrame, MLlib*), trusted deep learning (such as BigDL, Orca*, Nano*, DLlib*), trusted federated learning): with private set intersection (PSI). 



End-to-End Workflow

Figure 2.  BigDL PPML based end-to-end secure computing workflow


Here’s a step-by-step breakdown of the end-to-end secure computing workflow: 

  1. User submits job to Kubernetes* (via BigDL PPML command line interface), which creates the driver node
  2. BigDL PPML client attests the driver node
  3. Driver creates more worker nodes 
  4. Driver attests worker nodes 
  5. Driver and workers request keys from KMS 
  6. Workers read and decrypt input data 
  7. Workers run distributed Big Data, ML and DL programs 
  8. Workers encrypt and write output data 

Using the pre-configured workflow in Figure 2, developers can focus more on the development of business logic and use BigDL PPML to help ensure the end-to-end security and privacy of their applications. Users can significantly improve the development efficiency of private computing applications and shorten the time to develop private computing solutions.

Ths BigDL PPML solution has been deployed on Alibaba Cloud* DataTrust* platform, ByteDance* and others.

You can see how it works with this 10-minute demo presented at KubeCon* North America 2022 and check out the BigDL GitHub repo for more information.

Photo by Alina Grubnyak on Unsplash