Jian Zhang

AI Software Engineering Manager
Intel Corporation

Artificial intelligence (AI) has revolutionized virtually every industry from healthcare, retail, manufacturing, etc. However, most of today’s AI solutions are expensive and limited to a small set of data scientists. This is due to multiple factors. First, modern end-to-end AI pipelines are complex. They require multiple stages like data processing, feature engineering, model development, model deployment, and maintenance. The iterative nature of these stages makes the process time consuming. Second, deep expertise is often required to develop AI solutions. This creates an entry barrier for novice and citizen data scientists. Third, people tend to develop larger and deeper models to get better accuracy. These “over-parameterized” models lead to significant computational demands, which hinders deployment in resource-constrained environments.

We developed Intel® End-to-End AI Optimization Kit to make the end-to-end AI pipeline faster, simpler, and more accessible, broadening AI access to everyone, everywhere. It is a composable toolkit for end-to-end AI optimization to deliver high performance, lightweight models efficiently on commodity hardware. The toolkit is built on a set of Intel-optimized frameworks, such as Intel® Extension for PyTorch* (IPEX), Intel® Extension for TensorFlow* (ITEX), and Intel® AI Analytics Toolkit (AI Kit). It also integrates SigOpt for hyperparameter optimization. Intel End-to-End AI Optimization Kit also provides unique components and features for data preparation, model optimization, and model construction.

It improves the scale-up and scale-out efficiency of end-to-end AI pipelines to make "overnight training" for complex deep learning (DL) models possible. It delivers lighter DL models that have higher inference throughput lower resource requirements. It also makes end-to-end AI simpler. It automates the pipeline with click-to-run workflows and SigOpt AutoML, abstracts the complex APIs for data processing and feature engineering, simplifies distributed training, and can be easily integrated with existing or third-party machine learning (ML) solutions or platforms. It brings complex, compute-intensive deep learning models to commodity hardware, delivers built-in optimized models through parameterized models generated by smart democratization advisor (SDA), and domain-specific compact neural networks constructed with neural architecture search (NAS) technology. All of this makes AI more accessible to citizen data scientists.

Architecture

Popular models (e.g., RecSys, CV, NLP, ASR, and RL) from various domains are the input to Intel End-to-End AI Optimization Kit (Figure 1). These stock models are heavy, slow to train, and complex to tune and optimize. Depending on model type, Intel End-to-End AI Optimization Kit will optimize the models with either model advisor or neural network constructor. The expected outputs are optimized models that require only one-tenth the FLOPS (Floating-point Operations Per Second) and training time of the stock model with the same or minimal loss of accuracy.

Figure 1. Architecture and workflow of Intel® End-to-End AI Optimization Kit

RecDP

RecDP is a parallel data processing and feature engineering library built on PySpark* and extensible to other data processing tools like Modin*. Its key features and functions are as follows:

A tabular dataset processing toolkit
Abstract APIs to hide Spark* programming complexity
Optimized performance through adaptive query plan and strategy
Support for common feature engineering functions like target and count encoding
Easy integration into third-party solutions

RecDP uses "lazy execution" for better performance. It fuses operators and leverages the collection of data statistics to avoid unnecessary passes through the dataset, which is critical when handling large datasets. RecDP can also leverage the native columnar SQL engine capabilities provided by Optimized Analytics Package for Spark* platform to improve performance.

Smart Democratization Advisor (SDA)

SDA is a user-guided tool to facilitate automation. It provides built-in intelligence through parameterized models and leverages SigOpt for hyperparameter optimization (HPO) and built-in optimized models (e.g., RecSys, CV, NLP, ASR, and RL). It also converts manual model tuning and optimization to asssit AutoML and AutoHPO.

Neural Network Constructor

Neural network constructor is based on neural architecture search technology. Using a predefined supernet, it constructs neural network structures directly for a given domain. Its key features and functions are as follows:

Multi-model support, such as models from the CV, NLP, and ASR domains
Uses a unified, transformer-based SuperNet
Hardware-aware NAS uses metrics like FLOPS and latency as thresholds to determine the model architecture and model size
Train-free NAS uses a zero-cost proxy metric rather than training accuracy for candidate evaluation. It takes multiple network’s characteristics into consideration, such as trainability, expressivity, diversity, and saliency.
Leverages model adapter to deploy the model in the user’s production environment. Model adapter is a transfer learning-based component that provides fine tuning, knowledge distillation, and domain adaption features.

Example

Here is an example of how the toolkit works on the DL Recommendation Model (DLRM), including environment setup, data processing, built-in model advisor with patched code, and a one-line command to kick off the entire optimization process.

Step 1: Environment Setup

Intel End-to-End AI Optimization Kit is built on top of AI Kit, so the software necessary to run the pipeline is already accessible:

# DockerFile
FROM docker.io/intel/oneapi-aikit
ENV http_proxy=http://proxy-prc.intel.com:912
ENV https_proxy=http://proxy-prc.intel.com:912

# SigOpt
RUN python -m pip install sigopt==7.5.0 --ignore-installed

# PyTorch conda
RUN conda activate pytorch
RUN python -m pip install prefetch_generator tensorboardX onnx tqdm lark-parser

# Intel Extension for PyTorch
RUN python -m pip install intel_extension_for_pytorch==1.10.0 -f https://software.intel.com/ipex-whl-stable psutil
RUN mkdir -p /home/vmagent/app
WORKDIR /home/vmagent/app

Step 2: Parallel Data Processing with RecDP

The next step is to use RecDP for simplified data processing. In this example, two operators, Categorify() and FillNA(), are chained together and Spark lazy execution is used to reduce unnecessary passes through the data:

from pyspark.sql import *
from pyspark import *
from pyspark.sql.types import *
from pyrecdp.data_processor import *
from pyrecdp.encoder import *
from pyrecdp.utils import *
import numpy as np

path_prefix = "file://"
current_path = "/home/vmagent/app/dataset/demo/processed/"
csv_file = "/home/vmagent/app/dataset/demo/criteo_mini.txt"

# 1. Start spark and initialize data processor
t0 = timer()
spark = SparkSession.builder.master('local[*]') \
                            .config('spark.driver.memory','100G') \
                            .appName("DLRM").getOrCreate()
schema = StructType([StructField(f'_i{i}', IntegerType()) for i in range(0, 14)] )
df = spark.read.option('sep', '\t').option("mode", "DROPMALFORMED") \
               .schema(schema).csv(path_prefix + csv_file)
proc = DataProcessor(spark, path_prefix, current_path=current_path, spark_mode='local')

# 2. Process data with RecDP
CAT_COLS = list(range(14, 40))
to_categorify_cols = ['_c%d' % i for i in CAT_COLS]
op_categorify = Categorify(to_categorify_cols)
op_fillna_for_categorified = FillNA(to_categorify_cols, 0)
proc.append_ops([op_categorify, op_fillna_for_categorified])
df = proc.transform(df, name='dlrm')
t1 = timer()
print(f"Total process time is {(t1 - t0)} secs")

Step 3: AutoML with Smart Democratization Advisor

SDA converts the manual optimizations to assist AutoML. It provides predefined parameters for built-in models, which significantly reduces the time for AutoML:

model_info = dict()

# Config for model
model_info["score_metrics"] = [("accuracy", "maximize"), ("training_time", "minimize")]
model_info["execute_cmd_base"] = "python launch.py"
model_info["result_file_name"] = "best_auc.txt"

# Config for SigOpt
model_info["experiment_name"] = "dlrm"
model_info["sigopt_config"] = [
    {'name':'learning_rate','bounds':{'min':5,'max':50},'type':'int'},
    {'name':'lamb_lr','bounds':{'min':5,'max':50},'type':'int'},
    {'name':'warmup_steps','bounds':{'min':2000,'max':4500},'type':'int'},
    {'name':'decay_start_steps','bounds':{'min':4501,'max':9000},'type':'int'},
    {'name':'num_decay_steps','bounds':{'min':5000,'max':15000},'type':'int'},
    {'name':'sparse_feature_size','grid': [128,64,16],'type':'int'},
    {'name':'mlp_top_size','bounds':{'min':0,'max':7},'type':'int'},
    {'name':'mlp_bot_size','bounds':{'min':0,'max':3},'type':'int'}]
model_info["observation_budget"] = 1

Besides some configurable parameters, there are cases that need code-level optimization. For all built-in modes, the kit provides optimized models with patched code. Here is an example using IPEX and BF16 as well as the optimizer to improve model convergence on multiple CPU nodes:

# Framework optimization. Use IPEX & BF16

# Embedding Optimization
-       m_curr = (m[i] if self.max_emb_dim > 0 else m)
-       EE = nn.EmbeddingBag(n, m_curr, mode="sum", sparse=True)
-       W = np.random.uniform(
-           low=-np.sqrt(1 / n), high=np.sqrt(1 / n), size=(n, m_curr)
-       ).astype(np.float32)
        # democratized, use two dimension, sparse and dense
+       W = np.random.uniform(
+           low=-np.sqrt(1 / n), high=np.sqrt(1 / n), size=(n, m)
+       ).astype(np.float32)

# Optimizer Optimization
- optimizer = torch.optim.SGD(dlrm.parameters(), lr=args.learning_rate) …)

Step 4: End-to-End Model Optimization

Finally, we kick off the end-to-end model optimization with just a few lines of codes:

import sys
from e2eAIOK import SDA
sda = SDA(model ="dlrm", settings=setting)
sda.launch()

Intel End-to-End AI Optimization Kit provides more click-to-run recipes for popular models, including:

RecSys DLRM, DIEN, WnD
Automatic speech recognition (ASR) RNNT
Compute vision (CV) RESNET
Natural language processing (NLP) BERT
Reinforcement learning (RL) MiniGO

Performance

The tests were conducted on a four-node cluster (Table 1). Each node was equipped with two Intel® Xeon® Platinum 6240 processors and 384GB memory. The nodes were connected through 25GB Ethernet. One 1TB HDD SSD was used as a data drive.

Configuration	Details
Test Date	12/2021
Platform	S2600WFT
CPU	Intel® Xeon® Gold 6240
Number of Nodes	4
CPU per node	18core/socket, 2 sockets, 2 threads/core
Memory	DDR4 dual rank 384G, 12 slots / 32GB / 2666
Storage	1x 400GB Intel® SSD (SSDSC2BA400G3) OS Drive 1TB HDD for data storage
Network	1x Intel® X722, 1x Intel XXV710
Microcode	0x500002C
BIOS version	SE5C620.86B.0X.02.0094.102720191711
OS/Hypervisor/SW	Fedora* 29 5.3.11-100.fc29.x86_64

Table 1. System configuration

We compared the stock and optimized models’ performance on three popular RecSys models (Table 2). It delivered significant speedups for ETL, training, and inference for all three workloads (Figure 2).

Workloads	DLRM	WnD	DIEN
Framework	PyTorch*	TensorFlow*	TensorFlow
Libraries	oneMKL, oneCCL	oneMKL, oneCCL	oneMKL, oneCCL
Dataset (size, shape)	Criteo	outbrain	categoryFiles
Precision (FP32, INT8., BF16)	BF16	-	-
KMP AFFINITY	granularity=fine,compact,1,0	granularity=fine,compact,1,0	granularity=fine,compact,1,0
NUMACTL	Socket binding	Socket binding	Socket binding
OMP_NUM_THREADS	20	16	4

Table 2. Test configuration

Figure 2. RecSYS speedup with Intel® End-to-End AI Optimization Kit. Baseline configuration: one node with two Intel® Xeon® Gold 6240 processors (18 cores), HT On, Turbo ON, 384 GB (12 slots/32GB/2666 MHz) memory, BIOS SE5C620.86B.0X.02.0094.102720191711 (ucode:0x500002C), Fedora 29, 5.3.11-100.fc29.x86_64, PyTorch*, TensorFlow*, Spark*. Optimized configuration: same hardware and software configuration except four nodes and Intel-optimized PyTorch and TensorFlow, Horovod*, and modified DLRM, WnD, and DIEN workloads.

Call to Action

Intel End-to-End AI Optimization Kit is a composable toolkit that delivers high performance and lightweight models for commodity hardware, which helps to democratize AI. It leverages several key optimizations:

Parallel data processing with RecDP
Intel-optimized training frameworks
Lighter models with fewer layers and reduced communication overhead on distributed-memory parallel computers
Optimizer tuning (DLRM) to converge faster with larger batch size
Feature engineering (embedding table and encoding) optimizations

It gives good results on a wide range of popular models. If you want to use it for your own project, please visit https://github.com/intel/e2eAIOK/ to have a try.

The Parallel Universe Magazine

Intel’s quarterly magazine helps you take your software development into the future with the latest tools, tips, and training to expand your expertise.

Read Now

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Accelerating Artificial Intelligence with Intel® End-to-End AI Optimization Kit

Get the Latest on All Things CODE