Optimum for Intel API for Machine Learning Workloads

Hanwen Chang

AI manager, Intel

Dr. Deb Bharadwaj

Director of product management for AI and machine learning, Intel

Learn How to Use the Optimum Library with Intel Tools and APIs

Hugging Face* Transformers is an open source natural language processing (NLP) library and platform that provides pretrained models and tools for a wide range of NLP tasks. The library is named after the popular Transformers architecture that has revolutionized NLP by using transformer models like BERT, GPT, and RoBERTa. These models have achieved state-of-the-art results on various NLP benchmarks.

It offers a user-friendly interface for working with pretrained transformer models, allowing developers and researchers to easily fine-tune these models on specific NLP tasks or use them for tasks like text classification, text generation, question answering, and more. For optimal performance on Intel® platforms, it is crucial to integrate support for model compression technologies. Intel has made significant contributions—including quantization, pruning, and distillation techniques—to enhance the capabilities of the Optimum library, which can be seamlessly combined with a HuggingFace Transformer.

What Are INCTrainer and INCQuantizer?

The Trainer class offers a comprehensive API for training in Transformers. When considering techniques such as Quantization-Aware Training (QAT), pruning, and distillation, which require a training process, we extend the functionality of Transformers' Trainer to create our own custom INCTrainer. Users have the flexibility to input compression configurations, such as quantization_config, pruning_config, and distillation_config, allowing them to specify the desired settings and parameters within these configurations.

The INCQuantizer class is specifically designed for Post-Training Quantization (PTQ), a method that does not involve a training process. It is inherited from the base class, OptimumQuantizer, which is part of the Optimum library, and provides the flexibility to customize the use_from_pretrain and quantize functions. This extensibility enables us to effectively use INCQuantizer for PTQ.

After completing the compression process, you can use the _onnx_export function to convert your PyTorch* models into an Open Neural Network Exchange (ONNX*) format. Additionally, the INCModel class plays a crucial role in loading the quantized model and conducting evaluations. We have various INCModel classes tailored for different tasks, such as INCModelForSequenceClassification, which is designed specifically for sequence classification tasks.

The Value of AI Tools and APIs from Intel for the Optimum Library

Intel provides a comprehensive suite of performance optimization tools, facilitating efficient model training and deployment on dedicated hardware platforms for the Optimum library. This ensures that users can effortlessly use the capabilities of these platforms with the same level of ease and convenience synonymous with transformers. Following the compression process, users can seamlessly deploy models using Intel Runtime, including quantized models with Intel® Extension for PyTorch*, Intel® Extension for Transformers*, and OpenVINO™ toolkit, as well as pruned models with Intel Extension for Transformers.

Examples

The Optimum library offers compression examples for various tasks. We have designed a user-friendly Python* command-line interface, like python run_qa_post_training.py --apply_quantization in the example folders.

APIs

INCQuantizer

Use the function quantize to support Intel® Neural Compressor post-training quantization.

Parameters:

ID	Type	Description
model	torch.nn.Module.	The model to quantize
eval_fn	Callable[[PreTrainedModel], int], defaults to None.	The evaluation function to use for the accuracy-driven strategy of the quantization process. The accuracy-driven strategy will be enabled only if eval_fn is provided.
calibration_fn	typing.Union[typing.Callable[[transformers.modeling_ utils.PreTrainedModel], int], NoneType], defaults to be None.	The calibration function to use for the accuracy driven strategy of the quantization process
task	Str, defaults to None.	The task defining the model topology used for the ONNX export
seed	Int, defaults to 42.	The random seed to use when shuffling the calibration dataset

function INCQuantizer.get_calibration_dataset parameters: This API creates the calibration datasets.Dataset mentioned in function quantize to use for the post-training static quantization calibration step.

ID	Type	Description
dataset_name	Str	The dataset repository name on the Hugging Face Hub or path to a local directory containing data files in generic formats and optionally a dataset script, if it requires some code to read the data files
num_samples	Int, defaults to 100	The maximum number of samples composing the calibration dataset
dataset_config_name	Str, optional	The name of the dataset configuration
dataset_split	Str, defaults to train	Which split of the dataset to use to perform the calibration step
preprocess_function	Callable, optional	Processing function to apply to each example after loading the dataset
preprocess_batch	Bool, defaults to True	Whether the preprocess_function should be batched
use_auth_token	Bool, defaults to False	Whether to use the token generated when running a transformers-cli login

function INCQuantizer.quantize parameters: This API quantizes a model given the optimization specifications defined in quantization_config.

ID	Type	Description
quantization_config	PostTrainingQuantConfig	The configuration contains the parameters related to quantization
save_directory	Union [str, Path]	The directory where the quantized model should be saved
calibration_dataset	datasets.Dataset, defaults to None	The dataset to use for the calibration step, needed for post-training static quantization
batch_size	Int, defaults to 8	The number of calibration samples to load per batch
data_collator	DataCollator, defaults to None	The function to use to form a batch from a list of elements of the calibration dataset
remove_unused_columns	Bool, defaults to True	Whether or not to remove the columns unused by the model forward method
weight_only	Bool, defaults to False	Whether compress weights to integer precision (4 bit by default) while keeping activations floating-point. Fits best for large language model (LLM) footprint reduction and performance acceleration

quantization_config parameters:

ID	Type	Description
approach	Str, defaults to auto.	The configuration contains the parameters related to quantization.
recipes	Dict, defaults to empty.	Quantization approach used

INCQuantizer Code Example

```python

from functools import partial

from transformers import AutoModelForSequenceClassification, AutoTokenizer

from neural_compressor.config import PostTrainingQuantConfig

from optimum.intel import INCQuantizer



model_name = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# The directory where the quantized model will be saved

save_dir = "static_quantization"



def preprocess_function(examples, tokenizer):

return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)



# Load the quantization configuration detailing the quantization we wish to apply

quantization_config = PostTrainingQuantConfig(approach="static")

quantizer = INCQuantizer.from_pretrained(model)

# Generate the calibration dataset needed for the calibration step

calibration_dataset = quantizer.get_calibration_dataset(

"glue",

dataset_config_name="sst2",

preprocess_function=partial(preprocess_function, tokenizer=tokenizer),

num_samples=100,

dataset_split="train",

)

quantizer = INCQuantizer.from_pretrained(model)

# Apply static quantization and save the resulting model

quantizer.quantize(

quantization_config=quantization_config,

calibration_dataset=calibration_dataset,

save_directory=save_dir,

)

```

Quantization recipes can be specified as follows:

```python

recipes= {"smooth_quant": True, "smooth_quant_args": {"alpha": 0.5, "folding": True}}

quantization_config = PostTrainingQuantConfig(approach="static", backend="ipex", recipes=recipes)

```

func INCQuantizer._onnx_export parameters: This API exports the int8 model into the ONNX format to achieve higher applicability in multiple frameworks with the configuration in onnx_config.

ID	Type	Description
compressed_model	PyTorchModel.	The quantized model
onnx_config	OnnxConfig.	Base class for ONNX exportable model describing metadata on how to export the model through the ONNX format, constructed by ExportConfigConstructor
output_onnx_path	Union [str, Path].	The path to the ONNX model to be saved

onnx_config parameters in the onnx_export function:

ID	Type	Description
self._original_model.config	transformers.PretrainedConfig	The model configuration.

_onnx_export Code Example

from huggingface_hub import HfApi

from optimum.exporters import TasksManager

from optimum.exporters.onnx import OnnxConfig

from ..utils.constant import ONNX_WEIGHTS_NAME



model_type = self._original_model.config.model_type.replace("_", "-")

model_name = getattr(self._original_model, "name", None)

task = HfApi().model_info(self._original_model.config._name_or_path).pipeline_tag

onnx_config_class = TasksManager.get_exporter_config_constructor(

exporter="onnx",

model=self._original_model,

task=task

model_type=model_type,

model_name=model_name,

)

onnx_config = onnx_config_class(self._original_model.config)

compressed_model.eval()

output_onnx_path = save_directory.joinpath(ONNX_WEIGHTS_NAME)

self._onnx_export(compressed_model, onnx_config, “./onnx_results”)

INCTrainer

Support Intel Neural Compressor quantization-aware training, pruning, and distillation. The users need to define configurations of different optimizations. For example, quantization_config is the configuration of quantization-aware training and pruning_config is for pruning.

Parameters:

ID	Type	Description
model	Union [PreTrainedModel, torch.nn.Module], defaults to None.	Input model
args	TrainingArguments, defaults to None.	Arguments for training
data_collator	Optional [DataCollator], defaults to None.	The function to use to form a batch from a list of elements of the calibration dataset.
train_dataset	Optional [Dataset], defaults to None.	Dataset for training the model
eval_dataset	Optional [Dataset], defaults to None.	Dataset for evaluating the model
tokenizer	Optional [PreTrainedTokenizerBase], defaults to None.	Tokenizer loaded
model_init	typing.Callable[[], transformers.modeling_utils.PreTrainedModel], defaults to None.	Used to instantiate the model
compute_metrics	Optional [Callable[[EvalPrediction], Dict]], defaults to None.	Takes an EvalPrediction object (a namedtuple with a predictions and label_ids field) and returns a dictionary string to float
callbacks	Optional [List [TrainerCallback], defaults to None.	Defines the basic command for the training loop
optimizers	Tuple [torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR], defaults to (None, None).	Main entry to get the specific type of optimizer
preprocess_logits_for_metrics	Callable [[torch.Tensor, torch.Tensor], torch.Tensor], defaults to None.	Depending on the model and configuration, logits may contain extra tensors, like past_key_values, but logits always come first.
quantization_config	Optional [QuantizationAwareTrainingConfig], defaults to None.	The configuration contains the parameters related to quantization.
pruning_config	Optional [WeightPruningConfig], defaults to None.	Configuration of distillation. It contains evaluation metrics.
distillation_config	Optional [distillation_config], defaults to None.	Defines a single or a sequence of pruning configs
task	Optional[str], defaults to None.	Task name
save_onnx_model	Bool, defaults to False.	Whether or not to save to an ONNX model

pruning_config parameters of train function: Set this argument and use train function in INCTrainer to generate a sparse (structured or unstructured) model given the specifications defined in pruning_config.

ID	Type	Description
start_step	Int, optional, defaults to 0.	The step to start pruning
end_step	Int, optional, defaults to 0.	The step to end pruning
target_sparsity	Float, optional, defaults to 0.90.	The sparsity ratio the model can reach after pruning
pruning_type	Str, optional, defaults to snip_momentum.	A string that defines the criteria for pruning. Supports magnitude, snip, snip_momentum, magnitude_progressive, snip_progressive, snip_momentum_progressive, and pattern_lock.

distillation_config parameters of train function: Set this argument and use train function in INCTrainer to generate a student model given the specifications defined in distillation_config.

Type

Description

teacher_model

PreTrainedModel in Transformers

Teacher model for distillation, usually a big model, required if do distillation

INCTrainer Code Example

```python

import evaluate

import numpy as np

from datasets import load_dataset

from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, default_data_collator

from optimum.intel import INCModelForSequenceClassification, INCTrainer

from neural_compressor import QuantizationAwareTrainingConfig



model_id = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModelForSequenceClassification.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)

dataset = load_dataset("glue", "sst2")

dataset = dataset.map(lambda examples: tokenizer(examples["sentence"], padding=True, max_length=128), batched=True)

metric = evaluate.load("glue", "sst2")

compute_metrics = lambda p: metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)



# The directory where the quantized model will be saved

save_dir = "quantized_model"



# The configuration detailing the quantization process

quantization_config = QuantizationAwareTrainingConfig()



trainer = INCTrainer(

model=model,

quantization_config=quantization_config,

args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),

train_dataset=dataset["train"].select(range(300)),

eval_dataset=dataset["validation"],

compute_metrics=compute_metrics,

tokenizer=tokenizer,

data_collator=default_data_collator,

)



train_result = trainer.train()

metrics = trainer.evaluate()

trainer.save_model()



model = INCModelForSequenceClassification.from_pretrained(save_dir)

```

INCTrainer Supports Pruning

In the same manner, pruning can be applied by specifying the pruning configuration detailing the desired pruning process.

```python

from optimum.intel import INCTrainer

from neural_compressor import WeightPruningConfig



# The configuration detailing the pruning process

pruning_config = WeightPruningConfig(

pruning_type="magnitude",

start_step=0,

end_step=15,

target_sparsity=0.2,

pruning_scope="local",

)

trainer = INCTrainer(

model=model,

pruning_config=pruning_config,

args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),

train_dataset=dataset["train"].select(range(300)),

eval_dataset=dataset["validation"],

compute_metrics=compute_metrics,

tokenizer=tokenizer,

data_collator=default_data_collator,

)



train_result = trainer.train()

metrics = trainer.evaluate()

trainer.save_model()



model = AutoModelForSequenceClassification.from_pretrained(save_dir)

```

INCTrainer Supports Knowledge Distillation

Knowledge distillation can also be applied in the same manner.

```python

from optimum.intel import INCTrainer

from neural_compressor import DistillationConfig



teacher_model_id = "textattack/bert-base-uncased-SST-2"

teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_model_id)

distillation_config = DistillationConfig(teacher_model=teacher_model)



trainer = INCTrainer(

model=model,

distillation_config=distillation_config,

args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),

train_dataset=dataset["train"].select(range(300)),

eval_dataset=dataset["validation"],

compute_metrics=compute_metrics,

tokenizer=tokenizer,

data_collator=default_data_collator,

)



train_result = trainer.train()

metrics = trainer.evaluate()

trainer.save_model()



model = AutoModelForSequenceClassification.from_pretrained(save_dir)

```

Note The export API for INCTrainer is similar to INCQuantizer.

INCModel

Instantiates a quantized PyTorch model from a given Intel Neural Compressor configuration file. Returns a quantized model.

func from_pretrained parameters:

ID	Type	Description
model_name_or_path	str	Repository name in the Hugging Face Hub or path to a local directory hosting the model
q_model_name	str, optional	The name of the state dictionary located in model_name_or_path is used to load the quantized model. If state_dict is specified, the latter will not be used.
cache_dir	str, optional	Path to a directory in which a downloaded configuration should be cached if the standard cache should not be used
force_download	bool, optional, defaults to False	Whether or not to force to download or redownload the configuration files and override the cached versions if they exist
resume_download	bool, optional, defaults to False	Whether or not to delete an incompletely received file. Attempts to resume the download if such a file exists.
revision	str, optional	The specific model version to use. It can be a branch name, a tag name, or a commit ID, since we use a Git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by Git.
state_dict_path	str, optional	The path to the state dictionary of the quantized model.

Classes inherited from INCModel:

INCModelForQuestionAnswering: Model for question answering mapping. Examples include BertForQuestionAnswering and LongformerForQuestionAnswering.
INCModelForSequenceClassification: Model for sequence classification mapping. Examples include BertForSequenceClassification and XLMRobertaForSequenceClassification.
INCModelForTokenClassification: Model for token classification mapping. Examples include BertForTokenClassification and DistilbertForTokenClassification.
INCModelForMultipleChoice: Model for multiple choice mapping. Examples include BertForMultipleChoice and DistilbertForMultipleChoice.
INCModelForSeq2SeqLM: Model for Seq2Seq causal LM mapping. Examples include PegasusForConditionalGeneration and T5ForConditionalGeneration.
INCModelForCausalLM: Model for causal LM mapping. Examples include CTRLLMHeadModel and GPTNeoForCausalLM. Notice that this model checks the superclass documentation for the generic methods the library implements for all its models (such as downloading or saving).

ID	Type	Description
model	PyTorch model	Main class used to run inference.
config	Transformers.PretrainedConfig	Model configuration class with all the parameters of the model.
device	Str, defaults to cpu	The device type for which the model will be optimized. The resulting compiled model will contain nodes specific to this device.

INCModelForMaskedLM: Model for masked LM mapping. Examples include BertForMaskedLM and XLMWithLMHeadModel.
INCModelForXLNetLM: Model for XLNetLMHeadModel mapping.
INCModelForVision2Seq: Model for Vision2Seq mapping, for example: VisionEncoderDecoderModel.

Note For a full model-mapping list, see Transformers.

What’s Next?

We encourage you to check out and incorporate Intel’s other AI and machine learning framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel's AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Hugging Face Optimum* with Intel Tools Amplifies Transformer Performance

Authors