Learn How to Use the Optimum Library with Intel Tools and APIs
Hugging Face* Transformers is an open source natural language processing (NLP) library and platform that provides pretrained models and tools for a wide range of NLP tasks. The library is named after the popular Transformers architecture that has revolutionized NLP by using transformer models like BERT, GPT, and RoBERTa. These models have achieved state-of-the-art results on various NLP benchmarks.
It offers a user-friendly interface for working with pretrained transformer models, allowing developers and researchers to easily fine-tune these models on specific NLP tasks or use them for tasks like text classification, text generation, question answering, and more. For optimal performance on Intel® platforms, it is crucial to integrate support for model compression technologies. Intel has made significant contributions—including quantization, pruning, and distillation techniques—to enhance the capabilities of the Optimum library, which can be seamlessly combined with a HuggingFace Transformer.
What Are INCTrainer and INCQuantizer?
The Trainer class offers a comprehensive API for training in Transformers. When considering techniques such as Quantization-Aware Training (QAT), pruning, and distillation, which require a training process, we extend the functionality of Transformers' Trainer to create our own custom INCTrainer. Users have the flexibility to input compression configurations, such as quantization_config, pruning_config, and distillation_config, allowing them to specify the desired settings and parameters within these configurations.
The INCQuantizer class is specifically designed for Post-Training Quantization (PTQ), a method that does not involve a training process. It is inherited from the base class, OptimumQuantizer, which is part of the Optimum library, and provides the flexibility to customize the use_from_pretrain and quantize functions. This extensibility enables us to effectively use INCQuantizer for PTQ.
After completing the compression process, you can use the _onnx_export function to convert your PyTorch* models into an Open Neural Network Exchange (ONNX*) format. Additionally, the INCModel class plays a crucial role in loading the quantized model and conducting evaluations. We have various INCModel classes tailored for different tasks, such as INCModelForSequenceClassification, which is designed specifically for sequence classification tasks.
The Value of AI Tools and APIs from Intel for the Optimum Library
Intel provides a comprehensive suite of performance optimization tools, facilitating efficient model training and deployment on dedicated hardware platforms for the Optimum library. This ensures that users can effortlessly use the capabilities of these platforms with the same level of ease and convenience synonymous with transformers. Following the compression process, users can seamlessly deploy models using Intel Runtime, including quantized models with Intel® Extension for PyTorch*, Intel® Extension for Transformers*, and OpenVINO™ toolkit, as well as pruned models with Intel Extension for Transformers.
Examples
The Optimum library offers compression examples for various tasks. We have designed a user-friendly Python* command-line interface, like python run_qa_post_training.py --apply_quantization in the example folders.
APIs
INCQuantizer
Use the function quantize to support Intel® Neural Compressor post-training quantization.
Parameters:
ID |
Type |
Description |
model |
torch.nn.Module. |
The model to quantize |
eval_fn
|
Callable[[PreTrainedModel], int], defaults to None. |
The evaluation function to use for the accuracy-driven strategy of the quantization process. The accuracy-driven strategy will be enabled only if eval_fn is provided. |
calibration_fn |
typing.Union[typing.Callable[[transformers.modeling_ |
The calibration function to use for the accuracy driven strategy of the quantization process |
task |
Str, defaults to None. |
The task defining the model topology used for the ONNX export |
seed |
Int, defaults to 42. |
The random seed to use when shuffling the calibration dataset |
- function INCQuantizer.get_calibration_dataset parameters: This API creates the calibration datasets.Dataset mentioned in function quantize to use for the post-training static quantization calibration step.
ID
Type
Description
dataset_name
Str
The dataset repository name on the Hugging Face Hub or path to a local directory containing data files in generic formats and optionally a dataset script, if it requires some code to read the data files
num_samples
Int, defaults to 100
The maximum number of samples composing the calibration dataset
dataset_config_name
Str, optional
The name of the dataset configuration
dataset_split
Str, defaults to train
Which split of the dataset to use to perform the calibration step
preprocess_function
Callable, optional
Processing function to apply to each example after loading the dataset
preprocess_batch
Bool, defaults to True
Whether the preprocess_function should be batched
use_auth_token
Bool, defaults to False
Whether to use the token generated when running a transformers-cli login
- function INCQuantizer.quantize parameters: This API quantizes a model given the optimization specifications defined in quantization_config.
ID
Type
Description
quantization_config
PostTrainingQuantConfig
The configuration contains the parameters related to quantization
save_directory
Union [str, Path]
The directory where the quantized model should be saved
calibration_dataset
datasets.Dataset, defaults to None
The dataset to use for the calibration step, needed for post-training static quantization
batch_size
Int, defaults to 8
The number of calibration samples to load per batch
data_collator
DataCollator, defaults to None
The function to use to form a batch from a list of elements of the calibration dataset
remove_unused_columns
Bool, defaults to True
Whether or not to remove the columns unused by the model forward method
weight_only
Bool, defaults to False
Whether compress weights to integer precision (4 bit by default) while keeping activations floating-point. Fits best for large language model (LLM) footprint reduction and performance acceleration
- quantization_config parameters:
ID
Type
Description
approach
Str, defaults to auto.
The configuration contains the parameters related to quantization.
recipes
Dict, defaults to empty.
Quantization approach used
INCQuantizer Code Example
```python
from functools import partial
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from neural_compressor.config import PostTrainingQuantConfig
from optimum.intel import INCQuantizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# The directory where the quantized model will be saved
save_dir = "static_quantization"
def preprocess_function(examples, tokenizer):
return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)
# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = PostTrainingQuantConfig(approach="static")
quantizer = INCQuantizer.from_pretrained(model)
# Generate the calibration dataset needed for the calibration step
calibration_dataset = quantizer.get_calibration_dataset(
"glue",
dataset_config_name="sst2",
preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
num_samples=100,
dataset_split="train",
)
quantizer = INCQuantizer.from_pretrained(model)
# Apply static quantization and save the resulting model
quantizer.quantize(
quantization_config=quantization_config,
calibration_dataset=calibration_dataset,
save_directory=save_dir,
)
```
Quantization recipes can be specified as follows:
```python
recipes= {"smooth_quant": True, "smooth_quant_args": {"alpha": 0.5, "folding": True}}
quantization_config = PostTrainingQuantConfig(approach="static", backend="ipex", recipes=recipes)
```
- func INCQuantizer._onnx_export parameters: This API exports the int8 model into the ONNX format to achieve higher applicability in multiple frameworks with the configuration in onnx_config.
ID
Type
Description
compressed_model
PyTorchModel.
The quantized model
onnx_config
OnnxConfig.
Base class for ONNX exportable model describing metadata on how to export the model through the ONNX format, constructed by ExportConfigConstructor
output_onnx_path
Union [str, Path].
The path to the ONNX model to be saved
- onnx_config parameters in the onnx_export function:
ID
Type
Description
self._original_model.config
transformers.PretrainedConfig
The model configuration.
_onnx_export Code Example
from huggingface_hub import HfApi
from optimum.exporters import TasksManager
from optimum.exporters.onnx import OnnxConfig
from ..utils.constant import ONNX_WEIGHTS_NAME
model_type = self._original_model.config.model_type.replace("_", "-")
model_name = getattr(self._original_model, "name", None)
task = HfApi().model_info(self._original_model.config._name_or_path).pipeline_tag
onnx_config_class = TasksManager.get_exporter_config_constructor(
exporter="onnx",
model=self._original_model,
task=task
model_type=model_type,
model_name=model_name,
)
onnx_config = onnx_config_class(self._original_model.config)
compressed_model.eval()
output_onnx_path = save_directory.joinpath(ONNX_WEIGHTS_NAME)
self._onnx_export(compressed_model, onnx_config, “./onnx_results”)
INCTrainer
Support Intel Neural Compressor quantization-aware training, pruning, and distillation. The users need to define configurations of different optimizations. For example, quantization_config is the configuration of quantization-aware training and pruning_config is for pruning.
Parameters:
ID |
Type |
Description |
model |
Union [PreTrainedModel, torch.nn.Module], defaults to None. |
Input model |
args |
TrainingArguments, defaults to None. |
Arguments for training |
data_collator |
Optional [DataCollator], defaults to None. |
The function to use to form a batch from a list of elements of the calibration dataset. |
train_dataset |
Optional [Dataset], defaults to None. |
Dataset for training the model |
eval_dataset |
Optional [Dataset], defaults to None. |
Dataset for evaluating the model |
tokenizer |
Optional [PreTrainedTokenizerBase], defaults to None. |
Tokenizer loaded |
model_init |
typing.Callable[[], transformers.modeling_utils.PreTrainedModel], defaults to None. |
Used to instantiate the model |
compute_metrics |
Optional [Callable[[EvalPrediction], Dict]], defaults to None. |
Takes an EvalPrediction object (a namedtuple with a predictions and label_ids field) and returns a dictionary string to float |
callbacks |
Optional [List [TrainerCallback], defaults to None. |
Defines the basic command for the training loop |
optimizers |
Tuple [torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR], defaults to (None, None). |
Main entry to get the specific type of optimizer |
preprocess_logits_for_metrics |
Callable [[torch.Tensor, torch.Tensor], torch.Tensor], defaults to None. |
Depending on the model and configuration, logits may contain extra tensors, like past_key_values, but logits always come first. |
quantization_config |
Optional [QuantizationAwareTrainingConfig], defaults to None. |
The configuration contains the parameters related to quantization. |
pruning_config |
Optional [WeightPruningConfig], defaults to None. |
Configuration of distillation. It contains evaluation metrics. |
distillation_config |
Optional [distillation_config], defaults to None. |
Defines a single or a sequence of pruning configs |
task |
Optional[str], defaults to None. |
Task name |
save_onnx_model |
Bool, defaults to False. |
Whether or not to save to an ONNX model |
- pruning_config parameters of train function: Set this argument and use train function in INCTrainer to generate a sparse (structured or unstructured) model given the specifications defined in pruning_config.
ID
Type
Description
start_step
Int, optional, defaults to 0.
The step to start pruning
end_step
Int, optional, defaults to 0.
The step to end pruning
target_sparsity
Float, optional, defaults to 0.90.
The sparsity ratio the model can reach after pruning
pruning_type
Str, optional, defaults to snip_momentum.
A string that defines the criteria for pruning. Supports magnitude, snip, snip_momentum, magnitude_progressive, snip_progressive, snip_momentum_progressive, and pattern_lock.
- distillation_config parameters of train function: Set this argument and use train function in INCTrainer to generate a student model given the specifications defined in distillation_config.
ID
Type
Description
teacher_model
PreTrainedModel in Transformers
Teacher model for distillation, usually a big model, required if do distillation
INCTrainer Code Example
```python
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, default_data_collator
from optimum.intel import INCModelForSequenceClassification, INCTrainer
from neural_compressor import QuantizationAwareTrainingConfig
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset("glue", "sst2")
dataset = dataset.map(lambda examples: tokenizer(examples["sentence"], padding=True, max_length=128), batched=True)
metric = evaluate.load("glue", "sst2")
compute_metrics = lambda p: metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)
# The directory where the quantized model will be saved
save_dir = "quantized_model"
# The configuration detailing the quantization process
quantization_config = QuantizationAwareTrainingConfig()
trainer = INCTrainer(
model=model,
quantization_config=quantization_config,
args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),
train_dataset=dataset["train"].select(range(300)),
eval_dataset=dataset["validation"],
compute_metrics=compute_metrics,
tokenizer=tokenizer,
data_collator=default_data_collator,
)
train_result = trainer.train()
metrics = trainer.evaluate()
trainer.save_model()
model = INCModelForSequenceClassification.from_pretrained(save_dir)
```
INCTrainer Supports Pruning
In the same manner, pruning can be applied by specifying the pruning configuration detailing the desired pruning process.
```python
from optimum.intel import INCTrainer
from neural_compressor import WeightPruningConfig
# The configuration detailing the pruning process
pruning_config = WeightPruningConfig(
pruning_type="magnitude",
start_step=0,
end_step=15,
target_sparsity=0.2,
pruning_scope="local",
)
trainer = INCTrainer(
model=model,
pruning_config=pruning_config,
args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),
train_dataset=dataset["train"].select(range(300)),
eval_dataset=dataset["validation"],
compute_metrics=compute_metrics,
tokenizer=tokenizer,
data_collator=default_data_collator,
)
train_result = trainer.train()
metrics = trainer.evaluate()
trainer.save_model()
model = AutoModelForSequenceClassification.from_pretrained(save_dir)
```
INCTrainer Supports Knowledge Distillation
Knowledge distillation can also be applied in the same manner.
```python
from optimum.intel import INCTrainer
from neural_compressor import DistillationConfig
teacher_model_id = "textattack/bert-base-uncased-SST-2"
teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_model_id)
distillation_config = DistillationConfig(teacher_model=teacher_model)
trainer = INCTrainer(
model=model,
distillation_config=distillation_config,
args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),
train_dataset=dataset["train"].select(range(300)),
eval_dataset=dataset["validation"],
compute_metrics=compute_metrics,
tokenizer=tokenizer,
data_collator=default_data_collator,
)
train_result = trainer.train()
metrics = trainer.evaluate()
trainer.save_model()
model = AutoModelForSequenceClassification.from_pretrained(save_dir)
```
Note The export API for INCTrainer is similar to INCQuantizer.
INCModel
Instantiates a quantized PyTorch model from a given Intel Neural Compressor configuration file. Returns a quantized model.
- func from_pretrained parameters:
ID
Type
Description
model_name_or_path
str
Repository name in the Hugging Face Hub or path to a local directory hosting the model
q_model_name
str, optional
The name of the state dictionary located in model_name_or_path is used to load the quantized model. If state_dict is specified, the latter will not be used.
cache_dir
str, optional
Path to a directory in which a downloaded configuration should be cached if the standard cache should not be used
force_download
bool, optional, defaults to False
Whether or not to force to download or redownload the configuration files and override the cached versions if they exist
resume_download
bool, optional, defaults to False
Whether or not to delete an incompletely received file. Attempts to resume the download if such a file exists.
revision
str, optional
The specific model version to use. It can be a branch name, a tag name, or a commit ID, since we use a Git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by Git.
state_dict_path
str, optional
The path to the state dictionary of the quantized model.
Classes inherited from INCModel:
- INCModelForQuestionAnswering: Model for question answering mapping. Examples include BertForQuestionAnswering and LongformerForQuestionAnswering.
- INCModelForSequenceClassification: Model for sequence classification mapping. Examples include BertForSequenceClassification and XLMRobertaForSequenceClassification.
- INCModelForTokenClassification: Model for token classification mapping. Examples include BertForTokenClassification and DistilbertForTokenClassification.
- INCModelForMultipleChoice: Model for multiple choice mapping. Examples include BertForMultipleChoice and DistilbertForMultipleChoice.
- INCModelForSeq2SeqLM: Model for Seq2Seq causal LM mapping. Examples include PegasusForConditionalGeneration and T5ForConditionalGeneration.
- INCModelForCausalLM: Model for causal LM mapping. Examples include CTRLLMHeadModel and GPTNeoForCausalLM. Notice that this model checks the superclass documentation for the generic methods the library implements for all its models (such as downloading or saving).
ID |
Type |
Description |
model |
PyTorch model |
Main class used to run inference. |
config |
Transformers.PretrainedConfig |
Model configuration class with all the parameters of the model. |
device |
Str, defaults to cpu |
The device type for which the model will be optimized. The resulting compiled model will contain nodes specific to this device. |
- INCModelForMaskedLM: Model for masked LM mapping. Examples include BertForMaskedLM and XLMWithLMHeadModel.
- INCModelForXLNetLM: Model for XLNetLMHeadModel mapping.
- INCModelForVision2Seq: Model for Vision2Seq mapping, for example: VisionEncoderDecoderModel.
Note For a full model-mapping list, see Transformers.
What’s Next?
We encourage you to check out and incorporate Intel’s other AI and machine learning framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel's AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.