Hugging Face Optimum* with Intel Tools Amplifies Transformer Performance

Authors

author-image

By

Learn How to Use the Optimum Library with Intel Tools and APIs

Hugging Face* Transformers is an open source natural language processing (NLP) library and platform that provides pretrained models and tools for a wide range of NLP tasks. The library is named after the popular Transformers architecture that has revolutionized NLP by using transformer models like BERT, GPT, and RoBERTa. These models have achieved state-of-the-art results on various NLP benchmarks.

It offers a user-friendly interface for working with pretrained transformer models, allowing developers and researchers to easily fine-tune these models on specific NLP tasks or use them for tasks like text classification, text generation, question answering, and more. For optimal performance on Intel® platforms, it is crucial to integrate support for model compression technologies. Intel has made significant contributions—including quantization, pruning, and distillation techniques—to enhance the capabilities of the Optimum library, which can be seamlessly combined with a HuggingFace Transformer.

What Are INCTrainer and INCQuantizer?

The Trainer class offers a comprehensive API for training in Transformers. When considering techniques such as Quantization-Aware Training (QAT), pruning, and distillation, which require a training process, we extend the functionality of Transformers' Trainer to create our own custom INCTrainer. Users have the flexibility to input compression configurations, such as quantization_config, pruning_config, and distillation_config, allowing them to specify the desired settings and parameters within these configurations.

The INCQuantizer class is specifically designed for Post-Training Quantization (PTQ), a method that does not involve a training process. It is inherited from the base class, OptimumQuantizer, which is part of the Optimum library, and provides the flexibility to customize the use_from_pretrain and quantize functions. This extensibility enables us to effectively use INCQuantizer for PTQ.

After completing the compression process, you can use the _onnx_export function to convert your PyTorch* models into an Open Neural Network Exchange (ONNX*) format. Additionally, the INCModel class plays a crucial role in loading the quantized model and conducting evaluations. We have various INCModel classes tailored for different tasks, such as INCModelForSequenceClassification, which is designed specifically for sequence classification tasks.
 

The Value of AI Tools and APIs from Intel for the Optimum Library

Intel provides a comprehensive suite of performance optimization tools, facilitating efficient model training and deployment on dedicated hardware platforms for the Optimum library. This ensures that users can effortlessly use the capabilities of these platforms with the same level of ease and convenience synonymous with transformers. Following the compression process, users can seamlessly deploy models using Intel Runtime, including quantized models with Intel® Extension for PyTorch*, Intel® Extension for Transformers*, and OpenVINO™ toolkit, as well as pruned models with Intel Extension for Transformers.

Examples

The Optimum library offers compression examples for various tasks. We have designed a user-friendly Python* command-line interface, like python run_qa_post_training.py --apply_quantization in the example folders.

APIs

INCQuantizer

Use the function quantize to support Intel® Neural Compressor post-training quantization.

Parameters:

ID

Type

Description

model

torch.nn.Module.

The model to quantize

eval_fn

 

Callable[[PreTrainedModel], int], defaults to None.

The evaluation function to use for the accuracy-driven strategy of the quantization process. The accuracy-driven strategy will be enabled only if eval_fn is provided.

calibration_fn

typing.Union[typing.Callable[[transformers.modeling_
utils.PreTrainedModel], int]
, NoneType], defaults to be None.

The calibration function to use for the accuracy driven strategy of the quantization process

task

Str, defaults to None.

The task defining the model topology used for the ONNX export

seed

Int, defaults to 42.

The random seed to use when shuffling the calibration dataset

 

  • function INCQuantizer.get_calibration_dataset parameters: This API creates the calibration datasets.Dataset mentioned in function quantize to use for the post-training static quantization calibration step.
     

    ID

    Type

    Description

    dataset_name

    Str

    The dataset repository name on the Hugging Face Hub or path to a local directory containing data files in generic formats and optionally a dataset script, if it requires some code to read the data files

    num_samples

    Int, defaults to 100

    The maximum number of samples composing the calibration dataset

    dataset_config_name

    Str, optional

    The name of the dataset configuration

    dataset_split

    Str, defaults to train

    Which split of the dataset to use to perform the calibration step

    preprocess_function

    Callable, optional

    Processing function to apply to each example after loading the dataset

    preprocess_batch

    Bool, defaults to True

    Whether the preprocess_function should be batched

    use_auth_token

    Bool, defaults to False

    Whether to use the token generated when running a transformers-cli login

  • function INCQuantizer.quantize parameters: This API quantizes a model given the optimization specifications defined in quantization_config.
     

    ID

    Type

    Description

    quantization_config

    PostTrainingQuantConfig

    The configuration contains the parameters related to quantization

    save_directory

    Union [str, Path]

    The directory where the quantized model should be saved

    calibration_dataset

    datasets.Dataset, defaults to None

    The dataset to use for the calibration step, needed for post-training static quantization

    batch_size

    Int, defaults to 8

    The number of calibration samples to load per batch

    data_collator

    DataCollator, defaults to None

    The function to use to form a batch from a list of elements of the calibration dataset

    remove_unused_columns

    Bool, defaults to True

    Whether or not to remove the columns unused by the model forward method

    weight_only

    Bool, defaults to False

    Whether compress weights to integer precision (4 bit by default) while keeping activations floating-point. Fits best for large language model (LLM) footprint reduction and performance acceleration

  • quantization_config parameters:

    ID

    Type

    Description

    approach

    Str, defaults to auto.

    The configuration contains the parameters related to quantization.

    recipes

    Dict, defaults to empty.

    Quantization approach used

INCQuantizer Code Example

```python

from functools import partial

from transformers import AutoModelForSequenceClassification, AutoTokenizer

from neural_compressor.config import PostTrainingQuantConfig

from optimum.intel import INCQuantizer



model_name = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# The directory where the quantized model will be saved

save_dir = "static_quantization"



def preprocess_function(examples, tokenizer):

return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)



# Load the quantization configuration detailing the quantization we wish to apply

quantization_config = PostTrainingQuantConfig(approach="static")

quantizer = INCQuantizer.from_pretrained(model)

# Generate the calibration dataset needed for the calibration step

calibration_dataset = quantizer.get_calibration_dataset(

"glue",

dataset_config_name="sst2",

preprocess_function=partial(preprocess_function, tokenizer=tokenizer),

num_samples=100,

dataset_split="train",

)

quantizer = INCQuantizer.from_pretrained(model)

# Apply static quantization and save the resulting model

quantizer.quantize(

quantization_config=quantization_config,

calibration_dataset=calibration_dataset,

save_directory=save_dir,

)

```

Quantization recipes can be specified as follows:

```python

recipes= {"smooth_quant": True, "smooth_quant_args": {"alpha": 0.5, "folding": True}}

quantization_config = PostTrainingQuantConfig(approach="static", backend="ipex", recipes=recipes)

```
  • func INCQuantizer._onnx_export parameters: This API exports the int8 model into the ONNX format to achieve higher applicability in multiple frameworks with the configuration in onnx_config.
     

    ID

    Type

    Description

    compressed_model

    PyTorchModel.

    The quantized model

    onnx_config

     

    OnnxConfig.

    Base class for ONNX exportable model describing metadata on how to export the model through the ONNX format, constructed by ExportConfigConstructor

    output_onnx_path

    Union [str, Path].

    The path to the ONNX model to be saved

  • onnx_config parameters in the onnx_export function:
     

    ID

    Type

    Description

    self._original_model.config

    transformers.PretrainedConfig

    The model configuration.


_onnx_export Code Example

from huggingface_hub import HfApi

from optimum.exporters import TasksManager

from optimum.exporters.onnx import OnnxConfig

from ..utils.constant import ONNX_WEIGHTS_NAME



model_type = self._original_model.config.model_type.replace("_", "-")

model_name = getattr(self._original_model, "name", None)

task = HfApi().model_info(self._original_model.config._name_or_path).pipeline_tag

onnx_config_class = TasksManager.get_exporter_config_constructor(

exporter="onnx",

model=self._original_model,

task=task

model_type=model_type,

model_name=model_name,

)

onnx_config = onnx_config_class(self._original_model.config)

compressed_model.eval()

output_onnx_path = save_directory.joinpath(ONNX_WEIGHTS_NAME)

self._onnx_export(compressed_model, onnx_config, “./onnx_results”)

INCTrainer

Support Intel Neural Compressor quantization-aware training, pruning, and distillation. The users need to define configurations of different optimizations. For example, quantization_config is the configuration of quantization-aware training and pruning_config is for pruning.

Parameters:

ID

Type

Description

model

Union [PreTrainedModel, torch.nn.Module], defaults to None.

Input model

args

TrainingArguments, defaults to None.

Arguments for training

data_collator

Optional [DataCollator], defaults to None.

The function to use to form a batch from a list of elements of the calibration dataset.

train_dataset

Optional [Dataset], defaults to None.

Dataset for training the model

eval_dataset

Optional [Dataset], defaults to None.

Dataset for evaluating the model

tokenizer

Optional [PreTrainedTokenizerBase], defaults to None.

Tokenizer loaded

model_init

typing.Callable[[], transformers.modeling_utils.PreTrainedModel], defaults to None.

Used to instantiate the model

compute_metrics

Optional [Callable[[EvalPrediction], Dict]], defaults to None.

Takes an EvalPrediction object (a namedtuple with a predictions and label_ids field) and returns a dictionary string to float

callbacks

Optional [List [TrainerCallback], defaults to None.

Defines the basic command for the training loop

optimizers

Tuple [torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR], defaults to (None, None).

Main entry to get the specific type of optimizer

preprocess_logits_for_metrics

Callable [[torch.Tensor, torch.Tensor], torch.Tensor], defaults to None.

Depending on the model and configuration, logits may contain extra tensors, like past_key_values, but logits always come first.

quantization_config

Optional [QuantizationAwareTrainingConfig], defaults to None.

The configuration contains the parameters related to quantization.

pruning_config

Optional [WeightPruningConfig], defaults to None.

Configuration of distillation. It contains evaluation metrics.

distillation_config

Optional [distillation_config], defaults to None.

Defines a single or a sequence of pruning configs

task

Optional[str], defaults to None.

Task name

save_onnx_model

Bool, defaults to False.

Whether or not to save to an ONNX model

 

  • pruning_config parameters of train function: Set this argument and use train function in INCTrainer to generate a sparse (structured or unstructured) model given the specifications defined in pruning_config.
     

    ID

    Type

    Description

    start_step

    Int, optional, defaults to 0.

    The step to start pruning

    end_step

    Int, optional, defaults to 0.

    The step to end pruning

    target_sparsity

    Float, optional, defaults to 0.90.

    The sparsity ratio the model can reach after pruning

    pruning_type

    Str, optional, defaults to snip_momentum.

    A string that defines the criteria for pruning. Supports magnitude, snip, snip_momentum, magnitude_progressive, snip_progressive, snip_momentum_progressive, and pattern_lock.


     
  • distillation_config parameters of train function: Set this argument and use train function in INCTrainer to generate a student model given the specifications defined in distillation_config.
     

    ID

    Type

    Description

    teacher_model

     

    PreTrainedModel in Transformers

    Teacher model for distillation, usually a big model, required if do distillation

 

INCTrainer Code Example

```python

import evaluate

import numpy as np

from datasets import load_dataset

from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, default_data_collator

from optimum.intel import INCModelForSequenceClassification, INCTrainer

from neural_compressor import QuantizationAwareTrainingConfig



model_id = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModelForSequenceClassification.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)

dataset = load_dataset("glue", "sst2")

dataset = dataset.map(lambda examples: tokenizer(examples["sentence"], padding=True, max_length=128), batched=True)

metric = evaluate.load("glue", "sst2")

compute_metrics = lambda p: metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)



# The directory where the quantized model will be saved

save_dir = "quantized_model"



# The configuration detailing the quantization process

quantization_config = QuantizationAwareTrainingConfig()



trainer = INCTrainer(

model=model,

quantization_config=quantization_config,

args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),

train_dataset=dataset["train"].select(range(300)),

eval_dataset=dataset["validation"],

compute_metrics=compute_metrics,

tokenizer=tokenizer,

data_collator=default_data_collator,

)



train_result = trainer.train()

metrics = trainer.evaluate()

trainer.save_model()



model = INCModelForSequenceClassification.from_pretrained(save_dir)

```

 

INCTrainer Supports Pruning

In the same manner, pruning can be applied by specifying the pruning configuration detailing the desired pruning process.

```python

from optimum.intel import INCTrainer

from neural_compressor import WeightPruningConfig



# The configuration detailing the pruning process

pruning_config = WeightPruningConfig(

pruning_type="magnitude",

start_step=0,

end_step=15,

target_sparsity=0.2,

pruning_scope="local",

)

trainer = INCTrainer(

model=model,

pruning_config=pruning_config,

args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),

train_dataset=dataset["train"].select(range(300)),

eval_dataset=dataset["validation"],

compute_metrics=compute_metrics,

tokenizer=tokenizer,

data_collator=default_data_collator,

)



train_result = trainer.train()

metrics = trainer.evaluate()

trainer.save_model()



model = AutoModelForSequenceClassification.from_pretrained(save_dir)

```

 

INCTrainer Supports Knowledge Distillation

Knowledge distillation can also be applied in the same manner.

```python

from optimum.intel import INCTrainer

from neural_compressor import DistillationConfig



teacher_model_id = "textattack/bert-base-uncased-SST-2"

teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_model_id)

distillation_config = DistillationConfig(teacher_model=teacher_model)



trainer = INCTrainer(

model=model,

distillation_config=distillation_config,

args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=False),

train_dataset=dataset["train"].select(range(300)),

eval_dataset=dataset["validation"],

compute_metrics=compute_metrics,

tokenizer=tokenizer,

data_collator=default_data_collator,

)



train_result = trainer.train()

metrics = trainer.evaluate()

trainer.save_model()



model = AutoModelForSequenceClassification.from_pretrained(save_dir)

```

Note The export API for INCTrainer is similar to INCQuantizer.

INCModel

Instantiates a quantized PyTorch model from a given Intel Neural Compressor configuration file. Returns a quantized model.

  • func from_pretrained parameters:
     

    ID

    Type

    Description

    model_name_or_path

    str

    Repository name in the Hugging Face Hub or path to a local directory hosting the model

    q_model_name

    str, optional

    The name of the state dictionary located in model_name_or_path is used to load the quantized model. If state_dict is specified, the latter will not be used.

    cache_dir

    str, optional

    Path to a directory in which a downloaded configuration should be cached if the standard cache should not be used

    force_download

    bool, optional, defaults to False

    Whether or not to force to download or redownload the configuration files and override the cached versions if they exist

    resume_download

    bool, optional, defaults to False

    Whether or not to delete an incompletely received file. Attempts to resume the download if such a file exists.

    revision

    str, optional

    The specific model version to use. It can be a branch name, a tag name, or a commit ID, since we use a Git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by Git.

    state_dict_path

    str, optional

    The path to the state dictionary of the quantized model.

 

Classes inherited from INCModel:

  • INCModelForQuestionAnswering: Model for question answering mapping. Examples include BertForQuestionAnswering and LongformerForQuestionAnswering.
  • INCModelForSequenceClassification: Model for sequence classification mapping. Examples include BertForSequenceClassification and XLMRobertaForSequenceClassification.
  • INCModelForTokenClassification: Model for token classification mapping. Examples include BertForTokenClassification and DistilbertForTokenClassification.
  • INCModelForMultipleChoice: Model for multiple choice mapping. Examples include BertForMultipleChoice and DistilbertForMultipleChoice.
  • INCModelForSeq2SeqLM: Model for Seq2Seq causal LM mapping. Examples include PegasusForConditionalGeneration and T5ForConditionalGeneration.
  • INCModelForCausalLM: Model for causal LM mapping. Examples include CTRLLMHeadModel and GPTNeoForCausalLM. Notice that this model checks the superclass documentation for the generic methods the library implements for all its models (such as downloading or saving).

 

ID

Type

Description

model

PyTorch model

Main class used to run inference.

config

Transformers.PretrainedConfig

Model configuration class with all the parameters of the model.

device

Str, defaults to cpu

The device type for which the model will be optimized. The resulting compiled model will contain nodes specific to this device.

 

  • INCModelForMaskedLM: Model for masked LM mapping. Examples include BertForMaskedLM and XLMWithLMHeadModel.
  • INCModelForXLNetLM: Model for XLNetLMHeadModel mapping.
  • INCModelForVision2Seq: Model for Vision2Seq mapping, for example: VisionEncoderDecoderModel.

Note For a full model-mapping list, see Transformers.

What’s Next?

We encourage you to check out and incorporate Intel’s other AI and machine learning framework optimizations and end-to-end portfolio of tools into your AI workflow and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel's AI Software Portfolio to help you prepare, build, deploy, and scale your AI solutions.