Maximizing Performance: A Series on Optimizing Transformer-Based Models — Chapter 1

Vinish M
3 min readMay 10, 2023

--

Transformers-based models are widely used in natural language processing (NLP) tasks, such as text classification, question answering, text generation, and machine translation. These models are computationally expensive and memory-intensive, making them difficult to deploy on resource-constrained devices. To address this challenge, several optimization techniques, including quantization, pruning, and distillation, have been proposed. In this series, we will explore these techniques in detail and how they can be applied to transformers-based models using a set of libraries and tools.

Introduction

Neural networks have proven to be powerful models for solving complex problems in various fields such as computer vision, and natural language processing. However, the success of these models comes at the cost of high computational and memory requirements, which can make them difficult to deploy on resource-constrained devices. Therefore, optimizing neural networks has become an active area of research to improve their efficiency, reduce their memory footprint, and speed up their inference time. In this chapter, we are just going to get an introduction to various techniques.

Optimization techniques can be broadly classified into three categories:

  • Quantization
  • Pruning
  • Knowledge Distillation

Quantization:

Quantization is a technique that reduces the precision of the weights and activations of a neural network, which reduces the memory footprint and computational requirements of the model. The basic idea is to represent the weights and activations using fewer bits than the original representation. For example, reducing the precision from 32-bit floating-point numbers to 16-bit floating-point numbers or 8-bit integers can reduce the memory footprint by a factor of two to four.

There are two types of quantization techniques: post-training quantization and quantization-aware training. Post-training quantization involves quantizing the weights and activations of a pre-trained model, while quantization-aware training involves training the model with quantization in mind. In post-training quantization, the weights and activations are quantized after the model is trained, which can lead to a loss in accuracy. In quantization-aware training, the model is trained with quantization in mind, which can lead to a more accurate quantized model.

The Hugging Face Transformers library provides support for both post-training quantization and quantization-aware training. Check out these links(Overview, Resources) to know more about quantization.

Pruning:

Pruning is a technique that removes the connections in a neural network that has little or no impact on the output of the model. The basic idea is to remove the connections that have small weights or activations. This reduces the number of parameters in the model, which reduces the memory footprint and computational requirements of the model. Pruning can also improve the generalization of the model by reducing overfitting.

There are two types of pruning techniques: weight pruning and structured pruning. Weight pruning involves removing individual weights that have small magnitudes. Structured pruning involves removing entire rows, columns, or filters from the weight tensor that have small magnitudes. Structured pruning can be more effective than weight pruning because it removes entire structures from the weight tensor, which can result in a more efficient computation.

The Hugging Face Transformers library provides support for both weight pruning and structured pruning. Check out these links(Overview, Resources) to know more about pruning.

Knowledge Distillation

Distillation is a technique that involves training a smaller model to mimic the behavior of a larger, more complex model. The basic idea is to transfer the knowledge learned by the larger model to the smaller model. This reduces the memory footprint and computational requirements of the model without sacrificing accuracy. Distillation can also improve the generalization of the model by reducing overfitting.

source: https://edy-barraza.github.io/week12/

Models like distilBERT, distilRoBERTa, and distilGPT2 are a few examples of distilled models. Check out this link to know more about Knowledge Distillation.

Tools and Libraries Used for Optimization

  • Onnx
  • Onnx Runtime (Microsoft)
  • TensorRT (NVIDIA)
  • Optimum (HuggingFace)
  • Speedster (Nebuly)
  • transformer-deploy
  • kernl
  • FasterTransformer (NVIDIA)
  • DeepSpeed (Microsoft)
  • etc...

From the next chapters onwards, we will try out each and every tool to optimize the transformer-based models and benchmark the performance. Stay Tuned!

References:

  1. https://github.com/onnx/onnx
  2. https://github.com/microsoft/onnxruntime
  3. https://github.com/NVIDIA/TensorRT
  4. https://github.com/huggingface/optimum
  5. https://github.com/nebuly-ai/nebuly
  6. https://github.com/ELS-RD/transformer-deploy
  7. https://github.com/ELS-RD/kernl
  8. https://github.com/NVIDIA/FasterTransformer
  9. https://github.com/microsoft/DeepSpeed
  10. https://github.com/nebuly-ai/exploring-AI-optimization

--

--

Vinish M
Vinish M

Written by Vinish M

Machine Learning Engineer | Open Source Contributer | Creator. https://www.linkedin.com/in/vinish-m-4ab33a18b/

No responses yet