In the field of artificial intelligence, knowledge distillation (KD) has become a very interesting technique to optimize models without significantly sacrificing their performance.
This process, which allows the transfer of knowledge from complex models to smaller models, aims to enable these smaller models to learn deeper patterns and data structures than those that would be obtained with conventional training.
To learn about this technique in depth, in today’s article we will take a closer look at the history behind this concept, the different models that exist, as well as their mechanism and implementations.
History and evolution of knowledge distillation
The concept of knowledge distillation has its roots in the work of Caruana et al. in 2006, when they demonstrated that a massive classification model could be used to label a dataset and subsequently train a more compact neural network with comparable performance.
Later on, Hinton et al. (2015) extended this idea by introducing a formal distillation scheme, in which they proposed a two-stage approach: first train a large model to extract structure from the data and then transfer that knowledge to a smaller and thus more suitable model for real-time implementation.
This is how knowledge distillation was born, considered today as an advanced machine learning (ML) technique designed to transfer the generalization and learning capacity of a large model, known as “teacher model”, to a more compact model, called “student model”, becoming a key element in the optimization of deep neural networks, particularly in the context of generative artificial intelligence (GenAI) and large-scale language models (LLM).
On the other hand, it should be added that the goal of knowledge distillation is not only to replicate the results of the teacher model, but also to capture and emulate its reasoning patterns to improve computational efficiency without significantly sacrificing performance.
Mechanism of knowledge distillation
As we have just seen, knowledge distillation is based on the idea that a large neural model learns complex patterns and data structures that can be transferred to a smaller model through a supervised training process. To achieve this, the mechanism uses a series of functions that we discuss below:
Hard and soft targets
Deep learning models use softmax functions to generate predictions with the highest probability of being correct. However, intermediate values (logits) contain useful information about the generalization tendencies of the model. In distillation, these intermediate values, known as “soft targets”, are used as a guide to train the student model, allowing better knowledge transfer than if only the final label of the correct class, known as the “hard target”, were used.
In a classification task, such as animal image identification, the model generates a probability distribution over the possible classes and selects the category with the highest probability as its prediction.
- Hard targets: These are the standard classification labels used in supervised learning. They represent a binary assignment of probability, where the correct category receives 100% and the others 0%.
- Soft targets: These are the probability distributions generated by the model before a final decision is made. Instead of assigning absolute certainty, they reflect uncertainty and similarities between classes. For example, an image of a Golden Retriever might receive 75% probability for “Golden Retriever”, 20% for “Labrador Retriever” and 5% for “German Shepherd”, providing additional information on how the model perceives the relationships between categories.
So, the student model not only learns to predict the correct final answer, but also learns to grasp the relationship between classes and how the teacher model generalizes about the data. This allows it to acquire a more nuanced and flexible understanding, rather than a rigid classification based only on hard targets. Thus, as we have seen in the example, the student can learn that Golden Retrievers and Labrador Retrievers are more similar to each other than to a German Shepherd, which improves his ability to classify new images.
As a summary, we can say that soft targets contain information about how the teacher model thinks, which makes the learner able to learn more subtle patterns from the data, improving their generalization ability with fewer training examples.
Distillation loss function
The training of a deep learning model is based on minimizing a loss function, which measures the difference between the model’s predictions and the correct answers. In knowledge distillation, two main loss functions are used to ensure that the student model learns correctly from the teacher model:
- Hard loss: Hard loss is based on the difference between the final prediction of the student model and the actual sample label. Usually, cross-entropy is used, which penalizes incorrect predictions and pushes the model to improve its accuracy.
For example, if the image is a “cat” and the student model predicts “dog”, hard loss will take care of correcting this mistake by adjusting the model parameters. - Distillation loss: The distillation loss measures the difference between the probability distributions of the teacher model and the student model. Instead of just correcting final errors, this loss focuses on making the student model mimic the prediction pattern of the teacher model.
The most commonly used metric for this comparison is the Kullback-Leibler divergence, which measures how much different the probability distribution of the student model is compared to that of the teacher. On the other hand, a temperature parameter is also often used, which smooths the predictions and makes the probability distribution more informative, allowing the student model to learn more accurately.
Other types of knowledge distillation
Although in today’s article we have explained the classical form of distillation, we would like to add that there are different approaches to knowledge distillation. We briefly discuss them below:
- Feature-based distillation: Focuses on the transfer of internal representations from the master model to the learner, which involves the comparison of feature maps extracted at different layers of the neural network.
- Attention distillation: Exploits the attention mechanisms of advanced neural networks, such as transformers, to guide the learning of the learner model more effectively.
- Relation-based distillation: Instead of transferring only outputs or features, this method attempts to capture relationships between instances within the representation space of the master model.
Although each method has specific advantages and applications, in practice, multiple approaches can be combined to achieve a more efficient and accurate student model.
Implementation of knowledge distillation
Knowledge distillation can be implemented in different ways, depending on whether the teacher model is pre-trained or whether it is trained simultaneously with the student model. The main approaches are:
- Off-line distillation: This is the most traditional and widely used approach, in which the teacher model is pre-trained independently and then frozen to avoid updating its parameters. In this way, the student model is trained using the master output as a guide (i.e., the predictions generated by the master model for a given input), optimizing the loss function to minimize the difference between the predictions of the two models.
- On-line distillation: Unlike off-line distillation, in this scheme the teacher and student models are trained simultaneously, allowing a dynamic and continuous knowledge transfer. With this simultaneous training for both models, where the teacher generates predictions at each step of the process, the student model adjusts the values of its internal parameters, not only based on the original labels, but also taking advantage of the information provided by the teacher. In this way, both models evolve together, adapting to each other and allowing a more dynamic and efficient knowledge transfer.
- Self-distillation: This approach removes the need for a separate master model and, instead, the model itself acts as its own teacher and student, refining its internal representations throughout training. In this way, the continuous refinement process can improve your generalization ability without the need for a separate master model.
Conclusion
Knowledge distillation has emerged as a key technique for improving the efficiency and accessibility of modern AI models. By enabling the transfer of knowledge from large-scale models to more compact and efficient versions, knowledge distillation has become a fundamental tool in democratizing generative AI and expanding its applicability in computationally constrained environments.
As models continue to grow in complexity, distillation techniques will continue to evolve to bridge the gap between computational power and implementation efficiency.
Resources:
[1] R. Caruana et al. (2006) – Model Compression
[2] G. Hinton et al. (2015) – Distilling the Knowledge in a Neural Network
[3] IBM – What is knowledge distillation?
At Block&Capital, specialists in tech recruitment, we strive to create an environment where growth and success are within everyone’s reach. If you’re ready to take your career to the next level, we encourage you to join us.
Last posts