Knowledge distillation (KD)

What is Knowledge Distillation?

Goal

To Improve student’s accuracy with the help of a teacher model

Methods

We can divide KD methods into two categories by the learning targets

  • Distll logits (learning the output distribution of the teacher model)

    • ex: Deep Mutual learning, Soft-logits

  • Distll features (learning the intermediate value such as feature maps of the teacher model)

    • ex: FitNet, Distilling-Object-Detectors

How to choose student model

Student’s and teacher’s structure should be closely related

  • Reduce block number (Resnet block, transformer block)

  • Reduce hyperparameters (# of filters, size of filters, stride … etc)

  • Use more efficient structure (replace conv with depthwise separable conv)

  • Reduce model layers (後面幾層)

  • Low level vs high level (semantic) features

Pros and cons

Pros:

  • Easy to implement

  • Training time is acceptable

  • Accuracy is stable

Cons:

  • Need to choose an appropriate student

Code Tutorials

Distill logits on Yolov5

Distill features on Yolov5