# Knowledge distillation (KD) ## What is Knowledge Distillation? ### Goal To Improve student’s accuracy with the help of a teacher model
![](./KD.png) ### Methods We can divide KD methods into two categories by the learning targets - Distll logits (learning the output distribution of the teacher model) - ex: Deep Mutual learning, Soft-logits - Distll features (learning the intermediate value such as feature maps of the teacher model) - ex: FitNet, Distilling-Object-Detectors ## How to choose student model Student’s and teacher’s structure should be closely related
- Reduce block number (Resnet block, transformer block) - Reduce hyperparameters (# of filters, size of filters, stride … etc) - Use more efficient structure (replace conv with depthwise separable conv) - Reduce model layers (後面幾層) - Low level vs high level (semantic) features ## Pros and cons Pros: - Easy to implement - Training time is acceptable - Accuracy is stable Cons: - Need to choose an appropriate student ## Code Tutorials [Distill logits on Yolov5](./logits.md)

[Distill features on Yolov5](./feature.md)