The teacher is a large model: smart, slow, high-capacity, expensive to run. The student is smaller: faster, cheaper, easier to deploy, but usually less capable if trained in the standard way. Distillation asks a very practical question:
learn more