Well-Read Students Learn Better

Should you pre-train your compressed transformer model before knowledge distillation from an off-the-shelf teacher? This paper says yes and explore...
Back to Top