NSDI ’19 - Tiresias: A GPU Cluster Manager for Distributed Deep Learning
Juncheng Gu, Mosharaf Chowdhury, and Kang G. Shin, University of Michigan, Ann Arbor; Yibo Zhu, Microsoft and Bytedance; Myeongjae Jeon, Microsoft and UNIST; Junjie Qian, Microsoft; Hongqiang Liu, Alibaba; Chuanxiong Guo, Bytedance
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data s
6 views
88
31
2 years ago 00:58:37 1
NSD @ Defqon.1 2017
2 years ago 00:09:53 1
Cardinale Giuseppe Siri - Santuario N.S. di Caravaggio - Rapallo(GE)
6 years ago 00:24:09 6
NSDI ’19 - Tiresias: A GPU Cluster Manager for Distributed Deep Learning
6 years ago 00:26:01 5
NSDI ’19 - JANUS: Fast and Flexible Deep Learning via Symbolic Graph Execution of Imperative