Speaker: Prof. Yves Robert, Ecole Normale Supérieure de Lyon, France
Title: Resilience is a Critical Issue for Large-scale Platform
Date: Monday, May 18, 2015
Time: 11:00 a.m. – 12:00 p.m.
Location: Harut Barsamian Colloquia Room (Engineering Hall 2430)
Host: Prof. Jean-Luc Gaudiot
Abstract: Resilience is a critical issue for large-scale platforms. This talk will survey fault-tolerant techniques for high-performance computing:
– Overview of failure types and typical probability distributions
– Brief discussion of application-specific techniques, such as ABFT
– The standard general-purpose technique, checkpoint and rollback recovery
– Recent extensions with replication, prediction and silent error detection
– Relevant execution scenarios, evaluated and compared through quantitative models.
The talk includes several illustrative examples and targets a general audience.
Bio: Yves Robert received his PhD degree from Institut National Polytechnique de Grenoble. He is currently a full professor in the Computer Science Laboratory LIP at ENS Lyon. He is the author of 7 books, 130+ papers published in international journals, and 200+ papers published in international conferences. He is the editor of 11 book proceedings and 13 journal special issues. He is the advisor of 28 PhD theses. His main research interests are scheduling techniques and resilient algorithms for large-scale platforms. Yves Robert served on many editorial boards, including IEEE TPDS. He was the program chair of HiPC’2006 in Bangalore, IPDPS’2008 in Miami, ISPDC’2009 in Lisbon, ICPP’2013 in Lyon and HiPC’2013 in Bangalore. He is a Fellow of the IEEE. He has been elected a Senior Member of Institut Universitaire de France in 2007 and renewed in 2012. He has been awarded the 2014 IEEE TCSC Award for Excellence in Scalable Computing. He holds a Visiting Scientist position at the University of Tennessee, Knoxville since 2011.