Skip to content Skip to navigation

BoF 15: Monitoring Large-Scale HPC Systems: Data Analytics & Insights

Mark Parsons presented the project at this BoF session at ISC-HPC 2016. The abstract of the session is below:

"A clear trend in high-end HPC is the ever increasing size and complexity of supercomputers, with O(100,000) nodes, integration of diverse architectures into heterogeneous systems, and complex communication and I/O subsystems, not to mention the massive power supply and cooling infrastructures. Plus, systems are behaving in a dynamic and adaptive way (e.g. by dynamic frequency and voltage scaling), making it hard to assess system state or predict its evolution over time. Running these infrastructures requires fine grain monitoring and control policies to guarantee safe and efficient operation, and to optimize application throughput. This data can be analyzed online (in real or near real time) or offline to gain insight into how to further improve operations, optimize scheduling and resource use, and optimize specific workloads. It is also of paramount importance for analyzing and resolving excursions or faults, and for predicting failure situations and enable preventative maintenance. Finally, real-time monitoring data enables co-scheduling of malleable HPC applications, increasing the system throughput. Emerging machine learning and data analytics techniques can play an important part here. This session addresses issues and approaches in large-scale HPC monitoring from the perspectives of system administrators, end users, and vendors. It gathers experts from leading HPC centers and vendors to present their current practices, including approaches for near real time predictive analysis, and discuss gaps and future requirements. Half of the time will be spent in a panel-style discussion with the audience. Besides the confirmed speakers, participation of representatives from Atos/Bull and Cray is planned."

Slides from Mark Parsons' talk are available below.

Supporting Documents: