Skip to content Skip to navigation

NEXTGenIO at ISC High Performance 2019

NEXTGenIO was well represented again at ISC19.

Various of the partner organisations presented the successful results of the project via talks, demos and poster presentations at booths and in scheduled events, as detailed below.

Birds-of-a-Feather Session: Multi-Level Memory and Storage for HPC and Data Analytics & AI

NEXTGenIO members Hans-Christian Hoppe (Intel Datacenter Group) and Michèle Weiland (EPCC, University of Edinburgh), along with Kathryn Mohror of Lawrence Livermore National Laboratory, organised a BoF which continued the series of successful similar sessions at ISC. It brought together technology providers, application and system SW developers, and system operators to discuss use cases and requirements for next-generation multi-level storage/memory systems, present proof of concept prototype results, and examine the system SW and tools development.

More information at: http://bit.ly/ISC19BoF

Presentation at HPC I/O in the Data Center workshop

Adrian Jackson (EPCC, University of Edinburgh) presented results from NEXTGenIO in his talk, “An Architecture for High Performance Computing and Data Systems using Byte-Addressable Persistent Memory”, a research paper co-authored with Michèle Weiland and Mark Parsons (also both of EPCC), and Bernhard Homölle (Fujitsu). Further details here: http://www.nextgenio.eu/publications/architecture-high-performance-computing-and-datasystems-using-byte-addressable.

Research Poster: Kronos Development and Results - HPC Benchmarking with Realistic Workloads

Antonino Bonanni, Simon D. Smart and Tiago Quintino (ECMWF)

HPC benchmarking is traditionally performed by testing HPC subsystems in isolation and then combining the results to estimate full-system performance. Despite being widely adopted, this approach does not highlight potential bottlenecks under real-life workloads when many jobs contend the shared resources of the system (e.g. network, storage, etc...).

To address this issue, the European Centre for Medium-Range Weather Forecasts ECMWF is developing a software called Kronos that aims to benchmark HPC systems through realistic workloads, considering the HPC system as a whole. This approach comprises a modelling phase, where a workload model is generated from real-life job profiles and an execution phase, where the workload model is executed on a target machine through a set of light-weight and portable “synthetic” applications.

The most recent Kronos development involves the generation of a workload model made of a combination of “synthetic” and real applications. This allows retaining the most relevant applications in the benchmarking workload and only ancillary applications are substituted by their synthetic representation.

Kronos is being developed as part of the NEXTGenIO project which is a 4-year EU-funded Horizon 2020 project started in October 2015. NEXTGenIO is coordinated by the Edinburgh Supercomputing Centre (EPCC) and involves partners from several European countries. It aims to develop innovative solutions for I/O bottlenecks as high-performance approaches Exascale.

ECMWF is currently using Kronos in the procurement of their next HPC system that is due to become operational by 2020.

Booth Presentation: Intel Datacenter Persistent Memory Modules for Efficient HPC Workflows

Adrian Jackson (EPCC) gave this booth presentation at the Intel booth:

The NEXTGenIO project, which started in 2015 and is co-funded under the European Horizon 2020 R&D funding scheme, was one of the very first projects to investigate the use of DC PMM for the HPC segment in detail. Fujitsu have built up a 32-node prototype Cluster at EPCC using Intel Xeon Scalable CPUs (Cascade Lake generation), DC PMM (3 TBytes per dual-socket node), and Intel Omni-Path Architecture (a dual-rail fabric across the 32 nodes). A selection of eight pilot applications ranging from an industrial OpenFOAM use case to the Halvade genomic processingn workflow was studied in detail, and suitable middleware components for the effective use of DC PMM by these applications were created. Actual benchmarking with DC PMM is now possible, and this talk will discuss the architecture, the use of memory and app-direct DC PMM modes, and give first results on achieved performance.

As poster children for the use of DC PMM as extremely fast local storage targets, the OpenFOAM and Halvade workflows show a very significant reduction in I/O times required by passing data between workflow steps, and consequently, significantly reduced runtimes and increased strong scaling. Taking this further, a prototype setup of ECMWF's IFS forecasting system, which combines the actual weather forecast with several dozens of post-processing steps, does show the vast potential of DC PMM: forecast data is stored in DC PMM on the nodes running the forecast, while post-processing steps can quickly access this data via the OPA network fabric, and a meteorological archive pulls the data into long-term storage. Compared to the traditional system configurations, this scheme brings significant savings in time to completion for the full workflow.

Both of the above do use app-direct mode; the impact and value of memory mode is shown by a key materials science application (CASTEP), the memory requirements of which far exceed the usual HPC system configuration of approx. 4 GByte/core. In current EPCC practice, CASTEP uses only a fraction of the cores on each Cluster node - DC PMM in memory mode, with its up to 3 TBytes capacity on the NEXTGenIO prototype, enables use of all cores, and even with the unavoidable slowdown of execution compared to a DRAM-only configuration, the cost of running a CASTEP simulation is reduced, and the scientific throughput of a given number of nodes is increased commensurately.

Booth Demo: PyCOMPSs 2.5

PyCOMPSs is a task-based programming model that enables the execution of Python applications and workflows in distributed computing environments. The demo presented new features of the new release 2.5, such as the extended support for collections and a task-failure management mechanism.