Project Description

High-Performance-Computing (HPC) has become a standard research tool in many scientific disciplines. Research without at least supporting HPC calculations is becoming increasingly rare in the natural and engineering sciences . On top of that, new disciplines are discovering HPC as an asset to their research, for example in the areas of bioinformatics and social sciences. This means that more and more scientists without a deep understanding of the architecture and the functioning of such systems start using HPC resources. This knowledge gap is further enlarged as the complexity of HPC resources increases and gains significant importance in the field of performance engineering.

Most scientists that are new to HPC run their applications on local Tier-3 systems and are content once their research problem can be solved on an available system in an acceptable time frame. The missing knowledge with respect to performance measurements will often lead to a lock-in, because they are not able to scale their calculations to a Tier-2 or Tier-1 compute resource. At the same time, Tier-3 compute centers typically lack sufficient human resources to work with each user individually on application performance. In order to increase awareness for performance issues and enable users to assess possible gains from performance improving measures, systematic, unified, and easily understandable information on performance parameters should be provided across all scientific communities.This especially pertains to the performance parameters of HPC jobs and the importance of performance engineering.

Although the usage of many performance measurement tools is mostly straightforward, the serious disadvantage is the missing explanation of the results in a clear and simple way. Often the generated reports need expert knowledge to understand – this makes the profiling for normal cluster users nearly useless since they usually don’t have the background to interpret the results. By automatically assembling all data provided by available tools into a single centrally organized framework it will be much easier for the user and also for the administrator to identify potential performance bottlenecks. In addition, with the help of an all-in-one profiling tool, the user acceptance for code optimizations might be drastically increased, especially when the code tuning show a considerable performance boost.

As a consequence, a readily comprehensible user interface for application profiling across Tier-2/Tier-3 compute centers may lower the barrier for end users to improve and adjust their applications according to the conditions given by the compute centers. Additionally the interface may also help to overcome the gap of understanding and communication between experts and end users by incorporating all data into a shared documentation system.

With the deployment of heterogeneous, and more complex systems at Tier-3 centers the need of awareness for performance aspects is seen as a challenge for the optimal use of compute and storage resources on Tier-3 and Tier-2 resources. We aim to address this challenge by reaching the following objectives:

Raising awareness for performance parameters and issues across all HPC user communities.
Enabling HPC users at all levels of experience to obtain and understand information on the performance of their workloads.
Act on this information with suitable further investigation and performance engineering measures,thereby also lowering barriers to Tier-2 and Tier-1 resources due to insufficient scaling.
Simplify the task of application performance evaluation and consulting for Tier-3 end users.

In order to achieve these goals a profiling tool set based on existing profiling solutions will be implemented in the scope of this project. This tool set will automatically collect per job performance metrics and presents them to researchers in an easily understandable summary. The tool set will be complemented by extensive documentation and best practices information, detailing, as applicable, measures regarding further investigation of the problem, recommended changes to the job submission, and promising performance engineering targets.

It will be possible for administrators to collect the information gathered by the tool in a database, allowing for the investigation of patterns of performance issues in user jobs. This opens up new support channels, where support measures are initiated where needed and not only when requested.

The tool set will be simple to roll out on Tier-3 infrastructures and it will be maintained, along with the documentation, by the HLRN after the conclusion of this project. It will also be deployed on the HLRN infrastructure, so that users migrating from Tier-3 to Tier-2 will get the same profiling information. All tools and documentation will be made publicly available with the goal of a wide distribution among German Tier-3 HPC centers. They will therefore be a resource usable as a base for future projects and extensions in the area of HPC performance engineering.