HPC TaskMaster – Task Efficiency Monitoring System for the Supercomputer Center
Customer
Scientific and Technical Council of the HSE Supercomputer Complex
Developer
Project manager
Purpose
Development a task efficiency monitoring system on the cHARISMa supercomputer, which helps users to determine the correctness of running their computations, and system administrators to identify inefficient tasks, helping users with optimization. The developed system allows reduce inefficient load of the high-performance computing cluster, and thereby decrease the waiting time in the queue for all users.
Description of the project
Since the advent of supercomputers, ensuring their efficiency of computations has been a vital task. All supercomputers are unique due to various technical characteristics and software, therefore there is no universal tasks efficiency monitoring system. For large computing clusters, they are developing their own monitoring systems, and the cHARISMa supercomputer is no an exception.
The efficiency monitoring system HPC TaskMaster is already running on the cHARISMa supercomputer, and all users have access to it. The system not only collects and visualizes data from tasks, but also analyzes the efficiency of tasks based on the detected indicators of problems. The development of such systems is a large-scale scientific work, since the analysis of tasks requires the introduction of mathematical methods and processing methods using artificial intelligence.
More than 300 thousand launches of scientific and educational tasks are performed annually on the cHARISMa supercomputer. The HPC TaskMaster system is designed to help users perform their computations more efficiently. It provides informative reports on the characteristics of completed tasks, points out errors and gives recommendations for improving efficiency to users. By automatically identifying problematic tasks, the system makes it possible to use the resources of the entire supercomputer more efficiently, saving expensive machine time and speeding up work for all users.
The HPC TaskMaster system is developed in Python and JavaScript using open source software Telegraf, InfluxDB and Grafana.
The system is available to all cHARISMa supercomputer users at https://lk.hpc.hse.ru
Results
The developed monitoring system HPC TaskMaster allows tracking the efficiency of all tasks performed on the supercomputer. An inference about the effectiveness of the task is based both on the utilization rates of its components and on its individual properties. Users can view reports about performing tasks with inferences for each task and interactive graphs.
HPC TaskMaster is build based on an open source software, which will allow it to be installed on other HPC clusters.
Due to user consultations, it has already been possible to reduce by 25% the amount of inefficient computing on the supercomputer.
Repository
HPC TaskMaster is actively developing and improving by adding new features. We invite the scientific community, students and all interested programmers to participate in our OpenSource project.
Documentation
User manual and documentation is available on the Russian version
Publications
- Kostenetskiy P., Chulkevich R., Kozyrev V., Shamsutdinov A., Antonov D. HPC TaskMaster - Task Efficiency Monitoring System for the Supercomputer Center // Communications in Computer and Information Science. 2022 (BibTeX)
- Kostenetsky P.S., Shamsutdinov A.B., Chulkevich R.A., Kozyrev V.I. Certificate of state registration of the computer program "HPC TaskMaster – Task Efficiency Monitoring System for the Supercomputer Center" № 2022682037 dated 11/18/2022, copyright holder: Federal State Autonomous Educational Institution of Higher Education "National Research University "Higher School of Economics".
Pictures
Fig. 1. The architecture of HPC TaskMaster
Fig. 2. Graphics from a particular job page
Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.