Performance in HPC: Gnuplot & ARM map
Today’s workshop focused on exploring performance in a hybrid MPI/OpenMP application using tools such as Gnuplot and ARM Forge MAP.
The goal was to understand how different parallel configurations affect performance, and more importantly, why those differences occur.
The Core Idea
The key takeaway from this exercise is that performance in parallel computing is not simply about using more resources. Even when the total number of cores is fixed, the way those cores are divided between MPI processes and OpenMP threads can significantly impact runtime.
In this case, the total number of cores was fixed at 128, but the configuration varied:
- 2 processes × 64 threads
- 4 processes × 32 threads
- 8 processes × 16 threads
- 16 processes × 8 threads
- 32 processes × 4 threads
- 64 processes × 2 threads
Each of these configurations represents a different balance between distributed (MPI) and shared-memory (OpenMP) parallelism.
What the Results Showed
The results made it clear that not all configurations are equal.
For double precision, the best performance was achieved with 32 processes and 4 threads, with a runtime of approximately 41 seconds.
For single precision, the optimal configuration shifted to 16 processes and 8 threads, with a runtime of approximately 20 seconds.
Interestingly, increasing the number of MPI processes beyond a certain point did not continue to improve performance. In fact, it began to degrade.
This clearly demonstrates that there exists an optimal process-thread combination, and that simply increasing parallelism does not guarantee better performance.
Why More Parallelism Can Hurt
This behaviour can be explained by the overheads introduced by parallel execution.
Using too many MPI processes increases communication costs. Operations such as MPI_Recv, MPI_Reduce, and MPI_Barrier become more frequent and expensive. Additionally, work may become too finely divided, leading to inefficiencies.
On the other hand, using too many OpenMP threads introduces its own overheads. Thread scheduling, synchronisation, memory bandwidth contention, and cache locality issues can all negatively affect performance.
As a result, there is a trade-off between computation and overhead. Beyond a certain point, increasing parallelism adds more overhead than benefit.
Role of the Tools
Gnuplot and ARM MAP played complementary roles in this exercise.
Gnuplot was used to visualise the timing results. By plotting metrics such as total runtime, it becomes much easier to identify trends across different configurations. It answers the question: which configuration performs best?
ARM MAP, on the other hand, was used for profiling. It provides insight into how the program spends its time, including CPU usage, MPI communication, and OpenMP regions. It answers a different question: why does a particular configuration perform the way it does?
Understanding the Hotspots
The profiling results revealed that a significant portion of the runtime, around 39% — was spent in an OpenMP region starting at gpuvidrect.f:144.
At first, it was not obvious why the loop body itself was not highlighted in the source view. However, this is expected behaviour. ARM MAP often attributes the runtime of an entire OpenMP parallel region to the pragma line where it begins, rather than distributing it across individual lines inside the loop.
This means that instead of thinking in terms of individual lines, it is more accurate to think in terms of regions or computational blocks.
Other contributors to runtime included MPI communication calls and linear algebra routines such as dgemm and _sci_pdgetrf.
Reflection
This exercise shifted my perspective on performance.
Before this, it was easy to assume that using more cores would naturally lead to faster execution. However, this experiment demonstrated that performance is much more nuanced. It depends on how computation, communication, and memory behaviour interact.
What stood out most was the importance of measurement and analysis. Without tools like Gnuplot and ARM MAP, it would be difficult to understand why certain configurations perform better than others.
More importantly, this exercise highlights a key skill in high-performance computing: not just writing parallel code, but understanding and optimising its behaviour.
Summary
The exercise involved analysing a hybrid MPI/OpenMP scientific application by varying process-thread configurations, visualising performance using Gnuplot, and identifying runtime bottlenecks using ARM MAP. The results demonstrate that optimal performance depends on balancing computation and overhead, rather than maximising parallelism alone.
Leave a comment