📢 Notice 📢

Please have a read first!

Performance in HPC: Gnuplot & ARM map

March 19, 2026 3 minute read

Today’s workshop focused on exploring performance in a hybrid MPI/OpenMP application using tools such as Gnuplot and ARM Forge MAP.

The goal was to understand how different parallel configurations affect performance, and more importantly, why those differences occur.

The Core Idea

The key takeaway from this exercise is that performance in parallel computing is not simply about using more resources. Even when the total number of cores is fixed, the way those cores are divided between MPI processes and OpenMP threads can significantly impact runtime.

In this case, the total number of cores was fixed at 128, but the configuration varied:

2 processes × 64 threads
4 processes × 32 threads
8 processes × 16 threads
16 processes × 8 threads
32 processes × 4 threads
64 processes × 2 threads

Each of these configurations represents a different balance between distributed (MPI) and shared-memory (OpenMP) parallelism.

What the Results Showed

The results made it clear that not all configurations are equal.

For double precision, the best performance was achieved with 32 processes and 4 threads, with a runtime of approximately 41 seconds.

For single precision, the optimal configuration shifted to 16 processes and 8 threads, with a runtime of approximately 20 seconds.

Interestingly, increasing the number of MPI processes beyond a certain point did not continue to improve performance. In fact, it began to degrade.

This clearly demonstrates that there exists an optimal process-thread combination, and that simply increasing parallelism does not guarantee better performance.

Why More Parallelism Can Hurt

This behaviour can be explained by the overheads introduced by parallel execution.

Using too many MPI processes increases communication costs. Operations such as MPI_Recv, MPI_Reduce, and MPI_Barrier become more frequent and expensive. Additionally, work may become too finely divided, leading to inefficiencies.

On the other hand, using too many OpenMP threads introduces its own overheads. Thread scheduling, synchronisation, memory bandwidth contention, and cache locality issues can all negatively affect performance.

As a result, there is a trade-off between computation and overhead. Beyond a certain point, increasing parallelism adds more overhead than benefit.

Role of the Tools

Gnuplot and ARM MAP played complementary roles in this exercise.

Gnuplot was used to visualise the timing results. By plotting metrics such as total runtime, it becomes much easier to identify trends across different configurations. It answers the question: which configuration performs best?

ARM MAP, on the other hand, was used for profiling. It provides insight into how the program spends its time, including CPU usage, MPI communication, and OpenMP regions. It answers a different question: why does a particular configuration perform the way it does?

Understanding the Hotspots

The profiling results revealed that a significant portion of the runtime, around 39% — was spent in an OpenMP region starting at gpuvidrect.f:144.

At first, it was not obvious why the loop body itself was not highlighted in the source view. However, this is expected behaviour. ARM MAP often attributes the runtime of an entire OpenMP parallel region to the pragma line where it begins, rather than distributing it across individual lines inside the loop.

This means that instead of thinking in terms of individual lines, it is more accurate to think in terms of regions or computational blocks.

Other contributors to runtime included MPI communication calls and linear algebra routines such as dgemm and _sci_pdgetrf.

Reflection

This exercise shifted my perspective on performance.

Before this, it was easy to assume that using more cores would naturally lead to faster execution. However, this experiment demonstrated that performance is much more nuanced. It depends on how computation, communication, and memory behaviour interact.

What stood out most was the importance of measurement and analysis. Without tools like Gnuplot and ARM MAP, it would be difficult to understand why certain configurations perform better than others.

More importantly, this exercise highlights a key skill in high-performance computing: not just writing parallel code, but understanding and optimising its behaviour.

Summary

The exercise involved analysing a hybrid MPI/OpenMP scientific application by varying process-thread configurations, visualising performance using Gnuplot, and identifying runtime bottlenecks using ARM MAP. The results demonstrate that optimal performance depends on balancing computation and overhead, rather than maximising parallelism alone.

Share on

X Facebook LinkedIn Bluesky

Performance in HPC: Gnuplot & ARM map

The Core Idea

What the Results Showed

Why More Parallelism Can Hurt

Role of the Tools

Understanding the Hotspots

Reflection

Summary

Share on

Leave a comment

You may also enjoy

Fundamental Differentiation Rules

Looking Back on My First Internship

Note on Attention, Transformers, and the New Linear Models

Reflection on MP