Introduction
For solvers such as code_saturne, which can run on large resources for long durations, improving performance is always essential, to reduce both user wait times and IT costs (of which a large part is nowadays energy cost).
Performance gains are usually a combination of progress in computing power and in algorithms. For similar software over the last few decades, both factors have been important, with algorithm progress being a major driver.
So although the first step in improving program performance is on the algorithmic (and theory) side, detailed analysis of the behavior of those algorithms on actual hardware is important.
To assist developers and users in performance optimization, code_saturne includes many timers, and tries to log synthetic performance information.
This allows comparing the performance of numerical options and checking that no "unexpected" performance bottlenecks are present.
For more detailed analysis, the use of profiling tools is recommended.
Preliminary knowledge
To be able to understand the performance behavior of code_saturne, the user should have at least introductory knowlege of several hardware and programming model related aspects.
- code_saturne is mostly memory bound, so understanding of the roofline model is important.
- This implies some understading of memory models and hierarchies.
- Merging operators so as to increase arithmetic intensity (reuse of variables in memory for multiple computations) is essential to obtaining proper performance on modern hardware.
- Regarding parallelism, for both MPI (first level) and OpenMP (second level), understanding implicit and explicit barriers, and their relation to load balance is also important.
Built-in performance measurement
Performance of each time step
When running, the code_saturne solver generates a timer_stats.csv
file, which traces the elapsed time for each major operation type (mesh modification, post-processing, gradient reconstructions, linear solvers, and such). This information may be easily plotted using a spreadsheet or a visualization tool such as ParaView.
The code also generates a performance.log
file, which summarizes timings for various operations, in a manner independent of the number of time steps actually run (so this file is complete only after a successful run).
Profiling tools
To obtain more detailed performance information, use of a profiling tool is needed.
Use --enable-profile
to configure builds for profiling.
Common profiling tools
Several types of tools may be available. We list a few commonly available tools, though the list is far from exhaustive:
- Various profilers are commonly found on Linux machines.
- gprof is obsolete, and does not work well with non-static builds. Please forget about it outside of purely historical interest.
- The gprofng tool is a new profiler, available as part of recent Linux distributions base binutils package.
- As this tool is not broadly available on older systems, we do not describe it here yet, but will recommend it and provide links to additional resources as its availability progresses.
The Valgrind tool suite includes several tool which are very useful for profiling. Note that as usual when running under Valgrind, there is an overhead relative to actual performance, and the obtained timing results may be simulated as much as measured, but the information obtained is very similar to that obtained with less ubiquitous tools.
- Massif is a heap profiler, allowing measuring the evolution of memory over time. Using this tool simply requires defining: as a tool in the advanced run options in the GUI, or running
code_saturne run --tool-args="valgrind --tool=massif"
on the command line. The files produced are text files, but can also be visualized with the Massif-vizualizer application where available.
- Cachegrind profiles cache usage, and can be used in a similar manner, using
valgrind --tool=cachegrind
Data may be visualized using the kcachegrind tool. Note also that since cache behavior is simulated, the various cach sizes used may be modified, and the effect on behavior observed.
- Callgrind functions as a "general purpose" profiler. Combined with the kcachegrind tool, it is extremely easy to use on a Linux workstation. It allows easy visualization of call trees and hot spots, as illustrated below:
GDB in terminal mode
Vendor profiling tools
Other advanced profiling tools may be provided by various vendors, for example:
Using VTune with code_saturne
If Intel's VTune is available, the following procedure may be used:
- First, initialize the execution directory, using one of the following methods:
- From the GUI, in the advanced run/job submission options, check "initialize only", then submit the computation.
- Outside the GUI, run
code_saturne submit --initialize
In either case, the code will prepare the execution directory, and preprocess the mesh if needed, but not remove the executable and temporary script.
- Once the stage has finished,
cd
to the execution directory, and edit the run_solver
script script:
- Add commands necessary to load the profiler's environment. For example, on the EDF Cronos cluster, this requires adding
module load intel-Basekit
in the section where other modules are added.
- Search for the line actually launching the solver, near the end of the script;
- Before that line, add:
# Select a different profiling directory at each run
export DIR_PROF=profiling.$SLURM_NTASKS.$SLURM_JOB_ID
# Avoid using /tmp, which may be too small
export TMPDIR=$SCRATCH
- Before the
cs_solver
command, insert the profiling commands. For example, replace mpiexec <options> ./cs_solver --mpi "$@"
with: mpiexec <options> -q -collect hotspots -r $DIR_PROF -data-limit=4000 -- ./cs_solver --mpi "$@"
(mpiexec
might be replaced by another command, such as srun
depending on the system, but the logic remains the same).
- If using a batch system (usually true on a cluster), you will need to submit the
run_solver
script rather than running it directly.
- If you are familiar with batch commands, pass the required commands to the submission command.
- Otherwise, you can copy the batch headers at the beginning of the
runcase
file to the same position (starting at line 2) in the run_solver
file.
- With the SLURM resource manager, this means:
- Copy all lines of
runcase
starting with #SLURM
to run_solver
.
- Run
- With other systems, the syntax will be slightly different but the principle remains the same.
- Once the code has finished running, simply run
vtune-gui <profiling_directory_name>
This may require loading an environment module (as in the run_solver
file for the VTune installation (intel-Basekit
in the previous example).
VTune allows many exploration views, for example:
VTune summary page
or
VTune hotspots view