Parallelization CPU Affinity

Introduction

HEC-RAS is parallelized, meaning there can be more than one computation thread being run at a time. For 1D computations, parallelization occurs only when using PARDISO matrix solver. For 2D computations, the hydraulic solver uses OpenMP, where the computational work is divided into multiple parallel computational threads, which speeds up the simulation. In HEC-RAS 6.6 and earlier versions, the Operating System managed on what logical processors the parallel computational threads ran. For reasons explained below, this can be inefficient, and so for Version 6.7, HEC-RAS has added options to assign on which logical processors the computational threads can run. For most user computers, these new options will be more efficient and significantly reduce the run times.

HEC-RAS can limit the execution of computations to just Performance Cores (P-cores) and ignore Efficiency Cores (E-cores) which can greatly improve unsteady flow runtimes!

CPU Cores vs Logical Processors

A CPU Core is a physical processing unit. Logical processors are virtual execution contexts that allow a single physical core to handle multiple (typically 2) threads to run concurrently in the operating system. Hyper-Threading (HT) is an Intel technology which allows concurrent threads on a single core. The AMD has a similar technology called Simultaneous Multithreading (SMT) which is functionally the same as Hyper-Threading. In some cases, HT and SMT can lead to a decrease in runtime performance for CPU-intensive tasks due to resource contention, and thread management overhead. For this reason, HEC-RAS 6.7 has added the ability to turn off HT and SMT. For simplicity, the term "hyper-threading" is used here to represent both HT and SMT.

Performance Cores vs Efficiency Cores

Performance cores (often shortened to “P-cores”) and Efficiency cores (“E-cores”) are two different kinds of CPU cores that are combined on the same chip in a “hybrid” architecture. Performance cores are designed for high-performance tasks and will have higher clock speeds and larger local memory compared to Efficiency cores. Efficiency cores are designed for background or less demanding tasks and have a lower power consumption compared to performance cores. This architecture allows for computationally expensive processes to take place on Performance cores, while allowing less intensive computations to occur on more power-efficient but slower Efficiency cores. Modern operating systems like Windows 11 automatically allocate or assign tasks to either P-cores or E-cores based on their processing requirements. However, the operating system is not perfect and can sometimes assign computationally intense processes such as an HEC-RAS unsteady flow simulation to efficiency cores. Ideally, HEC-RAS should never run on efficiency cores.

CPU Affinity Options

HEC-RAS has 3 options to choose from for performing parallel computations during unsteady flow simulation: Operating System Managed, Restricted, and Pinned..

To set this CPU option, select the Options | Program Setup | Parallelization CPU Affinity... menu item.

The dialog shown below will then allow selection of the CPU Affinity.

Operating System Managed - The operating system will choose which cores will be used (P-cores and E-cores) for the computations. Hyper-threading is also allowed.

Restricted - Computations are restricted to run only on performance cores and the operating system is free to use any or all of the performance cores available. However, the threads only run on the performance cores. The operating system is free to manage which threads run on which performance cores. This option does not allow hyper-threading. Example: 4 Solver Cores are selected for a 2D Flow Area has 4 Solver Cores on a computer with 8 hyper-threaded performance cores (i.e. 16 performance logical processors), the computations will be performed using only 4 cores at one time but may use any of 8 performance cores during the simulation.

Pinned - Computations are restricted to run only on performance cores and the maximum number of cores is limited to the user-specified value. Example, 4 Solver Cores are selected for a 2D Flow Area on a computer with 8 performance cores, the computations will be performed using the same 4 performance cores during the entire simulation.

The default option is the Restricted option which will allow threads to run only on the performance cores, but the threads are allowed to switch cores during the simulation.

Setting the 2D Unsteady Flow Solver Cores

The maximum number of cores is set in the Unsteady Flow Analysis editor under the menu Options | Computation Options and Tolerances... and within the tab called 2D Flow Options (see image below). The default number of cores is All Available. In HEC-RAS 6.7 this option limits the number of cores to the number of performance cores, whereas in previous versions it would set the number of cores to the total number of cores. In addition, for hybrid systems, this option did not accurately capture the performance cores available. In HEC-RAS 6.7 the second row of the Parameter column is for the maximum number of Solver Cores followed by a parenthesis which contains the number of the performance and efficiency cores available on the current computer denoted by the capital P and E letters, respectively. For the example shown in the figure below, the computer has 12 performance cores and 0 efficiency cores. If the user selects 20 cores, the software will automatically limit the actual maximum number of cores to 12.

Selecting "All Available" for Solver Cores will limit computation execution to only Performance Cores (ignoring Efficiency Cores). It is suggested that he user specify the appropriate number of Solver Cores for each model.

When the simulation runs you will see messaging to inform you if you have CPU Affinity turned on, which cores are selected, and how many cores per 2D Flow Area are going to be used for the computations. As shown in the figure below, CPU Affinity has been turned on to limit computations to just the performance cores and limited to 4 cores at a time for the 2D area.

Performance

Requesting the use of "All Available" solver cores will (most likely) not result in the most efficient model run times. The overhead of distributing information across processors results in poor performance for smaller datasets. Therefore, users should optimize model runs by running a variety of scenarios with different solver cores selected. Based a limited set of test results on a computer with 8 logical processors for on a model with a range of cell sizes, the performance gains were maximized for models using 4 cores. Speed improvements given a the number of 2D cells and number solver cores is demonstrated in the figure below. For small models, 2 cores showed a marked improvement over 1 core; however, more cores resulted in reduction in performance. Only for larger models was there an increase in performance as the number of solver cores was increased.