Distributed Compute Plugin
The Distributed Compute Plugin provides the ability to horizontally scale the HEC-WAT Flood Risk Analysis (FRA) computes. The Distributed Compute Plugin is separated from the HEC-WAT instance for HEC-WAT version 1.1 or higher, the distributed compute is integrated with the HEC-WAT version 1.0 software.
The Controller Node and the Compute Nodes
The Distributed Compute consists of two major pieces, the controller node and the compute nodes. The controller node initiates the compute and sends jobs to the compute nodes. When the compute node is finished with its job, the compute node sends a message to the controller node indicating completion and the controller node pulls output results from the compute node for further post processing. The figure below is an example of one controller node with three compute nodes on the same port number.
Controller Node
There is one controller node per WAT simulation and will need to be set up first. This node communicates with the distributed compute grid and facilitates assigning jobs to each compute node, presenting progress to the user, and organizing results received from the compute nodes. The user manages the controller node through the HEC-WAT interface. Click here to set up the distributed compute plugin for the Controller.
Compute Node
There can be many compute nodes per WAT simulation. These nodes receive jobs from the distributed compute grid, compute them, and send results back to the controller node. The compute node is initialized via a batch file (computenode.bat). This launches a compute node that becomes part of the distributed compute grid. When the controller node sends a message to compute a WAT simulation job, the compute node starts an instance of HEC-WAT, checks to see if all of the necessary files are present in the proper locations for the project, computes the job, and reports results or failure of the compute.
You must have access to multiple computers connected to a network or virtual machine (VM). Each Compute node also needs to have the same version of HEC-WAT installed as the Controller node. Once you have the Controller Node set up, click here to set up the distributed compute plugin for the Compute nodes.
Begin Distributed Compute Simulation
Make sure you have WAT version 1.1 or higher installed in C:\Programs in both the Controller and Compute nodes.
Once the Controller Node and Compute Node dependencies have been placed in their respective directories, check that the nodes were set up correctly and that the controller node can link to the compute nodes.
To check that the controller node is functioning, open HEC-WAT where the controller resides. Select Tools → Simulation Compute Engine Manager and the Simulation Compute Manager window should open.
Under Nodes Present, there should be 1 node present which is indicating the controller node. Once this is confirmed, move on to check the compute node has been properly set.
On the compute node VM, click on the computenode.bat. A command window should appear. Keep this window open.
Once the command window appears, go to the out.log file located in C:\Users\<UserName>\AppData\Local\Temp and check that the "Ignite" jar has been properly called. You should see the Ignite logo in the out.log file.
Once you confirmed both the controller and compute nodes are properly set up, head back to your controller node and open the Simulation Compute Engine Manger. The Nodes Present should now show a value of 2 (one node is the controller node, one node for the compute node). If you have two compute nodes configured, you will see a value of 3 and so on. You have completed your Distributed Compute set up and can begin your FRA simulation.
Before beginning a Distributed Compute simulation, check to make sure a simulation has not already been run and results exists in your project. To review this, head over to your projects folder and open the runs folder. Delete the simulation folder if one exists in both the controller and compute machine. DHO-96 - Getting issue details... STATUS
Splitting up Jobs with a distributed compute
There are two main options for splitting up a job in a distributed compute:
- Split by Realization
- Split by Lifecycle
These options allow for customization of how long of a job is sent to any given compute node. In general we advise to split the jobs by lifecycle so that more compute nodes can be leveraged.