When you set up a Microsoft compute cluster, you have a choice of five supported network topologies:
1. All nodes on public network only
2. All nodes on public and private network
3. Compute nodes isolated on private network
4. All nodes on public, private and MPI networks
5. Compute nodes isolated on private and MPI networks
Which one is suitable for your applications?
N. 1 is often found in large installations, typically because of the prohibitive cost of dedicated private and mpi networks in such cases. I have seen several such examples in the financial world. In this case, the applications can be often characterized as parametric sweeps. They do not require communication among the tasks. They will use the network just for status updates and maybe for limited data transfer. If opting for such topology, keep in mind that you won't be able to use the built-in RIS-based deployment method, as it requires a private network. You will need an alternative deployment solution (e.g. a "corporate" installation of Windows deployment services).
N.2 has an obvious advantage: it separates management and mpi traffic from the “normal” traffic on the public network. MPI communications will take place on the management network if no dedicated network is available. This topology is suitable for most applications that source data from the public network and require "occasional" MPI traffic, i.e. they send packets down the MPI network infrequently, mostly in bursts and of relatively large size. The reason for that is twofold:
- Separation from public traffic will help prevent saturation of the management network, which may otherwise cause your jobs to fail. The head node checks whether the compute nodes are alive every 60 seconds. After 3 attempts, it marks them as unreachable. If a job was scheduled on them, it will be considered as failed. The compute nodes will try and update the status of their tasks as well. If they can’t reach the head node the task will fail.
- Similarly, one must avoid saturating the management network with MPI packets both because of the risk of job failure and to keep the performance levels acceptable.
N.3 again allows you to isolate cluster traffic from the general network, but it presents an obvious bottleneck in the head node, which has to act as router as well. You may still find the performance of this topology acceptable for most parametric sweeps. In this case you will have to pay particular attention to provisioning the data for the computation. If such data quantity is relatively large (e.g. several MB per node) and resides on the public network, this layout is not suitable. On the other hand, if the data is stored on a parallel file system that the nodes access via an independent connection, then this layout may be functional.
N.4 is the ideal case: all nodes can use the public network to access file shares or databases for input data, the management traffic is isolated on its network to limit congestion and enable RIS deployment and MPI has dedicated bandwidth over its own network, possibly with low latency (e.g infiniband). Note that you may choose this topology even if you just have a 3rd GbE network for MPI. Infiniband is not a requirement for it. Also keep in mind that you’ll have to provide an addressing mechanism over the 3rd network (dhcp, fixed addresses). The public network is usually handled by corporate dhcp, dns servers. The private network can use ICS from the head node, but out of the box there is no way in CCS v1 to provide addresses to the MPI network that is handled by the compute cluster management tools.
N.5 is often used for applications that require both low communication latency, high bandwidth for parallel processing and high storage throughput. Data is usually stored in a parallel file system (which may be hosted on a SAN), not on shares on the public network. So, in this case one may be able to save some money on networking hardware without compromising performance.