Qnet supports transmission over multiple networks and provides several policies for specifying how Qnet should select a network interface for transmission.
These Quality of Service policies include:
To fully benefit from Qnet's QoS, you need to have physically separate networks. For example, consider a network with two nodes and a hub, where each node has two connections to the hub:
If the link that's currently in use fails, Qnet detects the failure, but doesn't switch to the other link because both links go to the same hub. It's up to the application to recover from the error; when the application reestablishes the connection, Qnet switches to the working link.
Now, consider the same network, but with two hubs:
If the networks are physically separate and a link fails, Qnet automatically switches to another link, depending on the QoS that you chose. The application isn't aware that the first link failed.
You can use the tx_retries option to lsm-qnet.so to limit the number of times that Qnet retries a transmission, and hence control how long Qnet waits before deciding that a link has failed. Note that if the number of retries is too low, Qnet won't tolerate any lost packets and may prematurely decide that a link is down.
Let's look in more detail at the QoS policies:
If a link does fail, Qnet will switch to the next available link. This switch takes a few seconds the first time, because the network driver on the bad link will have timed out, retried, and finally died. But once Qnet knows that a link is down, it will not send user data over that link.
While load-balancing among the live links, Qnet will send periodic maintenance packets on the failed link in order to detect recovery. When the link recovers, Qnet places it back into the pool of available links.
Once your preferred link is available again, Qnet will again use only that link, ignoring all others (unless the preferred link fails).
Why would you want to use the exclusive policy? Suppose you have two networks, one much faster than the other, and you have an application that moves large amounts of data. You might want to restrict transmissions to only the fast network in order to avoid swamping the slow network under failure conditions.