QoS policies

Qnet supports transmission over multiple networks and provides several policies for specifying how Qnet should select a network interface for transmission.

These Quality of Service policies include:

loadbalance (the default)
Qnet is free to use all available network links, and will share transmission equally among them.
preferred
Qnet uses one specified link, ignoring all other networks (unless the preferred one fails).
exclusive
Qnet uses one—and only one—link, ignoring all others, even if the exclusive link fails.

To fully benefit from Qnet's QoS, you need to have physically separate networks. For example, consider a network with two nodes and a hub, where each node has two connections to the hub:

Figure 1. Qnet and a single network.

If the link that's currently in use fails, Qnet detects the failure, but doesn't switch to the other link because both links go to the same hub. It's up to the application to recover from the error; when the application reestablishes the connection, Qnet switches to the working link.

Now, consider the same network, but with two hubs:

Figure 2. Qnet and physically separate networks.

If the networks are physically separate and a link fails, Qnet automatically switches to another link, depending on the QoS that you chose. The application isn't aware that the first link failed.

You can use the tx_retries option to lsm-qnet.so to limit the number of times that Qnet retries a transmission, and hence control how long Qnet waits before deciding that a link has failed. Note that if the number of retries is too low, Qnet won't tolerate any lost packets and may prematurely decide that a link is down.

Let's look in more detail at the QoS policies:

loadbalance
Qnet decides which links to use for sending packets, depending on current load and link speeds as determined by io-pkt*. A packet is queued on the link that can deliver the packet the soonest to the remote end. This effectively provides greater bandwidth between nodes when the links are up (the bandwidth is the sum of the bandwidths of all available links), and allows a graceful degradation of service when links fail.

If a link does fail, Qnet will switch to the next available link. This switch takes a few seconds the first time, because the network driver on the bad link will have timed out, retried, and finally died. But once Qnet “knows” that a link is down, it will not send user data over that link.

While load-balancing among the live links, Qnet will send periodic maintenance packets on the failed link in order to detect recovery. When the link recovers, Qnet places it back into the pool of available links.

preferred
With this policy, you specify a preferred link to use for transmissions. Qnet will use only that one link until it fails. If your preferred link fails, Qnet will then turn to the other available links and resume transmission, using the loadbalance policy.

Once your preferred link is available again, Qnet will again use only that link, ignoring all others (unless the preferred link fails).

exclusive
You use this policy when you want to lock transmissions to only one link. Regardless of how many other links are available, Qnet will latch onto the one interface you specify. And if that exclusive link fails, Qnet will NOT use any other link.

Why would you want to use the exclusive policy? Suppose you have two networks, one much faster than the other, and you have an application that moves large amounts of data. You might want to restrict transmissions to only the fast network in order to avoid swamping the slow network under failure conditions.

Note: The loadbalance QoS policy is the default.