"We hypothesize that flow-consistent routing is responsible for virtually all of the congestion that occurs in the core of datacenter networks".
Flow-consistent routing is the constraint that packets for a given TCP 4-tuple get routed through the same network path, rather than balanced across all viable paths; locking a flow to a particular path makes it unlikely that segments will be received out of order on the destination, which TCP handles poorly.
Or, by sending the traffic over all routes, there is no way to keep one server from monopolizing all traffic, because each route is oblivious to the stress currently being experienced by all its peers. It has to set a policy using local data, not global data.
The usual failure mode for clever people thinking about software is taking their third person omniscient view of the system status and thinking they can write software that replicated what a human would do in that situation. We are still so very far from human level intuition and reasoning.
Ultimately one server cannot inject more than one link worth of traffic (e.g. 100 Gbps) into the network which is a tiny fraction of total capacity. Researchers have gotten really good results with "spray and pray" for sub-RTT flows combined with latency and queue depth feedback for multi-RTT flows.
Spray and pray sounds like a reasonable fit for UDP, no?
We’ve had these sorts of bottlenecks before, and they didn’t last. It’s always possible something fundamental changed, but it’s also possible that we are doing something wrong as the motherboard or OS levels and adopting new solutions puts us right back in that space where a couple of servers can easily saturate a network.
If a network card can move data as fast or faster than the main memory bus on a computer then what are we even doing? Should we be treating each subsystem as a special purpose computer and turn the bus into a network switch?
And we could totally construct systems that take some approximation of a global internet state into local routing decisions. But that might devalue some incumbent player's position in the market (or create a new privileged set of players) so even if we made a POC, it wouldn't get adopted.
This is true, and the congestion mentioned here was subtle and not called out - typically flows are handled in a stateless manner by load balancers that hash on some set of MAC/IP/PORT features of the packet. This is where congestion occurs and the paper mentions it here:
All that is needed for congestion is for two large flows
to hash to the same intermediate link; this hot spot will persist
for the life of the flows and cause delays for any other
messages that also pass over the affected link.
It makes logical sense, but I'd love to see the evidence for this.
It all depends on the application and overall use in of the network.
With sufficient flows and a mix of sizes it’ll still tend to even out. But if you’ve significant high-throughout, long lived flows this is definitely something you might hit.
Flow-consistent routing is the constraint that packets for a given TCP 4-tuple get routed through the same network path, rather than balanced across all viable paths; locking a flow to a particular path makes it unlikely that segments will be received out of order on the destination, which TCP handles poorly.