Optimal Routing Protocols for Large Data Centers - A Technological Feast

To enable large-scale Layer 2 communication between virtual machines (VMs) and containers (Docker) in data centres, various networking technologies have emerged. These include technologies based on routing protocols like Transparent Interconnection of Many Links (TRILL) and Shortest Path Bridging (SPB), as well as overlay technologies such as Extensible Virtual LAN (VXLAN) and Network Virtualization Using Generic Routing Encapsulation (NVGRE). However, these technologies have not been widely adopted due to their complexity and the varying capabilities of networking equipment.

Today, data centre (IDC) networks are returning to their original simplicity and becoming independent from business operations. Simplicity and reliability are now the primary requirements. Data centres only need to offer simple and reliable layer 3 underlay networking, while the layer 2 overlay network relies more on host-side software or smart network cards.

So, how do you choose a suitable routing protocol for the layer 3 networking of the data centre? This article focuses on the large data centre scenario and aims to provide a definitive answer.

IDC Network Architecture Evolution

Since the economic foundation influences the rest of the system. Similarly, the physical network architecture of a data centre largely influences the planning of routing protocols. When designing the architecture, it is recommended to read "Technology Feast | Internet Data Centre Network 25G Network Architecture Design." This article provides a brief introduction to IDC network architecture in order to clarify the relationship between the infrastructure and the selection of routing protocols.

Traditional architecture of data centre networks

Figure 1: Traditional data centre network architecture (internal, excluding gateway area)

Figure 1 depicts the network architecture of a conventional data centre.
● Traditional IDCs primarily offer services that allow external access to data centres.

● The flow distribution follows the 80/20 model, mainly in a north-south direction with minimal east-west flow.

● The network architecture design follows a three-level structure consisting of core, aggregation, and access. The downward aggregation uses a large layer 2 network, while the aggregation and core horizontally employ the manufacturer's private virtualization technology to ensure reliability.

● The traffic bottleneck occurs at the egress, and a high convergence ratio (10:1 or even greater) can be maintained inside the IDC.

In recent years, the deployment of cloud computing, big data, and other business technologies has led to the widespread use of distributed computing, distributed storage, and other technologies within IDCs. From a networking standpoint, there has been a significant increase in east-west traffic within IDCs, leading to a shift from the traditional 80/20 traffic model to a model dominated by east-west traffic.

Hence, the traditional network architecture started to exhibit numerous drawbacks and began to fail.

1. Poor scalability: The network size is limited by the number of core switch ports, making it difficult to expand capacity smoothly.

2. High convergence ratio: The traffic model designed for north-south traffic resembles a triangle, leading to decreased performance as traffic increases. Additionally, the east-west bandwidth is seriously insufficient.

3. Complex operation and maintenance due to a single control plane: The reliability of the aggregation and core relies on the manufacturer's horizontal virtualization technology. The single control plane of virtualization technology has clear disadvantages, and upgrading the version without interrupting business (In-Service Software Upgrade, ISSU) is difficult.

Fabric network architecture

In order to address the challenges encountered by traditional IDC networks, a new networking technology, Fabric network architecture, has gradually emerged.

Fabric is a familiar concept for network engineers. A frame-type switch based on the CLOS architecture relies on fabric (switchboard) as the forwarding bridge for line cards within the device as shown below in Figure 2.

Figure 2: IDC network architecture design - Network as a Fabric

The fabric networking architecture widely used in contemporary data centres closely resembles the CLOS switch.
● Line card: This serves as both an input and output source, aggregating the traffic of all servers. It can be considered equivalent to the top-of-rack switch (TOR) of an IDC.

● Fabric card: This is a high-speed forwarding channel located in the middle layer, through which cross-TOR traffic is rapidly forwarded.

Folding Figure 2 in half reveals the Leaf-Spine network architecture, which is widely used in data centres today.

Figure 3: Leaf-Spine network architecture

In an IDC, a Leaf-Spine network is formed, based on the smallest delivery unit, known as the POD (Point Of Delivery). To enhance the scalability of this network architecture, a layer is typically added on top of the POD to horizontally connect the PODs of different data centres and expand the scale of the entire data centre cluster.

Leaf-Spine architecture is highly praised for its powerful scale-out capability, extremely high reliability, and excellent maintainability. Well-known global Internet giants all use this networking architecture.

What routing protocol does the Fabric network architecture use?

Figure 4: Large data centre network built using Fabric

Facebook opened its data centre network design in 2014, evolving from F4 to F16 while maintaining the same basic architecture as Figure 4, using a typical Fabric network. The question arises: which routing protocol is more suitable for Fabric network architecture.

In RFC 7938 "Use of BGP for Routing in Large Scale Data Centres," the author proposes using Border Gateway Protocol (BGP) as the unique routing protocol within data centres and provides a detailed analysis. For further details, please refer to the original RFC.

Combining this RFC with the current practices of domestic and foreign Internet companies using BGP networking, let's analyze why BGP is more popular.

Design Principles for Large IDC Network Routing

Routing design is a crucial aspect of data centre network design, and its concept should align with the overall principles of the data centre. The key design points are as follows:

1. Scalability

Design considerations for data centre:

● When designing a data centre, it's important to consider scalability. Large internet companies have campuses with server scales exceeding 300K, while many large campuses have server scales between 20K and 100K. When designing the data centre network, it's essential to consider smooth scale-out capabilities. This approach allows for the delivery of data center networks according to POD, reducing initial investment, and ultimately provides the ability to support large-scale and ultra-large-scale clusters.
Design considerations for routing protocols:

● In routing protocol design, super-large data centres contain thousands of network devices, with a typical switch-to-server ratio of 1:20. The design should prioritize consistency, simplicity, and ease of use. Whether for a small-scale setup or a routing domain built with significant investment, the protocols should quickly propagate and converge.

2. Bandwidth and Traffic Model
Design considerations for data centre:

● The volume of east-west traffic in data centres has significantly increased, and the traditional high convergence ratio model for data centres can no longer meet the demand for east-west traffic.

● In the new network architecture, it is important to design for minimal convergence (Microsoft has even deployed an ultra-fast ratio network where the upstream bandwidth is greater than the downstream bandwidth).

● Considering the cost-effectiveness of network construction, we recommend deploying a convergence ratio of 1:1 to 3:1 per level.
Design considerations for routing protocols:

● For fabric networks, achieving low convergence is primarily accomplished by relying on uplink multi-link load. For example, the typical 25G TOR switch RG-S6510-48VS8CQ has a downlink bandwidth of 48*25Gbps=1200Gbps and an uplink bandwidth of 8*100Gbps=800Gbps. When all ports are fully utilized, the convergence ratio is 1.5:1.

● An important aspect of data centre routing design is the ability to easily implement Equal-Cost Multi-Path (ECMP) routing between multiple links in the data centre. Under normal circumstances, ECMP multi-links can evenly distribute traffic, and when links are added or removed, they can quickly converge without affecting existing network services.

3. CAPEX Minimization
Design considerations for data centre:

● Minimize capital expenditure by standardizing software and hardware requirements for network devices and reducing the variety of device types based on a unified architecture.

● Simplify network feature requirements to reduce R&D costs and time.
Design considerations for routing protocols:

● Use mature, widely accepted routing protocols supported on mainstream models, covering access, core, and backbone devices.

4. OPEX Minimization
Design considerations for data centre:

● Minimize operating costs: The operating costs of large data centre networks are often higher than the construction costs of infrastructure. It's important to consider reducing operating costs at the beginning of architectural design.
Design considerations for routing protocols:

● Reduce the size of failure domains in the network: When a network failure occurs, it's important to minimize the impact of routing convergence and ensure fast convergence time.

● Use only one routing protocol for the entire data centre: This simplifies operation and maintenance, reduces learning costs, and helps accumulate operational knowledge to quickly locate and restore faults.

Selecting the appropriate routing protocol for large Internet Data Centre (IDC) networks.

1. "Essential capabilities required by routing protocols"

Based on the analysis of the routing protocol design points in the previous article, it is concluded that large-scale IDC routing protocols must have the following capabilities:

● Scalability: The networking protocol should support horizontal expansion to accommodate ultra-large-scale data centres throughout the construction and final configuration phases.

● Simplicity: Opt for a simple, mature, and universal routing protocol with fewer software features. This will allow for a wider selection of equipment manufacturers.

● Single Protocol: Aim to use a single routing protocol within data centres to minimize complexity, reduce learning costs, and streamline operational experience.

● Fault Isolation: In the event of a fault, limit the impact area to enhance network robustness.

● Load Balancing: Establish equal-cost multi-paths within the data centre without relying on dedicated load-balancing equipment.

● Policy Control Flexibility: Provide various routing policy control methods to meet specific business flow requirements.

● Rapid Convergence: Ensure quick fault impact reduction and convergence in case of network faults.

2. "Existing Routing Protocol Matching."
Let's delve into the extent of compatibility among the current routing protocols.

● Routing Information Protocol (RIP): Not suitable for large-scale data centres.

● Enhanced Interior Gateway Routing Protocol (EIGRP): A private protocol that does not meet requirements 2 and 3.

● Interior BGP Protocol (IBGP): Generally needs to be used together with the Interior Gateway Protocol (IGP), which does not meet requirements 2 and 3.

● Open Shortest Path First (OSPF), Intermediate System to Intermediate System (ISIS), BGP: Apparently, these three routing protocols can meet all the requirements of 1-7. Among them, ISIS and OSPF are both link-state IGP protocols with high similarity. Consequently we choose OSPF, which is more widely used, for comparison. The following focuses on the analysis of OSPF and BGP routing protocols.

3. OSPF VS BGP
"The following are Wikipedia's definitions of the OSPF and BGP protocols."

● OSPF, which stands for Open Shortest Path First, is a link-state routing protocol. It is classified as an interior gateway protocol (IGP) and operates within an autonomous system. OSPF uses Dijkstra's algorithm to calculate the shortest path tree and uses "cost" as the routing metric. The link state database (LSDB) stores the current network topology, and the link state databases within the same area on the router are identical.

● On the other hand, BGP (Border Gateway Protocol) is a core decentralized autonomous routing protocol on the Internet. It facilitates reachability between autonomous systems (AS) by maintaining IP routing tables or "prefix" tables. BGP is a vector routing protocol and does not utilize traditional IGP indicators. Instead, it uses path-based network policies or rule sets to determine routing. This characteristic makes it more appropriate to be called a vector protocol rather than a routing protocol.

It's important to note that OSPF and BGP are both widely used routing protocols, and neither is inherently superior or inferior. In the context of large or super-large data centres, it's crucial to analyze the applicability of these two routing protocols.

Protocol Type Comparison Item	OSPF	BGP
Routing Algorithm	Dijkstra algorithm	Best path algorithm
Algorithm Type	Link Status	Distance Vector
Bearer Protocol	IP	TCP has a retransmission mechanism to ensure the reliability of protocol data.
Requirement 1: Large-scale networking	Applicability: ★★★ In theory, there is no limit on the number of hops, and it can support large-scale routing networks. However, 0SPF needs to regularly synchronize link status information across the entire network. For ultra-large-scale data centres, the link status information database is too large, and the performance consumption of network equipment during calculation is high. Simultaneously, network fluctuations have a significant impact on the area.	Applicability: ★★★★★ Only transmits the calculated optimal routing information. Applicable to large or super-large data centres. Mature practices have been implemented in super-large campuses.
Requirement 2: Simple	Applicability: ★★★ Simple deployment, moderate operation, and maintenance	Applicability: ★★★★ Easy to deploy and maintain.
Requirement 2: Please ensures that a single type of routing coordinator is deployed in the IDC. This will streamline our operations and ensure consistency in our routing processes.	Applicability: ★★★★ Satisfies IDC's internal deployment of the OSPF single routing protocol and offers rich software support on servers.	Applicability: ★★★★ Satisfy
Requirement 4: Reduce failure domains	Applicability: ★★ Link status information must be synchronized within the domain, and all failures must be updated synchronously.	IDC can deploy only the BGP single routing protocol internally, and there is also software support for external autonomous systems to use BGP interconnection on the server.
Requirement 5: Load balancing	Applicability: ★★★★ Plan the cost value and form ECMP when there are multiple links. When a link fails, it is necessary to synchronize the calculation of devices in the domain.	Applicability: ★★★★ BGP locally propagates only the calculated best path. When the network changes, only incremental information is transmitted.
Requirement 3: Deploy a single type of routing protocol in IDC	Applicability: ★★★★ Satisfies the requirement that only the OSPF single routing protocol can be deployed within the IDC, and it also has extensive software support on the server.	Applicability: ★★★★★ Rules and regulations. After determining the number of hops and AS, ECMP can be formed when there are multiple links. When a link fails, the next hop corresponding to the link will be removed from the ECMP group.
Requirement 6: Flexible control	Applicability: ★★★ The use of Area and ISA types effectively controls route propagation, as this process can be quite complex.	Applicability: ★★★★ Utilize a range of routing principles to effectively filter and regulate the transmission and reception of routes.
Requirement 7: Fast Convergence	Applicability: ★★★ When the number of routes is small, millisecond-level convergence can be achieved through BFD linkage. The notification is link status information. However, when the routing domain is large, the computational consumption increases, resulting in slow convergence.	Applicability: ★★★★ When there are only a few routes, millisecond-level convergence can be achieved using BFD linkage. The announced routes are locally calculated, so performance won't be significantly affected even in large routing domains. Additionally, BGP has AS-based fast-switching technology.

Table 1 Comparison of large data centre routing protocols

Based on our analysis of the table and industry practices, we recommend using the OSPF protocol for small and medium-sized data centres with a small number of network devices in the routing domain. For large or super-large data centres, it's more suitable to deploy the BGP routing protocol.

Summary

This article provides an overview of why large Internet Data Centres (IDCs) prefer using the BGP routing protocol for networking, without delving into the specific planning of the BGP protocol. Ruijie Networks has constructed large and super-large data centre networks for the top three Internet companies in China, utilizing the BGP routing protocol. Regarding the specific planning of the BGP routing protocol, here are some questions I would like to discuss in upcoming articles:

1. How should AS be planned for large data centres, considering that BGP private AS numbers are limited?

2. What interfaces does BGP use to establish neighbours, and how should we plan in ECMP/LACP scenarios?

3. How can we reasonably utilize the various routing principles of BGP?

4. What methods can be employed to optimize BGP performance, reliability, and convergence speed?

Technology Feast | BGP Routing Protocol Planning for Large Data Centres

Ruijie CWDM Solution Boosts New Quality Productive Forces for Digital Campus

Higher Education

5-Star Hotel Solution

Technology Feast: Routing Protocol Selection for Large Data Centre Networks

"Which company offers the most advanced data centre routing and networking technology?"

Data Centre Network Interconnection Technology

IDC Network Architecture Evolution

What routing protocol does the Fabric network architecture use?

Summary

Featured blogs

Higher Education

5-Star Hotel Solution

Technology Feast: Routing Protocol Selection for Large Data Centre Networks

"Which company offers the most advanced data centre routing and networking technology?"Data Centre Network Interconnection Technology

IDC Network Architecture Evolution

What routing protocol does the Fabric network architecture use?

Summary

Featured blogs

"Which company offers the most advanced data centre routing and networking technology?"

Data Centre Network Interconnection Technology