Preface
In the previous article titled “Technology Feast: Routing Protocol Selection for Large Data Centre Networks”, it was emphasized that the Border Gateway Protocol (BGP) has emerged as the preferred routing protocol for large data centres (IDC). Originally designed for intercommunication between different autonomous systems, BGP was initially unsuitable for IDC and posed numerous challenges. In response to these issues, what optimizations have savvy network engineers implemented for BGP? What key considerations must be factored into the BGP network planning for data centres? This article draws on the practical experience of large Internet companies both domestically and internationally and offers a thorough analysis.
Large Data Centre Network Architecture
▲ Figure 1: Spine-leaf network architecture of a large data centre (intranet)
In order to meet the high-reliability demands of data centre services, a common design approach for modern data centre networks is to assume that network devices and links are unreliable. The goal is to ensure that when these unreliable devices or links fail, any negative impact on business operations can be minimized through self-healing mechanisms. As a result, the Leaf-Spine (Leaf: leaf node, Spine: spine node) networking architecture has become the standard for data centres. This CLOS multi-level switching network design creates a large number of equivalent devices and paths, effectively eliminating single points of failure. The network architecture offers high reliability, exceptional performance, and robust horizontal expansion (scale-out) capabilities.
In this type of data centre architecture, the BGP routing protocol is often deployed across all layers of the CLOS network (including TOR, Leaf, Spine, and other devices in Figure 1) to create a simple, unified, and extremely large-scale network for the data centre. When deploying BGP, it's important to ensure that it meets the basic requirements for IPv4 and IPv6 routing transmission, as well as rapid convergence, flexible control, and expedient operation and maintenance capabilities.
"BGP Deployment Design Considerations"
The purpose of this article is to provide reference methods for the BGP routing deployment design of IDC, focusing on the underlay routing design inside IDC.
▲ Figure 2: Data centre BGP deployment design considerations
In a typical three-tier CLOS data centre network, the BGP design points can be roughly divided into two parts:
1. BGP basic capability planning includes:
● Planning AS numbers for Tier 1-3 devices
● Configuring basic BGP parameters and establishing BGP neighbours between devices
● Generating ECMP equal-cost routes for the CLOS network
● Controlling the routing attributes of different types of BGP routes
● Formulating routing rules
● Providing IPv4/IPv6 dual-stack capability
2. BGP operation and maintenance capability planning includes:
● Utilize the Bidirectional Forwarding Detection (BFD) protocol to expedite fault convergence.
● Ensure uninterrupted business capabilities.
BGP Basic Capability Planning
1. AS Number Planning
BGP's AS numbers are divided into public and private AS numbers. Although the AS number will not be announced to the external network within the IDC, it is still recommended to use private AS numbers to ensure security and maintain usage habits.
The previous BGP version (defined in RFC1771) assigns the AS number with a length of 2 bytes, of which 1023 AS numbers (64512~65534) are reserved for private use. This is insufficient to handle the tens of thousands of network elements in large IDCs. There are currently two solutions to this problem:
● The new "BGP Support for Four-octet AS Number Space" defines a 4-byte BGP AS number, expanding the number of AS numbers to be as plentiful as IPv4 addresses. This provides a range of 90 million (4,200,000,000~4,294,967,294) available for private AS. This is adequate for assigning an independent AS number to each network device or even each host in the IDC.
● Due to the ease of using AS numbers and ensuring that all devices can support them, it is recommended to use private AS numbers from 64512 to 65535 and plan the AS numbers globally. The same AS number can be used by multiple devices.
"The following is an example of a recommended AS number assignment:
Device Role
|
AS Planning Principles
|
Allocation method
|
Allocation Example
|
TOR
|
Unique within a Pod, can be repeated in different P0Ds (TOR's AS number will not be transferred across PODs
|
65000+TOR number
|
The first TOR is 65001 and the second TOR is 65002
|
Leaf
|
All Leafs in the same P0D have the same number.
|
64700+P0D No.
|
The first POD is 64701 and the second POD is 64702
|
SPINE
|
The only one in MAN
|
Global planning reservation
|
For example, XX cluster/park 64601
|
MAN
|
Intranet only
|
Global planning reservation
|
For example, XXMAN is 64513
|
▲ Figure 3: IDC AS number assignment example
2. BGP Basic Parameter Configuration
This section is crucial for data centres to enable BGP intercommunication. The following configuration is highly recommended:
● BGP neighbour establishment
To establish a BGP session, you need to specify an IP address because BGP relies on TCP for connection. It's best to use the device's direct interface address when setting up a BGP session within the IDC.
● BGP Router ID
It is merely an identifier, and assigning it to the management port address or loopback address of the switch is a recommended approach.
● BGP Timers
BGP uses keepalive messages to maintain sessions and check the reachability of the next hop. Originally, BGP was created for connecting different autonomous systems (ASs), such as service providers. The stability of routes between different ASs is more important than fast convergence. To prevent route oscillation, the default timer of the BGP protocol is set to a long duration, with keepalive and hold timers of 60 seconds and 180 seconds, respectively.
In a data centre, fast fault convergence is crucial, so it is recommended to use a BGP timer configuration of 1 second for keepalive and 3 seconds for hold timers to speed up convergence. BGP also has another important timer known as the Advertisement Interval, which determines the interval for publishing route announcements. By default, the BGP announcement interval is set to 30 seconds. However, in a data centre environment, immediate announcement of changes is necessary, so the recommended configuration for the Advertisement Interval is 0 seconds.
For Ruijie RGOS software, the BGP timer settings need to be configured within the BGP process.
Configuration Commands
|
Notes
|
timers bgp 1 3
|
BGP keepalive/hold time 1/3S
|
neighbour XX advertisement-interval 0
|
The interval for sending routing advertisements is 0 seconds
|
Other recommended configurations:
bgp log-neighbour-changes: This setting logs BGP state changes without enabling debugging.
3. BGP ECMP
In CLOS networks, equal-cost multipath routing is crucial for establishing network reliability and stability.
BGP can form equal-cost routes by enabling the "multipath" feature. For example, in Ruijie RGOS, the following configurations need to be made:
Configuration Commands
|
Notes
|
maximum-paths ebgp 32
|
The maximum number of BGP equal-cost routes is 32 (recommended on TOR) and the recommended number on Leaf is 64.
|
The previous section only enables the multi-path capability of BGP. Next, we need to apply BGP routing principles to add the next hops of multiple links to the routing table to create ECMP (Equal Cost Multi-Path) routing. Among these principles, the criteria for two routes to be considered equal and perform load balancing is that the first eight conditions are the same. In the BGP planning of the data centre, only AS_PATH needs consideration for these conditions, as the others are either identical in the IDC or irrelevant.
The AS-PATH attribute requires accurate comparison by default. An equal-cost path can only be formed when the length of the AS-PATH and the specific AS Number are the same. Based on the previous AS Number planning, each TOR has a different AS number. Consequently, the southbound route from the Leaf to the two TOR devices in the same group cannot achieve load balancing. To resolve this issue, AS-PATH loose comparison needs to be enabled on the Leaf device. For example, using Ruijie RGOS, the following configuration is required:
Configuration Commands
|
Notes
|
BGP best path as-path multipath-relax
|
Compare only the AS-PATH length instead of the specific value of the AS-PATH.
|
In the previous AS planning, it was noted that all Leaf devices in the same POD have the same AS number. This means that no matter which Leaf device sends the route, the AS-PATH seen on TOR is always the same. Therefore, there is no need to enable loose comparison on the Leaf model.
Additionally, there are many equal-cost neighbours between the Leaf and TOR with completely consistent configuration policies. It is recommended to use the BGP peer-group function to simplify configuration in actual deployment.
Implement this function and perform the following configuration on Ruijie RGOS:
Configuration Commands
|
Notes
|
neighbour abc peer-group
|
Create peer group abc
|
neighbour 【neighbourlP】 peer-group abc
|
Add neighbour to peer group abc
|
4. BGP Route Attribute Planning
BGP has a rich set of extended attributes that can achieve powerful routing control. The most commonly used attribute in IDC is the BGP community attribute, which can greatly simplify routing policies. In IDC, we often use private community attributes to add management tags to prefixes. Private communities use the AS: number format, where AS refers to the local AS number or the peer AS number, and the number refers to a locally assigned number used to indicate a group of communities to which policies can be applied. In practice, we can use simpler community tags, such as marking the business network segment with a 1:1 attribute and the intranet summary route with a 2:2 attribute. Based on this, we can exercise fine control over route delivery.
5. Establish routing rules
▲ Figure 4: Data centre BGP route announcement planning
Figure 4: Multiple groups of TOR + Leaf form a POD (Point of Delivery, the basic physical design unit of a data centre). The spine is responsible for horizontally connecting multiple PODs, while MAN/DCI provides cross-regional interconnection. IDC's BGP routing planning recommendations are as follows:
● Northbound routing
TOR to Leaf to Spine to MAN/DCI: The service network segment, management network segment, and Loopback are announced step by step. In the stacking scenario, TOR needs to announce the host route to Spine.
● Southbound routing
MAN/DCI to Spine to Leaf, delivering the summary route of the entire intranet, such as 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16. Leaf to TOR, in addition to announcing the summary route of the intranet, also needs to announce the business network segment, management network segment, and Loopback of this Pod (when the Leaf uplink fails, the traffic of the same POD can still match the detailed route and be forwarded through the Spine).
At present, the TOR layer increasingly uses de-stacking technology to achieve server dual homing (refer to the Technology Feast: “How to De-Stack Data centre Network Architecture”). In the de-stacking scenario, Leaf receives a large number of host routes from the ToR switch (depending on the number of hosts in the Pod, which may be tens of thousands). Leaf transmits host routes between TORs, which may cause the TOR switch routing capacity to exceed its limit. Therefore, it is necessary to implement a strategy in the receiving direction of TOR to filter out host routes sent by other TORs.
6. BGP Dual Stack Planning
In recent years, the construction of IPv6 has been vigorously promoted. In fact, the private network addresses of large IDCs are also facing exhaustion. Therefore, there is an urgent need to deploy IPv4/IPv6 dual stacks in IDCs.
BGP supports multiple protocols and can handle v4/v6 dual stack in the same BGP process. The general practice is to establish separate BGP sessions for BGP v4 and v6 neighbours, but this doubles the configuration and maintenance workload. BGP v4 update messages can be sent through the TCP connection established by v6, and vice versa i.e., a single connection allows message announcements from multiple protocol families.
▲ Figure 5: Advertising IPv4 routing information on an IPv6 session
Figure 5: Ruijie Networks provides an optimization solution where only a single session is established to carry dual-stack routing. This simplifies configuration, saves IP addresses, and reduces the performance consumption of deploying protocols such as BFD for BGP by half.
Planning BGP Operation and Maintenance Capability
In addition to considering the planning of BGP basic capabilities, data centres also have extremely high requirements for BGP network operation and maintenance capabilities. Common BGP operation and maintenance capability designs include the following points:
1. Use BFD technology to accelerate BGP network convergence
Although the IDC network is built with high redundancy, its reliability is still limited by the ability of network equipment to detect faults and reroute traffic to other paths, especially in extreme cases where the optical module or optical fiber is single-pass. In current data centres, lower fault convergence time is better, as cloud services require sub-second convergence. As mentioned above, the convergence time can be accelerated by modifying the BGP timer. However, the convergence time of this slow hello mechanism is at most seconds, which does not meet the requirements.
BFD can provide millisecond-level detection accuracy. By linking with BGP, it can achieve rapid convergence of BGP routes and ensure business continuity. It is recommended to enable BFD for BGP in the data centre IDC. Considering the performance of the device, it is advised to use a 300ms*3 configuration when all ports are enabled.
Taking Ruijie RGOS software as an example, the main configuration of BFD is as follows:
Configuration Commands
|
Notes
|
neighbour XX fall-over bfd
|
Enabling BFD
|
BFD interval 300 min rx 300 multiplier 3
|
300ms detection cycle, timeout if no notification is given 3 times
|
2. Uninterrupted Service Capability - Fast Switching of BGP
BGP route convergence requires deleting invalid routes from the routing table, adding new routes, and making corresponding changes in the chip forwarding table. When there are many routes, it takes time to delete and refresh the routing table individually, and the convergence time may reach several seconds or even tens of seconds. Ruijie RGOS software offers optimization for route convergence by supporting prefix-independent convergence. As shown in Figure 6, when all EBGP neighbours from Leaf 1 to Spine devices fail, Leaf 1 will notify all TORs that the AS to Spine is unreachable. After receiving this message, TOR looks for the pre-assigned corresponding ID index (allocated based on the Spine's AS number and Leaf's Router-ID) and notifies the forwarding table to switch the next hop, thereby achieving rapid convergence of services. Its convergence speed is no longer limited by the number of route entries. A large Internet company tested 12K routes, and the convergence time was 0.7 seconds.
▲ Figure 6: BGP Prefix-Independent Convergence
3. Uninterrupted service capability - BGP NSR
Leaf/Spine devices in data centres have high-reliability requirements, and most are equipped with dual management boards. For TOR devices, a similar dual management board effect is achieved in stacking networking scenarios. When the active and standby management boards switch, the inconsistency of state information can easily cause protocol oscillation.
NSR (Non-Stop Routing) is designed to ensure uninterrupted routing during the protocol restart when the switch management board switches between active and standby. When the NSR function is enabled, the TCP NSS (Non-Stop Service) service is activated to back up related neighbours and routing information to the standby board. During the active and standby switching of the management board, the NSR function keeps the network topology stable, maintains neighbour status and forwarding table, and ensures that key services are not interrupted.
4. Uninterrupted Service Capability - Smooth Exit and Delayed Release of BGP
● BGP Smooth Exit: In a CLOS data centre network, when isolating and upgrading devices, the BGP smooth exit function ensures that services continue to flow with minimal interruption. The implementation steps are:
First, the route with the lowest priority (local-preference value is 0 or MED value is 4294967295) is announced to the neighbouring device, carrying the well-known gshut community. This allows the neighbouring device to update the route and switch its traffic to the backup link or another equal-cost link in advance.
Then, delay for a certain period to ensure that route learning is completed, and disconnect the BGP connection with the neighbouring device.
● BGP delayed release: When the device restarts, there may be a situation where the routing table has not been sent to the local hardware table, but the routing information is announced to the neighbour, thereby diverting traffic prematurely and causing abnormal traffic forwarding. To avoid this problem, you can set BGP to adjust the published routes to the lowest priority when the entire machine restarts. It is recommended to pre-configure this capability in the device. Taking Ruijie RGOS as an example, you need to configure:
Configuration Commands
|
Notes
|
BGP advertises the lowest priority on startup: 120.
|
The duration for publishing low-priority routes is 120 seconds and can be configured.
|
Summary
Planning, building, and operating a good data centre BGP network is not an easy task; it requires a lot of practical experience. Fortunately, the application of BGP in IDC has become increasingly mature, and large Internet companies and operators have many practical cases to refer to. Ruijie Networks is also fortunate to be involved in this and has delivered multiple large-scale BGP data centre networks for customers such as Tencent, Alibaba, and ByteDance.
In our ongoing efforts to enhance BGP performance and streamline operations, we are excited to announce a series of upcoming discussions focused on BGP optimization and advanced operational features. These sessions will provide valuable insights and practical strategies aimed at improving network efficiency and reliability. We encourage you to look forward to these informative articles at the forthcoming technology feasts, where we will delve deeper into these critical topics. Your participation will be instrumental in fostering an engaging dialogue on best practices and innovative solutions within our community.