rfc9696v1.txt | rfc9696.txt | |||
---|---|---|---|---|
Internet Engineering Task Force (IETF) Y. Wei, Ed. | Internet Engineering Task Force (IETF) Y. Wei, Ed. | |||
Request for Comments: 9696 Z. Zhang | Request for Comments: 9696 Z. Zhang | |||
Category: Informational ZTE Corporation | Category: Informational ZTE Corporation | |||
ISSN: 2070-1721 D. Afanasiev | ISSN: 2070-1721 D. Afanasiev | |||
Yandex | Yandex | |||
P. Thubert | P. Thubert | |||
Cisco Systems | Individual | |||
T. Przygienda | T. Przygienda | |||
Juniper Networks | Juniper Networks | |||
December 2024 | December 2024 | |||
Routing in Fat Trees (RIFT) Applicability and Operational Considerations | Routing in Fat Trees (RIFT) Applicability and Operational Considerations | |||
Abstract | Abstract | |||
This document discusses the properties, applicability, and | This document discusses the properties, applicability, and | |||
operational considerations of Routing in Fat Trees (RIFT) in | operational considerations of Routing in Fat Trees (RIFT) in | |||
skipping to change at line 112 ¶ | skipping to change at line 112 ¶ | |||
8.2. Informative References | 8.2. Informative References | |||
Acknowledgments | Acknowledgments | |||
Contributors | Contributors | |||
Authors' Addresses | Authors' Addresses | |||
1. Introduction | 1. Introduction | |||
This document discusses the properties and applicability of "RIFT: | This document discusses the properties and applicability of "RIFT: | |||
Routing in Fat Trees" [RFC9692] in different deployment scenarios and | Routing in Fat Trees" [RFC9692] in different deployment scenarios and | |||
highlights the operational simplicity of the technology compared to | highlights the operational simplicity of the technology compared to | |||
traditional routing solutions. It also documents special | classical routing solutions. It also documents special | |||
considerations when RIFT is used with or without overlays and/or | considerations when RIFT is used with or without overlays and/or | |||
controllers and how RIFT identifies miscablings and reroutes around | controllers and how RIFT identifies miscablings and reroutes around | |||
node and link failures. | node and link failures. | |||
2. Terminology | 2. Terminology | |||
This document uses the terminology defined in [RFC9692]. The most | This document uses the terminology defined in [RFC9692]. The most | |||
frequently used terms and their definitions from that document are | frequently used terms and their definitions from that document are | |||
listed here. | listed here. | |||
skipping to change at line 138 ¶ | skipping to change at line 138 ¶ | |||
2-leaf shortcuts and multiple level shortcuts are possible and | 2-leaf shortcuts and multiple level shortcuts are possible and | |||
described further in the document. | described further in the document. | |||
Crossbar: | Crossbar: | |||
Physical arrangement of ports in a switching matrix without | Physical arrangement of ports in a switching matrix without | |||
implying any further scheduling or buffering disciplines. | implying any further scheduling or buffering disciplines. | |||
Directed Acyclic Graph (DAG): | Directed Acyclic Graph (DAG): | |||
A finite directed graph with no directed cycles (loops). If links | A finite directed graph with no directed cycles (loops). If links | |||
in a Clos are considered as either being all directed towards the | in a Clos are considered as either being all directed towards the | |||
top or vice versa, each of two such graphs is a DAG. | top or bottom, each of such two graphs is a DAG. | |||
Disaggregation: | Disaggregation: | |||
The process in which a node decides to advertise more specific | The process in which a node decides to advertise more specific | |||
prefixes southwards, either positively to attract the | prefixes southwards, either positively to attract the | |||
corresponding traffic or negatively to repel it. Disaggregation | corresponding traffic or negatively to repel it. Disaggregation | |||
is performed to prevent traffic loss and suboptimal routing to the | is performed to prevent traffic loss and suboptimal routing to the | |||
more specific prefixes. | more specific prefixes. | |||
Leaf: | Leaf: | |||
A node without southbound adjacencies. Level 0 implies a leaf in | A node without southbound adjacencies. Level 0 implies a leaf in | |||
skipping to change at line 181 ¶ | skipping to change at line 181 ¶ | |||
as links and address prefixes. A TIE always has a direction and a | as links and address prefixes. A TIE always has a direction and a | |||
type. North TIEs (sometimes abbreviated as N-TIEs) are used when | type. North TIEs (sometimes abbreviated as N-TIEs) are used when | |||
dealing with TIEs in the northbound representation, and South-TIEs | dealing with TIEs in the northbound representation, and South-TIEs | |||
(sometimes abbreviated as S-TIEs) are used for the southbound | (sometimes abbreviated as S-TIEs) are used for the southbound | |||
equivalent. TIEs have different types, such as node and prefix | equivalent. TIEs have different types, such as node and prefix | |||
TIEs. | TIEs. | |||
3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | 3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | |||
Clos [CLOS] topologies (commonly called a Fat Tree/network in modern | Clos [CLOS] topologies (commonly called a Fat Tree/network in modern | |||
IP fabric considerations as a homonym to the original definition of | IP fabric considerations as a similar term for the original | |||
the term Fat Tree [FATTREE]) have gained prominence in today's | definition of the term Fat Tree [FATTREE]) have gained prominence in | |||
networking, primarily as a result of the paradigm shift towards a | today's networking, primarily as a result of the paradigm shift | |||
centralized data-center-based architecture that delivers a majority | towards a centralized data-center-based architecture that delivers a | |||
of computation and storage services. | majority of computation and storage services. | |||
Current routing protocols were geared towards a network with an | Current routing protocols were geared towards a network with an | |||
irregular topology with isotropic properties and a low degree of | irregular topology with isotropic properties and a low degree of | |||
connectivity. When applied to Fat Tree topologies: | connectivity. When applied to Fat Tree topologies: | |||
* They tend to need extensive configuration or provisioning during | * They tend to need extensive configuration or provisioning during | |||
initialization and adding or removing nodes from the fabric. | initialization and adding or removing nodes from the fabric. | |||
* For link-state routing protocols, all nodes including spine-and- | * For link-state routing protocols, all nodes including spine-and- | |||
leaf nodes learn the entire network topology and routing | leaf nodes learn the entire network topology and routing | |||
skipping to change at line 276 ¶ | skipping to change at line 276 ¶ | |||
v ++--++ +-+-++ ++--++ ++--++ + | v ++--++ +-+-++ ++--++ ++--++ + | |||
|LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0 | |LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0 | |||
+----+ +----+ +----+ +----+ | +----+ +----+ +----+ +----+ | |||
Figure 1: RIFT Overview | Figure 1: RIFT Overview | |||
A spine node only has information necessary for its level, which is | A spine node only has information necessary for its level, which is | |||
all destinations south of the node based on SPF calculation, the | all destinations south of the node based on SPF calculation, the | |||
default route, and potentially disaggregated routes. | default route, and potentially disaggregated routes. | |||
RIFT combines the advantages of both link-state and distance-vector: | RIFT combines the advantages of both link-state and distance-vector | |||
protocols: | ||||
* Fastest possible convergence | * Fastest possible convergence | |||
* Automatic detection of topology | * Automatic detection of topology | |||
* Minimal routes/information on Top-of-Rack (ToR) switches, aka leaf | * Minimal routes/information on Top-of-Rack (ToR) switches, aka leaf | |||
nodes | nodes | |||
* High degree of ECMP | * High degree of ECMP | |||
skipping to change at line 299 ¶ | skipping to change at line 300 ¶ | |||
* Maximum propagation speed with flexible prefixes in an update | * Maximum propagation speed with flexible prefixes in an update | |||
There are two types of link-state databases that are "north | There are two types of link-state databases that are "north | |||
representation" North Topology Information Elements (N-TIEs) and | representation" North Topology Information Elements (N-TIEs) and | |||
"south representation" South Topology Information Elements (S-TIEs). | "south representation" South Topology Information Elements (S-TIEs). | |||
The N-TIEs contain a link-state topology description of lower levels, | The N-TIEs contain a link-state topology description of lower levels, | |||
and the S-TIEs simply carry default and disaggregated routes for the | and the S-TIEs simply carry default and disaggregated routes for the | |||
lower levels. | lower levels. | |||
RIFT also eliminates major disadvantages of link-state and distance- | RIFT also eliminates major disadvantages of link-state and distance- | |||
vector with the following: | vector protocols with the following: | |||
* Reduced and balanced flooding | * Reduced and balanced flooding | |||
* Level-constrained automatic neighbor discovery | * Level-constrained automatic neighbor discovery | |||
To achieve this, RIFT builds on the art of IGPs, such as OSPF, IS-IS, | To achieve this, RIFT builds on the art of IGPs, such as OSPF, IS-IS, | |||
Mobile Ad Hoc Network (MANET), and Internet of Things (IoT) to | Mobile Ad Hoc Network (MANET), and Internet of Things (IoT) to | |||
provide unique features: | provide unique features: | |||
* Automatic (positive or negative) route disaggregation of northward | * Automatic (positive or negative) route disaggregation of northward | |||
skipping to change at line 363 ¶ | skipping to change at line 364 ¶ | |||
4.2.1. Horizontal Links | 4.2.1. Horizontal Links | |||
RIFT is not limited to pure Clos divided into PoD and multi-planes | RIFT is not limited to pure Clos divided into PoD and multi-planes | |||
but supports horizontal (East-West) links below the ToF level. Those | but supports horizontal (East-West) links below the ToF level. Those | |||
links are used only for last resort northbound forwarding when a | links are used only for last resort northbound forwarding when a | |||
spine loses all its northbound links or cannot compute a default | spine loses all its northbound links or cannot compute a default | |||
route through them. | route through them. | |||
A full-mesh connectivity between nodes on the same level can be | A full-mesh connectivity between nodes on the same level can be | |||
employed and that allows North SPF (N-SPF) to provide for any node | deployed, which allows North SPF (N-SPF) to provide for any node | |||
losing all its northbound adjacencies (as long as any of the other | losing all its northbound adjacencies (as long as any of the other | |||
nodes in the level are northbound connected) to still participate in | nodes in the level are northbound connected) and still participate in | |||
northbound forwarding. | northbound forwarding. | |||
Note that a "ring" of horizontal links at any level below ToF does | Note that a "ring" of horizontal links at any level below ToF does | |||
not provide a "ring-based protection" scheme since the SPF | not provide a "ring-based protection" scheme since the SPF | |||
computation would have to deal with breaking of "loops", an | computation would have to deal with breaking of "loops", an | |||
application for which RIFT is not intended. | application for which RIFT is not intended. | |||
4.2.2. Vertical Shortcuts | 4.2.2. Vertical Shortcuts | |||
Through relaxations of the specified adjacency forming rules, RIFT | Through relaxations of the specified adjacency forming rules, RIFT | |||
skipping to change at line 409 ¶ | skipping to change at line 410 ¶ | |||
operation specified for East-West links and the southbound | operation specified for East-West links and the southbound | |||
reflection between nodes are not applicable. Also, ZTP will | reflection between nodes are not applicable. Also, ZTP will | |||
derive a sense of depth that will eliminate some links. | derive a sense of depth that will eliminate some links. | |||
Variations of ZTP could be derived to meet specific objectives, | Variations of ZTP could be derived to meet specific objectives, | |||
e.g., make it so that most routers have at least two parents to | e.g., make it so that most routers have at least two parents to | |||
reach the ToF. | reach the ToF. | |||
* RIFT applies to any Destination-Oriented DAG (DODAG) where there's | * RIFT applies to any Destination-Oriented DAG (DODAG) where there's | |||
only one ToF node and the problem of disaggregation does not | only one ToF node and the problem of disaggregation does not | |||
exist. In that case, RIFT operates very much like RPL [RFC6550], | exist. In that case, RIFT operates very much like RPL [RFC6550], | |||
but uses Link State for southbound routes (downwards in RPL's | but uses link-state information for southbound routes (downwards | |||
terms). For an arbitrary DAG with multiple destinations (ToFs), | in RPL's terms). For an arbitrary DAG with multiple destinations | |||
the way disaggregation happens has to be considered. | (ToFs), the way disaggregation happens has to be considered. | |||
* Positive Disaggregation expects that most of the ToF nodes reach | * Positive Disaggregation expects that most of the ToF nodes reach | |||
most of the leaves, so disaggregation is the exception as opposed | most of the leaves, so disaggregation is the exception as opposed | |||
to the rule. When this is no longer true, it makes sense to turn | to the rule. When this is no longer true, it makes sense to turn | |||
off disaggregation and route between the ToF nodes over a ring, a | off disaggregation and route between the ToF nodes over a ring, a | |||
full mesh, a transit network, or a form of area zero. Then again, | full mesh, a transit network, or a form of area zero. Then again, | |||
this operation is similar to RPL operating as a single DODAG with | this operation is similar to RPL operating as a single DODAG with | |||
a virtual root. | a virtual root. | |||
* In order to aggregate and disaggregate routes, RIFT requires that | * In order to aggregate and disaggregate routes, RIFT requires that | |||
skipping to change at line 433 ¶ | skipping to change at line 434 ¶ | |||
fabric. This can be achieved with a ring as suggested by RIFT | fabric. This can be achieved with a ring as suggested by RIFT | |||
[RFC9692], by some preconfiguration, or by using a synchronization | [RFC9692], by some preconfiguration, or by using a synchronization | |||
with a common repository where all the active prefixes are | with a common repository where all the active prefixes are | |||
registered. | registered. | |||
4.2.4. Reachability of Internal Nodes in the Fabric | 4.2.4. Reachability of Internal Nodes in the Fabric | |||
RIFT does not require that nodes have reachable addresses in the | RIFT does not require that nodes have reachable addresses in the | |||
fabric, though it is clearly desirable for operational purposes. | fabric, though it is clearly desirable for operational purposes. | |||
Under normal operating conditions, this can be easily achieved by | Under normal operating conditions, this can be easily achieved by | |||
injecting the node's loopback address into North and South Prefix | injecting the node's loopback address into Prefix North TIEs and | |||
TIEs or other implementation-specific mechanisms. | Prefix South TIEs or other implementation-specific mechanisms. | |||
Special considerations arise when a node loses all northbound | Special considerations arise when a node loses all northbound | |||
adjacencies but is not at the top of the fabric. If a spine node | adjacencies but is not at the top of the fabric. If a spine node | |||
loses all northbound links, the spine node doesn't advertise a | loses all northbound links, the spine node doesn't advertise a | |||
default route. But if the level of the spine node is auto-determined | default route. But if the level of the spine node is auto-determined | |||
by ZTP, it will "fall down" as depicted in Figure 8. | by ZTP, it will "fall down" as depicted in Figure 8. | |||
4.3. Use Cases | 4.3. Use Cases | |||
4.3.1. Data Center Topologies | 4.3.1. Data Center Topologies | |||
4.3.1.1. Data Center Fabrics | 4.3.1.1. Data Center Fabrics | |||
RIFT is suited for applying in data center (DC) IP fabrics underlay | RIFT is suited for applying underlay routing in data center (DC) IP | |||
routing, vast majority of which seem to be currently (and for the | fabrics, with the vast majority of these IP fabrics being Clos | |||
foreseeable future) Clos architectures. It significantly simplifies | architectures (and will be for the foreseeable future). It | |||
operation and deployment of such fabrics as described in Section 5 | significantly simplifies operation and deployment of such fabrics as | |||
for environments compared to extensive proprietary provisioning and | described in Section 5 for environments compared to extensive | |||
operational solutions. | proprietary provisioning and operational solutions. | |||
4.3.1.2. Adaptations to Other Proposed Data Center Topologies | 4.3.1.2. Adaptations to Other Proposed Data Center Topologies | |||
. +-----+ +-----+ | . +-----+ +-----+ | |||
. | | | | | . | | | | | |||
.+-+ S0 | | S1 | | .+-+ S0 | | S1 | | |||
.| ++---++ ++---++ | .| ++---++ ++---++ | |||
.| | | | | | .| | | | | | |||
.| | +------------+ | | .| | +------------+ | | |||
.| | | +------------+ | | .| | | +------------+ | | |||
skipping to change at line 507 ¶ | skipping to change at line 508 ¶ | |||
environments close to content producers (server farms connection via | environments close to content producers (server farms connection via | |||
DC fabrics) but in proximity to content consumers as well. Consumers | DC fabrics) but in proximity to content consumers as well. Consumers | |||
are often clustered in metro areas with their own network | are often clustered in metro areas with their own network | |||
architectures that can benefit from simplified, regular Clos | architectures that can benefit from simplified, regular Clos | |||
structures. Thus, they can also benefit from RIFT. | structures. Thus, they can also benefit from RIFT. | |||
4.3.3. Building Cabling | 4.3.3. Building Cabling | |||
Commercial edifices are often cabled in topologies that are either | Commercial edifices are often cabled in topologies that are either | |||
Clos or its isomorphic equivalents. The Clos can grow rather high | Clos or its isomorphic equivalents. The Clos can grow rather high | |||
with many levels. That presents a challenge for traditional routing | with many levels. That presents a challenge for classical routing | |||
protocols (except BGP [RFC4271] and Private Network-Network Interface | protocols (except BGP [RFC4271] and Private Network-Network Interface | |||
(PNNI) [PNNI], which is largely phased-out by now) that do not | (PNNI) [PNNI], which is largely phased-out by now) that do not | |||
support an arbitrary number of levels, which RIFT does naturally. | support an arbitrary number of levels, which RIFT does naturally. | |||
Moreover, due to the limited sizes of forwarding tables in network | Moreover, due to the limited sizes of forwarding tables in network | |||
elements of building cabling, the minimum FIB size RIFT maintains | elements of building cabling, the minimum FIB size RIFT maintains | |||
under normal conditions is cost-effective in terms of hardware and | under normal conditions is cost-effective in terms of hardware and | |||
operational costs. | operational costs. | |||
4.3.4. Internal Router Switching Fabrics | 4.3.4. Internal Router Switching Fabrics | |||
skipping to change at line 542 ¶ | skipping to change at line 543 ¶ | |||
The Cloud Central Office (CloudCO) is a new stage of the telecom | The Cloud Central Office (CloudCO) is a new stage of the telecom | |||
Central Office. It takes the advantage of Software-Defined | Central Office. It takes the advantage of Software-Defined | |||
Networking (SDN) and Network Function Virtualization (NFV) in | Networking (SDN) and Network Function Virtualization (NFV) in | |||
conjunction with general purpose hardware to optimize current | conjunction with general purpose hardware to optimize current | |||
networks. The following figure illustrates this architecture at a | networks. The following figure illustrates this architecture at a | |||
high level. It describes a single instance or macro-node of CloudCO | high level. It describes a single instance or macro-node of CloudCO | |||
that provides a number of value-added services (VASes), a Broadband | that provides a number of value-added services (VASes), a Broadband | |||
Access Abstraction (BAA), and virtualized network services. An | Access Abstraction (BAA), and virtualized network services. An | |||
Access I/O module faces a CloudCO access node and the Customer | Access I/O module faces a CloudCO access node and the Customer | |||
Premises Equipment (CPE) behind it. A Network I/O module is facing | Premises Equipment (CPE) behind it. A Network I/O module is facing | |||
the core network. The two I/O modules are interconnected by a leaf | the core network. The two I/O modules are interconnected by a spine- | |||
and spine fabric [TR-384]. | and-leaf fabric [TR-384]. | |||
+---------------------+ +----------------------+ | +---------------------+ +----------------------+ | |||
| Spine | | Spine | | | Spine | | Spine | | |||
| Switch | | Switch | | | Switch | | Switch | | |||
+------+---+------+-+-+ +--+-+-+-+-----+-------+ | +------+---+------+-+-+ +--+-+-+-+-----+-------+ | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | |||
| | | | | +-------------------------------+ | | | | | | | +-------------------------------+ | | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | |||
| | | | +-------------------------+ | | | | | | | | +-------------------------+ | | | | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | |||
skipping to change at line 615 ¶ | skipping to change at line 616 ¶ | |||
scenarios. | scenarios. | |||
* RIFT automatically negotiates Bidirectional Forwarding Detection | * RIFT automatically negotiates Bidirectional Forwarding Detection | |||
(BFD) per link. This allows for IP and micro-BFD [RFC7130] to | (BFD) per link. This allows for IP and micro-BFD [RFC7130] to | |||
replace Link Aggregation Groups (LAGs) that hide bandwidth | replace Link Aggregation Groups (LAGs) that hide bandwidth | |||
imbalances in case of constituent failures. Further automatic | imbalances in case of constituent failures. Further automatic | |||
link validation techniques similar to those in [RFC5357] could be | link validation techniques similar to those in [RFC5357] could be | |||
supported as well. | supported as well. | |||
* RIFT inherently solves many problems associated with the use of | * RIFT inherently solves many problems associated with the use of | |||
traditional routing topologies with dense meshes and high degrees | classical routing topologies with dense meshes and high degrees of | |||
of ECMP by including automatic bandwidth balancing, flood | ECMP by including automatic bandwidth balancing, flood reduction, | |||
reduction, and automatic disaggregation on failures while | and automatic disaggregation on failures while providing maximum | |||
providing maximum aggregation of prefixes in default scenarios. | aggregation of prefixes in default scenarios. ECMP in RIFT | |||
ECMP in RIFT eliminates the need for more Loop-Free Alternate | eliminates the need for more Loop-Free Alternate (LFA) procedures. | |||
(LFA) procedures. | ||||
* RIFT reduces FIB size towards the bottom of the IP fabric where | * RIFT reduces FIB size towards the bottom of the IP fabric where | |||
most nodes reside and allows with that for cheaper hardware on the | most nodes reside. This allows for cheaper hardware on the edges | |||
edges and introduction of modern IP fabric architectures that | and introduction of modern IP fabric architectures that encompass | |||
encompass, e.g., server multihoming. | server multihoming and other mechanisms. | |||
* RIFT provides valley-free routing that is loop free. A valley- | * RIFT provides valley-free routing that is loop free. A valley- | |||
free path allows for reversal of direction at most once from a | free path allows for reversal of direction at most once from a | |||
packet heading northbound to southbound while permitting traversal | packet heading northbound to southbound while permitting traversal | |||
of horizontal links in the northbound phase. This allows for the | of horizontal links in the northbound phase. This allows for the | |||
use of any such valley-free path in bisectional fabric bandwidth | use of any such valley-free path in bisectional fabric bandwidth | |||
between two destinations irrespective of their metrics that can be | between two destinations irrespective of their metrics that can be | |||
used to balance load on the fabric in different ways. Valley-free | used to balance load on the fabric in different ways. Valley-free | |||
routing eliminates the need for any specific micro-loop avoidance | routing eliminates the need for any specific micro-loop avoidance | |||
procedures for RIFT. | procedures for RIFT. | |||
skipping to change at line 699 ¶ | skipping to change at line 699 ¶ | |||
| +-----------+ | | + +---+linkSL7+-+ | + | | +-----------+ | | + +---+linkSL7+-+ | + | |||
| | | | | | | | | | | | | | | | | | |||
+-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | |||
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |||
+-+-----+ +-+-----+ +-----+-+ +-+-----+ | +-+-----+ +-+-----+ +-----+-+ +-+-----+ | |||
+ + + + | + + + + | |||
Prefix111 Prefix112 Prefix121 Prefix122 | Prefix111 Prefix112 Prefix121 Prefix122 | |||
Figure 4: Suboptimal Routing Upon Link Failure Use Case | Figure 4: Suboptimal Routing Upon Link Failure Use Case | |||
As shown in Figure 4, as the result of the south reflection between | As shown in Figure 4, as the result of the south reflection, Spine121 | |||
Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and | and Spine 122 know each other via Leaf121 or Leaf 122 at level 1. | |||
Spine 122 know each other at level 1. | ||||
Without disaggregation mechanisms, the packet from leaf121 to | Without disaggregation mechanisms, the packet from leaf121 to | |||
prefix122 will probably go up through linkSL5 to linkTS3 when linkSL6 | prefix122 will probably go up through linkSL5 to linkTS3 when linkSL6 | |||
fails. Then, the packet will go down through linkTS4 to linkSL8 to | fails. Then, the packet will go down through linkTS4 to linkSL8 to | |||
Leaf122 or go up through linkSL5 to linkTS6, then go down through | Leaf122 or go up through linkSL5 to linkTS6, then go down through | |||
linkTS8 and linkSL8 to Leaf122 based on the pure default route. This | linkTS8 and linkSL8 to Leaf122 based on the pure default route. This | |||
is the case of suboptimal routing or bow tying. | is the case of suboptimal routing or bow tying. | |||
With disaggregation mechanisms, Spine122 will detect the failure | With disaggregation mechanisms, Spine122 will detect the failure | |||
according to the reflected node S-TIE from Spine121 when linkSL6 | according to the reflected node S-TIE from Spine121 when linkSL6 | |||
skipping to change at line 788 ¶ | skipping to change at line 787 ¶ | |||
unique in the RIFT network and the level of the node in the Fat Tree, | unique in the RIFT network and the level of the node in the Fat Tree, | |||
which determines which peers are northward "parents" and which are | which determines which peers are northward "parents" and which are | |||
southward "children". | southward "children". | |||
ZTP is always on, but its decisions can be overridden when a network | ZTP is always on, but its decisions can be overridden when a network | |||
administrator prefers to impose its own configuration. In that case, | administrator prefers to impose its own configuration. In that case, | |||
it is the responsibility of the administrator to ensure that the | it is the responsibility of the administrator to ensure that the | |||
configured parameters are correct, i.e., ensure that the System ID of | configured parameters are correct, i.e., ensure that the System ID of | |||
each node is unique and that the administratively set levels truly | each node is unique and that the administratively set levels truly | |||
reflect the relative position of the nodes in the fabric. It is | reflect the relative position of the nodes in the fabric. It is | |||
recommended to let ZTP configure the network, and when not, it is | recommended to let ZTP configure the network, and when ZTP does not | |||
recommended to configure the level of all the nodes to avoid an | configure the network, it is recommended to configure the level of | |||
undesirable interaction between ZTP and the manual configuration. | all the nodes to avoid an undesirable interaction between ZTP and the | |||
manual configuration. | ||||
ZTP requires that the administrator points out the ToF nodes to set | ZTP requires that the administrator points out the ToF nodes to set | |||
the baseline from which the fabric topology is derived. The ToF | the baseline from which the fabric topology is derived. The ToF | |||
nodes are configured with the TOP_OF_FABRIC flag, which are initial | nodes are configured with the TOP_OF_FABRIC flag, which are initial | |||
'seeds' needed for other ZTP nodes to derive their level in the | 'seeds' needed for other ZTP nodes to derive their level in the | |||
topology. ZTP computes the level of each node based on the Highest | topology. ZTP computes the level of each node based on the Highest | |||
Available Level (HAL) of the potential parent closest to that | Available Level (HAL) of the potential parent closest to that | |||
baseline, which represents the superspine. In a fashion, RIFT can be | baseline, which represents the superspine. In a fashion, RIFT can be | |||
seen as a distance-vector protocol that computes a set of feasible | seen as a distance-vector protocol that computes a set of feasible | |||
successors towards the superspine and autoconfigures the rest of the | successors towards the superspine and autoconfigures the rest of the | |||
skipping to change at line 976 ¶ | skipping to change at line 976 ¶ | |||
| | | +--------------------------------+ | | | | +--------------------------------+ | |||
| | | | | | | | | | |||
| | | | | | | | | | |||
| | | | | | | | | | |||
| | | | | | | | | | |||
+ + + + | + + + + | |||
+-1--2--3--4--+ | +-1--2--3--4--+ | |||
| Leaf1 | ...... | | Leaf1 | ...... | |||
+-------------+ | +-------------+ | |||
Figure 9: Fallen Spine | Figure 9: Additional Cabling Constraint Example | |||
RIFT allows implementations to provide programmable plug-ins that can | RIFT allows implementations to provide programmable plug-ins that can | |||
adjust ZTP operation or capture information during computation. | adjust ZTP operation or capture information during computation. | |||
While defining this is outside the scope of this document, such a | While defining this is outside the scope of this document, such a | |||
mechanism could be used to extend the miscabling functionality. | mechanism could be used to extend the miscabling functionality. | |||
For other protocols to achieve this, it would require additional | For other protocols to achieve this, it would require additional | |||
operational overhead. Consider a fabric that is using unnumbered | operational overhead. Consider a fabric that is using unnumbered | |||
OSPF links; it is still very likely that a miscabled link will form | OSPF links; it is still very likely that a miscabled link will form | |||
an adjacency. Each attempt to move cables to the correct port may | an adjacency. Each attempt to move cables to the correct port may | |||
skipping to change at line 1134 ¶ | skipping to change at line 1134 ¶ | |||
way, the multiple routes are equally valid and should be conserved in | way, the multiple routes are equally valid and should be conserved in | |||
the case of anycast. Without further information from the | the case of anycast. Without further information from the | |||
redistributed routing protocol, it is impossible to sort out a | redistributed routing protocol, it is impossible to sort out a | |||
movement from a redistribution that happens asynchronously on | movement from a redistribution that happens asynchronously on | |||
different leaves. RIFT [RFC9692] expects that anycast addresses are | different leaves. RIFT [RFC9692] expects that anycast addresses are | |||
advertised within the timing precision, which is typically the case | advertised within the timing precision, which is typically the case | |||
with a low-precision timing and a multihomed node. Beyond that time | with a low-precision timing and a multihomed node. Beyond that time | |||
interval, RIFT interprets the lag as a mobility and only the freshest | interval, RIFT interprets the lag as a mobility and only the freshest | |||
route is retained. | route is retained. | |||
When using IPv6 [RFC8200], RIFT suggests to leverage [RFC8505] as the | When using IPv6 [RFC8200], RIFT suggests leveraging 6LoWPAN ND | |||
IPv6 ND interaction between the mobile node and the leaf. This not | [RFC8505] as the IPv6 ND interaction between the mobile node and the | |||
only provides a sequence counter but also a lifetime and a security | leaf. This not only provides a sequence counter but also a lifetime | |||
token that may be used to protect the ownership of an address | and a security token that may be used to protect the ownership of an | |||
[RFC8928]. When using [RFC8505], the parallel registration of an | address [RFC8928]. When using 6LoWPAN ND [RFC8505], the parallel | |||
anycast address to multiple leaves is done with the same sequence | registration of an anycast address to multiple leaves is done with | |||
counter, whereas the sequence counter is incremented when the point | the same sequence counter, whereas the sequence counter is | |||
of attachment changes. This way, it is possible to differentiate a | incremented when the point of attachment changes. This way, it is | |||
mobile node from a multihomed node, even when the mobility happens | possible to differentiate a mobile node from a multihomed node, even | |||
within the timing precision. It is also possible for a mobile node | when the mobility happens within the timing precision. It is also | |||
to be multihomed as well, e.g., to change only one of its points of | possible for a mobile node to be multihomed as well, e.g., to change | |||
attachment. | only one of its points of attachment. | |||
5.9. IPv4 over IPv6 | 5.9. IPv4 over IPv6 | |||
RIFT allows advertising IPv4 prefixes over an IPv6 RIFT network. An | RIFT allows advertising IPv4 prefixes over an IPv6 RIFT network. An | |||
IPv6 Address Family (AF) configures via the usual ND mechanisms and | IPv6 Address Family (AF) configures via the usual ND mechanisms and | |||
then V4 can use V6 next-hops analogous to [RFC8950]. It is expected | then V4 can use V6 next-hops analogous to [RFC8950]. It is expected | |||
that the whole fabric supports the same type of forwarding of AFs on | that the whole fabric supports the same type of forwarding of AFs on | |||
all the links. RIFT provides an indication whether a node is capable | all the links. RIFT provides an indication whether a node is capable | |||
of V4-forwarding and implementations are possible where different | of V4-forwarding and implementations are possible where different | |||
routing tables are computed per AF as long as the computation remains | routing tables are computed per AF as long as the computation remains | |||
skipping to change at line 1188 ¶ | skipping to change at line 1188 ¶ | |||
+---+----+ +---+----+ | +---+----+ +---+----+ | |||
| V4 | | V4 | | | V4 | | V4 | | |||
| subnet | | subnet | | | subnet | | subnet | | |||
+--------+ +--------+ | +--------+ +--------+ | |||
Figure 10: IPv4 over IPv6 | Figure 10: IPv4 over IPv6 | |||
5.10. In-Band Reachability of Nodes | 5.10. In-Band Reachability of Nodes | |||
RIFT doesn't precondition that nodes of the fabric have reachable | RIFT doesn't precondition that nodes of the fabric have reachable | |||
addresses, but the operational reasons to reach the internal nodes | addresses, but operational reasons to reach the internal nodes may | |||
may exist. Figure 11 shows an example that the network management | exist. Figure 11 shows an example that the network management | |||
station (NMS) attaches to Leaf1. | station (NMS) attaches to Leaf1. | |||
+-------+ +-------+ | +-------+ +-------+ | |||
| ToF1 | | ToF2 | | | ToF1 | | ToF2 | | |||
++---- ++ ++-----++ | ++---- ++ ++-----++ | |||
| | | | | | | | | | |||
| +----------+ | | | +----------+ | | |||
| +--------+ | | | | +--------+ | | | |||
| | | | | | | | | | |||
++-----++ +--+---++ | ++-----++ +--+---++ | |||
skipping to change at line 1224 ¶ | skipping to change at line 1224 ¶ | |||
If the NMS wants to access Leaf2, it simply works because the | If the NMS wants to access Leaf2, it simply works because the | |||
loopback address of Leaf2 is flooded in its Prefix North TIE. | loopback address of Leaf2 is flooded in its Prefix North TIE. | |||
If the NMS wants to access Spine2, it also works because a spine node | If the NMS wants to access Spine2, it also works because a spine node | |||
always advertises its loopback address in the Prefix North TIE. The | always advertises its loopback address in the Prefix North TIE. The | |||
NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ | NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ | |||
ToF2-Spine2. | ToF2-Spine2. | |||
If the NMS wants to access ToF2, ToF2's loopback address needs to be | If the NMS wants to access ToF2, ToF2's loopback address needs to be | |||
injected into its Prefix South TIE. This TIE must be seen by all | injected into its Prefix South TIE. This TIE must be seen by all | |||
nodes at the level below -- the spine nodes in Figure 9 -- that must | nodes at the level below -- the spine nodes in Figure 11 -- that must | |||
form a ceiling for all the traffic coming from below (south). | form a ceiling for all the traffic coming from below (south). | |||
Otherwise, the traffic from the NMS may follow the default route to | Otherwise, the traffic from the NMS may follow the default route to | |||
the wrong ToF Node, e.g., ToF1. | the wrong ToF Node, e.g., ToF1. | |||
In the case of failure between ToF2 and spine nodes, ToF2's loopback | In the case of failure between ToF2 and spine nodes, ToF2's loopback | |||
address must be disaggregated recursively all the way to the leaves. | address must be disaggregated recursively all the way to the leaves. | |||
In a partitioned ToF, even with recursive disaggregation, a ToF node | In a partitioned ToF, even with recursive disaggregation, a ToF node | |||
is only reachable within its plane. | is only reachable within its plane. | |||
A possible alternative to recursive disaggregation is to use a ring | A possible alternative to recursive disaggregation is to use a ring | |||
that interconnects the ToF nodes to transmit packets between them for | that interconnects the ToF nodes to transmit packets between them for | |||
their loopback addresses only. The idea is that this is mostly | their loopback addresses only. The idea is that this is mostly | |||
control traffic and should not alter the load-balancing properties of | control traffic and should not alter the load-balancing properties of | |||
the fabric. | the fabric. | |||
5.11. Dual-Homing Servers | 5.11. Dual-Homing Servers | |||
Each RIFT node may operate in ZTP mode. It has no configuration | Each RIFT node may operate in ZTP mode. It has no configuration | |||
(unless it is a ToF at the top of the topology or the must operate in | (unless it is a ToF node at the top of the topology or if it must | |||
the topology as leaf and/or support leaf-2-leaf procedures), and it | operate in the topology as a leaf and/or support leaf-2-leaf | |||
will fully configure itself after being attached to the topology. | procedures), and it will fully configure itself after being attached | |||
to the topology. | ||||
+---+ +---+ +---+ | +---+ +---+ +---+ | |||
|ToF| |ToF| |ToF| ToF | |ToF| |ToF| |ToF| ToF | |||
+---+ +---+ +---+ | +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | |||
| +----------------+ | | | | +----------------+ | | | |||
| +----------------+ | | | +----------------+ | | |||
| | | | | | | | | | | | | | |||
+----------+--+ +--+----------+ | +----------+--+ +--+----------+ | |||
| ToR1 | | ToR2 | Spine | | ToR1 | | ToR2 | Spine | |||
skipping to change at line 1270 ¶ | skipping to change at line 1271 ¶ | |||
| | | | | +-----------------+ | | | | | | | +-----------------+ | | |||
| | | | +--------------+ | | | | | | | | +--------------+ | | | | |||
| | | | | | | | | | | | | | | | | | |||
+---+ +---+ +---+ +---+ | +---+ +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | | | | | |||
+---+ +---+ ............. +---+ +---+ | +---+ +---+ ............. +---+ +---+ | |||
SV(1) SV(2) SV(n-1) SV(n) Leaf | SV(1) SV(2) SV(n-1) SV(n) Leaf | |||
Figure 12: Dual-Homing Servers | Figure 12: Dual-Homing Servers | |||
Sometimes people may prefer to disaggregate from ToR to servers from | Sometimes people may prefer to disaggregate from ToR nodes to servers | |||
start on, i.e. the servers have couple tens of routes in FIB from | from startup, i.e., the servers have multiple routes in the FIB from | |||
start on beside default routes to avoid breakages at rack level. | startup other than default routes to avoid breakages at the rack | |||
Full disaggregation of the fabric could be achieved by configuration | level. Full disaggregation of the fabric could be achieved by | |||
supported by RIFT. | configuration supported by RIFT. | |||
5.12. Fabric with a Controller | 5.12. Fabric with a Controller | |||
There are many different ways to deploy the controller. One | There are many different ways to deploy the controller. One | |||
possibility is attaching a controller to the RIFT domain from ToF and | possibility is attaching a controller to the RIFT domain from ToF and | |||
another possibility is attaching a controller from the leaf. | another possibility is attaching a controller from the leaf. | |||
+------------+ | +------------+ | |||
| Controller | | | Controller | | |||
++----------++ | ++----------++ | |||
skipping to change at line 1326 ¶ | skipping to change at line 1327 ¶ | |||
If the controller is attaching from a leaf to the fabric, no special | If the controller is attaching from a leaf to the fabric, no special | |||
provisions are needed. | provisions are needed. | |||
5.13. Internet Connectivity Within Underlay | 5.13. Internet Connectivity Within Underlay | |||
If global addressing is running without overlay, an external default | If global addressing is running without overlay, an external default | |||
route needs to be advertised through the RIFT fabric to achieve | route needs to be advertised through the RIFT fabric to achieve | |||
internet connectivity. For the purpose of forwarding of the entire | internet connectivity. For the purpose of forwarding of the entire | |||
RIFT fabric, an internal fabric prefix needs to be advertised in the | RIFT fabric, an internal fabric prefix needs to be advertised in the | |||
South Prefix TIE by ToF and spine nodes. | Prefix South TIE by ToF and spine nodes. | |||
5.13.1. Internet Default on the Leaf | 5.13.1. Internet Default on the Leaf | |||
In the case that the internet gateway is a leaf, the leaf node as the | In the case that the internet gateway is a leaf, the leaf node as the | |||
internet gateway needs to advertise a default route in its Prefix | internet gateway needs to advertise a default route in its Prefix | |||
North TIE. | North TIE. | |||
5.13.2. Internet Default on the ToFs | 5.13.2. Internet Default on the ToFs | |||
In the case that the internet gateway is a ToF, the ToF and spine | In the case that the internet gateway is a ToF, the ToF and spine | |||
skipping to change at line 1674 ¶ | skipping to change at line 1675 ¶ | |||
Nanjing | Nanjing | |||
210012 | 210012 | |||
China | China | |||
Email: zhang.zheng@zte.com.cn | Email: zhang.zheng@zte.com.cn | |||
Dmitry Afanasiev | Dmitry Afanasiev | |||
Yandex | Yandex | |||
Email: fl0w@yandex-team.ru | Email: fl0w@yandex-team.ru | |||
Pascal Thubert | Pascal Thubert | |||
Cisco Systems, Inc | Individual | |||
Building D | ||||
45 Allee des Ormes - BP1200 | ||||
06254 Mougins - Sophia Antipolis | ||||
France | France | |||
Phone: +33 497 23 26 34 | Email: pascal.thubert@gmail.com | |||
Email: pthubert@cisco.com | ||||
Tony Przygienda | Tony Przygienda | |||
Juniper Networks | Juniper Networks | |||
1194 N. Mathilda Ave | 1194 N. Mathilda Ave | |||
Sunnyvale, CA 94089 | Sunnyvale, CA 94089 | |||
United States of America | United States of America | |||
Email: prz@juniper.net | Email: prz@juniper.net | |||
End of changes. 26 change blocks. | ||||
72 lines changed or deleted | 69 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |