| 1 | <?xml version='1.0' encoding='utf-8'?>
|
| 2 | <!DOCTYPE rfc SYSTEM "rfc2629-xhtml.ent">
|
| 3 |
|
| 4 | <rfc xmlns:xi="http://www.w3.org/2001/XInclude" category="info"
|
| 5 | ipr="trust200902" obsoletes="" updates="" submissionType="IETF"
|
| 6 | consensus="true" number="9999" xml:lang="en" tocInclude="true" symRefs="true" sortRefs="true" version="3">
|
| 7 |
|
| 8 | <!-- xml2rfc v2v3 conversion 2.23.0 -->
|
| 9 |
|
| 10 | <front>
|
| 11 | <title abbrev="BGP-Prefix SID in large-scale DCs">BGP-Prefix Segment in
|
| 12 | large-scale data centers</title>
|
| 13 | <seriesInfo name="RFC" value="9999"/>
|
| 14 | <author fullname="Clarence Filsfils" initials="C." role="editor" surname="Filsfils">
|
| 15 | <organization>Cisco Systems, Inc.</organization>
|
| 16 | <address>
|
| 17 | <postal>
|
| 18 | <street/>
|
| 19 | <city>Brussels</city>
|
| 20 | <region/>
|
| 21 | <code/>
|
| 22 | <country>BE</country>
|
| 23 | </postal>
|
| 24 | <email>cfilsfil@cisco.com</email>
|
| 25 | </address>
|
| 26 | </author>
|
| 27 | <author fullname="Stefano Previdi" initials="S." surname="Previdi">
|
| 28 | <organization>Cisco Systems, Inc.</organization>
|
| 29 | <address>
|
| 30 | <postal>
|
| 31 | <street/>
|
| 32 | <city/>
|
| 33 | <code/>
|
| 34 | <country>Italy</country>
|
| 35 | </postal>
|
| 36 | <email>stefano@previdi.net</email>
|
| 37 | </address>
|
| 38 | </author>
|
| 39 | <author fullname="Gaurav Dawra" initials="G." surname="Dawra">
|
| 40 | <organization>LinkedIn</organization>
|
| 41 | <address>
|
| 42 | <postal>
|
| 43 | <street/>
|
| 44 | <city/>
|
| 45 | <code/>
|
| 46 | <country>USA</country>
|
| 47 | </postal>
|
| 48 | <email>gdawra.ietf@gmail.com</email>
|
| 49 | </address>
|
| 50 | </author>
|
| 51 | <author fullname="Ebben Aries" initials="E." surname="Aries">
|
| 52 | <organization>Juniper Networks</organization>
|
| 53 | <address>
|
| 54 | <postal>
|
| 55 | <street>1133 Innovation Way</street>
|
| 56 | <city>Sunnyvale</city>
|
| 57 | <code>CA 94089</code>
|
| 58 | <country>US</country>
|
| 59 | </postal>
|
| 60 | <email>exa@juniper.net</email>
|
| 61 | </address>
|
| 62 | </author>
|
| 63 | <author fullname="Petr Lapukhov" initials="P." surname="Lapukhov">
|
| 64 | <organization>Facebook</organization>
|
| 65 | <address>
|
| 66 | <postal>
|
| 67 | <street/>
|
| 68 | <city/>
|
| 69 | <code/>
|
| 70 | <country>US</country>
|
| 71 | </postal>
|
| 72 | <email>petr@fb.com</email>
|
| 73 | </address>
|
| 74 | </author>
|
| 75 | <date month="July" year="2019"/>
|
| 76 | <workgroup>Network Working Group</workgroup>
|
| 77 | <abstract>
|
| 78 | <t>This document describes the motivation and benefits for applying
|
| 79 | segment routing in BGP-based large-scale data-centers. It describes the
|
| 80 | design to deploy segment routing in those data-centers, for both the
|
| 81 | MPLS and IPv6 dataplanes.</t>
|
| 82 | </abstract>
|
| 83 | </front>
|
| 84 | <middle>
|
| 85 | <section anchor="INTRO" numbered="true" toc="default">
|
| 86 | <name>Introduction</name>
|
| 87 | <t>Segment Routing (SR), as described in <xref target="I-D.ietf-spring-segment-routing" format="default"/> leverages the source routing
|
| 88 | paradigm. A node steers a packet through an ordered list of
|
| 89 | instructions, called segments. A segment can represent any instruction,
|
| 90 | topological or service-based. A segment can have a local semantic to an
|
| 91 | SR node or global within an SR domain. SR allows to enforce a flow
|
| 92 | through any topological path while maintaining per-flow state only at
|
| 93 | the ingress node to the SR domain. Segment Routing can be applied to the
|
| 94 | MPLS and IPv6 data-planes.</t>
|
| 95 | <t>The use-cases described in this document should be considered in the
|
| 96 | context of the BGP-based large-scale data-center (DC) design described
|
| 97 | in <xref target="RFC7938" format="default"/>. This document extends it by applying SR
|
| 98 | both with IPv6 and MPLS dataplane.</t>
|
| 99 | </section>
|
| 100 | <section anchor="LARGESCALEDC" numbered="true" toc="default">
|
| 101 | <name>Large Scale Data Center Network Design Summary</name>
|
| 102 | <t>This section provides a brief summary of the informational document
|
| 103 | <xref target="RFC7938" format="default"/> that outlines a practical network design
|
| 104 | suitable for data-centers of various scales:</t>
|
| 105 | <ul spacing="normal">
|
| 106 | <li>Data-center networks have highly symmetric topologies with
|
| 107 | multiple parallel paths between two server attachment points. The
|
| 108 | well-known Clos topology is most popular among the operators (as
|
| 109 | described in <xref target="RFC7938" format="default"/>). In a Clos topology, the
|
| 110 | minimum number of parallel paths between two elements is determined
|
| 111 | by the "width" of the "Tier-1" stage. See <xref target="FIGLARGE" format="default"/>
|
| 112 | below for an illustration of the concept.</li>
|
| 113 | <li>Large-scale data-centers commonly use a routing protocol, such as
|
| 114 | BGP-4 <xref target="RFC4271" format="default"/> in order to provide endpoint
|
| 115 | connectivity. Recovery after a network failure is therefore driven
|
| 116 | either by local knowledge of directly available backup paths or by
|
| 117 | distributed signaling between the network devices.</li>
|
| 118 | <li>Within data-center networks, traffic is load-shared using the
|
| 119 | Equal Cost Multipath (ECMP) mechanism. With ECMP, every network
|
| 120 | device implements a pseudo-random decision, mapping packets to one
|
| 121 | of the parallel paths by means of a hash function calculated over
|
| 122 | certain parts of the packet, typically a combination of various
|
| 123 | packet header fields.</li>
|
| 124 | </ul>
|
| 125 | <t>The following is a schematic of a five-stage Clos topology, with four
|
| 126 | devices in the "Tier-1" stage. Notice that number of paths between Node1
|
| 127 | and Node12 equals to four: the paths have to cross all of Tier-1
|
| 128 | devices. At the same time, the number of paths between Node1 and Node2
|
| 129 | equals two, and the paths only cross Tier-2 devices. Other topologies
|
| 130 | are possible, but for simplicity only the topologies that have a single
|
| 131 | path from Tier-1 to Tier-3 are considered below. The rest could be
|
| 132 | treated similarly, with a few modifications to the logic.</t>
|
| 133 | <section anchor="REFDESIGN" numbered="true" toc="default">
|
| 134 | <name>Reference design</name>
|
| 135 | <figure anchor="FIGLARGE">
|
| 136 | <name>5-stage Clos topology</name>
|
| 137 | <artwork name="" type="" align="left" alt=""><![CDATA[ Tier-1
|
| 138 | +-----+
|
| 139 | |NODE |
|
| 140 | +->| 5 |--+
|
| 141 | | +-----+ |
|
| 142 | Tier-2 | | Tier-2
|
| 143 | +-----+ | +-----+ | +-----+
|
| 144 | +------------>|NODE |--+->|NODE |--+--|NODE |-------------+
|
| 145 | | +-----| 3 |--+ | 6 | +--| 9 |-----+ |
|
| 146 | | | +-----+ +-----+ +-----+ | |
|
| 147 | | | | |
|
| 148 | | | +-----+ +-----+ +-----+ | |
|
| 149 | | +-----+---->|NODE |--+ |NODE | +--|NODE |-----+-----+ |
|
| 150 | | | | +---| 4 |--+->| 7 |--+--| 10 |---+ | | |
|
| 151 | | | | | +-----+ | +-----+ | +-----+ | | | |
|
| 152 | | | | | | | | | | |
|
| 153 | +-----+ +-----+ | +-----+ | +-----+ +-----+
|
| 154 | |NODE | |NODE | Tier-3 +->|NODE |--+ Tier-3 |NODE | |NODE |
|
| 155 | | 1 | | 2 | | 8 | | 11 | | 12 |
|
| 156 | +-----+ +-----+ +-----+ +-----+ +-----+
|
| 157 | | | | | | | | |
|
| 158 | A O B O <- Servers -> Z O O O
|
| 159 | ]]></artwork>
|
| 160 | </figure>
|
| 161 | <t>In the reference topology illustrated in <xref target="FIGLARGE" format="default"/>,
|
| 162 | It is assumed:</t>
|
| 163 | <ul spacing="normal">
|
| 164 | <li>
|
| 165 | <t>Each node is its own AS (Node X has AS X). 4-byte AS numbers
|
| 166 | are recommended (<xref target="RFC6793" format="default"/>).</t>
|
| 167 | <ul spacing="normal">
|
| 168 | <li>For simple and efficient route propagation filtering,
|
| 169 | Node5, Node6, Node7 and Node8 use the same AS, Node3 and Node4
|
| 170 | use the same AS, Node9 and Node10 use the same AS.</li>
|
| 171 | <li>In case of 2-byte autonomous system numbers are used and
|
| 172 | for efficient usage of the scarce 2-byte Private Use AS pool,
|
| 173 | different Tier-3 nodes might use the same AS.</li>
|
| 174 | <li>Without loss of generality, these details will be
|
| 175 | simplified in this document and assume that each node has its
|
| 176 | own AS.</li>
|
| 177 | </ul>
|
| 178 | </li>
|
| 179 | <li>Each node peers with its neighbors with a BGP session. If not
|
| 180 | specified, eBGP is assumed. In a specific use-case, iBGP will be
|
| 181 | used but this will be called out explicitly in that case.</li>
|
| 182 | <li>
|
| 183 | <t>Each node originates the IPv4 address of its loopback interface
|
| 184 | into BGP and announces it to its neighbors. </t>
|
| 185 | <ul spacing="normal">
|
| 186 | <li>The loopback of Node X is 192.0.2.x/32.</li>
|
| 187 | </ul>
|
| 188 | </li>
|
| 189 | </ul>
|
| 190 | <t>In this document, the Tier-1, Tier-2 and Tier-3 nodes are referred
|
| 191 | to respectively as Spine, Leaf and ToR (top of rack) nodes. When a ToR
|
| 192 | node acts as a gateway to the "outside world", it is referred to as a
|
| 193 | border node.</t>
|
| 194 | </section>
|
| 195 | </section>
|
| 196 | <section anchor="OPENPROBS" numbered="true" toc="default">
|
| 197 | <name>Some open problems in large data-center networks</name>
|
| 198 | <t>The data-center network design summarized above provides means for
|
| 199 | moving traffic between hosts with reasonable efficiency. There are few
|
| 200 | open performance and reliability problems that arise in such design:
|
| 201 | </t>
|
| 202 | <ul spacing="normal">
|
| 203 | <li>ECMP routing is most commonly realized per-flow. This means that
|
| 204 | large, long-lived "elephant" flows may affect performance of
|
| 205 | smaller, short-lived "mouse" flows and reduce efficiency
|
| 206 | of per-flow load-sharing. In other words, per-flow ECMP does not
|
| 207 | perform efficiently when flow lifetime distribution is heavy-tailed.
|
| 208 | Furthermore, due to hash-function inefficiencies it is possible to
|
| 209 | have frequent flow collisions, where more flows get placed on one
|
| 210 | path over the others.</li>
|
| 211 | <li>Shortest-path routing with ECMP implements an oblivious routing
|
| 212 | model, which is not aware of the network imbalances. If the network
|
| 213 | symmetry is broken, for example due to link failures, utilization
|
| 214 | hotspots may appear. For example, if a link fails between Tier-1 and
|
| 215 | Tier-2 devices (e.g. Node5 and Node9), Tier-3 devices Node1 and
|
| 216 | Node2 will not be aware of that, since there are other paths
|
| 217 | available from perspective of Node3. They will continue sending
|
| 218 | roughly equal traffic to Node3 and Node4 as if the failure didn't
|
| 219 | exist which may cause a traffic hotspot.</li>
|
| 220 | <li>Isolating faults in the network with multiple parallel paths and
|
| 221 | ECMP-based routing is non-trivial due to lack of determinism.
|
| 222 | Specifically, the connections from HostA to HostB may take a
|
| 223 | different path every time a new connection is formed, thus making
|
| 224 | consistent reproduction of a failure much more difficult. This
|
| 225 | complexity scales linearly with the number of parallel paths in the
|
| 226 | network, and stems from the random nature of path selection by the
|
| 227 | network devices.</li>
|
| 228 | </ul>
|
| 229 | <t>First, it will be explained how to apply SR in the DC, for MPLS and
|
| 230 | IPv6 data-planes.</t>
|
| 231 | </section>
|
| 232 | <section anchor="APPLYSR" numbered="true" toc="default">
|
| 233 | <name>Applying Segment Routing in the DC with MPLS dataplane</name>
|
| 234 | <section anchor="BGPREFIXSEGMENT" numbered="true" toc="default">
|
| 235 | <name>BGP Prefix Segment (BGP-Prefix-SID)</name>
|
| 236 | <t>A BGP Prefix Segment is a segment associated with a BGP prefix. A
|
| 237 | BGP Prefix Segment is a network-wide instruction to forward the packet
|
| 238 | along the ECMP-aware best path to the related prefix.</t>
|
| 239 | <t>The BGP Prefix Segment is defined as the BGP-Prefix-SID Attribute
|
| 240 | in <xref target="I-D.ietf-idr-bgp-prefix-sid" format="default"/> which contains an
|
| 241 | index. Throughout this document the BGP Prefix Segment Attribute is
|
| 242 | referred as the BGP-Prefix-SID and the encoded index as the
|
| 243 | label-index.</t>
|
| 244 | <t>In this document, the network design decision has been made to
|
| 245 | assume that all the nodes are allocated the same SRGB (Segment Routing
|
| 246 | Global Block), e.g. [16000, 23999]. This provides operational
|
| 247 | simplification as explained in <xref target="SINGLESRGB" format="default"/>, but this
|
| 248 | is not a requirement.</t>
|
| 249 | <t>For illustration purpose, when considering an MPLS data-plane, it
|
| 250 | is assumed that the label-index allocated to prefix 192.0.2.x/32 is X.
|
| 251 | As a result, a local label (16000+x) is allocated for prefix
|
| 252 | 192.0.2.x/32 by each node throughout the DC fabric.</t>
|
| 253 | <t>When IPv6 data-plane is considered, it is assumed that Node X is
|
| 254 | allocated IPv6 address (segment) 2001:DB8::X.</t>
|
| 255 | </section>
|
| 256 | <section anchor="eBGP8277" numbered="true" toc="default">
|
| 257 | <name>eBGP Labeled Unicast (RFC8277)</name>
|
| 258 | <t>Referring to <xref target="FIGLARGE" format="default"/> and <xref target="RFC7938" format="default"/>, the following design modifications are
|
| 259 | introduced:</t>
|
| 260 | <ul spacing="normal">
|
| 261 | <li>Each node peers with its neighbors via a eBGP session with
|
| 262 | extensions defined in <xref target="RFC8277" format="default"/> (named "eBGP8277"
|
| 263 | throughout this document) and with the BGP-Prefix-SID attribute
|
| 264 | extension as defined in <xref target="I-D.ietf-idr-bgp-prefix-sid" format="default"/>.</li>
|
| 265 | <li>The forwarding plane at Tier-2 and Tier-1 is MPLS.</li>
|
| 266 | <li>The forwarding plane at Tier-3 is either IP2MPLS (if the host
|
| 267 | sends IP traffic) or MPLS2MPLS (if the host sends MPLS-
|
| 268 | encapsulated traffic).</li>
|
| 269 | </ul>
|
| 270 | <t><xref target="FIGSMALL" format="default"/> zooms into a path from server A to server
|
| 271 | Z within the topology of <xref target="FIGLARGE" format="default"/>.</t>
|
| 272 | <figure anchor="FIGSMALL">
|
| 273 | <name>Path from A to Z via nodes 1, 4, 7, 10 and 11</name>
|
| 274 | <artwork name="" type="" align="left" alt=""><![CDATA[
|
| 275 | +-----+ +-----+ +-----+
|
| 276 | +---------->|NODE | |NODE | |NODE |
|
| 277 | | | 4 |--+->| 7 |--+--| 10 |---+
|
| 278 | | +-----+ +-----+ +-----+ |
|
| 279 | | |
|
| 280 | +-----+ +-----+
|
| 281 | |NODE | |NODE |
|
| 282 | | 1 | | 11 |
|
| 283 | +-----+ +-----+
|
| 284 | | |
|
| 285 | A <- Servers -> Z
|
| 286 | ]]></artwork>
|
| 287 | </figure>
|
| 288 | <t>Referring to <xref target="FIGLARGE" format="default"/> and <xref target="FIGSMALL" format="default"/> and assuming the IP address with the AS and
|
| 289 | label-index allocation previously described, the following sections
|
| 290 | detail the control plane operation and the data plane states for the
|
| 291 | prefix 192.0.2.11/32 (loopback of Node11)</t>
|
| 292 | <section anchor="CONTROLPLANE" numbered="true" toc="default">
|
| 293 | <name>Control Plane</name>
|
| 294 | <t>Node11 originates 192.0.2.11/32 in BGP and allocates to it a
|
| 295 | BGP-Prefix-SID with label-index: index11 <xref target="I-D.ietf-idr-bgp-prefix-sid" format="default"/>.</t>
|
| 296 | <ul empty="true">
|
| 297 | <li><t>Node11 sends the following eBGP8277 update to Node10:</t>
|
| 298 | <dl spacing="compact">
|
| 299 | <dt>IP Prefix:</dt><dd>192.0.2.11/32</dd>
|
| 300 | <dt>Label:</dt><dd>Implicit-Null</dd>
|
| 301 | <dt>Next-hop:</dt><dd>Node11's interface address on the link to Node10</dd>
|
| 302 | <dt>AS Path:</dt><dd>{11}</dd>
|
| 303 | <dt>BGP-Prefix-SID:</dt><dd>Label-Index 11</dd>
|
| 304 | </dl>
|
| 305 | </li>
|
| 306 | </ul>
|
| 307 |
|
| 308 | <t>Node10 receives the above update. As it is SR capable, Node10 is
|
| 309 | able to interpret the BGP-Prefix-SID and hence understands that it
|
| 310 | should allocate the label from its own SRGB block, offset by the
|
| 311 | Label-Index received in the BGP-Prefix-SID (16000+11 hence 16011) to
|
| 312 | the NLRI instead of allocating a non-deterministic label out of a
|
| 313 | dynamically allocated portion of the local label space. The
|
| 314 | implicit-null label in the NLRI tells Node10 that it is the
|
| 315 | penultimate hop and must pop the top label on the stack before
|
| 316 | forwarding traffic for this prefix to Node11.</t>
|
| 317 | <ul empty="true">
|
| 318 | <li><t>Then, Node10 sends the following eBGP8277 update to Node7:</t>
|
| 319 | <dl spacing="compact">
|
| 320 | <dt>IP Prefix:</dt><dd>192.0.2.11/32</dd>
|
| 321 | <dt>Label:</dt><dd>16011</dd>
|
| 322 | <dt>Next-hop:</dt><dd>Node10's interface address on the link to Node7</dd>
|
| 323 | <dt>AS Path:</dt><dd>{10, 11}</dd>
|
| 324 | <dt>BGP-Prefix-SID:</dt><dd>Label-Index 11</dd>
|
| 325 | </dl>
|
| 326 | </li>
|
| 327 | </ul>
|
| 328 | <t>Node7 receives the above update. As it is SR capable, Node7 is
|
| 329 | able to interpret the BGP-Prefix-SID and hence allocates the local
|
| 330 | (incoming) label 16011 (16000 + 11) to the NLRI (instead of
|
| 331 | allocating a "dynamic" local label from its label
|
| 332 | manager). Node7 uses the label in the received eBGP8277 NLRI as the
|
| 333 | outgoing label (the index is only used to derive the local/incoming
|
| 334 | label).</t>
|
| 335 | <ul empty="true">
|
| 336 | <li><t>Node7 sends the following eBGP8277 update to Node4:</t>
|
| 337 | <dl spacing="compact">
|
| 338 | <dt>Label:</dt><dd>16011</dd>
|
| 339 | <dt>Next-hop:</dt><dd>Node7's interface address on the link to Node4</dd>
|
| 340 | <dt>AS Path:</dt><dd>{7, 10, 11}</dd>
|
| 341 | <dt>BGP-Prefix-SID:</dt><dd>Label-Index 11</dd>
|
| 342 | </dl>
|
| 343 | </li>
|
| 344 | </ul>
|
| 345 | <t>Node4 receives the above update. As it is SR capable, Node4 is
|
| 346 | able to interpret the BGP-Prefix-SID and hence allocates the local
|
| 347 | (incoming) label 16011 to the NLRI (instead of allocating a
|
| 348 | "dynamic" local label from its label manager). Node4
|
| 349 | uses the label in the received eBGP8277 NLRI as outgoing label (the
|
| 350 | index is only used to derive the local/incoming label).</t>
|
| 351 |
|
| 352 | <ul empty="true">
|
| 353 | <li><t>Node4 sends the following eBGP8277 update to Node1:</t>
|
| 354 | <dl spacing="compact">
|
| 355 | <dt>IP Prefix:</dt><dd>192.0.2.11/32</dd>
|
| 356 | <dt>Label:</dt><dd>16011</dd>
|
| 357 | <dt>Next-hop:</dt><dd>Node4's interface address on the link to Node1</dd>
|
| 358 | <dt>AS Path:</dt><dd>{4, 7, 10, 11}</dd>
|
| 359 | <dt>BGP-Prefix-SID:</dt><dd>Label-Index 11</dd>
|
| 360 | </dl>
|
| 361 | </li>
|
| 362 | </ul>
|
| 363 |
|
| 364 | <t>Node1 receives the above update. As it is SR capable, Node1 is
|
| 365 | able to interpret the BGP-Prefix-SID and hence allocates the local
|
| 366 | (incoming) label 16011 to the NLRI (instead of allocating a
|
| 367 | "dynamic" local label from its label manager). Node1
|
| 368 | uses the label in the received eBGP8277 NLRI as outgoing label (the
|
| 369 | index is only used to derive the local/incoming label).</t>
|
| 370 | </section>
|
| 371 | <section anchor="DATAPLANE" numbered="true" toc="default">
|
| 372 | <name>Data Plane</name>
|
| 373 | <t>Referring to <xref target="FIGLARGE" format="default"/>, and assuming all nodes
|
| 374 | apply the same advertisement rules described above and all nodes
|
| 375 | have the same SRGB (16000-23999), here are the IP/MPLS forwarding
|
| 376 | tables for prefix 192.0.2.11/32 at Node1, Node4, Node7 and
|
| 377 | Node10.</t>
|
| 378 | <table anchor="NODE1FIB" align="center">
|
| 379 | <name>Node1 Forwarding Table</name>
|
| 380 | <thead>
|
| 381 | <tr>
|
| 382 | <th align="center">Incoming label or IP destination</th>
|
| 383 | <th align="center">Outgoing label</th>
|
| 384 | <th align="center">Outgoing Interface</th>
|
| 385 | </tr>
|
| 386 | </thead>
|
| 387 | <tbody>
|
| 388 | <tr>
|
| 389 | <td align="center">16011</td>
|
| 390 | <td align="center">16011</td>
|
| 391 | <td align="center">ECMP{3, 4}</td>
|
| 392 | </tr>
|
| 393 | <tr>
|
| 394 | <td align="center">192.0.2.11/32</td>
|
| 395 | <td align="center">16011</td>
|
| 396 | <td align="center">ECMP{3, 4}</td>
|
| 397 | </tr>
|
| 398 | </tbody>
|
| 399 | </table>
|
| 400 |
|
| 401 | <table anchor="NODE4FIB" align="center">
|
| 402 | <name>Node4 Forwarding Table</name>
|
| 403 | <thead>
|
| 404 | <tr>
|
| 405 | <th align="center">Incoming label or IP destination</th>
|
| 406 | <th align="center">Outgoing label</th>
|
| 407 | <th align="center">Outgoing Interface</th>
|
| 408 | </tr>
|
| 409 | </thead>
|
| 410 | <tbody>
|
| 411 | <tr>
|
| 412 | <td align="center">16011</td>
|
| 413 | <td align="center">16011</td>
|
| 414 | <td align="center">ECMP{7, 8}</td>
|
| 415 | </tr>
|
| 416 | <tr>
|
| 417 | <td align="center">192.0.2.11/32</td>
|
| 418 | <td align="center">16011</td>
|
| 419 | <td align="center">ECMP{7, 8}</td>
|
| 420 | </tr>
|
| 421 | </tbody>
|
| 422 | </table>
|
| 423 |
|
| 424 | <table anchor="NODE7FIB" align="center">
|
| 425 | <name>Node7 Forwarding Table</name>
|
| 426 | <thead>
|
| 427 | <tr>
|
| 428 | <th align="center">Incoming label or IP destination</th>
|
| 429 | <th align="center">Outgoing label</th>
|
| 430 | <th align="center">Outgoing Interface</th>
|
| 431 | </tr>
|
| 432 | </thead>
|
| 433 | <tbody>
|
| 434 | <tr>
|
| 435 | <td align="center">16011</td>
|
| 436 | <td align="center">16011</td>
|
| 437 | <td align="center">10</td>
|
| 438 | </tr>
|
| 439 | <tr>
|
| 440 | <td align="center">192.0.2.11/32</td>
|
| 441 | <td align="center">16011</td>
|
| 442 | <td align="center">10</td>
|
| 443 | </tr>
|
| 444 | </tbody>
|
| 445 | </table>
|
| 446 |
|
| 447 | <table align="center">
|
| 448 | <name/>
|
| 449 | <thead>
|
| 450 | <tr>
|
| 451 | <th align="center">Incoming label or IP destination</th>
|
| 452 | <th align="center">Outgoing label</th>
|
| 453 | <th align="center">Outgoing Interface</th>
|
| 454 | </tr>
|
| 455 | </thead>
|
| 456 | <tbody>
|
| 457 | <tr>
|
| 458 | <td align="center">16011</td>
|
| 459 | <td align="center">POP</td>
|
| 460 | <td align="center">11</td>
|
| 461 | </tr>
|
| 462 | <tr>
|
| 463 | <td align="center">192.0.2.11/32</td>
|
| 464 | <td align="center">N/A</td>
|
| 465 | <td align="center">11</td>
|
| 466 | </tr>
|
| 467 | </tbody>
|
| 468 | </table>
|
| 469 | </section>
|
| 470 | <section anchor="VARIATIONS" numbered="true" toc="default">
|
| 471 | <name>Network Design Variation</name>
|
| 472 | <t>A network design choice could consist of switching all the
|
| 473 | traffic through Tier-1 and Tier-2 as MPLS traffic. In this case, one
|
| 474 | could filter away the IP entries at Node4, Node7 and Node10. This
|
| 475 | might be beneficial in order to optimize the forwarding table
|
| 476 | size.</t>
|
| 477 | <t>A network design choice could consist in allowing the hosts to
|
| 478 | send MPLS-encapsulated traffic based on the Egress Peer Engineering
|
| 479 | (EPE) use-case as defined in <xref target="I-D.ietf-spring-segment-routing-central-epe" format="default"/>. For example,
|
| 480 | applications at HostA would send their Z-destined traffic to Node1
|
| 481 | with an MPLS label stack where the top label is 16011 and the next
|
| 482 | label is an EPE peer segment (<xref target="I-D.ietf-spring-segment-routing-central-epe" format="default"/>) at Node11
|
| 483 | directing the traffic to Z.</t>
|
| 484 | </section>
|
| 485 | <section anchor="FABRIC" numbered="true" toc="default">
|
| 486 | <name>Global BGP Prefix Segment through the fabric</name>
|
| 487 | <t>When the previous design is deployed, the operator enjoys global
|
| 488 | BGP-Prefix-SID and label allocation throughout the DC fabric.</t>
|
| 489 | <t>A few examples follow:</t>
|
| 490 | <ul spacing="normal">
|
| 491 | <li>Normal forwarding to Node11: a packet with top label 16011
|
| 492 | received by any node in the fabric will be forwarded along the
|
| 493 | ECMP-aware BGP best-path towards Node11 and the label 16011 is
|
| 494 | penultimate-popped at Node10 (or at Node 9).</li>
|
| 495 | <li>Traffic-engineered path to Node11: an application on a host
|
| 496 | behind Node1 might want to restrict its traffic to paths via the
|
| 497 | Spine node Node5. The application achieves this by sending its
|
| 498 | packets with a label stack of {16005, 16011}. BGP Prefix SID
|
| 499 | 16005 directs the packet up to Node5 along the path (Node1,
|
| 500 | Node3, Node5). BGP-Prefix-SID 16011 then directs the packet down
|
| 501 | to Node11 along the path (Node5, Node9, Node11).</li>
|
| 502 | </ul>
|
| 503 | </section>
|
| 504 | <section anchor="INCRDEP" numbered="true" toc="default">
|
| 505 | <name>Incremental Deployments</name>
|
| 506 | <t>The design previously described can be deployed incrementally.
|
| 507 | Let us assume that Node7 does not support the BGP-Prefix-SID and let
|
| 508 | us show how the fabric connectivity is preserved.</t>
|
| 509 | <t>From a signaling viewpoint, nothing would change: even though
|
| 510 | Node7 does not support the BGP-Prefix-SID, it does propagate the
|
| 511 | attribute unmodified to its neighbors.</t>
|
| 512 | <t>From a label allocation viewpoint, the only difference is that
|
| 513 | Node7 would allocate a dynamic (random) label to the prefix
|
| 514 | 192.0.2.11/32 (e.g. 123456) instead of the "hinted" label as
|
| 515 | instructed by the BGP-Prefix-SID. The neighbors of Node7 adapt
|
| 516 | automatically as they always use the label in the BGP8277 NLRI as
|
| 517 | outgoing label.</t>
|
| 518 | <t>Node4 does understand the BGP-Prefix-SID and hence allocates the
|
| 519 | indexed label in the SRGB (16011) for 192.0.2.11/32.</t>
|
| 520 | <t>As a result, all the data-plane entries across the network would
|
| 521 | be unchanged except the entries at Node7 and its neighbor Node4 as
|
| 522 | shown in the figures below.</t>
|
| 523 | <t>The key point is that the end-to-end Label Switched Path (LSP) is
|
| 524 | preserved because the outgoing label is always derived from the
|
| 525 | received label within the BGP8277 NLRI. The index in the
|
| 526 | BGP-Prefix-SID is only used as a hint on how to allocate the local
|
| 527 | label (the incoming label) but never for the outgoing label.</t>
|
| 528 | <table anchor="NODE7FIBINC" align="center">
|
| 529 | <name>Node7 Forwarding Table</name>
|
| 530 | <thead>
|
| 531 | <tr>
|
| 532 | <th align="center">Incoming label or IP destination</th>
|
| 533 | <th align="center">Outgoing label</th>
|
| 534 | <th align="center">Outgoing interface</th>
|
| 535 | </tr>
|
| 536 | </thead>
|
| 537 | <tbody>
|
| 538 | <tr>
|
| 539 | <td align="center">12345</td>
|
| 540 | <td align="center">16011</td>
|
| 541 | <td align="center">10</td>
|
| 542 | </tr>
|
| 543 | </tbody>
|
| 544 | </table>
|
| 545 | <table anchor="NODE4FIBINC" align="center">
|
| 546 | <name>Node4 Forwarding Table</name>
|
| 547 | <thead>
|
| 548 | <tr>
|
| 549 | <th align="center">Incoming label or IP destination</th>
|
| 550 | <th align="center">Outgoing label</th>
|
| 551 | <th align="center">Outgoing interface</th>
|
| 552 | </tr>
|
| 553 | </thead>
|
| 554 | <tbody>
|
| 555 | <tr>
|
| 556 | <td align="center">16011</td>
|
| 557 | <td align="center">12345</td>
|
| 558 | <td align="center">7</td>
|
| 559 | </tr>
|
| 560 | </tbody>
|
| 561 | </table>
|
| 562 | <t>The BGP-Prefix-SID can thus be deployed incrementally one node at
|
| 563 | a time.</t>
|
| 564 | <t>When deployed together with a homogeneous SRGB (same SRGB across
|
| 565 | the fabric), the operator incrementally enjoys the global prefix
|
| 566 | segment benefits as the deployment progresses through the
|
| 567 | fabric.</t>
|
| 568 | </section>
|
| 569 | </section>
|
| 570 | <section anchor="iBGP3107" numbered="true" toc="default">
|
| 571 | <name>iBGP Labeled Unicast (RFC8277)</name>
|
| 572 | <t>The same exact design as eBGP8277 is used with the following
|
| 573 | modifications:</t>
|
| 574 | <ul empty="true" spacing="normal">
|
| 575 | <li>All nodes use the same AS number.</li>
|
| 576 | <li>Each node peers with its neighbors via an internal BGP session
|
| 577 | (iBGP) with extensions defined in <xref target="RFC8277" format="default"/> (named
|
| 578 | "iBGP8277" throughout this document).</li>
|
| 579 | <li>Each node acts as a route-reflector for each of its neighbors
|
| 580 | and with the next-hop-self option. Next-hop-self is a well known
|
| 581 | operational feature which consists of rewriting the next-hop of a
|
| 582 | BGP update prior to send it to the neighbor. Usually, it's a
|
| 583 | common practice to apply next-hop-self behavior towards iBGP peers
|
| 584 | for eBGP learned routes. In the case outlined in this section it
|
| 585 | is proposed to use the next-hop-self mechanism also to iBGP
|
| 586 | learned routes.</li>
|
| 587 | <li>
|
| 588 | <figure anchor="IBGPFIG">
|
| 589 | <name>iBGP Sessions with Reflection and Next-Hop-Self</name>
|
| 590 | <artwork name="" type="" align="left" alt=""><![CDATA[
|
| 591 | Cluster-1
|
| 592 | +-----------+
|
| 593 | | Tier-1 |
|
| 594 | | +-----+ |
|
| 595 | | |NODE | |
|
| 596 | | | 5 | |
|
| 597 | Cluster-2 | +-----+ | Cluster-3
|
| 598 | +---------+ | | +---------+
|
| 599 | | Tier-2 | | | | Tier-2 |
|
| 600 | | +-----+ | | +-----+ | | +-----+ |
|
| 601 | | |NODE | | | |NODE | | | |NODE | |
|
| 602 | | | 3 | | | | 6 | | | | 9 | |
|
| 603 | | +-----+ | | +-----+ | | +-----+ |
|
| 604 | | | | | | |
|
| 605 | | | | | | |
|
| 606 | | +-----+ | | +-----+ | | +-----+ |
|
| 607 | | |NODE | | | |NODE | | | |NODE | |
|
| 608 | | | 4 | | | | 7 | | | | 10 | |
|
| 609 | | +-----+ | | +-----+ | | +-----+ |
|
| 610 | +---------+ | | +---------+
|
| 611 | | |
|
| 612 | | +-----+ |
|
| 613 | | |NODE | |
|
| 614 | Tier-3 | | 8 | | Tier-3
|
| 615 | +-----+ +-----+ | +-----+ | +-----+ +-----+
|
| 616 | |NODE | |NODE | +-----------+ |NODE | |NODE |
|
| 617 | | 1 | | 2 | | 11 | | 12 |
|
| 618 | +-----+ +-----+ +-----+ +-----+
|
| 619 | ]]></artwork>
|
| 620 | </figure>
|
| 621 | </li>
|
| 622 | <li>
|
| 623 | <t>For simple and efficient route propagation filtering and as
|
| 624 | illustrated in <xref target="IBGPFIG" format="default"/>: </t>
|
| 625 | <ul spacing="normal">
|
| 626 | <li>Node5, Node6, Node7 and Node8 use the same Cluster ID
|
| 627 | (Cluster-1)</li>
|
| 628 | <li>Node3 and Node4 use the same Cluster ID (Cluster-2)</li>
|
| 629 | <li>Node9 and Node10 use the same Cluster ID (Cluster-3)</li>
|
| 630 | </ul>
|
| 631 | </li>
|
| 632 | <li>The control-plane behavior is mostly the same as described in
|
| 633 | the previous section: the only difference is that the eBGP8277
|
| 634 | path propagation is simply replaced by an iBGP8277 path reflection
|
| 635 | with next-hop changed to self.</li>
|
| 636 | <li>The data-plane tables are exactly the same.</li>
|
| 637 | </ul>
|
| 638 | </section>
|
| 639 | </section>
|
| 640 | <section anchor="IPV6" numbered="true" toc="default">
|
| 641 | <name>Applying Segment Routing in the DC with IPv6 dataplane</name>
|
| 642 | <t>The design described in <xref target="RFC7938" format="default"/> is reused with one
|
| 643 | single modification. It is highlighted using the example of the
|
| 644 | reachability to Node11 via spine node Node5.</t>
|
| 645 | <t>Node5 originates 2001:DB8::5/128 with the attached BGP-Prefix-SID for
|
| 646 | IPv6 packets destined to segment 2001:DB8::5 (<xref target="I-D.ietf-idr-bgp-prefix-sid" format="default"/>).</t>
|
| 647 | <t>Node11 originates 2001:DB8::11/128 with the attached BGP-Prefix-SID
|
| 648 | advertising the support of the SRH for IPv6 packets destined to segment
|
| 649 | 2001:DB8::11.</t>
|
| 650 | <t>The control-plane and data-plane processing of all the other nodes in
|
| 651 | the fabric is unchanged. Specifically, the routes to 2001:DB8::5 and
|
| 652 | 2001:DB8::11 are installed in the FIB along the eBGP best-path to Node5
|
| 653 | (spine node) and Node11 (ToR node) respectively.</t>
|
| 654 | <t>An application on HostA which needs to send traffic to HostZ via only
|
| 655 | Node5 (spine node) can do so by sending IPv6 packets with a Segment
|
| 656 | Routing header (SRH, <xref target="I-D.ietf-6man-segment-routing-header" format="default"/>). The destination
|
| 657 | address and active segment is set to 2001:DB8::5. The next and last
|
| 658 | segment is set to 2001:DB8::11.</t>
|
| 659 | <t>The application must only use IPv6 addresses that have been
|
| 660 | advertised as capable for SRv6 segment processing (e.g. for which the
|
| 661 | BGP prefix segment capability has been advertised). How applications
|
| 662 | learn this (e.g.: centralized controller and orchestration) is outside
|
| 663 | the scope of this document.</t>
|
| 664 | </section>
|
| 665 | <section anchor="COMMHOSTS" numbered="true" toc="default">
|
| 666 | <name>Communicating path information to the host</name>
|
| 667 | <t>There are two general methods for communicating path information to
|
| 668 | the end-hosts: "proactive" and "reactive", aka "push" and "pull" models.
|
| 669 | There are multiple ways to implement either of these methods. Here, it
|
| 670 | is noted that one way could be using a centralized controller: the
|
| 671 | controller either tells the hosts of the prefix-to-path mappings
|
| 672 | beforehand and updates them as needed (network event driven push), or
|
| 673 | responds to the hosts making request for a path to specific destination
|
| 674 | (host event driven pull). It is also possible to use a hybrid model,
|
| 675 | i.e., pushing some state from the controller in response to particular
|
| 676 | network events, while the host pulls other state on demand.</t>
|
| 677 | <t>It is also noted, that when disseminating network-related data to the
|
| 678 | end-hosts a trade-off is made to balance the amount of information Vs.
|
| 679 | the level of visibility in the network state. This applies both to push
|
| 680 | and pull models. In the extreme case, the host would request path
|
| 681 | information on every flow, and keep no local state at all. On the other
|
| 682 | end of the spectrum, information for every prefix in the network along
|
| 683 | with available paths could be pushed and continuously updated on all
|
| 684 | hosts.</t>
|
| 685 | </section>
|
| 686 | <section anchor="BENEFITS" numbered="true" toc="default">
|
| 687 | <name>Additional Benefits</name>
|
| 688 | <section anchor="MPLSIMPLE" numbered="true" toc="default">
|
| 689 | <name>MPLS Dataplane with operational simplicity</name>
|
| 690 | <t>As required by <xref target="RFC7938" format="default"/>, no new signaling protocol
|
| 691 | is introduced. The BGP-Prefix-SID is a lightweight extension to BGP
|
| 692 | Labeled Unicast <xref target="RFC8277" format="default"/>. It applies either to eBGP or
|
| 693 | iBGP based designs.</t>
|
| 694 | <t>Specifically, LDP and RSVP-TE are not used. These protocols would
|
| 695 | drastically impact the operational complexity of the Data Center and
|
| 696 | would not scale. This is in line with the requirements expressed in
|
| 697 | <xref target="RFC7938" format="default"/>.</t>
|
| 698 | <t>Provided the same SRGB is configured on all nodes, all nodes use
|
| 699 | the same MPLS label for a given IP prefix. This is simpler from an
|
| 700 | operation standpoint, as discussed in <xref target="SINGLESRGB" format="default"/></t>
|
| 701 | </section>
|
| 702 | <section anchor="MINFIB" numbered="true" toc="default">
|
| 703 | <name>Minimizing the FIB table</name>
|
| 704 | <t>The designer may decide to switch all the traffic at Tier-1 and
|
| 705 | Tier-2's based on MPLS, hence drastically decreasing the IP table size
|
| 706 | at these nodes.</t>
|
| 707 | <t>This is easily accomplished by encapsulating the traffic either
|
| 708 | directly at the host or the source ToR node by pushing the
|
| 709 | BGP-Prefix-SID of the destination ToR for intra-DC traffic, or the
|
| 710 | BGP-Prefix-SID for the the border node for inter-DC or
|
| 711 | DC-to-outside-world traffic.</t>
|
| 712 | </section>
|
| 713 | <section anchor="EPE" numbered="true" toc="default">
|
| 714 | <name>Egress Peer Engineering</name>
|
| 715 | <t>It is straightforward to combine the design illustrated in this
|
| 716 | document with the Egress Peer Engineering (EPE) use-case described in
|
| 717 | <xref target="I-D.ietf-spring-segment-routing-central-epe" format="default"/>.</t>
|
| 718 | <t>In such case, the operator is able to engineer its outbound traffic
|
| 719 | on a per host-flow basis, without incurring any additional state at
|
| 720 | intermediate points in the DC fabric.</t>
|
| 721 | <t>For example, the controller only needs to inject a per-flow state
|
| 722 | on the HostA to force it to send its traffic destined to a specific
|
| 723 | Internet destination D via a selected border node (say Node12 in <xref target="FIGLARGE" format="default"/> instead of another border node, Node11) and a
|
| 724 | specific egress peer of Node12 (say peer AS 9999 of local PeerNode
|
| 725 | segment 9999 at Node12 instead of any other peer which provides a path
|
| 726 | to the destination D). Any packet matching this state at host A would
|
| 727 | be encapsulated with SR segment list (label stack) {16012, 9999}.
|
| 728 | 16012 would steer the flow through the DC fabric, leveraging any ECMP,
|
| 729 | along the best path to border node Node12. Once the flow gets to
|
| 730 | border node Node12, the active segment is 9999 (because of PHP on the
|
| 731 | upstream neighbor of Node12). This EPE PeerNode segment forces border
|
| 732 | node Node12 to forward the packet to peer AS 9999, without any IP
|
| 733 | lookup at the border node. There is no per-flow state for this
|
| 734 | engineered flow in the DC fabric. A benefit of segment routing is the
|
| 735 | per-flow state is only required at the source.</t>
|
| 736 | <t>As well as allowing full traffic engineering control such a design
|
| 737 | also offers FIB table minimization benefits as the Internet-scale FIB
|
| 738 | at border node Node12 is not required if all FIB lookups are avoided
|
| 739 | there by using EPE.</t>
|
| 740 | </section>
|
| 741 | <section anchor="ANYCAST" numbered="true" toc="default">
|
| 742 | <name>Anycast</name>
|
| 743 | <t>The design presented in this document preserves the availability
|
| 744 | and load-balancing properties of the base design presented in <xref target="I-D.ietf-spring-segment-routing" format="default"/>.</t>
|
| 745 | <t>For example, one could assign an anycast loopback 192.0.2.20/32 and
|
| 746 | associate segment index 20 to it on the border Node11 and Node12 (in
|
| 747 | addition to their node-specific loopbacks). Doing so, the EPE
|
| 748 | controller could express a default "go-to-the-Internet via any border
|
| 749 | node" policy as segment list {16020}. Indeed, from any host in the DC
|
| 750 | fabric or from any ToR node, 16020 steers the packet towards the
|
| 751 | border Node11 or Node12 leveraging ECMP where available along the best
|
| 752 | paths to these nodes.</t>
|
| 753 | </section>
|
| 754 | </section>
|
| 755 | <section anchor="SINGLESRGB" numbered="true" toc="default">
|
| 756 | <name>Preferred SRGB Allocation</name>
|
| 757 | <t>In the MPLS case, it is recommend to use same SRGBs at each node.</t>
|
| 758 | <t>Different SRGBs in each node likely increase the complexity of the
|
| 759 | solution both from an operational viewpoint and from a controller
|
| 760 | viewpoint.</t>
|
| 761 | <t>From an operation viewpoint, it is much simpler to have the same
|
| 762 | global label at every node for the same destination (the MPLS
|
| 763 | troubleshooting is then similar to the IPv6 troubleshooting where this
|
| 764 | global property is a given).</t>
|
| 765 | <t>From a controller viewpoint, this allows us to construct simple
|
| 766 | policies applicable across the fabric.</t>
|
| 767 | <t>Let us consider two applications A and B respectively connected to
|
| 768 | Node1 and Node2 (ToR nodes). A has two flows FA1 and FA2 destined to Z.
|
| 769 | B has two flows FB1 and FB2 destined to Z. The controller wants FA1 and
|
| 770 | FB1 to be load-shared across the fabric while FA2 and FB2 must be
|
| 771 | respectively steered via Node5 and Node8.</t>
|
| 772 | <t>Assuming a consistent unique SRGB across the fabric as described in
|
| 773 | the document, the controller can simply do it by instructing A and B to
|
| 774 | use {16011} respectively for FA1 and FB1 and by instructing A and B to
|
| 775 | use {16005 16011} and {16008 16011} respectively for FA2 and FB2.</t>
|
| 776 | <t>Let us assume a design where the SRGB is different at every node and
|
| 777 | where the SRGB of each node is advertised using the Originator SRGB TLV
|
| 778 | of the BGP-Prefix-SID as defined in <xref target="I-D.ietf-idr-bgp-prefix-sid" format="default"/>: SRGB of Node K starts at value
|
| 779 | K*1000 and the SRGB length is 1000 (e.g. Node1's SRGB is [1000,
|
| 780 | 1999], Node2's SRGB is [2000, 2999], ...).</t>
|
| 781 | <t>In this case, not only the controller would need to collect and store
|
| 782 | all of these different SRGB's (e.g., through the Originator SRGB
|
| 783 | TLV of the BGP-Prefix-SID), furthermore it would need to adapt the
|
| 784 | policy for each host. Indeed, the controller would instruct A to use
|
| 785 | {1011} for FA1 while it would have to instruct B to use {2011} for FB1
|
| 786 | (while with the same SRGB, both policies are the same {16011}).</t>
|
| 787 | <t>Even worse, the controller would instruct A to use {1005, 5011} for
|
| 788 | FA1 while it would instruct B to use {2011, 8011} for FB1 (while with
|
| 789 | the same SRGB, the second segment is the same across both policies:
|
| 790 | 16011). When combining segments to create a policy, one need to
|
| 791 | carefully update the label of each segment. This is obviously more
|
| 792 | error-prone, more complex and more difficult to troubleshoot.</t>
|
| 793 | </section>
|
| 794 | <section anchor="IANA" numbered="true" toc="default">
|
| 795 | <name>IANA Considerations</name>
|
| 796 | <t>This document does not make any IANA request.</t>
|
| 797 | </section>
|
| 798 | <section anchor="MANAGE" numbered="true" toc="default">
|
| 799 | <name>Manageability Considerations</name>
|
| 800 | <t>The design and deployment guidelines described in this document are
|
| 801 | based on the network design described in <xref target="RFC7938" format="default"/>.</t>
|
| 802 | <t>The deployment model assumed in this document is based on a single
|
| 803 | domain where the interconnected DCs are part of the same administrative
|
| 804 | domain (which, of course, is split into different autonomous systems).
|
| 805 | The operator has full control of the whole domain and the usual
|
| 806 | operational and management mechanisms and procedures are used in order
|
| 807 | to prevent any information related to internal prefixes and topology to
|
| 808 | be leaked outside the domain.</t>
|
| 809 | <t>As recommended in <xref target="I-D.ietf-spring-segment-routing" format="default"/>,
|
| 810 | the same SRGB should be allocated in all nodes in order to facilitate
|
| 811 | the design, deployment and operations of the domain.</t>
|
| 812 | <t>When EPE (<xref target="I-D.ietf-spring-segment-routing-central-epe" format="default"/>) is used (as
|
| 813 | explained in <xref target="EPE" format="default"/>, the same operational model is
|
| 814 | assumed. EPE information is originated and propagated throughout the
|
| 815 | domain towards an internal server and unless explicitly configured by
|
| 816 | the operator, no EPE information is leaked outside the domain
|
| 817 | boundaries.</t>
|
| 818 | </section>
|
| 819 | <section anchor="SEC" numbered="true" toc="default">
|
| 820 | <name>Security Considerations</name>
|
| 821 | <t>This document proposes to apply Segment Routing to a well known
|
| 822 | scalability requirement expressed in <xref target="RFC7938" format="default"/> using the
|
| 823 | BGP-Prefix-SID as defined in <xref target="I-D.ietf-idr-bgp-prefix-sid" format="default"/>.</t>
|
| 824 | <t>It has to be noted, as described in <xref target="MANAGE" format="default"/> that the
|
| 825 | design illustrated in <xref target="RFC7938" format="default"/> and in this document,
|
| 826 | refer to a deployment model where all nodes are under the same
|
| 827 | administration. In this context, it is assumed that the operator doesn't
|
| 828 | want to leak outside of the domain any information related to internal
|
| 829 | prefixes and topology. The internal information includes prefix-sid and
|
| 830 | EPE information. In order to prevent such leaking, the standard BGP
|
| 831 | mechanisms (filters) are applied on the boundary of the domain.</t>
|
| 832 | <t>Therefore, the solution proposed in this document does not introduce
|
| 833 | any additional security concerns from what expressed in <xref target="RFC7938" format="default"/> and <xref target="I-D.ietf-idr-bgp-prefix-sid" format="default"/>. It
|
| 834 | is assumed that the security and confidentiality of the prefix and
|
| 835 | topology information is preserved by outbound filters at each peering
|
| 836 | point of the domain as described in <xref target="MANAGE" format="default"/>.</t>
|
| 837 | </section>
|
| 838 | <section anchor="Acknowledgements" numbered="true" toc="default">
|
| 839 | <name>Acknowledgements</name>
|
| 840 | <t>The authors would like to thank Benjamin Black, Arjun Sreekantiah,
|
| 841 | Keyur Patel, Acee Lindem and Anoop Ghanwani for their comments and
|
| 842 | review of this document.</t>
|
| 843 | </section>
|
| 844 | <section anchor="Contributors" numbered="true" toc="default">
|
| 845 | <name>Contributors</name>
|
| 846 | <artwork><![CDATA[
|
| 847 | Gaya Nagarajan
|
| 848 | Facebook
|
| 849 | US
|
| 850 |
|
| 851 | Email: gaya@fb.com
|
| 852 |
|
| 853 |
|
| 854 | Gaurav Dawra
|
| 855 | Cisco Systems
|
| 856 | US
|
| 857 |
|
| 858 | Email: gdawra.ietf@gmail.com
|
| 859 |
|
| 860 |
|
| 861 | Dmitry Afanasiev
|
| 862 | Yandex
|
| 863 | RU
|
| 864 |
|
| 865 | Email: fl0w@yandex-team.ru
|
| 866 |
|
| 867 |
|
| 868 | Tim Laberge
|
| 869 | Cisco
|
| 870 | US
|
| 871 |
|
| 872 | Email: tlaberge@cisco.com
|
| 873 |
|
| 874 |
|
| 875 | Edet Nkposong
|
| 876 | Salesforce.com Inc.
|
| 877 | US
|
| 878 |
|
| 879 | Email: enkposong@salesforce.com
|
| 880 |
|
| 881 |
|
| 882 | Mohan Nanduri
|
| 883 | Microsoft
|
| 884 | US
|
| 885 |
|
| 886 | Email: mnanduri@microsoft.com
|
| 887 |
|
| 888 |
|
| 889 | James Uttaro
|
| 890 | ATT
|
| 891 | US
|
| 892 |
|
| 893 | Email: ju1738@att.com
|
| 894 |
|
| 895 |
|
| 896 | Saikat Ray
|
| 897 | Unaffiliated
|
| 898 | US
|
| 899 |
|
| 900 | Email: raysaikat@gmail.com
|
| 901 |
|
| 902 | Jon Mitchell
|
| 903 | Unaffiliated
|
| 904 | US
|
| 905 |
|
| 906 | Email: jrmitche@puck.nether.net
|
| 907 | ]]></artwork>
|
| 908 |
|
| 909 | </section>
|
| 910 | </middle>
|
| 911 | <back>
|
| 912 | <references>
|
| 913 | <name>References</name>
|
| 914 | <references>
|
| 915 | <name>Normative References</name>
|
| 916 |
|
| 917 | <reference anchor="RFC2119"
|
| 918 | target="https://www.rfc-editor.org/info/rfc2119"
|
| 919 | xml:base="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
|
| 920 | <front>
|
| 921 | <title>Key words for use in RFCs to Indicate Requirement
|
| 922 | Levels</title>
|
| 923 | <seriesInfo name="DOI" value="10.17487/RFC2119"/>
|
| 924 | <seriesInfo name="RFC" value="2119"/>
|
| 925 | <seriesInfo name="BCP" value="14"/>
|
| 926 | <author initials="S." surname="Bradner" fullname="S. Bradner">
|
| 927 | <organization/>
|
| 928 | </author>
|
| 929 | <date year="1997" month="March"/>
|
| 930 | <abstract>
|
| 931 | <t>In many standards track documents several words are used to
|
| 932 | signify the requirements in the specification. These words are
|
| 933 | often capitalized. This document defines these words as they
|
| 934 | should be interpreted in IETF documents. This document
|
| 935 | specifies an Internet Best Current Practices for the Internet
|
| 936 | Community, and requests discussion and suggestions for improvements.</t>
|
| 937 | </abstract>
|
| 938 | </front>
|
| 939 | </reference>
|
| 940 | <reference anchor="RFC8277"
|
| 941 | target="https://www.rfc-editor.org/info/rfc8277"
|
| 942 | xml:base="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8277.xml">
|
| 943 | <front>
|
| 944 | <title>Using BGP to Bind MPLS Labels to Address Prefixes</title>
|
| 945 | <seriesInfo name="DOI" value="10.17487/RFC8277"/>
|
| 946 | <seriesInfo name="RFC" value="8277"/>
|
| 947 | <author initials="E." surname="Rosen" fullname="E. Rosen">
|
| 948 | <organization/>
|
| 949 | </author>
|
| 950 | <date year="2017" month="October"/>
|
| 951 | <abstract>
|
| 952 | <t>This document specifies a set of procedures for using BGP to
|
| 953 | advertise that a specified router has bound a specified MPLS
|
| 954 | label (or a specified sequence of MPLS labels organized as a
|
| 955 | contiguous part of a label stack) to a specified address prefix.
|
| 956 | This can be done by sending a BGP UPDATE message whose Network
|
| 957 | Layer Reachability Information field contains both the prefix
|
| 958 | and the MPLS label(s) and whose Next Hop field identifies the
|
| 959 | node at which said prefix is bound to said label(s). This
|
| 960 | document obsoletes RFC 3107.</t>
|
| 961 | </abstract>
|
| 962 | </front>
|
| 963 | </reference>
|
| 964 | <reference anchor="RFC4271"
|
| 965 | target="https://www.rfc-editor.org/info/rfc4271"
|
| 966 | xml:base="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4271.xml">
|
| 967 | <front>
|
| 968 | <title>A Border Gateway Protocol 4 (BGP-4)</title>
|
| 969 | <seriesInfo name="DOI" value="10.17487/RFC4271"/>
|
| 970 | <seriesInfo name="RFC" value="4271"/>
|
| 971 | <author initials="Y." surname="Rekhter" fullname="Y. Rekhter" role="editor">
|
| 972 | <organization/>
|
| 973 | </author>
|
| 974 | <author initials="T." surname="Li" fullname="T. Li" role="editor">
|
| 975 | <organization/>
|
| 976 | </author>
|
| 977 | <author initials="S." surname="Hares" fullname="S. Hares" role="editor">
|
| 978 | <organization/>
|
| 979 | </author>
|
| 980 | <date year="2006" month="January"/>
|
| 981 | <abstract>
|
| 982 | <t>This document discusses the Border Gateway Protocol (BGP),
|
| 983 | which is an inter-Autonomous System routing protocol.</t>
|
| 984 | <t>The primary function of a BGP speaking system is to exchange
|
| 985 | network reachability information with other BGP systems. This
|
| 986 | network reachability information includes information on the
|
| 987 | list of Autonomous Systems (ASes) that reachability information
|
| 988 | traverses. This information is sufficient for constructing a
|
| 989 | graph of AS connectivity for this reachability from which
|
| 990 | routing loops may be pruned, and, at the AS level, some policy
|
| 991 | decisions may be enforced.</t>
|
| 992 | <t>BGP-4 provides a set of mechanisms for supporting Classless
|
| 993 | Inter-Domain Routing (CIDR). These mechanisms include support
|
| 994 | for advertising a set of destinations as an IP prefix, and
|
| 995 | eliminating the concept of network "class" within BGP. BGP-4
|
| 996 | also introduces mechanisms that allow aggregation of routes,
|
| 997 | including aggregation of AS paths.</t>
|
| 998 | <t>This document obsoletes RFC 1771. [STANDARDS-TRACK]</t>
|
| 999 | </abstract>
|
| 1000 | </front>
|
| 1001 | </reference>
|
| 1002 |
|
| 1003 | <reference anchor="RFC7938"
|
| 1004 | target="https://www.rfc-editor.org/info/rfc7938"
|
| 1005 | xml:base="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7938.xml">
|
| 1006 | <front>
|
| 1007 | <title>Use of BGP for Routing in Large-Scale Data Centers</title>
|
| 1008 | <seriesInfo name="DOI" value="10.17487/RFC7938"/>
|
| 1009 | <seriesInfo name="RFC" value="7938"/>
|
| 1010 | <author initials="P." surname="Lapukhov" fullname="P. Lapukhov">
|
| 1011 | <organization/>
|
| 1012 | </author>
|
| 1013 | <author initials="A." surname="Premji" fullname="A. Premji">
|
| 1014 | <organization/>
|
| 1015 | </author>
|
| 1016 | <author initials="J." surname="Mitchell" fullname="J. Mitchell" role="editor">
|
| 1017 | <organization/>
|
| 1018 | </author>
|
| 1019 | <date year="2016" month="August"/>
|
| 1020 | <abstract>
|
| 1021 | <t>Some network operators build and operate data centers that
|
| 1022 | support over one hundred thousand servers. In this document,
|
| 1023 | such data centers are referred to as "large-scale" to
|
| 1024 | differentiate them from smaller infrastructures. Environments
|
| 1025 | of this scale have a unique set of network requirements with an
|
| 1026 | emphasis on operational simplicity and network stability. This
|
| 1027 | document summarizes operational experience in designing and
|
| 1028 | operating large-scale data centers using BGP as the only routing
|
| 1029 | protocol. The intent is to report on a proven and stable
|
| 1030 | routing design that could be leveraged by others in the
|
| 1031 | industry.</t>
|
| 1032 | </abstract>
|
| 1033 | </front>
|
| 1034 | </reference>
|
| 1035 | <reference anchor="I-D.ietf-spring-segment-routing"
|
| 1036 | target="http://www.ietf.org/internet-drafts/draft-ietf-spring-segment-routing-15.txt">
|
| 1037 | <front>
|
| 1038 | <title>Segment Routing Architecture</title>
|
| 1039 | <seriesInfo name="Internet-Draft"
|
| 1040 | value="draft-ietf-spring-segment-routing-15"/>
|
| 1041 | <author initials="C" surname="Filsfils" fullname="Clarence Filsfils">
|
| 1042 | <organization/>
|
| 1043 | </author>
|
| 1044 | <author initials="S" surname="Previdi" fullname="Stefano Previdi">
|
| 1045 | <organization/>
|
| 1046 | </author>
|
| 1047 | <author initials="L" surname="Ginsberg" fullname="Les Ginsberg">
|
| 1048 | <organization/>
|
| 1049 | </author>
|
| 1050 | <author initials="B" surname="Decraene" fullname="Bruno Decraene">
|
| 1051 | <organization/>
|
| 1052 | </author>
|
| 1053 | <author initials="S" surname="Litkowski" fullname="Stephane Litkowski">
|
| 1054 | <organization/>
|
| 1055 | </author>
|
| 1056 | <author initials="R" surname="Shakir" fullname="Rob Shakir">
|
| 1057 | <organization/>
|
| 1058 | </author>
|
| 1059 | <date month="January" day="25" year="2018"/>
|
| 1060 | <abstract>
|
| 1061 | <t>Segment Routing (SR) leverages the source routing paradigm.
|
| 1062 | A node steers a packet through an ordered list of instructions,
|
| 1063 | called segments. A segment can represent any instruction,
|
| 1064 | topological or service-based. A segment can have a semantic
|
| 1065 | local to an SR node or global within an SR domain. SR allows to
|
| 1066 | enforce a flow through any topological path while maintaining
|
| 1067 | per-flow state only at the ingress nodes to the SR domain.
|
| 1068 | Segment Routing can be directly applied to the MPLS
|
| 1069 | architecture with no change on the forwarding plane. A segment
|
| 1070 | is encoded as an MPLS label. An ordered list of segments is
|
| 1071 | encoded as a stack of labels. The segment to process is on the
|
| 1072 | top of the stack. Upon completion of a segment, the related
|
| 1073 | label is popped from the stack. Segment Routing can be applied
|
| 1074 | to the IPv6 architecture, with a new type of routing header. A
|
| 1075 | segment is encoded as an IPv6 address. An ordered list of
|
| 1076 | segments is encoded as an ordered list of IPv6 addresses in the
|
| 1077 | routing header. The active segment is indicated by the
|
| 1078 | Destination Address of the packet. The next active segment is
|
| 1079 | indicated by a pointer in the new routing header.</t>
|
| 1080 | </abstract>
|
| 1081 | </front>
|
| 1082 | </reference>
|
| 1083 |
|
| 1084 | <reference anchor="I-D.ietf-idr-bgp-prefix-sid"
|
| 1085 | target="http://www.ietf.org/internet-drafts/draft-ietf-idr-bgp-prefix-sid-27.txt">
|
| 1086 | <front>
|
| 1087 | <title>Segment Routing Prefix SID extensions for BGP</title>
|
| 1088 | <seriesInfo name="Internet-Draft"
|
| 1089 | value="draft-ietf-idr-bgp-prefix-sid-27"/>
|
| 1090 | <author initials="S" surname="Previdi" fullname="Stefano Previdi">
|
| 1091 | <organization/>
|
| 1092 | </author>
|
| 1093 | <author initials="C" surname="Filsfils" fullname="Clarence Filsfils">
|
| 1094 | <organization/>
|
| 1095 | </author>
|
| 1096 | <author initials="A" surname="Lindem" fullname="Acee Lindem">
|
| 1097 | <organization/>
|
| 1098 | </author>
|
| 1099 | <author initials="A" surname="Sreekantiah" fullname="Arjun Sreekantiah">
|
| 1100 | <organization/>
|
| 1101 | </author>
|
| 1102 | <author initials="H" surname="Gredler" fullname="Hannes Gredler">
|
| 1103 | <organization/>
|
| 1104 | </author>
|
| 1105 | <date month="June" day="26" year="2018"/>
|
| 1106 | <abstract>
|
| 1107 | <t>Segment Routing (SR) leverages the source routing paradigm.
|
| 1108 | A node steers a packet through an ordered list of instructions,
|
| 1109 | called segments. A segment can represent any instruction,
|
| 1110 | topological or service-based. The ingress node prepends an SR
|
| 1111 | header to a packet containing a set of segment identifiers
|
| 1112 | (SID). Each SID represents a topological or a service-based
|
| 1113 | instruction. Per-flow state is maintained only on the ingress
|
| 1114 | node of the SR domain. An SR domain is defined as a single
|
| 1115 | administrative domain for global SID assignment. This document
|
| 1116 | defines an optional, transitive BGP attribute for announcing BGP
|
| 1117 | Prefix Segment Identifiers (BGP Prefix-SID) information and the
|
| 1118 | specification for SR-MPLS SIDs.</t>
|
| 1119 | </abstract>
|
| 1120 | </front>
|
| 1121 | </reference>
|
| 1122 | <reference anchor="I-D.ietf-spring-segment-routing-central-epe"
|
| 1123 | target="http://www.ietf.org/internet-drafts/draft-ietf-spring-segment-routing-central-epe-10.txt">
|
| 1124 | <front>
|
| 1125 | <title>Segment Routing Centralized BGP Egress Peer
|
| 1126 | Engineering</title>
|
| 1127 | <seriesInfo name="Internet-Draft"
|
| 1128 | value="draft-ietf-spring-segment-routing-central-epe-10"/>
|
| 1129 | <author initials="C" surname="Filsfils" fullname="Clarence Filsfils">
|
| 1130 | <organization/>
|
| 1131 | </author>
|
| 1132 | <author initials="S" surname="Previdi" fullname="Stefano Previdi">
|
| 1133 | <organization/>
|
| 1134 | </author>
|
| 1135 | <author initials="G" surname="Dawra" fullname="Gaurav Dawra">
|
| 1136 | <organization/>
|
| 1137 | </author>
|
| 1138 | <author initials="E" surname="Aries" fullname="Ebben Aries">
|
| 1139 | <organization/>
|
| 1140 | </author>
|
| 1141 | <author initials="D" surname="Afanasiev" fullname="Dmitry Afanasiev">
|
| 1142 | <organization/>
|
| 1143 | </author>
|
| 1144 | <date month="December" day="21" year="2017"/>
|
| 1145 | <abstract>
|
| 1146 | <t>Segment Routing (SR) leverages source routing. A node steers
|
| 1147 | a packet through a controlled set of instructions, called
|
| 1148 | segments, by prepending the packet with an SR header. A segment
|
| 1149 | can represent any instruction topological or service-based. SR
|
| 1150 | allows to enforce a flow through any topological path while
|
| 1151 | maintaining per-flow state only at the ingress node of the SR
|
| 1152 | domain. The Segment Routing architecture can be directly
|
| 1153 | applied to the MPLS dataplane with no change on the forwarding
|
| 1154 | plane. It requires a minor extension to the existing link-state
|
| 1155 | routing protocols. This document illustrates the application of
|
| 1156 | Segment Routing to solve the BGP Egress Peer Engineering
|
| 1157 | (BGP-EPE) requirement. The SR-based BGP-EPE solution allows a
|
| 1158 | centralized (Software Defined Network, SDN) controller to
|
| 1159 | program any egress peer policy at ingress border routers or at
|
| 1160 | hosts within the domain.</t>
|
| 1161 | </abstract>
|
| 1162 | </front>
|
| 1163 | </reference>
|
| 1164 | </references>
|
| 1165 |
|
| 1166 | <references>
|
| 1167 | <name>Informative References</name>
|
| 1168 | <reference anchor="RFC6793"
|
| 1169 | target="https://www.rfc-editor.org/info/rfc6793"
|
| 1170 | xml:base="https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.6793.xml">
|
| 1171 | <front>
|
| 1172 | <title>BGP Support for Four-Octet Autonomous System (AS) Number
|
| 1173 | Space</title>
|
| 1174 | <seriesInfo name="DOI" value="10.17487/RFC6793"/>
|
| 1175 | <seriesInfo name="RFC" value="6793"/>
|
| 1176 | <author initials="Q." surname="Vohra" fullname="Q. Vohra">
|
| 1177 | <organization/>
|
| 1178 | </author>
|
| 1179 | <author initials="E." surname="Chen" fullname="E. Chen">
|
| 1180 | <organization/>
|
| 1181 | </author>
|
| 1182 | <date year="2012" month="December"/>
|
| 1183 | <abstract>
|
| 1184 | <t>The Autonomous System number is encoded as a two-octet entity
|
| 1185 | in the base BGP specification. This document describes
|
| 1186 | extensions to BGP to carry the Autonomous System numbers as
|
| 1187 | four-octet entities. This document obsoletes RFC 4893 and
|
| 1188 | updates RFC 4271. [STANDARDS-TRACK]</t>
|
| 1189 | </abstract>
|
| 1190 | </front>
|
| 1191 | </reference>
|
| 1192 | <reference anchor="I-D.ietf-6man-segment-routing-header"
|
| 1193 | target="http://www.ietf.org/internet-drafts/draft-ietf-6man-segment-routing-header-21.txt">
|
| 1194 | <front>
|
| 1195 | <title>IPv6 Segment Routing Header (SRH)</title>
|
| 1196 | <seriesInfo name="Internet-Draft"
|
| 1197 | value="draft-ietf-6man-segment-routing-header-21"/>
|
| 1198 | <author initials="C" surname="Filsfils" fullname="Clarence Filsfils">
|
| 1199 | <organization/>
|
| 1200 | </author>
|
| 1201 | <author initials="D" surname="Dukes" fullname="Darren Dukes">
|
| 1202 | <organization/>
|
| 1203 | </author>
|
| 1204 | <author initials="S" surname="Previdi" fullname="Stefano Previdi">
|
| 1205 | <organization/>
|
| 1206 | </author>
|
| 1207 | <author initials="J" surname="Leddy" fullname="John Leddy">
|
| 1208 | <organization/>
|
| 1209 | </author>
|
| 1210 | <author initials="S" surname="Matsushima" fullname="Satoru Matsushima">
|
| 1211 | <organization/>
|
| 1212 | </author>
|
| 1213 | <author initials="d" surname="daniel.voyer@bell.ca"
|
| 1214 | fullname="daniel.voyer@bell.ca">
|
| 1215 | <organization/>
|
| 1216 | </author>
|
| 1217 | <date month="June" day="13" year="2019"/>
|
| 1218 | <abstract>
|
| 1219 | <t>Segment Routing can be applied to the IPv6 data plane using a
|
| 1220 | new type of Routing Extension Header. This document describes
|
| 1221 | the Segment Routing Extension Header and how it is used by
|
| 1222 | Segment Routing capable nodes.</t>
|
| 1223 | </abstract>
|
| 1224 | </front>
|
| 1225 | </reference>
|
| 1226 |
|
| 1227 | </references>
|
| 1228 | </references>
|
| 1229 | </back>
|
| 1230 | </rfc>
|