Transcription

Global MPLS Design UsingCarrier Supporting Carrier (CSC)Technical WhitepaperVersion 1.2Authored by:Nicholas RussoCCDE #20160041CCIE #42518 (EI/SP)THE INFORMATION HEREIN IS PROVIDED ON AN "AS IS" BASIS, WITHOUT ANYWARRANTIES OR REPRESENTATIONS, EXPRESS, IMPLIED OR STATUTORY,INCLUDING WITHOUT LIMITATION, WARRANTIES OF NONINFRINGEMENT,MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Copyright 2021 Nicholas Russo – http://njrusmc.netChange HistoryVersion and DateChangeResponsible Person20200914Version 0.1Initial DraftNicholas Russo20200917Version 0.2Spelling/grammar correctionsNicholas Russo20201002Version 0.3Technical clarificationsNicholas Russo202001102Version 1.0Initial ReleaseNicholas Russo202001205Version 1.1Legal disclaimers and cleanupNicholas Russo202101005Version 1.2Technical clarificationsNicholas Russoii

Copyright 2021 Nicholas Russo – http://njrusmc.netContents1.2.Overview . 71.1.Problem Statement. 71.2.Solution Summary . 7Architecture . 102.1.Point of Presence (POP) Design . 102.1.1.Physical Connectivity . 102.1.2.IGP Routing . 142.1.3.Multicast Routing . 182.1.4.BGP VPN Services Routing . 192.1.5.MPLS Label Advertisement . 202.1.5.1.Label Distribution Protocol (LDP). 202.1.5.2.Resource Reservation Protocol for MPLS Traffic Engineering (MPLS-TE) . 222.1.5.3.Segment Routing (SR) . 222.1.6.2.2.Customer Services . 232.1.6.1.Layer-3 VPN . 232.1.6.2.Layer-2 VPN . 262.1.6.3.Multicast VPN . 30Carrier Supporting Carrier (CSC) Design. 352.2.1.BGP Labeled-unicast (BGP-LU) Connectivity . 352.2.2.Interaction Between IGP and BGP-LU . 372.2.3.Inter-AS BGP VPN Services Routing . 392.2.4.Non-CSC Transport Supplementation. 482.3.Extranet Integration . 512.4.Quality of Service (QoS) Design . 522.4.1.Queuing and Shaping . 522.4.2.Classification, Marking, and Policing . 542.5.Management, Security, and Automation . 562.5.1.Global Management View (GMV) Design . 562.5.2.VPN Management View (VMV) Design. 57iii

Copyright 2021 Nicholas Russo – http://njrusmc.net2.5.3.Management LAN Security and High Availability . 612.5.4.TACACS Command Authorization . 642.5.5.Automation Strategy and Use Cases . 662.6.3.2.5.5.1.Data Collection for Archival and Troubleshooting. 662.5.5.2.MPLS Route-target (RT) Management . 672.5.5.3.Inter-POP Performance Measurement . 672.5.5.4.Extranet IP Address Overlap/Translation Management. 672.5.5.5.Customer Onboarding Assistance . 68Example Customer Use Cases . 682.6.1.Geographic Extension with Multi-tenancy . 682.6.2.Satellite Communications (SATCOM) Remoting. 702.6.3.Highly-Available Internet Access. 722.6.4.WAN Aggregation and Cloud Data Center . 75Complexity Assessment. 773.1.State . 773.2.Optimization. 783.3.Surface. 78Appendix A – Acronyms . 80Appendix B – References . 85FiguresFigure 1 - High-level CSC/Option C Architecture. 9Figure 2 - Traditional POP Physical Design. 11Figure 3 - Leaf/Spine POP Physical Design . 12Figure 4 - Using CSC-CEs as BGP VPN Route Reflectors. 13Figure 5 - Using Dedicated Out-of-band BGP Route Reflectors . 14Figure 6 - Using RRs for Transit in a POP with Link Failures . 17Figure 7 - Preventing Transit RRs in a POP using Areas . 18Figure 8 - Intra-POP iBGP VPN Sessions and Link Failure Tolerance. 20iv

Copyright 2021 Nicholas Russo – http://njrusmc.netFigure 9 - LDP/IGP Synchronization with LDP Session Failures. 21Figure 10 - LDP Session Protection with Link Failures . 22Figure 11 - Building MPLS L3VPNs within a POP. 24Figure 12 - Unique L3VPN RD for Active/Active Forwarding . 25Figure 13 - Unique L3VPN RD for Active/Standby Forwarding. 26Figure 14 - Building MPLS L2VPNs within a POP. 28Figure 15 - L2VPN Services Offered. 29Figure 16 - Calculating MTU for VPN Services. 30Figure 17 - MVPN Profile 0 Design . 32Figure 18 - MVPN Profile 3 Design . 33Figure 19 - MVPN Profile 11 Design . 34Figure 20 - MVPN Ingress Replication Design (Profiles 19 and 21) . 35Figure 21 - eBGP-LU Inbound and Outbound Filters . 36Figure 22 - Inter-POP Flow with IGP/eBGP Redistribution. 37Figure 23 - Inter-POP Flow with iBGP-LU from CSC-CE to PE . 38Figure 24 - Originating the PIM Proxy Vector for P Router RPF . 39Figure 25 - Basic iBGP Non RR-Client Mesh over CSC . 40Figure 26 - Introducing Backdoor Links with Merged IGP Domains . 41Figure 27 - Introducing Backdoor Links with iBGP-LU and Local AS . 42Figure 28 - Satellite POP Connectivity to Regional POP Using eBGP VPN. 43Figure 29 - Second-Tier "Route Reflection" with eBGP VPN Sessions. 44Figure 30 - eBGP VPN Inter-region Mesh Design . 46Figure 31 - Controlling Inter Region eBGP VPN Advertisements. 47Figure 32 - High-level Non-CSC Auxiliary Transport Design . 48Figure 33 - Non-CSC Direct Links Between POPs. 49Figure 34 - Non-CSC E-LAN Service Between POPs . 50Figure 35 - Extranet Integration with MPLS Inter-AS Option A . 51Figure 36 - Queuing and PHB Design . 54Figure 37 - DSCP to EXP Mapping on Ingress . 55Figure 38 - Global Management View (GMV) Design . 57Figure 39 - VPN Management View (VMV) Design. 58Figure 40 - VMV Connectivity with Hub/Spoke Route Targets . 59v

Copyright 2021 Nicholas Russo – http://njrusmc.netFigure 41 - Voice over IP (VoIP) and Voice QoS Design . 61Figure 42 - NOC Security Stack Design . 62Figure 43 - Layer-2 Defense in Depth Security Design . 63Figure 44 - 802.1X for NOC Users and IP Phones . 64Figure 45 - Tiered TACACS Design and Command Sets . 66Figure 46 - Use Case: Connecting Geographically Dispersed Nodes. 69Figure 47 - High-level SATCOM Remoting Design. 70Figure 48 - Using a Single L2VPN to Connect Multiple Sites . 71Figure 49 - Using QinQ Tunneling to Avoid VLAN Rewrites . 72Figure 50 - Internet-in-VRF High-level Design. 74Figure 51 - Internet-in-VRF Regional Failover . 75Figure 52 - Managed IaaS High-level Design . 76TablesTable 1 - Plausible MVPN Profile Options . 30Table 2 - Core Queuing Allocations . 52Table 3 - Ingress PE Classification and Marking. 55Table 4 - Global and VPN Management Outage Matrix . 60vi

Copyright 2021 Nicholas Russo – http://njrusmc.net1. Overview1.1.Problem StatementWorking for a large service provider, we struggled to find a way to connect disparate sites in asecure, multi-tenant way across continents. We lacked both the financial and political resourcesto build a global transport infrastructure ourselves, which was exacerbated by concernssurrounding the initial capital investments and long-term operating expenses. Some of our siteswere deployed in developed countries where a wide variety of Wide Area Network (WAN)connectivity options were available. Others were in developing countries where the connectivityoptions were few and poor performing. Due to security and cost concerns, using the publicInternet as transport was not an option at the time this network was designed.In addition to providing transport connectivity between regions, our diverse collection ofcustomers required a variety of different services. Some required basic IPv4/v6 connectivity,others had non-IP applications requiring layer-2 transport, and still others needed IP multicasttransport across the world. Almost all customers required some combination of high scalability,high availability, rapid provisioning, and low packet loss.Once we identified a transport provider, we learned that it may not be accessible to all locationswhere we needed a point of presence (POP). Our solution would also have to account forcontingency connections, such as one-off direct circuits or additional service providers. Theseother transports should fit into the design as seamlessly as possible and serve as alternative pathswhere possible. Furthermore, our primary provider could not guarantee the availability ofEthernet access media, which implied our last-mile design had to be transport-independent.1.2.Solution SummaryWe selected Multi-Protocol Label Switching (MPLS) as the core technology used in the solution.Unlike modern alternatives, MPLS is well-known, widely supported, and has enjoyed decades ofsuccess in production. Additionally, much of our network equipment did not support the newestmulti-tenancy VPN technologies such as Ethernet Virtual Private Network (EVPN) and VirtualeXtensible Local Area Network (VXLAN).Because we were not able to build a global transport network, we relied on an existing Tier 1service provider that offered a variety of transport services globally. The most accessible,scalable, and flexible solution available was Carrier Supporting Carrier (CSC). This solutionextends the concept of a traditional MPLS Layer-3 VPN (VPN) by allowing the customer to runtheir own MPLS network within the VPN. As such, our remote POPs could offer a wide array ofnetwork services to our customers and the Tier 1 service provider would act as an MPLStransport network only.7

Copyright 2021 Nicholas Russo – http://njrusmc.netCSC is seldom used in real life because other options, such as Ethernet LAN (E-LAN) services,make it easy to connect remote POPs at layer-2. Smaller carriers can run their regular interiorgateway protocols (IGP) and MPLS label exchange protocols without any layer-3 interactionswith the core carrier. However, such technologies require Ethernet last-mile connectivity(notwithstanding sloppy layer-2 interworking designs) which could not be guaranteed in everycountry in which we had a POP. CSC provides last-mile circuit flexibility/independence whilealso improving scale as the customer and core carriers exchange routes using Border GatewayProtocol (BGP). In this context, BGP is extended to include an MPLS label for every prefix andis known as BGP labelled unicast (BGP-LU).What makes this design truly unique is not only the rare deployment of a production, global scaleCSC network, but the inclusion of Inter-AS MPLS Option C. This relatively complex integrationallows two different BGP autonomous systems (AS) to exchange BGP VPN routing informationin a highly scalable way. Rather than exchanging such information through the AS boundaryrouters (ASBRs) as Options A and B do, Option C peers the BGP VPN route-reflectors (RR)instead. This allows the ASBRs to be unaware of any VPN routing, serving only as CSCcustomer edge (CSC-CE) devices connecting to the core carrier’s CSC provider edge (CSC-PE)devices. The justification for this design, instead of the more traditional internal BGP (iBGP)VPN sessions, comes later in this document.The term “BGP VPN” is a generic statement that represents any BGP address-family used tocarry customer VPN information, whether it is IPv4/v6 routes, MAC addresses, Virtual PrivateLAN Service (VPLS) discovery/signalling messages, multicast VPN (VPN) discovery/signallingmessages, and more. This highly generic combined design leveraging CSC and Option C allowsany service to be extended between any pair of POPs in the world, regardless of their manner ofconnectivity. Some exceptions apply, often with multicast VPN transport, which is discussedlater. The diagram below illustrates a high-level design L3VPN design.8

Copyright 2021 Nicholas Russo – http://njrusmc.netFigure 1 - High-level CSC/Option C ArchitectureMULTI-HOP EBGP VPNV4/V6(BETWEEN RR AND/OR PE)BGP ASN65001EBGP LABELED UNICAST(CSC-PE TO CSC-CE)BGP ASN65002CSCCOREBGP ASN65003EBGP IPV4/V6(PE TO CE)9

Copyright 2021 Nicholas Russo – http://njrusmc.net2. ArchitectureThis section describes the solution in greater technical detail. It examines each individualcomponent in depth, adding new components as it progresses. This document is not a trainingtutorial on the technologies, but does explain how they work within the context of the design.2.1.Point of Presence (POP) DesignIndividual POPs within the architecture do not have to be identical, but there are some commondesign constraints that apply to all of them. This section explores the design of the POPsthemselves without focusing on inter-POP communications. In my particular customer, the POPsoperated autonomously for about a year before we decided to tie them together. During that firstyear, they only served their regional customers with no inter-POP/global connectivity available.2.1.1. Physical ConnectivityWe developed two conceptual POP designs, each of which had two options for BGP routereflector (RR) placement to service the BGP VPN address-families. The first design was atraditional aggregation block with two distribution/core routers on top. Every customer-facingPE device would dual-home to each distribution/core router (typically CSC-CEs or dedicated Prouters) using a directly connected Ethernet connection. Such designs are decades old and arecommonly seen in campus access networks and traditional data centers where the vast majorityof traffic is north/south. In our case, north/south means inter-POP, and this was indeed the maintraffic pattern for most customers once global connectivity was established. Very little traffictraveled east/west, meaning intra-POP, although this was certainly supported. The diagrambelow illustrates the traditional aggregated POP design.10

Copyright 2021 Nicholas Russo – http://njrusmc.netFigure 2 - Traditional POP Physical DesignCSCCOREPECSCPECSCPECSCCECSCCEPEPEPEThe second design was based on a leaf/spine design, effectively adding another pair of routersbetween the customer facing PEs and the CSC-CEs. Both the PEs and CSC-CEs are “leaves” inthis design, with the CSC-CEs being classified as “border leaves” given their integration with anexternal network. The middle tier consisted of the “spines” whereby every leaf is connected toevery spine. Leaves never connect to leaves and spines never connect to spines within the sametier, with one exception. The border leaves can optionally be interconnected because shuttlingingress/egress traffic between edge devices is useful to improve availability or implementingress/egress traffic engineering in the future. The main technical advantage of leaf/spine overthe traditional design is the ability to improve scale for east/west traffic. Simply add more spinesto increase availability, capacity, or both.This can also be viewed as a disadvantage, since the only purpose of a spine is to forward traffic.This incurs additional cost and management burden. In real life, we never deployed leaf/spinePOPs as there was no compelling operational justification, despite their popularity at the time.This document will discuss the details surrounding its deployment nonetheless. The diagrambelow illustrates the conceptual leaf/spine POP physical design.11

Copyright 2021 Nicholas Russo – http://njrusmc.netFigure 3 - Leaf/Spine POP Physical EAVESSPINESPPWe overlaid two different BGP RR strategies atop these POP designs. The first was a low-costapproach that repurposed the CSC-CEs, whether they were aggregation routers or border leaves,to serve as BGP RRs for the POP. Because these devices were already quite powerful in terms ofcomputing capacity, using them to serve as BGP RRs was a low-risk, cost-effective choice. EachPE in the POP would peer to these RRs using internal BGP (iBGP) which is detailed later in thisdocument. This is the design we selected in real-life as cost concerns governed many of ourdecisions. The diagram below illustrates the intra-POP iBGP VPN sessions overlaid on both thetraditional and leaf/spine physical designs. Note that the precise details regarding the iBGPtopology are discussed later in the document.12

Copyright 2021 Nicholas Russo – http://njrusmc.netFigure 4 - Using CSC-CEs as BGP VPN Route ReflectorsCSCCORECSCPECSCPECSCCECSCCECSCCOREIBGP VPNROUTE REFLECTORSPECSCPECSCPEPECSCCECSCCEPPIBGP VPNRR CLIENTSPEPEPEBGP FREEThe second design involved a pair of dedicated RRs outside of the forwarding path of customertraffic. These routers would look like PEs from a physical connectivity perspective, but wouldnot service any customers and would never be used for traffic forwarding. This non-transitbehavior can be implemented by manipulating IGP (discussed later). In modern designs, theseBGP RRs are often low-cost virtual routers with large memory allocations, medium CPUallocations, and low network bandwidth allocations. Additionally, we considered using adifferent pair of BGP RRs for all the different VPN services we offered, such as IPv4 VPN, IPv6VPN, multicast VPN, etc. This incurs even greater cost and management burden, but reduces fatesharing and slightly improves availability.Some of the largest carriers manage risk by spreading different BGP address-families acrossdifferent RRs to the maximum extent economically possible. In our environment, we did nothave a general-purpose computing environment immediately available. When including thecapital investment needed to build and maintain it, this solution was prohibitively expensive andnot at all worth doing. The diagram below illustrates conceptual examples of adding dedicatedRRs to the traditional and leaf/spine POP designs at a high-level. Note that the term “BGP free”means that there are no VPN capabilities on those devices. Some devices, like the CSC-CE, mayrun BGP for a different purpose later.13

Copyright 2021 Nicholas Russo – http://njrusmc.netFigure 5 - Using Dedicated Out-of-band BGP Route ReflectorsCSCCORECSCCOREBGP GP VPNRR CLIENTSPERRRRBGPFREEPEPPiBGP VPNROUTE REFLECTORSRRRR2.1.2. IGP RoutingBecause each regional POP is relatively small (consisting of 10 to 30 devices), any IGP wouldscale adequately without much concern. Although our organization had no need for any MPLStraffic engineering (TE) given the tiny size of our POPs and lack of a long-haul infrastructure,we agreed that choosing a link-state protocol was necessary. This makes future TE integrationeasier, along with support for emerging technologies like Segment Routing (SR). This reducedour choices to Open Shortest Path First (OSPF) and Intermediate System to Intermediate System(IS-IS), the two most popular link-state IGPs.OSPF was the more appropriate choice for our network because our operators were alreadyextensively trained in this protocol. Some network OS implementations, like Cisco IOS, IOSXE, and IOS-XR, will ignore OSPF external routes when redistributing OSPF into BGP bydefault. This is useful because any BGP routes redistributed into OSPF will not be considered forredistribution from OSPF back into BGP. In short, this prevents routing loops with no additionaldesign or implementation effort. IS-IS has no such default behavior, and this will becomerelevant later in the document when discussing CSC integration. To prevent potential routingloops, IS-IS would require manual configuration to match/filter these routes at the point ofredistribution (CSC-CE). For network implementation experts, this is inconsequential, but it isavoidable complexity that adds no value. In both cases, the scale of each POP is small enoughthat a flat OSPF area 0 or IS-IS level-2 design is adequate, with the exception of dedicated RRsin OSPF environments (discussed later).First, consider basic OSPF optimizations. All transit links should use the point-to-point (P2P)network type to speed convergence, reduce link-state database bloating, and reduce the14

Copyright 2021 Nicholas Russo – http://njrusmc.nettopological graph complexity. P2P links do not have a designated router (DR) and thus no DRelection. A link interconnecting exactly two OSPF speakers is not a multi-access network andtherefore does not benefit from a DR, which is represented as a Link State Advertisement type 2(LSA2) in the LSDB. As such, no LSA2 should be present anywhere in the network, reducingthe number of total graph vertices by almost half. It is advisable to retain “stub networks” withinthe router LSA for OSPFv2 (LSA1) or the intra-area prefix LSA within OSPFv3 (LSA9) tosimplify troubleshooting. This allows operators to ping transit links, to source pings from transitlinks, and to see at a glance which links might be experiencing problems by checking the routingtable. Given the small network and the rarity with which these IP subnets change, there is littleoperational benefit to suppressing these prefixes.Next, consider OSPF security. Modern OSPF implementations allow for SHA-256 authentication(some platforms offer even stronger hashes) which should be preferred instead of the older MD5option. In addition to authentication, OSPFv3 also offers IPsec encryption, which in the author’sexperience, is overly complex, prone to breaking, and not worth deploying. OSPF TTL-securityensures that neighbors are directly connected, preventing any long-range hijacking attacks fromexternal networks, such as those accessible over CSC. Protecting the OSPF link-state database(LSDB) itself can be accomplished by setting maximum LSA limits to prevent accidental LSAinjection at scale, perhaps due to unfiltered BGP to OSPF redistribution. Such concerns wereirrelevant in our environment given that our global Internet connections were placed in a VPN,but some customers may prefer to transport Internet traffic in the global routing table. This topicis discussed in greater detail later in the document.In less symmetric networks, some operators deploy loop free alternate (LFA) technologies toallow OSPF to inspect the LSDB in greater detail to determine if backup paths exist. When theydo, the router can preemptively install these backup paths in hardware for faster failover. In ourcase, POPs are perfectly symmetric with the same IGP cost used on all links (10 in our case),automatically resulting in equal-cost multi-path (ECMP). This feature allows for load sharingbetween devices based on various hashing algorithms which are out of scope for this document.More important than the load sharing is the high availability; because both routes are used forforwarding, they are both programmed in hardware already. This obviates the need for complexLFA techniques and given our early-in-career network operators, ECMP was the best choice.Note that some hardware platforms may benefit from LFA enabled even in ECMP environments.This depends on how the platform maintains its forwarding tables and is likewise out of scopefor this document.The first step in the convergence process is failure detection. Because all devices in the POPwere directly connected (i.e. no intermediate Ethernet switches), the Ethernet interface line statuswas an accurate indication of a link’s up/down status. This raises the question of “carrier delay”;how long after a failure is detected should the control-plane mark the interface as down? In ourfirst three years of operation, we observed only two false-negative micro-flaps whereby aninterface loses electrical or optical signal for a brief period of time (a few milliseconds at most),but immediately returns. Marking this as a link flap and starting the convergence process is moredetrimental than just waiting, so we used a relatively aggressive carrier-delay of 5 milliseconds.This delay helps the control-plane ignore rare microflaps rather than starting the convergenceprocess prematurely.15

Copyright 2021 Nicholas Russo – http://njrusmc.netNote that Bidirect

extends the concept of a traditional MPLS Layer -3 VPN (VPN) by allowing the customer to run their own MPLS network within the VPN. As such, our remote POPs could offer a wide array of network services to