Internet-Draft | IP VPNs abstract next-hops | August 2022 |
Malyushkin | Expires 18 February 2023 | [Page] |
This document discusses the IP VPN convergence aspects and specifies procedures for IP VPN to signal the attachment circuit failure. The specified procedures help significantly improve the IP VPN convergence.¶
This note is to be removed before publishing as an RFC.¶
Status information for this document may be found at https://datatracker.ietf.org/doc/draft-malyushkin-bess-ip-vpn-abstract-next-hops/.¶
Discussion of this document takes place on the BGP Enabled ServiceS Working Group mailing list (mailto:bess@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/bess/. Subscribe at https://www.ietf.org/mailman/listinfo/bess/.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 18 February 2023.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Neither IP VPN [RFC4364] nor IPv6 VPN [RFC4659] have a mass routes withdrawal mechanism. The failure of a connection to a CE forces a PE to withdraw all affected VPN routes instead of noticing other PE routers about the attachment circuit failure. These routes may be packed into one or more BGP UPDATE messages and then disseminated through the network. Depending on the BGP topology these messages may be further processed and replicated by intermediate nodes (e.g., route reflectors). In general, every affected route must be withdrawn from all interested parties. The number of failed routes impacts the convergence time. More routes require more time. A sophisticated intermediate BGP topology may also negatively affect this time.¶
Network`s convergence speed is important. There is a potential traffic loss that lasts until the failure notification (BGP UPDATE messages) reaches other members participating in the affected VPN service (i.e., routers using the affected VPN routes for traffic forwarding). Moreover, this loss happens at the egress point where the failed CE router is connected to the network and after traffic has proceeded a whole path.¶
There is a mechanism to avoid this traffic loss that acts while the network is converging which is named the BGP PIC edge [I-D.ietf-rtgwg-bgp-pic]. This mechanism depends on the availability of an extra exit point for every affected route. In case when the CE router is connected to a pair of PE routers (i.e., it is multihomed) and a link between the CE and one of these PE fails all affected traffic can be redirected by this PE toward another. On the other hand, the BGP PIC edge when it is active at egress is associated with the sub-optimal routing. Traffic from an ingress PE must follow the path toward the egress PE where the failed link with the CE is attached. Then this egress PE redirects traffic thanks to the pre-installed backup records toward another PE. Such a tromboning can negatively influence traffic characteristics (delay, loss rate, etc.).¶
Another problem with the BGP PIC edge at egress is a possible routing loop. Suppose a CE router is connected to a pair of PE routers and contributes to them a set of routes. These PE install these routes and propagate them via internal BGP VPN sessions. Both PEs receive these routes via a PE-CE protocol and the internal BGP VPN sessions. The routes received via the internal BGP VPN sessions are used as backups for the routes received via the PE-CE protocol. When the CE fails the PE routers activate their backups sending traffic to each other until TTL reaches zero.¶
The BGP PIC edge mechanism is a transient solution. As soon as all VPN members are notified about the unreachability of all affected VPN routes traffic will be sent to the extra exit point in an optimal way or it will be dropped at ingress. The goal of the solution described in this document is to decrease the time required by all VPN members to be aware of the failure thus reduce the time the BGP PIC edge lasts. This solution does not replace the BGP PIC edge and can be applied to networks in parallel with it. It is recommended to combine them together.¶
Even if destinations that were advertised by a failed CE lack alternatives the time of the network reaction may be important. Imagine that the CE advertises a huge number of routes and attracts a considerable amount of traffic, but for some reason, these routes do not have an alternative exit point. Until all other members of the VPN service are aware of the failure traffic will flow through the network in vain. The solution described in this document will significantly reduce this time.¶
This document refers to [RFC4364] in all cases when a logic of the latter is applicable to both address families. [RFC4659] is referred to explicitly if it introduces a new logic.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The document uses the terminology defined in [RFC4271], [RFC4364], [RFC6513], and [RFC3031].¶
Attachment Circuit.¶
Global Routing Table.¶
Either a VPN-IPv4 or VPN-IPv6 route.¶
An artificial IPv4 or IPv6 address in the GRT that represents an address of CE in VRF.¶
An actual IPv4 or IPv6 address of CE that is bound (linked) to ANH.¶
A unidirectional dependency between the statuses of ANH and LA.¶
Consider the topology in Figure 1. CE1 and CE2 maintain external BGP sessions with PE1 for IPv4 and IPv6 unicast address families. Both CEs send routes via these sessions, which must be reachable by CE3 through the VPN service. PE1 exports routes installed into VRF1 and send them as VPN routes to PE2.¶
Figure 2 shows the routes received from the CEs and installed into VRF1 by PE1 on the left and VPN routes advertised by PE1 on the right. The most interesting column here is "VPN Next-Hops". The address of 192.0.2.1 is a primary address of PE1 which is used as a default VPN next-hop address and as a source address of internal BGP sessions. All routes of CE2 use the default address of PE1 as the next-hop when exported as VPN routes. There is a special export policy on PE1 for the internal BGP sessions that modifies next-hops for the VPN-IPv4 routes of CE1 to the address of 192.0.2.100 and for the VPN-IPv6 routes of CE1 to the address of ::ffff:192.0.2.200.¶
PE1 advertises unicast host-specific routes for the addresses 192.0.2.1, 192.0.2.100, and 192.0.2.200 via a routing protocol. PE2 receives these routes and installs them into the GRT. PE1 also allocates an MPLS label for the addresses mentioned above and distributes the bindings of this label via a label distribution protocol. PE2 receives these bindings and installs them into its tunnel table. Thus, PE2 can resolve all VPN routes that PE1 has sent.¶
Suppose that AC2 between PE1 and CE2 has failed for some reason. When PE1 has noticed the failure it invalidates all routes inside VRF1 that were used to reach the CE2 addresses, 198.51.100.2/31 and 2001:DB8::/127. Other routes that recursively uses the routes to addresses of CE2 for look up their next-hops become inactive too. Because of that PE1 starts withdrawing the corresponding VPN routes, 203.0.113.128/25 and 2001:DB8:200::/64 (RD is omitted). PE2 must wait for these withdrawals before it stops sending traffic toward PE1 (traffic from CE3 to CE2).¶
In another scenario, AC1 fails instead of AC2. Imagine, PE1 is configured to monitor the status of AC1 and in the case of the failure of AC1 PE1 immediately updates the routing protocol and the label distribution protocol. These updates include the withdrawal of the unicast routes for the addresses 192.0.2.100 and 192.0.2.200 (but not for 192.0.2.1) and the label bindings for them. In parallel with it, PE1 proceeds with the similar steps described previously for the case with the AC2 failure. PE2 eventually receives the updates either by the routing protocol or the label distribution protocol or both. Thanks to a hierarchical FIB it invalidates all VPN routes at once that use the failed routes (and tunnels) to their VPN next-hops. PE2 stops sending traffic to PE1 (traffic from CE3 to CE1) even if it has not received yet any withdrawals for the corresponding VPN routes.¶
For the sake of brevity, both scenarios are discussed without alternative exit points for the routes inside VRF1 and just for a couple of such routes. In real deployments, CE can distribute much more routes to more than one PE. In that case, the mechanism described above can significantly improve the network convergence times.¶
This document introduces the mechanism that helps notify VPN members about the AC failure that has happened to one of these members. The described solution expects the following:¶
The solution described in this document modifies the behavior of egress PE routers only and can be deployed incrementally.¶
Section 4.3.2 of [RFC4364] states:¶
When a PE router distributes a VPN-IPv4 route via BGP, it uses its own address as the "BGP next hop".¶
In most cases, the "own address" is the address of a virtual interface (e.g., a loopback). This address usually acts as a tunnel endpoint for labeled traffic. The tunnel using it may be instantiated by different mechanisms and must be capable of forwarding MPLS traffic. The PE also uses this address as a source address of internal BGP sessions. Due to a virtual nature of the interface owning this address, it is nearly impossible to face the failure of this interface (except for artificial ways). Only the failure of a whole PE leads to it.¶
The solution described in the document proposes using additional next-hops for VPN routes advertised by a single PE. This alters a behavior described in [RFC4364] and [RFC4659] that presupposes the advertising of a single next-hop address for all VPN routes of a PE. With regard to the described solution, additional next-hops advertised by a PE are named ANH addresses. An ANH address is an artificial IPv4 or IPv6 address that belongs to the GRT. An ANH acts as a proxy address for an actual address of a CE residing in a VRF. The status of the latter address influences the status of its ANH. A CE`s address selected for an ANH is named an LA. An LA may belong to a common subnet of a PE-CE pair in a VRF or can be any other address of the CE, it cannot belong to the GRT. An ANH and LA pair does not necessarily belong to the same address family. For example, it is possible to have an ANH of IPv4 and an LA of its ANH of IPv6, and vice versa. An LA can be a link-local IPv6 address, in this case, its ANH MUST be a proxy to a triplet (LA, AC, VRF) instead of (LA, VRF) where the LA belongs to the AC from the triplet.¶
Addresses installed in different VRFs may be overlapped. Thus, values of ANHs may be arbitrary and do not have to be the same as their LAs. An operator is free to choose these values according to network address plans. To achieve goals stated in Section 1 values of ANHs MUST be unique throughout the GRTs of the network. The case when several PE routers advertise the same value for the ANH (e.g., anycast) is out of the scope of this document.¶
The ANH proxying is not a route leaking mechanism, it cannot be used for traffic forwarding between the GRT and a VRF in any direction. The ANH proxying creates a dependency between the statuses of an ANH and an LA (Section 5.1). An ANH MUST be bound to only one LA. An LA in turn MUST be bound to only one ANH. There is a strict one-to-one mapping between them. An operator may create the ANH proxying for any address in a VRF, but the solution expects that this address is used as a next-hop for routes in this VRF. These routes does not necessarily belong to the same CE that owns the LA (i.e., third-party next-hop).¶
The ANH proxying can be described as a static host-specific route that is installed in the GRT. A destination address of this route is configured as a value selected for an ANH by an operator. A next-hop address of the static route is equal to an LA (selected for the ANH). Additionally, the operator configures a VRF (its name or index) directly for this static route. It points to where the next-hop of the route must be resolved. For unlabeled traffic coming to a PE via the GRT, the static route acts as a route to the bit bucket. This document does not restrict implementations by this mechanic.¶
The status of an ANH depends on the existence of a route to its LA. This route MUST be present in a VRF associated with the ANH. The ANH is considered active if and only if the route to its LA is active and is available for traffic forwarding. The proposed solution does not restrict the type of this route, but it MUST support at least direct routes and static routes. An implementation MAY filter the protocols used for resolution of LAs by a configuration policy.¶
The status of an ANH is unidirectional, only the status of an LA defines the status of an ANH.¶
An implementation MAY support the option of deactivation of an ANH manually by an operator.¶
Besides the dependency on a route toward an LA, an ANH MAY be a client of any mechanism of active monitoring of the LA. It can be any next-hop tracking (ARP, ICMP probes if they are applicable to the LA) or a BFD [RFC5880] session between a CE that owns the LA and a PE that owns the ANH.¶
In general, the distribution of ANHs by means of a routing protocol does not differ from the distribution of any other addresses that are considered to be BGP next-hops.¶
An ANH SHOULD be advertised by a routing protocol. In this case, the next conditions MUST be met:¶
This solution does not restrict the type of the routing protocol for ANH routes distribution.¶
When the status of an ANH changes from active to inactive a PE MUST notify the other PEs receiving a route to this ANH. The speed of origination of such notification and its propagation is crucial.¶
PEs that received a route to an ANH act according to the standard procedures that are applicable to the routing protocol. This solution does not modify this behavior.¶
A PE may have one or several ANHs and distributes them as per Section 5.2. In that case, according to Section 5 of [RFC4364] there MUST be a tunnel for every such address of the PE.¶
This solution does not restrict the type of tunnels that point to ANHs, but these tunnels MUST forward MPLS traffic. However, an implementation of an egress tunnel endpoint may require some changes to support a point-to-point tunnel (e.g., RSVP-TE LSP [RFC3209] or IP GRE [RFC4032]) to an ANH. These changes are out of the scope of this document.¶
The solution does not consider in detail using tunneling technologies other than MPLS LSPs for transport of labeled VPN traffic. The rest of the section is applicable to MPLS LSPs only.¶
For all ANHs with LAs that belong to the same VRF, a PE MUST allocate the same label. A PE MAY allocate a single label for all ANHs (e.g., implicit label).¶
When a PE allocates a label for an ANH it MUST associate a release timer with this label. If the status of the ANH changes to inactive the PE starts the release timer for the label. While the timer is active if the PE is receiving traffic with this label it MUST continue to handle this traffic like the failure has not happened. When the timer reaches zero the PE starts freeing the resources associated with the label. This timer does not influence the generation and advertising of the failure notification via the label distribution protocol. An implementation SHOULD support a manual setting of the release timer (including zero). If a label is allocated for a group of ANHs a PE starts the timer if and only if the last active address of the group becomes inactive.¶
When a PE advertises a label binding to an ANH it either MUST be accomplished by a label distribution protocol in parallel with the advertising of the ANH via a routing protocol (Section 5.2), or the ANH MUST be sent as a labeled route (e.g., BGP-LU [RFC8277]).¶
When the status of an ANH (Section 5.1) changes to inactive and a label binding to this ANH was advertised by a PE via the label distribution protocol the PE MUST notify other routers receiving the label for this ANH. The speed of origination of such notification and its propagation is also important, this notification may be received before the notification via the routing protocol, or it may be the only notification channel (Section 9.4).¶
Routers that received a label binding for an ANH act according to the standard procedures that are applicable to the label distribution protocol. The proposed solution does not change the behavior of ingress LERs or LSRs.¶
The solution that is proposed in this document is only applicable to VPN-IPv4 and VPN-IPv6 routes (i.e., SAFI 128). Using any other routes with ANHs is out of the scope of this document.¶
For a group of routes installed in a VRF and united by a common next-hop address, an operator MAY set up an ANH as a next-hop of the corresponding VPN routes. The rest routes from the same VRF (if they are left) MUST be advertised by procedures [RFC4364] or [RFC4659] if it is supposed to advertise them.¶
An ANH for VPN-IPv4 routes is encoded according to Section 4.3.2 of [RFC4364] as a VPN-IPv4 address with an RD of 0.¶
An ANH for VPN-IPv6 routes is encoded according to Section 3.2.1 of [RFC4659] as a VPN-IPv6 address. This VPN-IPv6 address contains an RD of 0 and an IPv6 address which is equal to the ANH. In case when the ANH is the IPv4 address the VPN-IPv6 address is encoded as an IPv4-mapped IPv6 address. The procedures of including a link-local address are not altered by this solution.¶
According to Section 4.3 of [RFC4364], routes that are installed in a VRF are converted to VPN routes (this statement is applicable to both address families), and "exported" to BGP. This solution assumes that all VPN routes are installed into the VPN Loc-RIB with a next-hop address that is equal to the own address of a PE where this VRF is configured. In the other words, the solution does not modify procedures for converting routes from VRFs to VPN routes.¶
All routes in a Loc-RIB are processed into appropriate Adj-RIBs-Out according to configured policies [RFC4271], Section 9.1.3. The solution expects that there MUST be a special export policy that is applicable to routes undergoing from the VPN Loc-RIB to VPN Adj-RIBs-Out and is processed in a chain before all policies that are configured by an operator (if there are such policies). This special export policy modifies next-hop addresses only for those routes that are supposed by a configuration to be sent with ANHs (or a single ANH).¶
A PE does not check the presence of a route to an ANH in the GRT before copying VPN routes from a Loc-RIB into a corresponding Adj-RIB-Out and during the Update-send process (Section 9.2 of [RFC4271]). When the status of an ANH (Section 5.1) changes to inactive a PE does not start withdrawing VPN routes that use this ANH as their next-hop. It prevents churn in the case when an operator decides to maintain a network and manually disable the ANH. On the other hand, deleting a binding between an ANH and its LA MUST start changing the corresponding next-hop addresses in Adj-RIBs-Out to the default value (the value from the Loc-RIB).¶
An implementation MAY support an option of selecting distinct Adj-RIBs-Out where VPN routes will be placed with ANHs.¶
For an ingress PE, it is impossible to determine whether a next-hop address of received VPN routes is a regular address or an abstract one. The ingress PE considers every VPN next-hop address as the address of a standalone egress router even if a group of VPN next-hop addresses belongs to the same device. Having an active route and a tunnel to a BGP next-hop address the ingress PE encapsulates and sends traffic via the tunnel according to Section 5 of [RFC4364].¶
If an egress PE receives MPLS traffic with a label that was allocated for one of its ANHs the solution expects the following (other cases are out of the scope of this document):¶
A PE detects the failure of a connected CE by different mechanisms. These mechanisms are not considered in this document. The net effect of the failure is the unreachability of routes to addresses (or a route to a single address) of the failed CE in a VRF where an AC to the CE resides. The PE usually uses these routes to recursively resolve next-hops for other routes in the VRF (are also usually distributed by the CE). All failed routes try to find new options to resolve their next-hops, if there are no such options the PE starts deleting the failed routes from the VRF.¶
After the PE detected the CE had failed and if one of addresses of the CE is an LA the PE immediately deactivates an ANH of this LA. If a route to the ANH was distributed as per Section 5.2 the PE notifies all neighbors of a routing protocol. If the ANH was also bound to a label and this label was distributed via a label distribution protocol, the PE notifies all neighbors of the label distribution protocol.¶
The PE may start distributing updates via BGP VPN sessions notifying its peers that the routes in the VRF are no longer reachable. This process does not relate to the process described above and the solution does not modify it. However, if the route to ANH was distributed by BGP via the same set of sessions that are used for VPN routes distribution an implementation SHOULD schedule sending of the UPDATE message with the ANH`s withdrawal prior to UPDATE messages with the failed VPN routes.¶
As stated in Section 4, the proposed solution expects an ingress PE to consider the status of a tunnel toward a BGP VPN next-hop. Thus, when the status of the tunnel changes to inactive the ingress PE simultaneously deactivates all VPN routes with a next-hop equal to an address of the tunnel`s endpoint. If the ingress PE does not follow this logic, the solution expects that the status of a route toward a BGP VPN next-hop in the GRT is used the same way. The ingress PE can apply both procedures. In any case, the ingress PE can react to the failure of a remote CE (the CE connected to a remote PE in the same VPN) or an AC to the CE independently of the receiving of BGP UPDATE messages that withdraw VPN routes pointing to this CE. The ingress PE may activate backups for these routes and redirect traffic by them.¶
A requirement to have a tunnel to every next-hop address that a PE uses to advertise VPN routes may pose scalability concerns. There are some thoughts on how to deploy the described solution from the scalability point of view:¶
Deploying of the ANH solution should be considered on a per-service basis. The following points may help to decide whether an ANH is appropriate:¶
It is worth noting the detection time of a CE failure or a failure of link (or an AC) to the CE. This time contributes much to overall convergence. For example, sometimes it is not possible to notice the link failure by a loss of a signal, and extra mechanisms are required for this task. Some of these mechanisms interact with a session of a PE-CE protocol. When these mechanisms detect the failure, routes distributed by an associated PE-CE protocol`s session will become inactive. At the same time routes toward addresses of the CE (or a single route) are usually not distributed by the PE-CE protocol`s session. They may be direct routes or statics. Thus, the routes toward the addresses of the CE are not affected by the detection mechanism described above and are staying alive. If one of the CE`s addresses is an LA the status of an ANH that is proxying to the LA will also be active. To prevent such behavior, it is recommended to use detection of the failure of an address of the CE or a route to this address (both on the PE and the CE, especially if the CE is multihomed). In the other words, it is better (in the described case) to monitor a next-hop for the routes distributed by the PE-CE protocol, but not the session of the PE-CE protocol.¶
Routes advertised by a routing protocol can be aggregated at some points of a network (e.g., ABRs). Such aggregation may lead to obscurity issues in the event of an ANH deactivation. The aggregation of routes removes a notification channel that is supplied by the routing protocol. A label distribution protocol can provide this notification channel when it is used for the distribution of labels for ANHs. But in this case, it requires all VPN members to consider the status of tunnels toward ANHs as described in Section 4.¶
If LDP [RFC5036] is used as the label distribution protocol for ANHs, the following steps should be considered:¶
Section 7 of [RFC6514] introduces the VRF Route Import EC. Section 5.1.3 of [RFC6513] describes a scenario when unicast VPN routes do not contain this community during the selection of Upstream PE:¶
If a route does not have a VRF Route Import Extended Community, the route's Upstream PE is determined from the route's BGP Next Hop.¶
The solution described in this document expects that unicast VPN routes of a VPN service may be sent by a PE with different BGP next-hop addresses. It may create an issue with the importing of C-Multicast routes if this VPN service also acts as MVPN and does not mark its VPN routes with the VRF Route Import EC. It is not recommended to configure a new import Route Target EC for every ANH. Instead, there are two possible ways to mitigate the described problem:¶
This document specifies extensions for the advertisement of VPN routes with different next-hops by a signle PE. From this point of view, the security considirations described in [RFC4364] and [RFC4659] are equally applicable for the extensions described in this document.¶
This document has no IANA actions.¶
The author would like to thank Roman Peshekhonov for his review and valuable input.¶