SRV6 enhancement proposal#316
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces an enhancement to the OpenPERouter API to support SRV6-based L3VPNs. By extending the existing Underlay configuration and adding a dedicated L3VPN resource, the change enables tighter integration of Kubernetes clusters into SRV6-enabled networks, allowing for efficient traffic tunneling over IPv6 infrastructure without the need for VXLAN. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This enhancement proposal introduces SRv6 L3VPN support to OpenPERouter by extending the Underlay CRD and adding a new L3VPN CRD. The review feedback identifies several technical issues in the proposed API definitions, including a typo in the documentation, conflicting MaxItems validation markers, incorrect syntax for MinLength markers, and the use of a non-standard Required marker. Additionally, suggestions were made to optimize CEL validation rules and correct a public IP address used in an example.
| - a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html) | ||
| --> | ||
|
|
||
| - [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) |
861b030 to
98deb49
Compare
1772e82 to
568a891
Compare
|
|
||
| // ISIS holds the ISIS configurations for the underlay, one per ISIS process. | ||
| // +kubebuilder:validation:MaxItems=10 | ||
| ISIS []ISISConfig `json:"isis,omitempty"` |
There was a problem hiding this comment.
Instead of configuring ISIS in the underlay, we could also have an ISIS CRD. From the underlay configuration, + the SRV6 configuration, we could then generate ISIS CRs. But that seems more complex with few gains?
There was a problem hiding this comment.
If we allow only a single ISIS process then we can hardcode the name of the process, as well
There was a problem hiding this comment.
I think we should have a discriminated union in the Underlay.
You'd then select either EPVN or ISIS, and provide a configuration for the chosen protocol.
This seems simpler, and aligned with k8s api best practices.
There was a problem hiding this comment.
imo, is legit ISIS to have ISIS as part of the underlay configuration. It's where it effectively belogns.
On the other hand, I would maybe question moving the evpn / srv6 bits elsewhere if they don't really represent underlay configuration.
404aaa2 to
3e155bf
Compare
|
@fedepaol This is ready for a first round of review. I left a few things purposefully up for discussion |
|
@maiqueb ptal |
fedepaol
left a comment
There was a problem hiding this comment.
Really nice! Did a first pass
| - Implementation of L3VPN over SRv6 | ||
| - with IPv4 overlay (definitely possible) | ||
| - with IPv6 overlay (verify if possible with FRR / Linux kernel) | ||
| - Using ISIS as the IGP to exchange reachability information between the edge |
There was a problem hiding this comment.
I would try to understand if we can threat isis and srv6 orthogonally, allowing to:
- use bgp to exchange locator reacheability
- use isis to exchange vtep reacheability
There was a problem hiding this comment.
I don't think that's something that's possible / makes sense.
The locator /48s must be announced through the entire routing domain, including transit nodes, and that's the job of an IGP (ISIS). Otherwise, you'd have to add bgp sessions to every single node of the routing domain.
On the other hand, the actual provider edge nodes (the tunnel edges) must transmit meta information that fits inside BGP message and not inside ISIS, such as RT / route-distinguiser.
** I don't quite understand what the End.X routes are for, yet, but ISIS also carries info about End.X behavior.
# FRR:
I>* fd00:0:12:e002::/64 [115/0] is directly connected, to-l1-1-1, seg6local uA nh6 fe80::a8c1:abff:fea8:a61b, weight 1, 00:09:02
# Kernel:
fd00:0:12:e002::/64 nhid 32 encap seg6local action End.X nh6 fe80::a8c1:abff:fea8:a61b flavors next-csid lblen 32 nflen 16 dev to-l1-1-1 proto isis metric 20 pref medium
# show isis database detail
l1-1-pe.00-00 287 0x00000002 0x4cd1 925 0/0/0
Protocols Supported: IPv4, IPv6
Area Address: 49.0001
Hostname: l1-1-pe
TE Router ID: 172.31.1.12
Router Capability: 172.31.1.12 , D:0, S:0
SR Algorithm:
0: SPF
SRv6: O:0
Extended Reachability: 1720.3100.1011.44 (Metric: 10)
SRv6 Lan End.X SID: fd00:0:12:e002::, Algorithm: SPF, Weight: 0, Endpoint Behavior: uA, Flags: B:0, S:0, P:0 Neighbor-ID: 1720.3100.1011
SRv6 SID Structure Locator Block length: 32, Locator Node length: 16, Function length: 16, Argument length: 0,
Extended Reachability: 1720.3100.1012.47 (Metric: 10)
SRv6 Lan End.X SID: fd00:0:12:e003::, Algorithm: SPF, Weight: 0, Endpoint Behavior: uA, Flags: B:0, S:0, P:0 Neighbor-ID: 1720.3100.1013
SRv6 SID Structure Locator Block length: 32, Locator Node length: 16, Function length: 16, Argument length: 0,
IPv4 Interface Address: 172.31.1.12
Extended IP Reachability: 10.1.2.0/24 (Metric: 10)
Extended IP Reachability: 10.1.3.0/24 (Metric: 10)
Extended IP Reachability: 172.31.1.12/32 (Metric: 10)
IPv6 Reachability: fd00:0:12::/48 (Metric: 0)
IPv6 Reachability: fc00::2:172:31:1:12/128 (Metric: 10)
SRv6 Locator: fd00:0:12::/48 (Metric: 0) standard
Sub-TLVs:
SRv6 End SID Endpoint Behavior: uN, SID value: fd00:0:12::
Sub-Sub-TLVs:
SRv6 SID Structure Locator Block length: 32, Locator Node length: 16, Function length: 16, Argument length: 0,
BGP advertisement example:
l1-1-pe# show bgp ipv4 vpn
BGP table version is 2, local router ID is 172.31.1.12, vrf id 0
Default local pref 100, local AS 65000
Status codes: s suppressed, d damped, h history, u unsorted, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 172.31.1.12:100
*> 192.168.0.0/31 0.0.0.0@3< 0 32768 ?
UN=0.0.0.0 EC{65000:100} label=917504 sid=fd00:0:12:: sid_structure=[32,16,16,0] type=bgp, subtype=5
*> 192.168.1.0/24 192.168.0.1@3< 0 0 65001 ?
UN=192.168.0.1 EC{65000:100} label=917504 sid=fd00:0:12:: sid_structure=[32,16,16,0] type=bgp, subtype=5
Route Distinguisher: 172.31.1.12:101
*> 192.0.2.0/26 192.0.3.1@4< 0 0 65002 ?
UN=192.0.3.1 EC{65000:101} label=917520 sid=fd00:0:12:: sid_structure=[32,16,16,0] type=bgp, subtype=5
*> 192.0.3.0/31 0.0.0.0@4< 0 32768 ?
UN=0.0.0.0 EC{65000:101} label=917520 sid=fd00:0:12:: sid_structure=[32,16,16,0] type=bgp, subtype=5
Route Distinguisher: 172.31.1.22:100
*>i 192.168.0.2/31 fc00::2:172:31:1:22
0 100 0 ?
UN=fc00::2:172:31:1:22 EC{65000:100} label=917504 sid=fd00:0:22:: sid_structure=[32,16,16,0] type=bgp, subtype=0
*>i 192.168.2.0/24 fc00::2:172:31:1:22
0 100 0 65001 ?
UN=fc00::2:172:31:1:22 EC{65000:100} label=917504 sid=fd00:0:22:: sid_structure=[32,16,16,0] type=bgp, subtype=0
Route Distinguisher: 172.31.1.22:101
*>i 192.0.2.64/26 fc00::2:172:31:1:22
0 100 0 65002 ?
UN=fc00::2:172:31:1:22 EC{65000:101} label=917520 sid=fd00:0:22:: sid_structure=[32,16,16,0] type=bgp, subtype=0
*>i 192.0.3.2/31 fc00::2:172:31:1:22
0 100 0 ?
UN=fc00::2:172:31:1:22 EC{65000:101} label=917520 sid=fd00:0:22:: sid_structure=[32,16,16,0] type=bgp, subtype=0
Route Distinguisher: 172.31.1.32:100
*>i 192.168.0.4/31 fc00::2:172:31:1:32
0 100 0 ?
UN=fc00::2:172:31:1:32 EC{65000:100} label=917504 sid=fd00:0:32:: sid_structure=[32,16,16,0] type=bgp, subtype=0
*>i 192.168.3.0/24 fc00::2:172:31:1:32
0 100 0 65001 ?
UN=fc00::2:172:31:1:32 EC{65000:100} label=917504 sid=fd00:0:32:: sid_structure=[32,16,16,0] type=bgp, subtype=0
Route Distinguisher: 172.31.1.32:101
*>i 192.0.2.128/26 fc00::2:172:31:1:32
0 100 0 65002 ?
UN=fc00::2:172:31:1:32 EC{65000:101} label=917520 sid=fd00:0:32:: sid_structure=[32,16,16,0] type=bgp, subtype=0
*>i 192.0.3.4/31 fc00::2:172:31:1:32
0 100 0 ?
UN=fc00::2:172:31:1:32 EC{65000:101} label=917520 sid=fd00:0:32:: sid_structure=[32,16,16,0] type=bgp, subtype=0
Displayed 12 routes and 12 total paths
There was a problem hiding this comment.
not sure I get your point.
You could just have a network statement advertising the reacheability of the locator. And bgp doesn't require you to have sessions with all the nodes, as it's not for evpn.
Similrarly, we could use isis to advertise the reacheability of the vtep assigned to the loopback
There was a problem hiding this comment.
You could just have a network statement advertising the reacheability of the locator. And bgp doesn't require you to have sessions with all the node
Sure, but how do the transit nodes know where your locator is? That's ISIS' job:
PE-A <-> transit a <-> transit b <-> transit c <-> PE-B
ISIS takes care of announcing the locator prefix to all nodes in the domain, so that all 5 of them know where the locator prefix is. Otherwise, if you announce it via BGP, you must have BGP sessions spanning between all nodes (as in a datacenter that uses eBGP unnumbered between all nodes). Otherwise, you'd blackhole your traffic.
Also, ISIS is responsible for the End.X route:
FRR:
I>* fd00:0:10:e000::/64 [115/0] is directly connected, eth1, seg6local uA nh6 fe80::a8c1:abff:fe21:97a0, eth1, weight 1, 00:31:48
Linux:
fd00:0:10:e000::/64 nhid 52 encap seg6local action End.X nh6 fe80::a8c1:abff:fe21:97a0 flavors next-csid lblen 32 nflen 16 dev eth1 proto isis metric 20 pref medium
Even though I have to admit that I've got no clue atm what it's used for.
Similrarly, we could use isis to advertise the reacheability of the vtep assigned to the loopback
I am using ISIS to advertise the loopback IP
From all of the documents and examples I can find, I think the standard is pretty much:
- use ISIS to advertise the locator prefix
- use BGP to advertise SRv6 functions = the specific /128 for decapsulation
I would go as far as to say - because we use ISIS, and otherwise we need another way to announce the BGP update source across the network:
- use ISIS to advertise the BGP update-source and SRv6 source-address (meaning: IPs on the loopback)
We could have tons of other esoteric combinations, but shouldn't we stick to what seems to be the established standard? In theory, we also wouldn't need BGP at all, we could just inject announce the /128 manually and announce them via an IGP ourselves, or do everything with static routes ;-)
So I'd stick to Cisco's way of doing it: https://www.segment-routing.net/images/20250311-srv6-usid-frr-netdev-0x19.pdf
There was a problem hiding this comment.
Or, way shorter: what's our target first user going to use - and we should stick with that :-)
There was a problem hiding this comment.
Despite the fact that as you wrote, it's much easier to find references to the srv6 - isis combo, I think that supporting BGP is a nice addition. A good reference is https://datatracker.ietf.org/doc/draft-hss-srv6ops-srv6-routing-planes/
If I read that correctly:
* Locator Advertisements:
Green locators are advertised as standard IPv6 prefixes (AFI-2, SAFI-
1) and are tagged with the extended color community [RFC4360]
corresponding to Green.
I am fine with starting with isis only, but I'd like to write the API in such a way that we'll be able to add BGP support afterwards
There was a problem hiding this comment.
and iiuc, it's how cilium works too (https://www.youtube.com/watch?v=k_2nMzxi1Hk)
There was a problem hiding this comment.
There was a problem hiding this comment.
note to myself: Modify the RFE's constraints: SRv6 can only be configured if ISIS OR BGP is configured
| An operator runs an SRv6 enabled network with IPv6 and ISIS as the IGP. The | ||
| operator's edge nodes implement L3VPN and exchange L3VPN information via ISIS | ||
| and BGP. They want to span ISIS all the way to the Kubernetes cluster and want | ||
| to peer iBGP between their tunnel edge nodes and the Kubernetes nodes (for |
There was a problem hiding this comment.
or to their route reflectors (we should also mention the RR enhancement here #218
There was a problem hiding this comment.
According to the description in #218
In cloud environments where the ToR does not support EVPN, east/west traffic cannot be distributed via the fabric. This enhancement adds an internal RR so EVPN routes are reflected between router pods and VXLAN data plane goes directly node-to-node, avoiding the ToR.
It seems that enhancement is for east-west traffic, so wouldn't apply to L3VPN?
I rephrased the sentence to:
They want to span ISIS all the way to the Kubernetes cluster and want
to peer iBGP between their tunnel edge nodes (and potentially Route Reflectors)
and the Kubernetes nodes (for iBGP support, see
Although iBGP edge nodes are kind of agnostic to the fact that they are talking to an RR or not, the RR is the only device with special syntax / behavior
| ``` | ||
| // L3VPNSpec defines the desired state of VPN. | ||
| // +kubebuilder:validation:XValidation:rule="!has(self.hostsession) || self.hostsession.hostasn != self.hostsession.asn",message="hostASN must be different from asn" | ||
| type L3VPNSpec struct { |
There was a problem hiding this comment.
not sure we might want to tie this more to srv6?
There was a problem hiding this comment.
do you mean the name?
| **Description:** Right part of the RouteDistinguisher, after the `:`, with the | ||
| left part being the router ID. | ||
| **Comments:** `RouteDistinguisherSuffix` is a very clumsy name. We might instead | ||
| want to add a field `VNI` (the same as for EVPN) and the right part |
There was a problem hiding this comment.
vni though is very evpn specific. We can call it networkIdentifier or something like that
There was a problem hiding this comment.
The field is actually called Assigned Number subfield: https://datatracker.ietf.org/doc/html/rfc4364#section-4.2
What about: RDAssignedNumber?
31bc6eb to
f327f31
Compare
|
|
||
| // Neighbors is the list of external neighbors to peer with. | ||
| // +kubebuilder:validation:MinItems=1 | ||
| Neighbors []Neighbor `json:"neighbors,omitempty"` |
There was a problem hiding this comment.
do we want to declare what addressfamily to enable for each neighbor?
There was a problem hiding this comment.
that brings us back to the discussion that we had on the IPv6 PR, doesn't it? I thin the Neighbor entry needs an AddressFamily field
There was a problem hiding this comment.
#395
I will take this up once the IPv6 PR merged
| - Implementation of L3VPN over SRv6 | ||
| - with IPv4 overlay (definitely possible) | ||
| - with IPv6 overlay (verify if possible with FRR / Linux kernel) | ||
| - Using ISIS as the IGP to exchange reachability information between the edge |
There was a problem hiding this comment.
not sure I get your point.
You could just have a network statement advertising the reacheability of the locator. And bgp doesn't require you to have sessions with all the nodes, as it's not for evpn.
Similrarly, we could use isis to advertise the reacheability of the vtep assigned to the loopback
| // +kubebuilder:validation:MaxLength=15 | ||
| // +kubebuilder:validation:MinLength:=1 | ||
| // +optional | ||
| Interface string `json:"interface,omitempty"` // https://regex101.com/r/RlniVP/2 see kernel bool dev_valid_name(...) |
There was a problem hiding this comment.
in l3vni (and evpn in general) we don't override the source and we use the interface ip. I think there's value, I just wonder if we should rethink that too
| ``` | ||
| // L3VPNSpec defines the desired state of VPN. | ||
| // +kubebuilder:validation:XValidation:rule="!has(self.hostsession) || self.hostsession.hostasn != self.hostsession.asn",message="hostASN must be different from asn" | ||
| type L3VPNSpec struct { |
| this is a base value that gets modified per node. | ||
| - **Name:** `Format` | ||
| **Description:** Locator format. | ||
| **Comments:** Currently, the only supported value is `usid-f3216`. With the |
There was a problem hiding this comment.
either is fine imo, the direction we are taking is to do cel whenever we can
f327f31 to
159f24e
Compare
159f24e to
55dc1ba
Compare
|
Need to add rp_filter = 0 for VRF to the enhancement |
|
note to myself: ISIS: default option should be to advertise passive ISIS interfaces only |
c41b485 to
67f6380
Compare
67f6380 to
dc2357e
Compare
maiqueb
left a comment
There was a problem hiding this comment.
Mostly, I worry that the feature is very big, and there isn't a phased implementation plan.
From what I understand, we need to add:
- ISIS config
- SRv6 config
- L3VPN CRD
- webhook validation
- FRR templates
- host network provisioning
- sysctl (and I suspect we don't know exactly which we need ...)
- IPAM extensions
- E2E tests
I think we should add a phased implementation proposal. Unless you tell me this is all or nothing ...
My knowledge about isis and srv6 is limited; I'm trying to improve, but, at first, I might ask questions that might not make all that sense.
| nitty-gritty. | ||
| --> | ||
|
|
||
| ### User Stories |
There was a problem hiding this comment.
should we have a user story about exposing workloads to the outside world using their IPs, with an srv6 backend ?
unsure if that's covered by the first user story below.
There was a problem hiding this comment.
the first story covers reaching workloads via service endpoints from the outside. I think that's analogous to what L3VNI does. or am I missing something?
|
|
||
| ### Goals | ||
|
|
||
| - Implementation of L3VPN over SRv6 |
There was a problem hiding this comment.
question: is L2 possible ? i.e. can you implement EVPN using ISIS and srv6 ?
There was a problem hiding this comment.
it's possible but not currently implemented in frr (and some bits are also missing in the kernel, specifically the l2 evpn part). We should be open to that
There was a problem hiding this comment.
OK, so whatever API we end up with must accept that scenario.
There was a problem hiding this comment.
The problem is that it's in theory possible to do any of the following (non-exhaustive list):
- L2 EVPN using ISIS and srv6
- L3 EVPN using ISIS and srv6
- L2VPN using ISIS and srv6
- L2 EVPN using BGP and srv6
- L3 EVPN using BGP and srv6
- L2VPN using BGP and srv6
I'm a bit afraid that if we keep in mind every potential future implementation, we overengineer an API which is currently in the alpha stage and is by definition in a transient state.
Any system that is successful needs to grow and change as new use cases emerge or existing ones change.
https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-changes
https://kubernetes.io/docs/reference/using-api/#api-versioning
Also isn't the OpenPERouter purposefully a subset of the FRR config? Otherwise, we could just expose the FRR config verbatim to the enduser, in API constructs ;-)
There was a problem hiding this comment.
This RFE is designed in a way that ISIS is independent (on an API level) from SRv6.
SRv6 in a first implementation depends on ISIS.
What this doesn't contain, yet, is e.g. a selector under SRV6 for the underlay protocol, ISIS or BGP, but it can be extended.
An L3VPN by definition uses SRv6 (not EVPN), that's why I named the resource that way (and not SRv6VPN or something). An L2VPN resource can be added later, when/if support lands.
On the other hand, I believe that the EVPN configuration or current L3VNI / L2VNI resources would have to be extended to have a selector for either VXLAN or for SRv6 as the tunneling mechanism. This would, afaict, reuse the segmentrouting locator and encapsulation source in FRR, which is mirrored in this RFE by the SRV6 config block.
If we want to account for all of the different possible configurations, I feel that we have to "explode" the underlay config into different CRDs and work with selectors.
Where the current RFE fails is how do we have EVPN and L3VPN both use SRv6 for tunneling and then select different protocols for the underlay:
- EVPN via BGP for the overlay, SRv6 for tunneling, BGP for the underlay
- L3VPN via BGP for the overlay, SRv6 for tunneling, ISIS for the underlay
But are designs with such mixed uses cases, all configured at the same time in a single cluster, really realistic?
There was a problem hiding this comment.
So in an ideal API with maximum flexibility, shouldn't we have something that breaks up resources configuratoin elements along conceptual lines, a little bit along the ideas of ISO/OSI or network layers:
layer 0:
CRD for host configuration: adds IP addresses to the loopbacks, brings interfaces into the OpenPERouter; maybe configures sysctls, etc
layer 1:
CRDs for the underlay routing protocols: BGP, ISIS, OSPF configuration
layer2:
CRD for VXLAN tunnels: selects the tunnel source IP, e.g.
CRD for SRv6: configures the segment routing block in the FRR configuration, set's the locator prefix and
SRv6 tunnel source, and once support is there in FRR, configures e.g. encap reduced
layer 3:
CRD for the overlay routing protocol: BGP configuration
layer4:
L3VNI and L2VNI
L3VPN and L2VPN
either integrated in layer 4:
L3VNI and L2VNI resources must select: host configuration CRD; VXLAN or SRv6 tunnel protocol; one of the underlay protocols; and the overlay protocol
L3VPN and L2VPN resources that can select: host configuration; SRv6 tunnel protocol; one of the underlay protocols; BGP configuration
Or having an integration layer with a crd that then selects everythig that has to go together
And then we'd add checks to avoid incompatible configuration options. But that's super complex and doesn't fit the current API (nor its philosophy of being small and nimble)
There was a problem hiding this comment.
I'm a bit afraid that if we keep in mind every potential future implementation, we overengineer an API which is currently in the alpha stage and is by definition in a transient state.
I agree. We should let the dust settle and then see what we missed, while trying to make this as good as we can
| - with IPv6 overlay (verify if possible with FRR / Linux kernel) | ||
| - Using ISIS as the IGP to exchange reachability information between the edge | ||
| nodes. | ||
| - Using FRR to exchange route information via BGP and ISIS and set up the dataplane. |
There was a problem hiding this comment.
will we be able to implement srv6 using BGP ? or only ISIS ?
There was a problem hiding this comment.
yep. added to the RFE, but that's not the immediate goal
| - Implementation of L3VPN over SRv6 | ||
| - with IPv4 overlay (definitely possible) | ||
| - with IPv6 overlay (verify if possible with FRR / Linux kernel) | ||
| - Using ISIS as the IGP to exchange reachability information between the edge |
There was a problem hiding this comment.
why is this an explicit goal ? trying to understand if srv6 requires ISIS as control plane, or not.
There was a problem hiding this comment.
while it might work with bgp too (see the thread below, https://github.com/openperouter/openperouter/pull/316/changes#r3123018997) isis is the most popular / documented way to advertise srv6 locators
| The operator configures the OpenPERouter underlay with the required ISIS | ||
| configuration, such as the ISIS base NET address (which will be incremented | ||
| by the router index for each node), as well as the SRv6 information such as the | ||
| source CIDR (again offset by the router index for each Kubernetes node) and the | ||
| locator information such as the prefix (offset by the router index). |
There was a problem hiding this comment.
do you elaborate the arithmetic to calculate the ISIS net somewhere ?
The existing IPAM functions (VTEPIp, RouterID) operate on CIDRs ... I think we'll need something different here. Especially interested in how we offset the locator.
There was a problem hiding this comment.
Not in this RFE, but in the prototype.
type ISISSystemID [6]byte
// IncrementSystemID takes an ISIS SystemID and an offset and returns the result of the sum of both.
func IncrementSystemID(si ISISSystemID, offset int) (ISISSystemID, error) {
carry := offset
for i := range 6 {
if carry == 0 {
break
}
idx := len(si) - 1 - i
res := int(si[idx]) + carry
carry = res / 256
si[idx] = byte(res % 256)
}
if carry > 0 {
return si, fmt.Errorf("overflow while incrementing SystemID %s with offset %d", si, offset)
}
return si, nil
}
And for the locator :
I think it's the only thing in the entire prototype that I let Claude dot, because every time I look at that precise bit, it's super late and I'm too tired to understand the math behind it :-P
According to the unit tests, this here works, though:
// OffsetIPv6Prefix adds nodeIndex to the network portion of an IPv6 prefix.
// For example, with prefix "fd00:0:11::/48", nodeIndex 2, and prefixLen 48,
// it returns "fd00:0:13::/48" (0x11 + 2 = 0x13).
// Note: written by Claude, verify.
func OffsetIPv6Prefix(prefix string, nodeIndex, prefixLen int) (string, error) {
ip, ipNet, err := net.ParseCIDR(prefix)
if err != nil {
return "", fmt.Errorf("failed to parse prefix %s: %w", prefix, err)
}
if ip.To4() != nil {
return "", fmt.Errorf("prefix %s is not IPv6", prefix)
}
ipBytes := ip.To16()
if ipBytes == nil {
return "", fmt.Errorf("failed to convert IP to IPv6")
}
// Calculate how many bytes and bits we need to modify
numBytes := prefixLen / 8
remainingBits := prefixLen % 8
// Convert the network portion to a big integer and add nodeIndex
var networkValue uint64 = 0
for i := 0; i < numBytes && i < 8; i++ {
networkValue = (networkValue << 8) | uint64(ipBytes[i])
}
// Handle remaining bits if prefixLen is not a multiple of 8
if remainingBits > 0 && numBytes < 8 {
networkValue = (networkValue << uint(remainingBits)) | (uint64(ipBytes[numBytes]) >> uint(8-remainingBits))
}
// Add the node index
networkValue += uint64(nodeIndex)
// Write the value back to the IP bytes
if remainingBits > 0 && numBytes < 8 {
// Extract the remaining bits and put them back
mask := byte((1 << uint(8-remainingBits)) - 1)
hostPortion := ipBytes[numBytes] & mask
ipBytes[numBytes] = byte((networkValue&((1<<uint(remainingBits))-1))<<uint(8-remainingBits)) | hostPortion
networkValue >>= uint(remainingBits)
numBytes--
}
// Write back the full bytes
for i := numBytes - 1; i >= 0; i-- {
ipBytes[i] = byte(networkValue & 0xff)
networkValue >>= 8
}
// Get the original prefix length from the input
ones, _ := ipNet.Mask.Size()
return fmt.Sprintf("%s/%d", ipBytes.String(), ones), nil
}
| // evpn contains EVPN-VXLAN configuration for the underlay. | ||
| // +optional | ||
| EVPN *EVPNConfig `json:"evpn,omitempty"` |
There was a problem hiding this comment.
this is the part I find extremely confusing. so evpn in the API stands for the combination of EVPN (control plane) + vxlan (data plane). That's what the current nomenclature is doing.
Then we have the isis configuration, which, afaiu just instructs the router on how to configure ISIS to reach the TOR (BGP would still be used to exchange the routes ...).
Finally, we have the srv6 configuration, which configures this non VXLAN data-plane.
IIUC, the EVPN config is actually just the VXLAN data-plane configuration (which I think could/should be renamed).
To make everything more explicit, I would:
- implement a discriminated union in the
UnderlaySpecCRD to define the control plane used. It could either be VXLAN, or, SRv6 (which you're adding now). This means we should renameEVPNtoVXLAN. - add another discriminated union meant to represent the connection to the ToR. It would default to BGP. With your proposal, you'd add ISIS. Maybe in the future we'll want / need to add OSPF. Maybe call it
ToRConnectionProtocol
I'm aware I might be making this more complex, but when I read this proposal I didn't understand the purpose of ISIS in it at all. This would - imho - help make that clearer.
There was a problem hiding this comment.
Grr ... I knew I was missing something.
I'm now aware my suggestion would not work for the L2VNI exposed via L3VPN. L2VNIs always require VXLAN. I lost that detail.
It can't be a discriminated union. But, the model could work like:
- L2VNIs will always be carried by VXLAN
- L3 VPN will be carried by either VXLAN OR SRv6.
The latter can be represented in a union - meaning, the attribute name could be something along the lines of L3 VPN data-plane .
There was a problem hiding this comment.
L2VNIs are always carried by VXLAN via IPv4
L3VNIs are always carried by VXLAN via IPv4
L3VPN is always carried by IPv6/SRv6 via IPv6.
Within the terminology of the current version of this RFE. Federico didn't specifically like the L3VPN terminology, and I'm o.k. with finding a better name for it. But L3VPN here means SRv6
There was a problem hiding this comment.
this is the part I find extremely confusing. so evpn in the API stands for the combination of EVPN (control plane) + vxlan (data plane). That's what the current nomenclature is doing.
Yeah. But when you look at the EVPN field, what it really does is, it just either configures an Ipv4 address on the lound loopback. Or, it moves an interfaces into the namespace and picks the IPv4 address of that interface as the VXLAN tunnel source. It's not doing anything that is, per se, EVPN related. Indeed, I think there should be a configuration section for the tunnel source, and either it moves an interface into the ns, or it assigns IPv4 and/or IPv6 addresses from source CIDRS. Any feature that requires tunneling (such as EVPN and SRv6) should then pick the tunnel source addresses from there. Even when EVPN will eventually support IPv6 endpoints, there's no need to have different IPv6 addresses for the VXLAN and SRv6 tunnel sources. They can simply share the same IP.
In routing, using the loopbacks is also generally super important. E.g., in ISIS, a good practice is to announce only the loopback IPs into the network, and then establish BGP sessions between the loopbacks.
There was a problem hiding this comment.
This means we should rename EVPN to VXLAN.
I think we shoul rename it to something related to tunnel or loopback (not verbatim, but something that conveys that meaning), because that's what it does ... picking and configuring an IP address for dependent services that can then use those IPs.
I was about to add this change here, but then I thought that it's overkill: the current version of this RFE tries to fit the feature within the scope of the current API. Otherwise, I'd have to rename the EVPN field, meaning that the RFE and the implementation would become more complex. Fortunately, SRv6 is IPv6 only and EVPN is IPv4 only right now, so it's possible to make the current RFE work, and then redesign this in #341 ?
There was a problem hiding this comment.
who's we and what next release?
There was a problem hiding this comment.
I would keep the two separate instead of converging on one single ip (and maybe even with two separate loopback interfaces).
We can have evpn and srv6 co-exist (even if not on the same vrf), and the locator has a different semantic as it's a prefix and not a full /32 /128 ip.
There was a problem hiding this comment.
But what's the value of having 2 different loopback interfaces, with different IPs for SRv6 and EVPN? Vs having asingle IPv4 range and a single IPv6 range and assigning from there?
What happens if I, as an admin, want to reuse the same loopback IP address to preserve IP addresses?
There was a problem hiding this comment.
The locator is a completely different thing - I'm only talking about the loopback IPs for the tunnel endpoint.
Also, what happens if I have only eth10, and eth0 has an IPv4 and IPv6 address, and I want to move that interface into the OpenPERouter network namespace. And I want to use the IPv4 and IPv6 address on eth10 for both SRv6 and EVPN tunnel sources. I cannot do this with the current API with EVPN.VtepInterface and then SRv6.Source.Interface (or similar).
Of course, I could also say that with a combined field, I couldn't move 2 interfaces into the OpenPERouter ns, but given that we're talking about baremetal systems, it's more likely to have a single interface. Or we could make the interface field and the IP address fields slices.
But either way, there's no way to reuse IPs or interfaces if we have one for SRv6 and another one for EVPN, which is why I think that this avenue isn't right, and we should have one shared field TunnelSource or similar
There was a problem hiding this comment.
fair enough, but the loopback ip have different semantics depending on the tunnel. I wonder if we should extract that in a different field, and use it depending on the protocol instead of forcing a v4 for evpn and a v6 for srv6.
If we do that, then we can advertise it to neighbors either via bgp or isis and having it independent from the tunneling protocol
|
|
||
| // srv6 holds the SRv6 configuration. Requires ISIS or Neighbors configuration. | ||
| // +optional | ||
| SRV6 *SRV6Config `json:"srv6,omitempty"` |
There was a problem hiding this comment.
can we have srv6 and vxlan configured in the same underlay ?
I assume so, since I believe these will simply just configure the underlay-assigned loopback interface with IP addresses.
Another question: what if both configure different IPv6 addresses ?
There was a problem hiding this comment.
yes we can and it's part of the RFE (further clarifications on next push)
however, the vxlan tunnels with EVPN are IPv4 only, and the SRv6 endpoints are IPv6 only - so fortunately, they won't step on each other's feet.
Be it as it may, we should consolidate the 2 CIDR configurations (and the interface config) into a tunnel source config sections (or similar). From our internal convo and the convo on the other API RFE, I think we both agree on the fact that it's best not to mess too much with the current API and instead add such changes with the API RFE?
Wrt multiple IPv4 addresses / multiple IPv6 addresses e.g. on the loopback. I think that's not necessary, we need one of each. But to be discussed in the API RFE?
There was a problem hiding this comment.
however, the vxlan tunnels with EVPN are IPv4 only, and the SRv6 endpoints are IPv6 only - so fortunately, they won't step on each other's fee
as above, we can't always assume this and we should come up with an api that covers it.
There was a problem hiding this comment.
see my comment above
EVPN.VtepCIDR + Source.CIDR, and let's imagine we add the CIDRs to independent loopback interfaces
Then how do we deal with administrators who want to reuse the same IP address across EVPN and SRv6
On the other hand, if we have a common Tunnel endpoint configuration (that can perhaps be selected via a selector by EVPN and SRv6), then we can configure all CIDRs on a common loopback and we can reuse IP addresses; make the Tunnel endpoint configuration a slice, and we can add as many CIDRs as we want and thus have either common CIDRs, or e.g. one for EVPN and one for SRv6.
I can't imagine a case where it's advantageous to add CIDRs to different loopbacks, whereas I can imagine the above disadvantage of having them on different loopbacks. Unless if the same CIDR is specified 2x in the config, then we move it to a common loopback. But that duplication is a bit silly.
It's less clear to me for VtepInterface, but I think it's better to have a common config location to designate the interfaces to move into the PERouter, and then the protocols can select them with a selector or similar.
The current EVPN configuration in my opinion mixes 2 separate concepts which are shared underlay configuration (moving IP addresses and interfaces to the OpenPERouter); the EVPN VXLAN tunnel uses that underlay, but so can any other tunneling technology.
| General observations: | ||
| - The `UnderlaySpec` must allow `EVPN` and `SRV6` at the same time. | ||
| - `L2VNI` and `L3VPN` shall be able to coexist. | ||
| - However, `L3VNI` and `L3VPN` resources are mutually exclusive. This must be |
There was a problem hiding this comment.
I assume you mean within the same VRF, right ? ...
There was a problem hiding this comment.
Right now, I'd say not at all. Otherwise, we'd have to test this (neighbor configuration, etc.). It's IMO also not super likely that there's SRv6 , EVPN all mixed in the same network.
There was a problem hiding this comment.
added clarification to the rfe (next push)
There was a problem hiding this comment.
Tks. That makes it clearer.
I think it's fair.
|
|
||
| // rdAssignedNumber sets the Route Distinguisher's Assigned Number subfield. | ||
| // The Administrator subfield is automatically set to the value of the router | ||
| // ID. OpenPERouter uses Type 1 Route Distinguishers as defined in RFC4364, |
There was a problem hiding this comment.
I think your max value is not right ...
A Type 1 Route Distinguisher has the format: <4-byte IP admin>:<2-byte assigned number>.
So, the assigned number, is at most 2 bytes, which would cap the max at 2^16-1 (65535).
Actually, why do we follow type 1 RDs ? If we followed type 0, you'd have the max you define below. TBH, I don't know which format to use / guidelines on choosing it.
There was a problem hiding this comment.
Yeah and FRR supports both: https://docs.frrouting.org/en/stable-8.5/bgp.html#clicmd-rd-vpn-export-AS-NN-IP-nn
Honestly, I chose it because it shows up like that in most defaults. I'll add the new limit to the doc and a goal / non-goal for the rd type
There was a problem hiding this comment.
Tks. It's crystal clear now.
77b3dc1 to
078437f
Compare
Proposes extending OpenPERouter API to support L3VPN over Segment Routing IPv6 (SRv6). The proposal adds ISIS and SRV6 configuration fields to the Underlay CRD and introduces a new L3VPN Custom Resource Definition for managing VPN instances across Kubernetes nodes. Signed-off-by: Andreas Karis <ak.karis@gmail.com>
078437f to
fee046d
Compare
| // +optional | ||
| ExtendedNexthop *bool `json:"extendedNexthop,omitempty"` | ||
|
|
||
| // updateSource specifies the BGP update-source for this neighbor. |
There was a problem hiding this comment.
I am skeptic about allowing the user to set their own ips here, as it would mean a different ip per node and doesn't scale super well
There was a problem hiding this comment.
also, I'd consider having this enabled by default and failing srv6 if a user disables it?
| (...) | ||
| // extendedNexthop indicates if the BGPPeer has capability extended-nexthop. | ||
| // +optional | ||
| ExtendedNexthop *bool `json:"extendedNexthop,omitempty"` |
There was a problem hiding this comment.
what's the downside of always enabling it? I feel like asking to set it if you want to use srv6 is a bit clunky.
|
|
||
| We must adjust veth MTU to account for SRv6 SRH overhead. A caveat here is | ||
| that we might have to adjust veth MTU to the minimum of {VXLAN overhead; SRH oerhead} | ||
| in mixed L2VNI + L3VPN scenarios; although in theory they should be using different |
There was a problem hiding this comment.
how so if the interface with the TOR is the same one?
| Each node's Locator is derived from the Locator prefix configured in the | ||
| Underlay CR by adding the node's index to the prefix portion of the address. | ||
| For example, given a Locator prefix of `fd00:0:11::/48` and a node index of `2`, | ||
| the resulting Locator is `fd00:0:13::/48` (`0x11 + 0x2 = 0x13`). This ensures |
There was a problem hiding this comment.
we should also specify the maximum size of the space for a locator, or we'll risk to overflow it. I would also simplify forcing starting from a prefix ending with 0s.
In my very old poc I specified a range (here https://github.com/openperouter/openperouter/pull/114/changes#diff-657b56e968414485159d0e0fcec601d18fee2a275ec3842baaff84a566fa20efR59) , might be useful here too. The implementation becomes much simple.
Also, from what I saw with srv6 + bgp, it makes sense to have the loopback ip to be in the same subnet as the locator.
| Each node's ISIS NET address is derived from the base NET address configured in | ||
| the Underlay CR by adding the node's index to the System ID portion of the | ||
| address. For example, given a base NET of `49.0001.0002.0003.0004.00` and a node | ||
| index of `1`, the resulting NET is `49.0001.0002.0003.0005.00` (the System ID |
There was a problem hiding this comment.
same here, let's impose simple bases to make the implementation easier.
|
|
||
| ### Goals | ||
|
|
||
| - Implementation of L3VPN over SRv6 |
There was a problem hiding this comment.
I'm a bit afraid that if we keep in mind every potential future implementation, we overengineer an API which is currently in the alpha stage and is by definition in a transient state.
I agree. We should let the dust settle and then see what we missed, while trying to make this as good as we can
| // evpn contains EVPN-VXLAN configuration for the underlay. | ||
| // +optional | ||
| EVPN *EVPNConfig `json:"evpn,omitempty"` |
There was a problem hiding this comment.
fair enough, but the loopback ip have different semantics depending on the tunnel. I wonder if we should extract that in a different field, and use it depending on the protocol instead of forcing a v4 for evpn and a v6 for srv6.
If we do that, then we can advertise it to neighbors either via bgp or isis and having it independent from the tunneling protocol
Is this a BUG FIX or a FEATURE ?:
/kind design
What this PR does / why we need it:
Enhancement proposal for SRV6 and ISIS support.
Special notes for your reviewer:
Enhancement proposal for SRV6 and ISIS support.
Release note: