I’ve been seeing some network problems lately, at sites where the problem was designing the VPC and routing mix correctly. Generally, there’s plenty of room to make a mistake, the situation is a bit confusing to most people. So I’m going to try to explain how to separate out routing and Layer 2 (L2) forwarding with VPC’s, so the routing will work correctly. I’m hoping to help by explaining the problem situation you need to avoid as simply as I can, and showing some simple examples, with lots of diagrams. For a simple description of how basic VPC works, see my prior posting, How VPC Works
.Cisco
has put out some pretty good slideware on the topic, but there are an
awful lot (too many?) diagrams. Either that’s confusing folks, or people
just aren’t aware that VPC port channels have some design limitations,
you can’t just use them any which way as with normal port channels (or
port channels to a VSS’d 6500 pair).
The short
version of the problem: routing peering across VPC links is not
supported. (Adjacency will be established but forwarding will not work
as desired.) The “vpc peer-gateway” command does not fix this, and is
intended for another purpose entirely (EMC and NetApp end systems that
learn the router MAC address as the source MAC in frames, rather than
using ARP and learning the default gateway MAC address).
Let’s start by repeating the basic VPC forwarding rule from the prior blog:
VPC Rule 101
VPC
peers are expected to forward a frame received on a member link out any
other member link that needs to be used. Only if they cannot do so due
to a link failure, is forwarding across the VPC peer link and then out a
member link allowed, and even then, the cross-peer-link traffic can
only go out the member link that is paired with the member link that is
down.
The same
rules apply to routed traffic. Since VPC does no spoofing of the two
peers being one L3 device, packets can get black-holed.
The Routing with VPC Problem
Here’s the
basic situation where we might be thinking of doing VPC and can get
into trouble. Note I’ve been using dots for routed SVI’s, just as a
graphical way to indicate where the routing hops are. (No connection
with the black spot in the novel Treasure Island.)
This is
where we have a L3-capable switch and we wish to do L2 LACP
port-channeling across two Nexus chassis. If the bottom switch is
L2-only, no problem. Well, we do have to think about singly-homed
servers, orphan (singly-homed) devices, non-VPC VLANs, failure modes,
etc., but that is much more straight-forward.
All is fine if you’re operating at Layer 2 only.
Let’s walk
through what VPC does with L3 peering over a L2 VPC
port-channel. Suppose a packet arrives at the bottom switch C (shown by
the green box and arrow in the diagram above or below). The switch has
two routing peers. Let’s say the routing logic decides to forward the
packet to Nexus A on the top left. The same behavior could happen if it
chooses to forward to B. The router C at the bottom has a (VPC) port
channel. It has to decide which uplink to forward the packet over to get
it to the MAC address of the Nexus A at the top left.
Approximately
50% of the time, based on L2 port channel hashing, the bottom L3 switch
C will use the left link to get to Nexus A. That works fine. Nexus A
can forward the frame and do what is needed, i.e. forward out another
member link.
The other
50% or so of the time, port channel hashing will cause router C to L2
forward the frame up the link to the right, to Nexus B. Since the
destination MAC address is not that of Nexus B, Nexus B will L2 forward
the frame across the VPC peer link to get it to A. But then the problem
arises because of the basic VPC forwarding rule. A is only allowed to
forward the frame out a VPC member link if the paired link on Nexus B is
down. Forwarding out a non-member link is fine.
So the problem is in-on-member-link, cross-peer-link, out-another-member-link: no go unless paired member link is down. Routing does not alter this behavior.
Yes, if
there is only one pair of member links, you cannot have problems, until
you add another member link. If you add a 2nd VLAN that is trunked on
the same member links, inter-VLAN routing may be a problem. If you just
do FHRP routing at the Nexus pair, no, the L2 spoofing handles MAC
addresses just fine (using the FRHP MAC so no transit of the peer link
is necessary). It’s when your inter-VLAN routing is via an SVI on one of
the bottom switches routing to a peer SVI on the Nexus pair that you
will probably have problems.
You can
have similar problems even if only one of the two Nexus switches is
operating at L3, or has a L3 SVI in a VLAN that crosses the VPC trunks
to the switch at the bottom. We will see an example of this later.
Conclusion:
it is up to us to avoid getting into this situation! That is, VPC is
not a no-brainer, if you want to mix it with routing you must design for
that.
You can
also do this sort of thing with two switches at the bottom of the
picture, e.g. pair of N5K to pair of N7K’s. Or even VSS 6500 pair to VPC
Nexus pair. See also our Carole Reece’s blog about it, Configuring
Back-to-Back vPCs on Cisco Nexus Switches, and the Cisco whitepaper with
details, http://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps708/white_paper_c11_589890.html.
VPC is allowed and works, but we need to design it to operate at L2
only.
Drilling Down on VPC Routing
We are
also OK if we use a FHRP with a VPC to get traffic from a VPC’d server
to a pair of Nexii, and then route across non-VPC point-to-point links,
e.g. into the campus core or WAN. VPC does very well at spoofing L2,
and the virtual MACs used with the three FHRP’s allow direct forwarding
out VPC member links by VPC peers. Routing to the core uses non-VPC
non-member links, so no problem.
The
problem in the L3 story above is that the frame is being forwarded at L2
to the real MAC not virtual MAC of A, and B is not allowed to do the
routing on behalf of A.
The next
diagram shows how this typically bites us. If we’re migrating from
6500′s (bottom) to Nexus (top) and we are inconsistent, we can get in
trouble. If our packet hits an SVI, is routed to Nexus B but sent via
Nexus A, then Nexus B will not be able to route the frame again out the
member link marked with the red X, to get to a L3 SVI on the bottom
right switch D.
This might
happen from datacenter to user closet, if you have L2 to a collapsed
core/distribution Nexus pair, with some SVI’s between old 6500 C and new
Nexus switches A and B in the datacenter, and closet switches with
SVI’s on the same switches as the datacenter SVI’s (switch D in the
diagram). It might also happen if you have some VLANs with SVI’s on
datacenter access switches like C, and other VLANs on other datacenter
access switches like switch D (perhaps even with all SVI’s migrated to
live only on the Nexus pair). It can even happen on one switch, where C
and D are the same switch, and you’re routing between VLANs via an SVI
on C. (Same picture, just a little more cluttered because the green
arrow and red X are on the link back to C.)
Summary: Making Routing Work with VPC
Here’s the
Cisco-recommended design approach, using my drawing and words. The
black links are L2 VPC member links. The red links are additional
point-to-point routed links.
The simple
design solution is to only allow L2 VLANs with SVI’s at the Nexus level
across the VPC member links. If you must have some SVI’s on the bottom
switch(es) and some others on the Nexus switches, block those VLANs on
the L2 trunks that are VPC members, and route them instead across
separate L3 point-to-point links, shown in red in the above diagram. Of
course, if you’re routing say VLAN 20, there would be no point to having
a routed SVI for VLAN 20 on the bottom switch and on the Nexus switches
as well.
The point
to point routed interfaces do not belong to VLANs, so they cannot
possibly accidentally be trunked over the member links, which are
usuallly trunks.
When you
have SVI’s rather than routed interfaces or dot1q subinterfaces, you
have to be aware of which VLANs you do and do not allow on the VPC
member links. If you have many VLANs that need routing, use dot1q
subinterfaces on the routed point-to-point links to prevent “VPC routing
accidents”. Or use SVI’s and trunking over the point-to-point non-VPC
links, just be very careful to block those VLANs on the VPC trunk member
links.
Using VPC to Buy Time to Migrate to L3 Closets
As you
will have noticed in my recent blog, Simplicity and Layer 2, I like L3
closets. That generally means your L2 is mostly confined to the
datacenter. No L2 problems out in the closets!
Our
present discussion is highly relevant if you are migrating from L2 to L3
closets. Several hospitals we are working with have had spanning tree
problems (or risk). They wish to reduce their L2 domains size and any
risk by moving to L3 closets. One way to tackle this is to drop Nexus
switches in at the core or distribution layer (they are sometimes
combined layers), and start out running VPC to all the L2 closets. That
“stabilizes the patient” to buy time and stability for the cure, L3
closets.
If you
whittle away at sprawling VLANs spanning closets, buildings and
campuses, you can generally manage to clean up one closet at a time.
Iterate for the next year or two. Painful, but much more robust!
Consider a
single closet switch that you’re working on. You can get yourself to a
situation where the SVI’s are in the distribution layer Nexuses, say,
and you have L2 VPC member trunks to the closet switch (now represented
by the bottom switch in our above diagrams). When all the VLANs are
single-closet-only VLANs, you can then un-VPC the uplinks to that one closet,
turn them into point-to-point routed links, put the SVI’s on the closet
switch instead, and be done. If you want a slower transition, add
separate L3 routed point-to-point links like the above red lines, and
control which VLANs are trunked across the VPC member links. All it
takes is organization and being clear about where you’re doing L2 and
where you’re doing L3 — which I’d say should be part of the design
document / planning.
Another Example
One more
real world example shows how it is easy to not see the potential
problem. Suppose you have a router, e.g. an MPLS WAN router, and for
some reason you have to attach it to a legacy switch at the bottom of
the picture, as shown in the following diagram:
Why would
you do this? In one case we’ve seen, the vendor router had a
FastEthernet port, and the Nexus switch had no 100 mbps capable ports.
Another is copper versus fiber ports and locations of the devices in
question.
Suppose
the uplinks are VPC members, and because of the VPC routing problems,
the site is trying to make this work with just a VPC on the right Nexus
switch, switch B. In the case in question,; C and D were actually the
same switch, but I’m presenting it this way since the diagram is more
clear when I show two switches.
At the
left we see a packet hitting a SVI in the leftover 6500 (which some
sites would shift to being a L2-only access switch, and other sites
would discard or recycle elsewhere in the network.).
The bottom
left switch SVI can route to other SVI’s that are local. To get to the
WAN router, the left switch needs to somehow route the packet via the
top right Nexus, Nexus B somehow. It turns out there is exactly one VLAN
with SVI’s on all three switches, which gives switch C a way to route
to the rest of the network. Switch C therefore follows the dynamic EIGRP
routing, by routing into the shared VLAN with next hop Nexus B.
In 50% of
the flows, the packet goes via the left Nexus A, across the peer link,
and thus B cannot forward it out the VPC member link to get to the
router.
Exercise for the reader; Consider traffic going the other way, from the WAN back to the datacenter. See the following diagram:
Does it work? If not, what goes wrong? Can you explain it? [Hint: there's a red X in the above diagram for a reason!]
Possible Solutions
(1) Attach
the MPLS VPN WAN router to one or both Nexii directly. Note that
dual-homing via the 6500 (bottom right) is a Single Point of Failure
(SPoF), so connecting to only one N7K is no worse (or better).
(2) Put
the SVI for the router’s VLAN on the bottom right switch, and convert
the uplinks to L3 point-to-point. Or use dedicated point-to-point links
for all routed traffic from bottom right to the two Nexii. Since point
to point routed interfaces don’t belong to VLANs, they can’t
accidentally be trunked over VPC member links.
(3) Have
no SVI’s on the Nexii — do all routing on the bottom switches. That
actually works — but doesn’t help in terms of getting the routing onto
the much more powerful Nexus switches, which is where you probably want
it.
Conclusion
Please
don’t draw the conclusion that you can’t do routing with VPC. You can in
certain ways. What you do not want is a router or L3 switch interacting
with routing on VPC peers over a VPC port-channel link. You can route
to VPC peers as long as you’re not using a VPC port-channel, e.g. just a
plain point-to-point link or a L3 port-channel to a single Nexus. If
there is an SVI at the bottom (that is, not on the Nexus pair) for a
given VLAN, block it from the member links and thereby force it to route
over the dedicated routed links. In that case, don’t allow the VLAN
across the VPC peer link either: that link should only carry the VLANs
that are allowed on the VPC member links, and no others, no routing,
nothing else.
You can also route over a VPC port-channel, as long as your routing peers are reached at L2 across the VPC but are not the VPC peers your VPC connects to. That is, routing peering across a L2-only VPC Nexus pair in the middle is OK.
In the
datacenter, stick to pure L2 when doing VPC, up to some sort of L3
boundary. When doing L3, use non-VPC L3 point-to-point links. If you
have a pod running off a pair of L3-capable Nexus 55xx’s and you feel
the need to VPC some L2-ness through your Nexus 7K core, fine, just use
dedicated links for the L3 routing. And when doing so, don’t use SVI’s,
use honest to goodness L3 ports, that is, “no switchport” type ports.
That way you cannot goof and forget to disallow any relevant VLANs
across VPC member links that are trunks.
Upcoming
design consideration: don’t VPC multi-hop FCoE. It’s OK to VPC FCoE at
the access layer, just don’t do it beyond there. Why not VPC multi-hop
FCoE? Among other reasons, it makes it far too easy to merge fabrics
accidentally. That’s a Bad Thing, definitely something you do not
want to do! Also, you do have to be careful about FCoE with a 2 x 2 VPC
— that’s covered in the Nexus course (now named “DCUFI”). Which I’m
teaching about once a month for FireFly (www.fireflycom.net)
Why Did Cisco Do It This Way?
I think
the engineers expected everyone doing L3 to put it on separate links.
It’s not clear to me why they thought people would WANT to do that. Nor
the confusion about SVI’s and where you were doing routing that a lot of
people seem to have (i.e. understanding it too complex for real world).
It might also have something to do with the datacenter switch
positioning of the Nexus products.
References
Quote from that thread: “We don’t support running routing protocols over VPC enabled VLANs.”
No comments:
Post a Comment