This is my first ever blog and it is quite long as I think I picked a complex subject. I would like to share some advice regarding the Core Network Migration approach. I have personally led many successful large scale migrations on critical networks including hospitals like: York, Bristol, Clatterbridge, Barnsley and I can tell you that each migration is different but there are multiple common elements that I will try to point out.
I have spent significant amount of time planning and executing multiple core migrations so I hope those tips will make it easier for you. On the other hand I am always listening to your comments and feedback because perhaps you have executed core upgrades yourself and have better ways of doing it?
Anyway, imagine your company has just won a contract to replace the legacy Cisco Catalyst core network with the brand new Cisco Virtual Switching System (VSS) and you are the lucky engineer to lead this. To make it a little more interesting lets assume that the existing supervisor modules don’t support VSS mode so it is not only clustering but also hardware replacement. Lets assume that it is 2-tier topology (collapsed core/distribution and access layer).
I am assuming your are at the CCNP level with a decent understanding of the networking so I won’t be explaining various technologies like VLAN, VSS, ARP, Spanning Tree, HSRP, etc.
This post is also not about best practice VSS configuration. I mention the basics like Hub-and-Spoke Layer 2 topology and Link Aggregation but for the design guidelines check Cisco website.
Layer 2 domain
First of all it is critical to establish the Layer 2 topology per VLAN. In most cases those topologies will be very similar for each VLAN but as each network grows and is managed by various people over the years you can expect quite a lot of inconsistencies. You need to establish the following as a minimum:
- VTP mode
- VLAN database
- Spanning Tree Mode and Root Placement per VLAN
- Spanning Tree Security Features like: Root Guard, Loopguard and BPDU guard
I would always recommend making notes and creating some decent drawings showing the topology. When you start connecting the new core switches to your current network you don’t want to cause any disruption so knowing the current Layer 2 topology is essential.
Classic example of disruption is “accidental” Spanning Tree Root change that can cause a blip on the network lasting up to a minute. That can happen only because you have plugged in the new core VSS to your old switches and by the virtue of lowest MAC address you have just elected a new Root causing Spanning Tree re-convergence. Of course if the current Root is protected by Root Guard there won’t be any disruption and you will simply not get any Layer 2 connectivity to begin with.
TIP: When you connect your new VSS to the legacy core there is no reason to make it Root switch immediately so make sure its priority is high enough – especially if you do it during the day without any maintenance window.
TIP: Your new design should be using VTP transparent mode. In the past VTP Server was a popular configuration option to automate VLAN deployment but with the number of management tools available these days it isn’t necessary to use VTP for automation any more.
I am sure you have also heard about VTP Server “accidentally” wiping the existing VLAN database. Most of the security weakness with VTP has been addresses by VTPv3 but I still recommend not using VTP as VLAN management tool.
TIP: If you have Layer 2 access layer make sure that you filter only necessary VLANs with static allowed list. Stay away from VTP pruning because it requires VTP server mode and its hard to troubleshoot.
TIP: Your new design in VSS mode should be using Link Aggregation techniques so make sure you pick one of the dynamic methods like LACP or PAGP.
TIP: Your new VSS will be Hub-and-Spoke topology from layer 2 perspective but you should still run Spanning Tree with its security features because human error like wrong cabling can lead to a loop. Never disable Spanning Tree.
Layer 3 demarcation
What do I mean by Layer 3 demarcation? Lets put it that way, your end devices (Laptops, Workstations, Phones, Access Points, Printers, etc) would typically be connected to the access switches. Each access port should belong to a VLAN (data, voice, access points, etc) and each of those VLANs would need a Switch Virtual Interface (SVI) to provide the default gateway for those devices so they can route the traffic to other subnets.
If you have your Layer 3 demarcation at the access layer you are probably running some sort of dynamic routing protocol like OSPF or EIGRP. There are some scenarios where you need to stretch a certain VLAN between the various access locations in which case you may be using the hybrid solution.
In hybrid solution your links between the access & core layer are trunks carrying all of the necessary VLANs that need to be extended as well as a dedicated VLAN that can be used to form point to point routing adjacency towards the upstream switches. Some SVIs are configured locally on the access layer while other are on the core.
The “hybrid” solution is not ideal because it doesn’t provide clear Layer 3 segregation in your network but never the less it often exists in a production environment.
TIP: Get rid of “hybrid” solution if you can and ideally run Layer 3 demarcation at the edge. You would need to discuss this with the customer because some legacy devices need to be Layer 2 adjacent and can be connected in different access areas.
TIP: If you can run Layer 3 to the edge then make sure you establish routing adjacency by configuring IP addressing on the physical interfaces instead of using SVIs.
Network Routing and Gateway Redundancy
Another scenario is having the access switches purely in Layer 2 mode and all SVIs terminated at the next aggregation point. In this case you will often see the access switch dual attached to two different physical switches that are providing the redundant gateway.
In order to provide gateway redundancy you will usually run protocols like HSRP, VRRP or less often GLBP. Unfortunately all those protocols in most cases require a Layer 2 triangle that forms a logical loop. That loop needs to be broken by the Spanning Tree. When you deploy VSS you will no longer require those protocols to provide the redundant gateway because in VSS mode both switches are clustered as a single logical entity.
There are two potential problems that are associated with migrating away from protocols mentioned above:
- In order to send the frame to the default gateway each hosts need to resolve this gateway’s MAC address via ARP resolution. All those protocols are using special MAC addresses and once you stop using them the ARP cache on end devices need to reflect the new MAC address which will be the same as SVI MAC address. Some operating system don’t detect it immediately so you can run into a problem of stale ARP cache for few hours.
- Some of the hosts may be configured with default gateway pointing to individual SVI address instead of Virtual address which will break their connectivity once the new SVI address changes to the previous Virtual Address.
TIP: If you want to prevent ARP resolution issues simply configure new SVIs on the VSS with additional “virtual” IP address for few days (it can be one of the previous SVI addresses). That will ensure that VSS will respond to the frames sent to the virtual MAC address. After a period of time each workstation would send another ARP request that should provide a MAC address of the SVI. You will be able to remove the virtual IP address after 1 day without risking much because typical ARP time-out is 4 hours.
TIP: There isn’t much you can do about hosts being configured with wrong default gateway but let the customer know that this is a potential risk.
Another big aspect of your migration strategy is the routing. Chances are you will have a combination of static and dynamic routing with some redistribution elements. The most complex part of it would most likely be implemented at your core network. When you start moving physical connections around you need to be very careful and try to predict the routing behaviour when you do it. Imagine having two hundred static routes on your network and re-patching one of the next hops somewhere else…It is very easy to introduce the routing loop if you aren’t careful.
TIP: Get all of your static routes into Excel spreadsheet and sort them so you can see and filter all existing “Next Hops”. Track each “Next Hop” address on the network and make a note where it is physically connected to.
Finally you need to watch out for the Policy Based routing, Static ARP entries, NAT, Multicast Routing. The goal is to replace the core network with a minimum or zero downtime.
TIP: Make a note of any static ARP entries and copy them to the new core
TIP: If you find any Policy Based Routing or NAT make sure you ask the customer about this. It may no longer be required.
TIP: Multicast migration is another important consideration, perhaps it needs redesign or security hardening. Make sure you confirm all the existing sources, receivers as well as RP placement. Once you have all that information make sure to implement multicast security features on the new core protecting the control plane. Chances are the legacy infrastructure won’t have the most amazing filtering in place so make the new core secure and best practice design.
Migration High Level Approach
I was always lucky enough to have team mate with me and I hope you can be in the same position. If you don’t have second pair of hands you will often have to ask the customer to help. You are leading the migration and making all the configuration changes while the second person is re-patching for you.
Labelling all the connections is really important so you can easily revert your changes. I always have a spreadsheet mapping each physical interface on the old core to the end device and destination port on the new core. That way you can simply communicate each re-patching task to your colleague and revert back if required.
When we talk about physical re-patching there are number of things to plan ahead:
- the length and type of the existing link – will it stretch to adjacent rack with your VSS or do you need to provide new patch lead
- labelling fibres in a data centre is great but those labels are often blown away by the air-conditioning system so again make sure you have electronic record
From the logical point of view you need to start your migration by:
- Clustering VSS chassis and upgrading it to the recommended software version
- Creating all VLANs and SVIs but keep SVIs in “admin shut-down” state to avoid duplicate IP addressing
- Configure basic routing to enable Layer 3 connectivity between the legacy core and VSS. Depending on the current network configuration it may be static, dynamic or a mixture of both
- Connect new VSS core to your legacy core with 2 redundant links bundled in the port channel. The migration period should be as short as possible so I normally don’t provide redundant “loop” topology because it makes it more complex to use. Instead you can provide a second “cold standby” redundant bundle but keep it “admin shut-down” to avoid any loops. You can bring it back in emergency later.
- From Layer 2 / Layer 3 perspective the new port-channel should provide both types of connectivity so it should be a trunk allowing all VLANs required by the old network as well as a dedicated Point-to-Point VLAN for routing
- Don’t forget to establish PIM adjacency if there is any Layer 3 multicast involved
Once the legacy network is connected to the new VSS at both Layer 2 and Layer 3 I would normally split the migration into the following parts:
- Migrate Access Layer to new VSS and convert redundant uplinks to the port-channel
- Migrate firewalls, servers and any other devices
- Migrate SVIs to the new VSS by enabling them on the VSS side and disabling on the old core. It is good to run continuous ping to some devices inside the subnet you are moving and let the customer do some testing too.
- Somewhere around this point you need to start planning Multicast RP migration (if there is one configured)
- Migrate Static Routing & Redistribution by grouping the routes by the common Next Hop address. This normally means you have all your static routes in notepad ready to paste them into to the new core and remove them from the old core. Any redistribution to dynamic routing protocol should be controlled by the route-map which you can prepare in advance.
- Migrate Default Route
In the end of the process you should end up with the new VSS connected to the empty old core. The old core may still be a spanning tree root but it doesn’t really matter because you should have logical hub and spoke topology with VSS. You can now power down the old cores and change the spanning tree priority on the VSS to some low value.
TIP: If you want to convert two individual uplinks from the access layer to a port-channel you can do it by:
- shutting down one of the links and converting it to a port-channel
- re-patching the link to VSS (which is preconfigured in a port-channel) and enabling it
- at this point you can shut the remaining link to the legacy core, add it to port-channel bundle, re-patch to VSS and enable
TIP: At every step of the migration you need to try and predict the result without any surprises. I would pay most attention to any changes that can impact your own SSH session to the devices. If you remember the basics – maintaining the same Layer 2 / Layer 3 connectivity before and after each task – you should be all right.
TIP: Don’t rush it. I can assure you that taking predictable baby steps is far more efficient than trying to cut corners. Remember, if you cause an outage you may have to troubleshoot for hours, revert all your changes, write major incident report and potentially the whole migration can be put on hold.
Migrating the existing network to the new solution is a big and challenging job. You can’t just go for it without proper consideration and planning. There is many simple things that can go wrong if you aren’t careful so make sure that whatever you do you need to understand exactly the outcome of it. You don’t want to be in a position where single typo or mistake means you cut yourself off the SSH session and at the same time caused a major network outage.
I know there are people always recommending “reload in” command just in case you don’t know what you are doing but don’t be one of them. Would you seriously want to run that command on the switch in a hospital feeding critical care or A&E anyway? If you aren’t sure or are afraid go there with your console cable instead.
We all make mistakes but our aim has to be perfection and minimum disruption. It is better to be pessimistic and careful than over confident. When it comes to network engineering nothing can replace meticulous planning and great diagrams.