The Most Unlikely Network Outage

Why I don’t like VTP

I think this must be 7-8 years now since I first heard that running VTP in Client / Server mode can be dangerous and should be avoided. There are some well known problems and scenarios in which you can bring the entire network down by simply replacing a switch. The point of this blog is not going to explain how this can happen because that is already widely documented on the internet.Instead what I would like to demonstrate here is less known side effect of using VTP Server / Client mode when VTP Pruning is also enabled. That experience made me even stronger supporter of disabling VTP on the network and only using modern automation tools to add VLANs to multiple switches at the same time.

In hours or Out of hours

I was Delivery Architect responsible for the Core Network Migration at University Hospitals Bristol and on that day I was simply preparing for the evening work. I had great documentation and detailed plan on paper and in my head. Some of the main points you need to consider for the core network migration have been discussed in the previous blog so I don’t need to go into details again.

The plan at this moment was simply to add few extra VLANs to the core network and later use those VLANs to establish new Layer 3 peering required by the migration plan. I am always very conscientious when it comes to the risk and changes that can be done during working hours so all the service affecting stuff should always be done out of hours. Surely adding few new VLANs to the VLAN database is safe or isn’t it?

Unexpected VTP Pruning Behaviour

I was in my zone when phones started ringing and people frantically walking around and engaging in nervous conversations. “Call manager is down” someone said just few meters away from me, “Radiology can’t retrieve images” someone else shouted, “A&E just rang, they lost wireless”…. It wasn’t until 10 minutes later when I started looking at those problems and trying to help IT team.

I quickly realised that tens of access switches are only receiving VLAN 1 and all the rest of the VLANs are “pruned” by VTP. What is going on here I thought and bounced the link to that switch to see full list of VLANs available again just to disappear 30 seconds later. I realised I had 30 seconds during which I may be able to login to the switch and check what was going on. To my horror I saw log messages saying that VLAN database has reached its limit of 255 VLANs and the switch has converted itself to transparent. As a result it wasn’t responding to any VTP Pruning messages so for whatever reason the upstream core pruned all VLANs instead! At that point it was not important if it should behave that way or not, maybe it was IOS specific or a bug. Whatever it was it had to be fixed quick.

The situation was really bad, around 70 access switches were effectively isolated from the network and removing extra VLANs from the VTP server didn’t fix the problem because those switches were still transparent and ignored VTP pruning messages. One obvious fix was disabling VTP pruning but the customer didn’t want to take the risk and to be honest I wasn’t that keen to do it in hours either. In the end of the day if adding couple of VLANs caused the network outage I would rather not experiment with any changes that can break more things. We ended up reloading most of those switches and in the evening VTP pruning was disabled and the whole network was converted to VTP transparent mode.

Learning from experience

Why did I mention this most unlikely network disaster I have accidentally caused? In the end of the day I am the expert so I should be perfect? In reality we all make mistakes and I am big fan of knowledge sharing and learning from bad experiences. In the end of the day there is still plenty of flaky old networks up there and our job as network experts is to make sure that when we upgrade them, we take into account everything that we can think of but then think outside the box when we are unlucky to see the problem we never come across before.

Never concentrate on a blame game or looking for a scapegoat. Focus on the solution and the way forward instead. Learning from the mistakes is the most effective way for improvement so it is critical to create an environment where people are encouraged to share their experience with others so the same mistake doesn’t happen again.