Modernizing a Campus Network

The Old Network

Over the past year I’ve been tasked with figuring out how to upgrade our campus wired infrastructure. We were running on EoL hardware that was configured using an even older three-tier switching design which consisted of a core layer where all routing and campus switching happened, a distribution layer within each building where switching aggregation occurred, and finally an access layer where end devices connected. In most buildings the access and distribution layers were combined. All VLANs originate in the core, and are trunked down to the distribution and access layers.

This worked fine when it was originally designed decades ago, but over the years organic growth and changes have turned it into an unstable and difficult to maintain beast with hundreds of VLANs spanning more than a hundred buildings, multiple remote sites, two cities, and several hundred fiber links. Reliance on spanning tree for redundancy has been an adventure in ghost hunting as topology changes happen frequently and cause weird intermittent outages that ripple across the entire network, not to mention all the idle links caused by spanning tree simply shutting down redundant paths.

Adding a new VLAN to fulfill a business need from other areas of campus requires meticulous planning and mapping of every possible path spanning tree might choose to send that traffic. VLANs might exist for a single building, or they may span every single switch on our network, remote sites and data centers included.

In addition to the problem of VLAN sprawl and spanning tree madness, the network is filled with single points of failure that I’ve been slowly identifying since I took over this role a few years ago. There are many places where a single power supply failure would take down an entire building, or even many buildings.

The Search for a Solution

When we sat out to upgrade the network I had a few design goals:

Absolutely no spanning tree in the core.
Move routing down to the building if possible.
Eliminate as many single points of failure as financially possible.
Improve support for multiple security zones.
Use our fiber infrastructure more efficiently.
Automation, automation, automation!
Swap out the network without anybody noticing the changes.

Now you may notice that last goal conflicts with the first couple of goals. While I, like most network engineers, will tell you that stretching layer 2 all over the place is a terrible idea, it’s the middle of a pandemic and we’re all working from home and I’m not interested in being the guy to make everyone else change their IP space and route everything just to make my own job easier.

My boss tells me not to design to edge cases and that it’s okay to tell other people to change their processes, but I want to have my cake and eat it too. I’ll happily try to guide people into making better network decisions, but if someone really needs to have a VLAN extended across campus, I want to be able to easily deliver that service for them.

A new goal

So how do I deliver highly available layer 2 services across the campus core without spanning tree? I could use technologies like MLAG to create loop free redundant topologies across the network, then trunk VLANs across the LAGs.

But this seems wrong somehow. This violates the “move routing down a layer” goal. I don’t want to be trunking VLANs across the core anymore. With that in mind I updated my goals:

Absolutely no spanning tree in the core.
~~Move routing down to the building if possible.~~ Pure layer 3 routed core.
Eliminate as many single points of failure as financially possible.
Improve support for multiple security zones.
Use our fiber infrastructure more efficiently.
Automation, automation, automation!
Swap out the network without anybody noticing the changes.

A Solution

During this process we were working with an awesome partner to find the best design and help us with the implementation and deployment process. We ended up choosing Aruba Networks as our preferred campus access solution.

One of the features of the solution was something Aruba calls Dynamic Segmentation. This mode of operation is to tunnel wired switch ports back to a central controller via GRE just like their APs tunnel wireless clients. This lets you treat your wired clients just like wireless clients.

With this, I could deliver seamless L2 connectivity anywhere on campus across a routed core. Combined with ClearPass for authentication and automation this seemed to meet all my design goals! Contracts were signed, POs submitted, and equipment started arriving.

Never Satisfied

Initially the design called for the bulk of the wired infrastructure to be built out using ex-HP ProCurve hardware that had been rebranded Aruba AOS. While I’m sure this would have worked fine, I have been maintaining a network using EoL hardware, and I didn’t want to build out a brand new network using previous generation technology. I wanted cutting edge.

During the vendor selection process I had the privilege of speaking directly with some senior leadership at Aruba, including the head of their switching program. I expressed my concerns about being on a supported platform for the life of this network and we had a great conversation about the features and capabilities I was looking for in the future, and my desire to push the envelope. To my surprise a few days later I got word from my boss that some other conversations had been had and a deal negotiated to get us Aruba’s next generation CX switching platform across our entire campus. Not only that, it was the CX models that supported the most advanced feature set. I had cutting edge! The CX platform also has great support for programmability and automation using a built in REST API right on the switch as well as the ability to run python code directly on the switch. Another box checked.

Room For Improvement

We decided to focus our resources on deploying the new wireless first as it was the most highly visible aspect of the network refresh. This meant we put off the wired refresh a few months. This gave me time to keep thinking about the design. Something about it kept nagging at me. I realized there were two things I didn’t love about the design.

First, the controller based wired network was proprietary. We were going to end up locked in to an Aruba specific implementation for the entire network. While I really like Aruba so far, I also like to keep my options open, especially given that we specifically did not choose another vendor for that exact reason.

Second, the controller based wired network introduced a choke point that could also be a failure point. I didn’t like the idea of all my network traffic depending on a small number of controllers to work. I like highly distributed and fault tolerant systems.

The Breakthrough

During my research into the capabilities of the CX platform I noticed a feature called EVPN. I did some further reading on the subject and discovered that it seemed to be a cross-vendor next generation datacenter technology typically used in Spine/Leaf topologies.

The most interesting aspect of it to me was that it split the network into two distinct parts. The “underlay” network, which is a pure layer 3 routed network who’s sole job is to carry traffic for layer above, and the “overlay” network which is where the actual end-host data moves. The overlay is an abstraction, much like running multiple VMs on a single physical server. EVPN would let me move layer 2 traffic across a routed core. Maybe I could meet my design goal without needing a controller based solution after all!

In my research it seemed that this was strictly a datacenter technology that had to run on a directly connected CLOS fabric, but I was hooked and needed to know more. Besides, if you squint, my campus design looked a bit like a spine/leaf network… sort of. Eventually I learned that there were multiple types of EVPN connections. It could be used for Layer 2 traffic or Layer 3 traffic, but the CX implementation only supported Layer 2 traffic. Bummer, but no matter, this was interesting. Lots of research and testing in the lab later, I verified that I could make EVPN run across multiple router hops, not just a true spine/leaf fabric.

The Holy Grail

So one major goal was to support multiple security zones. My solution to this was to use multiple VRF instances on the routers. In short this was like multiple virtual routers on one piece of hardware. I could create a VRF for each security zone on the firewall! The downside was that each VRF needed to have it’s own connections to it’s peer VRFs on the other routers. Moving from a campus design that had basically two core routers to a design that had over a hundred routers meant this was very complicated, and running multiple VRF instances meant the problem just scales up in complexity.

The real solution came when I found a document from Aruba titled “Dynamic Segmentation: Virtual Network Based Tunneling“. The document was about Dynamic Segmentation, but using EVPN rather than a central controller. It referenced the 10.05 firmware release which added Layer 3 support to EVPN on the CX platform. Finally, I had validation that my dream of deploying EVPN across a campus was not only possible, but Aruba had a validated design for it! Not only that, I could build one underlay network then run as many overlay VRFs on top of it as I wanted without needing to have such complex configurations. I went back to my lab and started testing the new configurations options and soon had a production ready configuration.

Where we are today

As of the writing of this post we have 15 campus buildings converted to the new Aruba network running on a full EVPN based design. So far it’s been smooth sailing and nobody has noticed that we’ve ripped out the entire legacy network and replaced it with a cutting edge new network with a drastically different design under the hood.

In my next post I’ll dive off into the technical details of how EVPN works and then how I’ve used it to actually implement the modern campus network. Expect configuration snippets and lab topologies you can build yourself!