Palo Alto HA in Azure
HA in cloud can be a bit tricky. If you’re used to plugging a few extra cables into an appliance and setting up sub-second failover HA in the data center, you’re going to be a sad. While a lot of services in Azure have very capable HA features, integrating third-party solutions don’t always work the way you’d want.
There are a few different ways to setup HA in Azure, and it’s pretty much the same for just about all vendors. The main differences generally are with how the configuration is sync’d between the members, if load balancers are used, or if the NVAs are active/active or active/passive.
With Azure, there have been two general methods to setting up HA; using calls to the Azure API to move routes and IP addresses between the HA members virtual NICs when a failure occurs or using load balancers to monitor health and steer traffic as needed
I’m going to outline my opinions of these two methods and introduce another using the Azure Route Server as an additional option. All of these methods are using active/passive HA which is generally more common. Active/active designs always failover the fastest but come with their own set of challenges and considerations.
Method 1: Azure API
Azure API config guide
In the Azure API model, the NVAs first need to have the capability to send API calls to Azure. I haven’t found one from the major players in the Azure Marketplace that doesn’t yet. You’d setup the NVA with a service principal and give it permissions to access networking resources. The NVA pair needs to track if there is a failure event and if there is, whoever is alive sends out the API calls to say “Hey, move the routes and IP addresses to my interface!”. This is an easy and straight forward method to HA…. BUT it’s super slow. The NVAs generally are quick to respond, the bottleneck here is Azure. These API calls aren’t some prioritized special high availability dedicated process. Just normal calls to the normal “Microsoft.Network” API. This ends up making the flip for the HA process up in multiples of minutes to complete.
You COULD try scripting it yourself using functions, but I suspect you’ll end up with the same if not longer lag.
This model works well with VPN tunnels terminating on the appliance. Since the active appliance is essentially identical and the public IP doesn’t change, the tunnel will come back up naturally after the failover process is completed by the Azure API calls.
This is the least ‘expensive’ model. Meaning, there is no additional costs (outside of the time cost) for using the API to failover and you don’t need additional networking components. Generally NVAs don’t charge extra for HA outside of needing to have two devices and generally two licenses.
Method 2: Load Balancers
Load Balancer config guide
The load balancer model relies on load balancers on both sides of the HA pair. This model is a lot faster than the Azure API process but requires more bits. The load balancers do a periodic health check and traffic gets sent to whoever is alive. State can get a bit fuzzy and failovers will often result in timeouts but even those timeouts will end up falling out in less time than it would take to fail over with the Azure API method.
Note: need to use standard Load balancer due to the HA ports feature that is required, can’t use ‘Basic SKU.
Adding the load balancers increases cost a bit since you need to use the Standard SKU due to the ‘HA ports’ option required, but IMO it’s negligible unless you are passing massive amounts of data through them. As anything cloud, anything can add up.
Method 3: Azure Route Server
See my review of Azure Route Server
This is my favorite model and arguably the more complicated one. I like it because it offers more flexibility by allowing you to peer and exchange routes with Azure directly using BGP (albeit a watered down implementation of BGP). Exchanging routes removes the need to add static routes each time something changes in your Azure environment. If you add a spoke VNet and peer it to your hub, as long as settings are correct for the peering, your firewall will learn the new network. This method removes the need to have a load balancer behind (inside) the NVAs, so as far as the plumbing is concerned it is more simple. You will still need a public side load balancer if you want to share a public IP address, however.
The Azure Route Server allows you to advertise a default route to the rest of your Azure network. IT also allows for ECMP (Equal Cost Multi-Path) routes, which comes in handy with active/active setups.
Failover is fairly quick, you just have to wait for the BGP process to pull the route (standard BGP timers).
An issue I did run into was with the Palo Alto’s config sync feature. If you’re running sync, it’ll sync interface details as well as everything else.
Here’s what Palo Alto’s sync in an HA setup
Basically, everything minus management and HA settings.
This causes an issue when the passive firewall becomes active. Due to the sync process, It now has the wrong IP addresses assigned to the interface. You might think DHCP would be a simple fix here… You’re wrong! Unfortunately, you cannot setup BGP to use an interface that’s also running DHCP. It’ll complain and won’t commit the changes.
The only way around this is to disable config sync between them. It’s an unfortunate trade-off since keeping the configurations in sync is kinda a big deal. If you’re using Panorama, it’s not as big deal since you have more flexibility with configurations. If you’re not using it, then you’ll have to either make all configuration changes twice, or enable config-sync after a change, then turn it off and go back in to adjust the interfaces and BGP settings on the secondary.
Another answer to this is to get Palo Alto to give us the option to exclude items from sync or allow BGP to function with an interface set to DHCP.
In the below design, I’ve included the public load balancer to keep a single public IP address for Internet egress. So, this is more of a hybrid approach.
In this design, I’ve done away with the load balancer entirely. Here, you would just be using the native Azure Internet egress, no static public IP addresses. This is a useful design if you have an ExpressRoute or are using something like an Azure Gateway for VPN connectivity, or you aren’t worried about the ingress side.
Conclusions:
I’ve configured all versions of these models for clients, and then some. The choice of which to use depends on their requirements and capabilities. Here’s my pro/con list:
Azure API:
Pros:
- Simple and automated
- Least amount of components to mange
- Recommended by Palo Alto
- Configuration is sync’d between devices
- Plays nice with IPSec VPN tunnels
Cons:
- Sloooooow failover (multiples of minutes potentially)
- Have to use static routes for Vnets
Use if:
You want to set something up that requires the least amount of components to manage and are not worried about how long failover takes.
Azure Load Balancers:
Pros:
- Fast(er) failover
- Works well with active/active configurations
- Additional health metrics from load balancers
Cons:
- More parts to manage
- Additional cost of standard load balancers
- VPNs need to be pinned to one firewall at a time, additional complexity with VPN failover
Use If:
You need quick failover or want to use active/active setup (maybe link to blurb on why active/active). Don’t have VPN tunnels or don’t mind setting up a bit more of a complicated setup.
Azure route server:
Pros:
- Can exchange routes with peered Vnets
- ECMP capabilities
- No need for load balancers
- Quick failover (as fast as load balancers)
- Works well with IPSec VPN tunnels
- Flexible design options
Cons:
- Cannot use config-sync between the firewalls (Due to interface IP sync)
- Additional complexity with routing and need to use BGP
- Azure Route Server is an additional cost
- Azure Route Server is limited in features (currently)
Use if:
You want to have active/active or active/passive HA with route peering with Azure and ECMP capabilities and need faster failover than the API method
The biggest challenge I came across with Route Server is that it cannot overwrite/supercede routes advertised to spokes from an ExpressRoute VNG. Since routes advertised by an ExpressRoute gateway are considered system routes, any identical routes advertised by Route Server are discarded by Azure in favor of the routes coming from ExpressRoute. So your spoke-to-onprem traffic would not be filtered by the NVAs, the traffic would go direct. A major blow to the use of Route Server for customers with hybrid connectivity needs.
https://learn.microsoft.com/en-us/azure/route-server/route-injection-in-spokes#connectivity-to-on-premises-through-azure-virtual-network-gateways
What I do to ensure traffic does flow through the NVA in both directions is to place a UDR on the GatewaySubnet… I know, that’s kinda defeating the purpose of dynamic routing but it does resolve the issue you are talking about. You also have to add a default route to the NVA (or the load balancer if you have one). ExpressRoute really does muck things up a bit. I’ll have to add this to the post as well.