Network Resiliency: Why it Really Is Better to Prepare for the Worst

by Tim McConnaughy|Aug 20, 2024

Like a severe thunderstorm or car trouble, outages and downtime in your network often happen at the worst of times. These outages can be devastating: they scramble flight schedules, delay important surgeries, disrupt retail sales, and damage your credibility with customers. If you manage a network, you cannot wait until an outage happens to start planning for recovery. Pursuing networking resiliency means designing your network to be as outage-proof as possible so you are ready when the unexpected happens.

Unfortunately, just like cloud security, designing for resiliency carries extra costs. Here are some best practices for achieving network resiliency without spending your entire budget.

Redundancy: Strategic Duplication

A core principle of networking resiliency is redundancy: using resources, services, and gateways that perform the same function but live in different places. The idea is that if trouble strikes one group of resources, the redundant resources can take over. For example, if a cloud region is down due to some failure, having resources deployed in another cloud or another region could safeguard against the failure and ensure business continuity.

Redundancy is essential for resiliency, but it can be expensive because it doubles the number of resources and connections you pay for. The Aviatrix Cloud Networking Platform uses redundancy in two main areas to balance resiliency with cost optimization:

High Availability (HA) gateways ― With High Availability gateways, you can have two active Aviatrix gateways forwarding traffic or configure them as active-standby to support asymmetric connections to hybrid devices like firewalls.
Multiple Availability Zones ― Aviatrix deploys infrastructure to multiple Availability Zones to ensure a single cloud data center failure will not impact connectivity. The Aviatrix Controller also handles the cloud route tables to ensure load balancing between the gateways, and, in the case of failure, the Controller also automatically reprograms the route tables to point at the surviving gateway.

Monitoring: Constant Alertness

The second step in your resiliency plan is monitoring: keeping an eye on your network to find potential threats and irregularities that reveal deeper issues. This is the ongoing work of a networking team that you cannot skip, but you can make it easier for yourself through automation.

Networking teams monitor several key areas:

Traffic – The ebb and flows of user activity and its effect on performance through days and seasons.
Logs – The records of activity and events in a network that can help teams reconstruct the story of what happened.
Anomalies – Unusual network activity that might indicate threats or application problems.

Aviatrix CoPilot offers many tools that streamline monitoring through automation, including:

FlowIQ – FlowIQ analyzes traffic flows using global heat maps and time series trend charts.
ThreatIQ – ThreatIQ monitors security threats, sends you alerts when threats are detected, and blocks traffic that is associated with threats.
Audit – The Audit page in CoPilot allows you to review recent user activity such as API calls.

Recovery: Preparing Ahead of Time

Redundancy and monitoring help you avoid outages, but unfortunately, a hurricane, cyberattack, or employee error can still take out a part of your network. You can prepare for those situations ahead of time by creating a disaster recovery plan.

A comprehensive disaster recovery plan should include a full plan of action for triaging issues, team and individual responsibilities, and customer communication. In relation to your network, it should include a backup and restore: your data should be backed up somewhere safe so you can reload it and continue at least basic operations. Consider the two most important statistics when designing for disaster recovery:

Recovery Time Objective – RTO is a measure of how long it should take an organization to restore normal operations after disaster recovery.
Recovery Point Objective – RPO is a measure of how recent the data should be after recovering from disaster. RPO is based on the frequency and amount of data retention; this is based on organizational policy and cost.

Aviatrix CoPilot offers a few built-in options to help you with disaster recovery:

Agility – Aviatrix can help improve RTO by being very agile and quick to deploy using automation like Terraform, restoring services quickly.
Backup and Restore – Aviatrix CoPilot offers a Backup and Restore function to help you save Aviatrix settings, policies, and configurations.

You can’t absolutely prevent network issues any more than you can control the weather, but with careful planning and good designing and management practices, your network can survive them.

Learn More about Resiliency at Office Hours!

Want to learn more tips for keeping a cloud network resilient? Join us at the Cloud Networking Office Hours on Friday, August 23 at 8:30am PT! Josh Cridlebaugh, Director of Solutions Marketing at Aviatrix, will offer advice and answer questions. Register here to attend.

Surviving the HCLS Compliance Crisis

The Top 10 AI Security Threats Seen in the Wild ─ Part 1

Security as a Practice, not a Phase: Naveen Vasantha Kumar’s Journey in Cloud Networking

Tim McConnaughy

Technical Marketing Engineer (TME)

Tim is an expert in hybrid and multicloud networking and Kubernetes. His previous roles include network administrator, system administrator, firewall jockey, collaboration engineer, network consultant, as well as a network and cloud network architect.