SPOFs

This entry is part of the “Deployment Quest” series.

Multiple chains with a single link connecting them

SPOFs, or a Single Points of Failure, are the points in a system that can cause a complete failure of the overall system. These can be both technological and operational in nature. Creating a truly resilient system means identifying and mitigating as many of these as you can. Truly resilient systems minimize SPOFs and put mechanisms in place to handle any SPOFs that can’t be immediately dealt with and minimize the consequences of any given failure.

Single Point of Failure

A single point of failure is a part of a system that, if it fails, will stop the entire system from working. SPOFs are undesirable in any system with a goal of high availability or reliability, be it a business practice, software application, or other industrial system.

Wikipedia

If we look back at the first post in this series, there are a multitude of SPOFs that need to be handled. Our first deployment is a single server with a single network feed. The following is a list of immediate SPOFs that need to be dealt with:

  • Single network connection
  • Single network interface
  • Single server
  • Single database
  • Monolithic application
  • Single hard drive
  • Single power supply

All of these SPOFs are technological in nature. We didn’t explore the operational workflows around this deployment, so identifying non-technological is beyond this particular exercise.

In the second post, we discussed mitigation of some of these SPOFs. Primarily, we distributed services to multiple servers, mitigating the single server as a SPOF. However, the failure of any single service results in the failure of the whole.

Mitigating SPOFs comes down to identifying where you depend on a single resource and developing a strategy to mitigate it. In our previous example, we identified the server as a SPOF and mitigated it by using multiple servers. However, since we’re still dealing with single instances of dependent services, mitigating this SPOF doesn’t help us much.

If we duplicated each service and placed each on its own server, we’re in a much better situation. Failure of any single server, while potentially degrading overall service, will not result in the complete failure of the system. So, we need to identify SPOFs while keeping in mind any dependencies between components.

Now that we’ve deployed multiple copies of each service, what other SPOFs still exist? Each server only has a single power supply, there’s only one network interface on each server, and the overall system only has a single network feed. So we can continue mitigating SPOFs by duplicating each component. For instance, we can add multiple network interfaces to each server, deploy additional network connections, and ensure there are multiple power supplies in each server.

There is a point of diminishing returns, however. Given unlimited time and resources, every SPOF can be eliminated, but is that realistic? In a real world scenario, there are often constraints that cannot be easily overcome. For instance, it may not be possible to deploy multiple network connections in a given location. However, it may be possible to distribute services across multiple locations, thereby eliminating multiple SPOFs in one fell swoop.

By deploying to two or more locations, you potentially eliminate multiple SPOFs. Each location will have a network connection, separate power, and separate facilities. Mitigating each of these SPOFs increases the resiliency of the overall system.

In other situations, there may be financial constraints. Deploying to multiple locations may be cost prohibitive, so mitigations need to come in different forms. Adding additional network interfaces and connections help mitigate network failures. Multiple power supplies mitigate hardware failures. And deploying UPS power or, if possible, separate power sources, mitigates power problems.

Each deployment has its own challenges for resiliency and engineers need to work to identify and mitigate each one.

Thou Shalt Segment

This entry is part of the “Deployment Quest” series.

In the previous article, we discussed a very simple monolithic deployment. One server with all of the relevant services necessary to make our application work. We discussed details such as drive layout, package installation, and some basic security controls.

In this article, we’ll expand that design a bit by deploying individual services and explain, along the way, why we do this. This won’t solve the single point of failure issues that we discussed previously, but these changes will move us further down the path of a reliable and resilient deployment.

Let’s do a quick recap first. Our single server deployment includes a simple website application and a database. We can break this up into two, possibly three distinct applications to be deployed. The database is straightforward. Then we have the website itself which we can break into a proxy server for security and the primary web server where the application will exist.

Why a proxy server, you may ask. Well, our ultimate goals are security, reliability, and resiliency. Adding an additional service may sound counter-intuitive, but with a proxy server in front of everything we gain some additional security as well as a means to load balance when we eventually scale out the application for resiliency and reliability purposes.

The security comes by adding an additional layer between the client and the protected data. If the proxy server is compromised, the attacker still has to move through additional layers to get to the data. Additionally, proxy servers rarely contain any sensitive data on them. ie, there are no usernames or passwords located on the proxy server.

In addition to breaking this deployment into three services, we also want to isolate those services into purpose-built networks. The proxy server should live in what is commonly called a DMZ network, we can place the web server in a network designated for web servers, and the database goes into a database network.

Simple network diagram containing a DMZ, Web Network, and Database Network.
Segmented networks

Keeping these services separate allows you to add additional layers of protection such as firewall rules that limit access to each asset. For instance, the proxy server typically needs ports 80 and 443 open to allow http and https traffic. The web server requires whatever port the web application is running on to be open, and the database server only needs the database port open. Additionally, you can limit the source of the traffic as well. Proxy servers are typically open to the world, but the web server only needs to be open to the proxy server, and the database server only needs to be open to the web server.

With this new deployment strategy, we’ve increased the number of servers and added the need for a lot of new configuration which increases complexity a bit. However, this provides us with a number of benefits. For starters, we have more control over the security of each system, allowing us to reduce the attack surface of each individual server. We’ve also added the ability to place firewalls in between each server, limiting the traffic to specific ports and, in certain instances, from specific hosts.

Separating the services to their own servers also has the potential of allowing for horizontal scaling. For instance, you can run multiple proxy and web servers, allowing additional resiliency in case of the failure of one or more servers. Scaling the proxy servers requires some additional network wizardry or the presence of a load balancing device of some sort, but the capability is there. We’ll discuss horizontal scaling in a future post.

The downside of a deployment like this is the complexity and the additional overhead required. Instead of a single server to maintain, you now have three. You’ve also added firewalls to the mix which also need maintenance. There’s also additional latency added due to the network overhead of communication between the servers. This can be reduced through a number of techniques such as caching, but is generally not an issue for typical web applications.

The example we’ve used, thus far, is quite simplistic and this is not necessarily a good strategy for a small web application deployment, but provides an easy to understand example as we expand our deployment options. In future posts we’ll look at horizontal scaling, load balancing, and we’ll start digging into new technologies such as containerization.