One of the biggest challenges to automated, testable, repeatable application deployments is connecting application components running in different parts of the infrastructure. This is true whether the application is running in legacy VMs or fully Dockerized. Reliably connecting for example, a front end load balancer to its upstream services, and connecting those services to each other and their databases for every deployment is a critical aspect of almost every real application. Automating that configuration and re-configuration when scaling the application (or when components fail), is as important to the application's performance and availability as any other aspect of the application architecture. It's also often the most overlooked and hardest to get right.
Of course, none of us would fail to recognize and address this problem, but from the number of database and API credentials that can be frequently found in Github repos, we can tell that more than a few developers have struggled with even simple configuration management. Let's be honest, though, even if we've avoided the embarrassment of sharing secrets in a public repo, we've probably all learned some hard lessons about separating configuration from code.
So, we put our database connection details in environment variables or a configuration file, and now our application is set to connect to that specific database host. And that works wonderfully...until we need to scale, or migrate the database, or if that database goes offline unexpectedly. Then we have to figure out how to re-configure our application.
Just to help focus our solutions to that problem, let's say we're using a traditional SQL database, like PostgreSQL or MySQL. There is some interesting work being done to cluster these databases, but the majority of solutions look like one of these two stories:
On its face, the first choice seems ideal. It represents all the right reasons to outsource a problem: we get quick scalability with no need to modify the existing app. The proxy is a centralized place to manage access to the database, so all we have to do is point the client application at the proxy and go. The proxy itself can be as simple as HAProxy, or it can be a more complex proxy that can direct writes to a master and reads to the read replicas1.
On the other hand, we also add a new layer of latency and failure to our app. The proxy might be able to direct queries to multiple replicas, and we might even be able to setup a hot standby database master, but if he proxy itself goes offline or gets overloaded, we're screwed.
The good news is that infrastructure providers have added another layer to solve that. We can add a hot standby proxy and have the main proxy instance and the standby share a virtual IP (sometimes called a "VIP"). What that means is that we've got a switch, or load balancer, or other infrastructure solution that can direct traffic as needed based on the health of the instances. And, if it works right, this is awesome, because it means that we don't have to make any changes to the app to achieve the availability and scale that are required for production.
But, there are some problems. First, the proxy adds latency, and many implementations of the VIP do too. VIPs themselves often take time to change (one leading cloud provider offers a 10 minute SLA for changes, meaning requests can be going to to the wrong host for ten minutes). You might also struggle to make an SQL proxy read its writes to avoid problems resulting from replication delays. And, of course, the client application has no knowledge of what database instance it's talking with, so tracing errors requires tracing logs from the app through the proxy to the server instance it actually connected to.
The proxy may take a number of forms. Yes, it's often a hardware device, perhaps an F5, but for many Kubernetes implementations, it's a router on the host. It's also often an HAProxy instance on the host (for Docker deployments), or on a separate instance (for VM infrastructure). And, the proxy pattern doesn't just apply to databases. The naive convenience of configuring apps to connect to a single IP address (or name2) is so alluring that it persists into the age of containerization and Docker.
The seemingly simple discovery patterns with proxies or VIPs are appealing because they put the decisions and complexity in the infrastructure, rather than in the application. Why build what we don't need to, right? But, as you can see, that "simple" pattern becomes very complex and has some significant drawbacks as we try to ensure availability and performance. As Tyler Treat recently noted, regarding distributed systems: "If you need application-level guarantees, build them into the application level. The infrastructure can’t provide it." Put another way: the proxy can't solve the CAP theorem problems (or any of the other challenges of distributed computing) for your application, that's your job.
The alternative to that pattern is for active discovery and configuration in the application. This is often rejected because it has the appearance of more moving parts and sometimes requires changes to the client applications to make it work. Let's consider our application connecting to an SQL database again. Our application needs to know what database instances are available to connect to, and to make that work we'll need a place where those service instances are registered. Sometimes that's called a "service catalog" or "discovery database." We'll also need something that will register the database instances in the service catalog and monitor their health. Registration and health checking don't actually need to be combined, but I like to do that, and make each service handle it for itself. For an SQL database, that means I'll periodically do a test query to check the health, and then update the service catalog to register and maintain a healthy status for the instance. That means that if I have five healthy database instances, I'll have five entries for it in the service catalog, but if one of those database instances becomes unhealthy (because my health check query stops working), then the service catalog will report only four instances. One common service catalog is Hashicorp's Consul, which many prefer because it has a rich understanding of "services."
That part probably all makes sense so far, right? We're adding a little bit of logic to our database instances, usually via a script or other code that runs with them in their container, VM, or on bare metal...wherever the database runs. Connecting our client applications isn't difficult either.
Our client applications will need to monitor the service catalog for changes to the instances for each service, and then update the configuration when those services do change. For some applications, all we need to do is keep a configuration file updated. Then, each time the application is run (such as for a page load, or via cron, or whatever), it will re-read the configuration file and get the new database configuration (many PHP applications work this way, for example). Consul-template works with Consul to do just this. It can use a template file and fill in the details from Consul's database of services.
Many applications, however, read their configuration at start time and don't re-read it unless given a signal (many Node.js applications work this way). For those applications, you'll need to make sure the application can accept a signal to cause it to re-load its configuration details. For these applications, Consul-template can be set to send the signal for configuration reload after it refreshes the configuration.
This isn't limited to connecting applications to databases. It's easy enough to use this pattern to connect the app to NoSQL databases (think Mongo, Couchbase, or Cassandra, etc.) or to its cache (think Reddis or Memcache, etc.). It can also connect the front-end proxy to the app (think Nginx or HAProxy, etc.). And, if you're looking at connecting multiple services in the same app, hopefully you're already looking at this pattern.
As application developers, we now also get much more visibility into failures and control over how to handle them. Imagine we start seeing error rates going up...are those errors specific to the database instance? Is that instance still getting updates, is it not actually healthy? Do the errors follow a pattern, a write to one instance followed by a read on another, for example? Perhaps it's because of replication delays? We can have visibility to these errors because our application knows exactly what host it's connecting to for each query. In addition to seeing and being able to trace the errors, we have options for how to handle them.
The difference between the two options that I'm describing is really a matter of where the decisions are made. Passive discovery patterns are those that separate the application from the decisions, leaving the application passive in both the choice of what back ends to connect to, and passive in resolving failures that may result. Active discovery patterns move those decisions into the application so it can have an active role in choosing the backend and working around failures it may encounter. By making the application an active participant in the discovery process, we can eliminate a layer of complexity, misdirection, and latency between the application and its backends and give us faster, more reliable, and more resilient applications.
The proxy can be as complex as you want, including one that fronts a multi-master DB with stickiness configured based on the connecting IP, the application failures end up being effectively the same. ↩
A full discussion of the failure modes of DNS-based discovery methods is worthy of a separate article, but they, too, can be categorized as passive discovery mechanisms and should be viewed with some caution. ↩