Quick Essay on HA Web Applications

answered this reddit post and decided I should also post the same answer to the blog.

HA design for web sites is a battle I’ve fought. As with any HA system, you are first looking for any single point of failure. In this case, having a single NFS server behind web nodes would be a failure waiting to happen.

The second thing about running web applications, is that you need to engineer cache-ability into your content. This way assets like pictures, css files, and videos can be reliably delivered from your reverse proxies in front of your web heads.

You only need as many web heads (web application servers) as you need to interact with live client sessions (un-cacheable content)…and then add as many extra nodes as you need to account for maintenance and load spikes as determine necessary.

To distribute your application (like php files) and your assets, I would start with source control. Best is to have a pair of repos in your application network, and scripts that push to the web heads, or trigger pulls from the web heads.

Roll-back is also a HA topic: some designs encourage twice the number of web heads and twice the number of proxies (depending on load) and astute DNS control in order to do a quick roll backs. The deployment in those cases is to only half the nodes (writing over the old pool) and then if your deploy tilts, a DNS change will hop you over to the previous pool.

Deploying database changes is more arduous, especially for small shops, because you have to keep in mind that if you are rolling out to a subset of your servers (like the two pools mentioned above) you also have to partition your database resources as equitably.

A schema roll-out is often the first step before an asset roll-out, because you do not want a flood of errors from new application code looking for things like missing columns and tables for the few minutes or hours your asset roll-out takes. This takes shrewd release planning because you must not roll out a schema that breaks the currently running application code before the upcoming application code is rolled out.

Doing HA for your databases is a good argument for hiring a dedicated DBA. Doing HA-*staffing* for your HA application implies hiring *two* so that one can take a vacation or go home during a roll out to spell the other when he’s done with his 18 hour day.

You can cluster databases, and you can do trees of replication, and you can round-robin your masters. The simplest thing to do for an HA setup is to dedicate two database nodes to be a master and a next-master. The next-master can be used as a RO slave most of the time. Then you can chain your RO slaves off both of those. When you need to do a fail-over of the write master, you would switch the replication path of the first-tier children from your master to your next-master and de-pool your master.

The notion of database pools is very application specific, mostly because you need to explicitly run multiple database connections inside your application: one for transactions (writes and updates) and another for simple queries (ro activity). Your application needs to know when a connection has gone stale (like a pool change) and to re-open a new connection. And ideally your pool needs to explicitly check for failures to connect to the wo connection and throw itself into a “sorry, try again later mode.”

Consistency between all nodes in the db pool is a regular challenge. The first in the line of that challenge is: how far behind in replication are each of my nodes? This, and other criteria, as known as ‘fitness criteria’ and your application, or your pool monitoring agents (which might be on each pool node) need to kick nodes out of a pool as soon as they fall behind in replication, show errors, are at high load or in maintenance mode. Ideally you would have something like an ESB (enterprise service bus) or a DNS service that populated hostname-ip mappings very quickly (<= 1Hz) to keep unfit pool members out of service until they catch up.

You should read up on mysql-proxy. That was the last promising project I read about that seemed to encapsulate much of this logic. Otherwise, as of five years ago, there was no good encapsulation for this kind of solution for MySQL. MySQL-Cluster was not the solution I was looking for (and I suggest you read the short book on it) because it is architected for OLTP in-memory transactions. My data-set was highly relational and document oriented.