Sunday, June 1, 2014

WTF is Middleware, or, "... the full complexities of the problem ...", or, why knowing "HTML" is not enough to build websites

I'm getting deeper into setting up a modern production open-source webserver, still sort-of following the instructions I like, by George London. However, since that guide is built around Amazon Web Services, and I think that Amazon is not kosher, it's not as simple for me as just following the guide literally.  I'm still (still!) somewhat stuck on figuring out what exactly I want to do.  I want to have a production website with all my blog writing, my songs not in 4/4 database, and my experimental task tracking tool; and I want to get current with state-of-the-art website development. But I still have "Step 2: ????" in the middle of my plan.

Let's focus on setting up web servers. There are a lot of tricks and problems to running a website. The last platform I used, OpenACS, already had a good grasp on many of them in 2001, so I want to see what's really changed in 13 years. The key problems are:
  1. Changes to a website need to be tested before they are put in production
  2. A website usually includes both code and data, so changes to code need to include changes to data structures
  3. Your website is too slow/broken.
There are many solutions to each of these problems.  They basically involve massive quantities of Middleware.  Middleware is the stuff that is completely essential but that anybody not in software has never heard of much less conceived of the need for.

For testing, the basic idea is that changes are made in one environment and then deployed to a production environment when ready.  At a minimum, this includes knowing a version control tool; I know cvs quite well but everybody is praising the new paradigm of distributed version control (git or Mercurial); I started with git but I think Mercurial is a better fit for my needs.  This also entails having multiple environments, and there are two ways to have multiple environments: manually setting up and configuring each server (aka, the hard way), or learning to use an tool that can automatically set up and configure each server (aka, the even harder way that claims to lead to an easy way).  Fab, as I understand it, help with this, but on top of that George London's approach uses chef. I think I'm going to proceed with manual configuration for now, since I need to understand that anyway before automating, and then I may try chef or SaltStack or something else.

For managing changes to code and corresponding data structures properly, something that I still see plenty of developers treat as a surprising and unexpected new problem, Django seems to be committing to south, so I'll use that.

For the third thing, web sites being too slow and broken, things have gotten much more complicated.  Content Delivery Networks, CDNs, are far more common and reaching, I think, a bit lower than they were in 2001.  Load balancing (spreading work over multiple webservers and multiple database servers) used to mean buying an expensive box from BigIP, but at least at the lower end there seem to be a lot more software solutions (nginx in this case).  But the biggest change is a much stronger emphasis on caching.  That is, after computing what some part of a website should be like, remembering that result and skipping the computation until the result is no longer valid.  Just for that problem alone, the standard Django approach seems to include memcached, celery, and RabbitMQ.

And with all that, we still haven't gotten to the actual web site code, the part that determines if you have created Twitter or just pets.com.  That code needs a home to run, since most modern web code is written in interpreted languages, not compiled languages; programs written in compiled languages are turned into freestanding programs before use, but interpreted code is always run within an interpreter program.  Django is Python code, so it runs in Python, but for production purposes that should happen within something more industrial-strength, and the suggested option is Gunicorn.  And I've already decided to use (or at least get much further before giving up on) Django itself, which is a web development platform, and Django-cms, which is a module for Django that does articles and directories and page templates and the like (like Wordpress or Blogger, but more and bigger).

So, to get from the default development Django environment, which is basically just two programs,
- Python
- SQLite

to a full production environment, I need to understand at least the basics of:

- PostgreSQL
- Nginx
- memcached
- gunicorn
- celery
- RabbitMQ
- Mercurial
- fabric
- south

Each of these has the potential to be weird, buggy, idiosyncratic, complicated to install, tough to maintain, abruptly abandoned by its creators, or any number of other bad things.  As of today I'm experienced and comfortable with exactly one of these, PostgreSQL.  So step 1 is to set up a production-ready server inside a virtual server on my own desktop, using all of these tools, adding them one at a time.  Today, I got PostGreSQL working, and nginx, but at that point my static files weren't getting served, and diagnosing that started to suggest that celery and RabbitMQ need to go in at the same time as nginx, and celery has friends, celerycam and celerybeat, and dj-celery.  So I started digging into the documentation, where I found this very reassuring line:

[Celery] is easy to use so that you can get started without learning the full complexities of the problem it solves.
I guess that's the crux of it all, right?  Since each of these ten or twenty or thirty helper programs represents years of work and decades (hopefully) of knowledge of the problem, it's really nice when they just work and solve problems you didn't even know you had.

And so far, they aren't terrible.  The really demoralizing thing in middleware work is to discover that the tool you have committed to is broken in a subtle way, or grossly inadequately documented, or simply doesn't do the thing it's purported to do.  Microsoft's Team Foundation Server does most of the stuff by itself that these dozens of programs collectively do, but for the parts that I've had to use (admittedly mostly the non-development side, like SharePoint, the wiki, and other tools) are pretty awful in every regard, from performance to documentation to feature set to style.  So from that perspective, I guess this project is going well.

1 comment :

  1. And guess what... Two years later, and it's still exactly how you described...

    ReplyDelete