Statsd Admin Interface for Python

Here’s a little python class for interacting with the statsd admin interface. I wrote it for EnergyHub, where we used it alongside the nifty pystatsd module to clean up obsolete stats as we rotated our servers. We can delete the obsolete stats, in order to prevent gauges, for example, from reporting their last value until statsd restarts.

It implements the admin commands to list stats: stats, counters, timers, gauges; and the management commands: delcounters, deltimers, delgauges.

Dynamo Session Manager

We (EnergyHub) have just released a session replication plugin for Tomcat 6 using Amazon DynamoDB.

Motivation

We’ve never used the bundled Tomcat clustering solutions, mainly because we run on EC2 and the multicast-based solution doesn’t work there. For about 18 months we’ve been using Memcached Session Manager (m-s-m), which stores the sessions in memcached. It works pretty well, but avoiding a single point of failure in a simple protocol like memcached is hard. m-s-m solves this problem by saving the session to a given memcached node and also a backup server. This has led to high complexity: in our case we counted 6 memcached calls for each web request. In defense of m-s-m, the author is very active on the mailing list and helped us debug a lot of corner case problems, often pushing out a release within a day or two of us reporting a problem. Also we were using the non-sticky configuration which is less well tested.

In the end, the sheer amount of code and the number of steps for each request in m-s-m made it hard to diagnose subtle intermittent bugs. We found David Dawson’s Mongo Tomcat Sessions plugin which very cleanly loads a session from Mongo at the beginning of the request and saves it back at the end. This is smartly putting the replication logic into the database layer. Mongo is great for easy-to-use replication, and we rely on it in production already, but for decent replication we’d be talking about two or three servers, which adds cost and maintenance burden. We figured we could take the general approach but use Amazon’s DynamoDB on the backend: no worries about deploying or monitoring the storage layer.

Implementation

When a request comes in, we look it up by ID in a Dynamo table. If it’s not found we’ll start a new session. After the request a helper Tomcat valve calls back to save the session back to Dynamo. This approach works well, the only thing to consider before rushing to deploy is that Dynamo must be configured for a certain throughput. In our case, the throughput t that must be provisioned is

t = s*r,

where s is the session size rounded up, in kB; and r is the request rate (requests per second). For example, if the vast majority of sessions are 1 < s < 2 kB and we have a maximum request rate of 100 req/s, then we must provision the table for 2*100 = 200 read units and 200 write units.

Session Expiration by table Rotation

Moving to Dynamo loses us one key feature vs MongoDB: secondary indices. In this case, it means we can't have an index on the last modified time of the session, which could be used to delete the expired sessions. We have to workarounds in Dynamo:

  1. Scan the whole table for expired 'last modified time' sessions. This is expensive and hard to provision for: if you have a high session turnover you could have millions of sessions to scan through, but only hundred of provisioned reads per second in which to do so.
  2. Move active sessions to a new table and drop the old one. This is the approach we have taken. For example, if the expiration time is one hour, we will start by saving our sessions into table 'A'. After one hour, we will create a new table, 'B', and start saving new sessions into B. When loading an existing session, we will look in 'B', and if not found, in 'A'. In this manner active sessions will be moved to table B over the next hour. After another hour, we will create table 'C' and start saving there. At this point all sessions that only exist in table 'A' are older than one hour and can be safely dropped, so we delete the whole table.

Extra Features / Advanced Settings

  • We've added optional monitoring via statsd, which we use heavily in production.
  • We auto-provision the read-write capacity when we create tables, based on a given session size and request rate.
  • We've added the ability to ignore requests by URI or based on the presence of certain HTTP headers. This is useful for us because we have a lot of machine-generated traffic that doesn't use sessions.

See the code at https://github.com/werkshy/dynamo-session-manager or use the jars from http://repo1.maven.org/maven2/net/energyhub/dynamo-session-manager/

Running MySQL/InnoDB in-memory for unit tests

I thought I’d take another shot at reducing our build times. When we test our full legacy code, there’s a lot of slow integration tests involving mysql. I looked at using an in-memory database like H2-with-mysql-syntax but some of our code (e.g. table creation) is too mysql-specific. Next step: use a ram disk for Mysql. This is all based on Ubuntu 12.04.

Here’s a script that starts MySQL with the parameters to use /dev/shm for all files, and bootstraps in the root user and system tables. I have verified using iotop and iostat that nothing is written to actual disk with these settings.

As for performance? a full test run of our main data access library has gone from 4:11 to 3:41, about 11% faster. Not much really!

WordPress Development, Staging and Production Deployment

A.K.A Keeping Your WordPress in Git

As a web application developer I’m used to having several environments to deploy to: my local workstation, the QA testing environment and our production environment. I’m also accustomed to keeping everything in version control: code, config and deployment scripts. As we prepare a new release it spends time in the QA environment and when testing is complete we move it to production. The method for deploying to QA is very similar to how we deploy to production, since we want to catch bugs in the deployment process itself.

This technique is not obviously applied to WordPress deployment. Over the years I have developed a technique for hosting a WordPress ‘development’ environment for our marketing and frontend webdev people to work on before it is release to the public. We keep all the changes in git and deploy directly from git in one command. I haven’t seen any other great solutions to the problem that a lot of your content is in the database, but a whole bunch of stuff is also in the theme files (php and js), so you need to ‘deploy’ the database changes alongside the file changes. Here’s my take on that.

Caveat

This technique BLOWS AWAY the production database during deployment. It is therefore not useful if you have comments enabled in WordPress. We use WordPress more like a CMS than a blog so we are free to replace the database when we deploy. The technique could probably be adapted to only deploy the essential tables (pages, posts etc) and leave the comments table alone.

Usage

Let’s assume the development environment is at /var/www/dev and the production environment is at /var/www/prod.

To ‘checkin’ the dev version

cd /var/www/dev
dump-n-push

To ‘checkout’ the current version into production

cd /var/www/prod
deploy

Set Up

Download the scripts from https://github.com/werkshy/wp-deploy and copy them to /usr/local/bin, which should be in your $PATH.

Everything is checked into git: wordpress files, themes, plugins, db dumps, everything.

Install wordpress in the dev environment

Download and unzip the wordpress release at /var/www/dev.

You’ll need to setup the dev database.

mysql -uroot -p
mysql> create database wpdev;
mysql> grant all on wpdev.* to wordpress identified by 'wordpress';

Set the db parameters in wp_config.php. THIS WILL NOT BE CHECKED IN.

Edit .gitignore, most importantly to block wp-connfig.php:

/wp-content/cache/
.DS_Store
/wp-config.php
.htaccess

Set up your webserver to serve php from that directory as normal, (see example Apache configs at the end of this post).

Add Everything To Git


cd /var/www/dev
git init
git add -A
git commit -m "initial commit"

Create the ‘origin’ repository

You may keep your site on a remote git repo, or in a git repo on the local machine.

Create the ‘origin’ repository:

cd /root/
mkdir wp.git
cd wp.git
git init --bare

Push your dev commit to the origin

cd /var/www/dev
dumpdb
git remote add origin /root/wp.git
git push

Prepare the production environment

Checkout the files:

cd /var/www
git clone /root/wp.git prod
cd prod
cp /var/www/dev/wp-config.php .

Create the production database (use the same user as the dev one)

mysql -uroot -p
mysql> create database wpprod
mysql> grant all on wpprod.* to wordpress;

Set the production db name in wp-config.php

Now try loading the db dump into production:

loaddb

If that all works, you can now dump and push the dev site with

dump-n-push

and you can deploy the production site from git with

deploy

Example Apache Config

Development Environment:

<VirtualHost *:80>
	ServerName dev.energyhub.com
	DocumentRoot /var/www/dev
	<Directory "/var/www/dev">
		AllowOverride All
	</Directory>
</VirtualHost>

Production Environment:

<VirtualHost *:80>
	ServerName www.energyhub.com
	DocumentRoot /var/www/prod
	<Directory "/var/www/prod">
		AllowOverride All
	</Directory>
</VirtualHost>