Monday, October 8, 2007

Can your clustered software platform do this?

Disclamer: I actually know nothing about cluster software environments. And what I call "threads" below aren't actually threads.

Suppose you're developing some sort of clustered multi-processor system. The sorta thing that runs on many computers, and uses fairly complex inter-node interaction with various bits of shared state and whatnot (i.e., something more involved than parallel processing jobs with a message bus in-between). Say, it's a clustered Instant Messaging server, or some sort of a HA database system.

For development, you're running many processes on your workstation, or have some sort of a test cluster thing. All of that is using sample datasets, debug clients, etc.

Now, the problem is that changing code while developing this system is a nightmare. Most of the time, any single change will require recompiling the thing, restarting parts (or all) of it, and then going through all the motions necessary to get it into the desired state. At which point the freshly-written code is debugged, new issues are found, and the whole cycle needs to be repeated. A nightmare.

Well, today I found myself in just such a situation: developing a clustered system that does something or other in many processes spread accross multiple machines. Only (of course) it was all being done in Erlang. And it was a pleasure.

I would be poking at the cluster through Erlang shell (an interactive yoke similar to what one finds in Python or Perl, only this one was running as part of the cluster, so I could easily examine states of various bits of code running on various machines, send them commands, kill & restart threads, etc.) When this poking found an error,
I simply fixed it in code, compiled that, and then issued this command at the shell:

rpc:multicall(nodes(), c, l, [my_module]).

What this says is "on all erlang processes known, purge old version of code for my_module, and replace is it with new version from disk". What it does is, it goes to every
connected Erlang node (and since Erlang clusters tend to organise into fully-connected graphs, that means all nodes in the cluster), finds all threads currently using code from my_module, and substitutes new code from disk instead. And it generally does it without affecting running in-memory state at all. Let me just repeat: I can change code a thread is running without restarting it, while keeping all of its associated in-memory data intact.

This way when I find a bug, or need some new feature, I can add it to my running cluster and have it available straight away, without any sort of restarting or messing necessary. It's very cool.