Sunday, 19 July 2009

High-availability... or not

It's been so long since my last post I won't even bother apologizing.

We all know single points of failure (SPOFs) are bad. By definition they provide a point in the system where the failure of one component could have a significant impact on availability, functionality or another -ility you care about.

Avoiding SPOFs is what high-end boxes are all about. Redundant power supplies with independent distribution, parallel IO links, segregated CPU boards, multiple backplane fabrics, all that great stuff that works like a champ (usually), makes a single box highly resilient and costs a fortune.

I've two things against such systems, one practical and one possibly more prejudice though it is based on experience. The practical point is that this doesn't scale; fine if you need a few dozen CPUs but what if you need hundreds, thousands? Or terrabytes of memory and petabytes of disk? That takes you outside a single box and if you have to cluster boxes then I see very little value in using high-end ones. More below.

The prejudice is that I've found that HA boxes are wonderful while they work, if they do fail then they are substantially harder to recover than a vanilla box. At work we had a HA system that wasn't all that high-end but did use some funky hardware-level clustering. The vendor had 2 people in the country who understood the setup so if the system ever choked then better hope nneither had resigned, was on leave or busy with another customer's dead HA system. When one of those babies dies you don't do a simple reboot, or at least you don't when the failure has taken part of the hardware heartbeat/failover logic with it. Ugly.

If you need a cluster then you could use many high-end systems that in isolation are HA. Surely then the overall cluster availability will be much higher than if you used simple off-the-shelf boxes? Well yes, but mathematics gets in the way. Mean time between failure (MTBF) and availability metrics fall off really quickly. Assume a cluster of 100 machines each of which has 99.9% uptime. Compare this with one with systems with a mere 95% availability. How much better is the first?

Turns out the first has an availability figure just over 90%, the second just over 59%. On the surface this would seem to be evidence for big iron but think about it -- 90% isn't very good especially as we were paying for 99.9%. With 10% downtime over the time period then we have to put in place the processes and mechanisms for recovery/repair/replacement. That means sparing, monitoring, training, all that good stuff. If instead the failure rate is the higher figure for the commodity system then we need... do exactly the same thing, if more of it. Real-life figures are going to be somewhere between these ones as most systems are much better than 95% but don't get near 99.9% real-world uptime.

This is the inflexion point; at some point the system scale means that failures become the everyday case and we have to deal with them. If that's the case then what is the benefit of the higher-spec machines? If you put in place the processes required to repair a machine a week will it really be much more work to do 3 instead? If the unit cost for the less expensive units is 1/10 the higher ones then where's the best investment point?

This is why I'm so convinced that for higher-scale systems then software-level HA is the way to go. Don't rely on hardware clustering and big RAID arrays, use multiple replication and software clustering. Make each box as anonymous as possible, utilize N+1 type sparing and as boxes fail spin up a new one and give it the needed personality. And you never have to call one of the 2 vendor experts to do it.

Monday, 23 March 2009

Too much of a good thing

Occam's razor says that things should be as simple as possible. The dubiously quoted Einstein corollory states that they should be as simple as possible but no more so. Both are design principles that infrastructure people should have burned into their brains.

I don't want to remain purely in the language space but as hinted at previously the Java language is, in my humble opinion, an exemplar of undesirable complexity. Actually that's untrue. The Java language started out as a beautifully simple and compact language and the changes over the years have only partially damaged those qualities (non-reified generics aside), what I'm really talking about are the libraries.

There is certainly a tension here as mentioned previously also the ability to leverage (how I loathe that word) an existing language runtime/library is a massive advantage for any new language. But I think most would agree that the Java libraries have become very unwieldy over time. I believe there are 2 main reasons for this, one philosophical and one decidedly practical.

The philosophical reason is simple -- backwards compatibility is in Sun's DNA. The company that is proud of modern binaries running on 15-year-old boxes is going to be a real advocate for library longevity and future-proofing. I think there are large sections of the library that could easily be removed at this point with practically no one noticing but it won't happen because of Sun's demand for backward compatibility. (As an aside this position also inflicted type erasure generics on the world). I do have hope the proposals for a modular JDK gain traction as that is likely the sensible way forward.

I suspect though that the practical reason for library bloat is the really interesting one we can learn from. I'd argue that the biggest millstone around the Java library's neck is the legacy of the java.* vs javax.* package split. For those who weren't using Java at the time the original model was that core JDK packages would be in the java.* namespace, additional libraries existing as distinct downloads would be in the javax.* namespace.

Except that several javax.* APIs were rolled into the core JDK. Swing is the best example, with cryptography, XML and web services support not far behind. Suddenly there were parallel namespace package hierarchies within the language with really no strong principles driving some of the split. Why are some classes in java.sql and others javax.sql? Why is text processing a 'core' java.util.* package but XML processing is in the hinterlands of javax.*? It gets even worse when you consider the 3rd party libraries from W3C and OMG that have also been sitting in the core library for a very long time now.

On one level who cares but I think there are a few real lessons here. Firstly, namespaces matter! As someone who has had to define naming strategies and namespace management across multiple cooperating projects I know firsthand how hard it is to get right. The Java libraries I believe show the consequences not so much of getting it wrong but rather how good decisions can have really nasty consequences. Who would have thought that an approach to namespace segregation combined with a desire for backwards compatibility could combine to drive a massive explosion in library complexity? The law of unintended consequences at work.

Very nice, but what can we do about it? For infrastructure providers the lesson I would argue is clear: don't approach version transition and support as operational problems. High-level and conceptual design decisions can have a direct measurable impact on how well the system can transition from one release to the next. Pollute the namespace and you have trouble. Push too hard for multiple supported versions and you can easily multiply support complexity and cost. Do both and you may have Java on your hands.

Thursday, 26 February 2009

The programming language toread list

It is mildly embarrassing to be reminded how long it is since you blogged by reading the URL when opening up your resume to touch it up. I can't believe I haven't posted in two months. If anyone actually reads this -- sorry, I've been busy, not that its a real excuse.

I've been doing a lot of programming of late, both at work and on an external project I may be in a position to inflict on the world soon. So my lack of posting was not due to disillusionment after broaching the subject of programming languages last time.

I mentioned the new programming languages I planned to learn and unfortunately that activity is still in the future tense. But I'd like to explore which languages and why. My reading list currently has books on Erlang, Clojure and Scala on it. I never set out to be interested in these types of languages, it just sort of happened but on reflection I can certainly see why. I see three themes in the languages chosen; functional programming, concurrency and JVM compatibility.

I'll deal with the functional aspect first. I'll doubtless lose all credibility as a computer scientist by saying this but I've never been that excited by functional languages. I suspect it's because my first introduction was to Lisp in a computer theory course and it was used more as a demonstration of functional theory than as a means for getting things done. Ahh, my practical streak. I either wasn't taught, or didn't appreciate aspects such as the incredible power of side-effect free functions. Years later I start to appreciate this power and I want to learn more. But in all honesty I see the functional aspects of these languages as more of a means to an end than the reason they interest me. Which leads to...

Concurrency is, in my opinion, the real paydirt here. I long ago came to the conclusion that there are two sorts of people when it comes to hand-crafted multithreaded code; those who know their best is at best barely good enough and those who haven't realized how bad they are at it yet. There are a few people who are true masters of the craft but the number is so small I discard it as a rounding error.

We hear much about the multicore challenge and I think we need be slightly careful as not long over a decade ago there were similar fears that SMP was impossible above 12-way boxes. But hype aside the fact is that next month there'll be 2-socket boards with 8 real and 8 virtual cores thanks to the new Xeons. Yes the virtual cores are artifacts of hyperthreading but if the OS scheduler sees it as a thread execution engine then to the application the box is 16 core. Ten years ago a 16-CPU box was serious high-end iron and few apart from Oracle developers (as in those who write Oracle) had to worry about coding for the platform. But everyone has to think concurrency now. And crude abstractions such as hand-managed processes or threads are going to fail us in this arena.

Ironically this is an area where Java will probably boast much better performance than C++, at least until the C0X libraries appear in anger. Single thread performance will become much less important, a language runtime will instead live or die by its scalability. And thanks to the concurrency utilities mentioned last time Java I think has given itself a few years breathing space. Having high-level concepts such as managed thread pools, blocking queues and countdown latches really makes high thread count programming much easier.

But the abstractions still require a reasonable amount of appreciation of concurrency. And if you're wondering if volatile will do instead of synchronized then you are in trouble. This is why I love Erlang's actor model, where every process is logically distinct and tied together by message queues. Say goodbye to worries about deadlock and synchronization, instead write the logic for the tasks to be performed and let the runtime do the rest. I've been lucky of late to write several concurrent Java apps where I could put worker threads in a pool on the end of a BlockingQueue and I think it gave me just a flash of the power behind this model. I look forward to learning more and getting a better appreciation for what the other languages provide here also.

And I do think that Scala and Clojure have made a great move by being hosted on the JVM. Because of the factors I discussed last time I know that I need broad solid libraries from a language, almost more than anything else. If a language is going to demand I learn a whole set of new libraries then it better offer something damn compelling. I suspect Erlang does that but we'll see. For the other languages though they wave a get out of jail free card by allowing the standard Java libraries to be called from within the new code.

True this sort of hybrid programming will probably hit problems if the whole purpose for the new language is to offer something Java doesn't but again, this is all about trade-offs and I'll likely take that one. But the standard Java libraries are huge (too much so but I'll talk about that another time) and providing a mechanism to call out to them gives a new language a huge advantage; most new languages suffer for years because of a lack of libraries and indeed new language innovation is I suspect impeded by this constraint. Using the JVM is effectively a means of language bootstrapping.

So those are why I want to learn these languages and why I find them compelling. Now I just need find the time...

Tuesday, 23 December 2008

The infrastructure of programming languages

So far I've talked about infrastructure from a traditional systems perspective; even if very dynamic and yes agile it is generally physical equipment, operating systems and software services. There's also a finer-grained perspective on infrastructure which I think is critical but is rarely identified as an infrastructure issue -- programming languages.

For the first time in a few years I had to learn some new programming languages this year. Well, to be precise I needed to learn Python for work and chose to learn Ruby for an external project. I strongly recommend the experience to any other technologists who have been in the business for a while, I plan to continue the trend. But that's for another post.

What struck me about the experience of learning a new language was of course the new way of doing things. For me in particular finally biting the dynamic language bullet after 20-odd years with static languages was a really educational experience. But after the newness wore off and once I became comfortable with the new way of doing things, even after I started to internalize the new ways of working and productivity gains I was struck by something: how little I cared.

That is of course exaggerating for effect. What really struck me was the realization that it wasn't language features that would colour my overall view of the language... or at least not outside the productivity question. What I really cared about was the available libraries to enable me to do interesting things without starting from scratch every time. Textbook coding examples are all very nice but in the real world I'll want to have network clients and servers, access various datastores and process structured and unstructured data. In a highly concurrent way of course. The quality of the sockets and thread libraries in a language are likely to have a more direct impact on my development experience than any language feature, even if it is one as compelling as closures. In other words I want a firm infrastructure delivered as the standard libraries that come with the language.

The adoption of standard libraries is an interesting case study in how developers will, if the infrastructure is good enough, jump at the opportunity not to re-invent a wheel that isn't core to their business. Sure many standard libraries in any language have some 3rd party alternatives but I'd argue that in most of those cases the external alternatives exist because of flaws in the standard libraries. Java is actually a really good example here. Threading was, prior to the addition of the java.util.concurrent packages, damn hard to get right. Easier by far than C or C++ but that wasn't a high bar to cross. Many people looked at external libraries such as those built by Doug Lea and those very libraries became folded into the base Java SE platform as the aforementioned java.util.concurrent package. And the exact same thing has happened again with his fork/join libraries that seem destined for addition in JDK 7.

Conversely the java.util.logging classes have not been universally loved and alternatives such as the Apache Log4J have arguably gained significantly more traction. The situation here gets muddy as not only do you have different logging frameworks but you have several of them acting as facades into which specific implementations can be plugged. You could probably have code using the SL4J libraries that talked to the JDK logging atop Log4J. Good luck with that.

The lesson here isn't to throw stones, rather it's to highlight the reoccurence of patterns previously seen on the systems side. If a piece of infrastructure is good enough then it will be used. If its not and there are alternatives then the best thing the infrastructure providers can do is look at the weaknesses in their offerings highlighted by the alternative and plug the gap. If they don't then the result will be the mishmash of standard products and blessed alternatives that the users (developers in this case) hate and the forbidden alternatives being used undercover to actually get things done. For systems infrastructure the providers are operations staff in many cases, for languages its standards bodies or spec consortiums (yes I do distinguish between the two). In many cases both share many of the same curses, from a perception of sloth by a demanding user community to a responsibility for overall product health that is not fully appreciated outside their small groups.

Tuesday, 9 December 2008

Team sizes and inertia to change

There's a subtle aspect to team sizes that really only becomes noticeable when the need to radically change the ways of working arises.

In a simple view of the world a team either has enough people for its responsibilities, too many or not enough. But when they need to significantly modernize their processes this model becomes overly simplistic.

For the team with extra resources it's not really a challenge. They have the people with extra cycles who can be deployed on developing and putting in place the new ways of working. So for example while the rest of the team is still building servers manually the forward-looking ones are starting to stand-up provisioning servers.

At the other end of the spectrum the under-resourced team will not have fun when the new needs emerge. But because they are already stretched they basically have no option but to work to find ways of doing things smarter and that likely aligns closely with the new imperatives. Continuing the above example the small underresourced team would probably already have some sort of provisioning service in place, they just don't have enough people to do their jobs without these sort of tool-led automation and productivity gains.

The interesting case is the team that is ticking along with the current workload. When operational and other realities demand a more responsive solution then the team has two options; start the modernization as with the other teams or do some shuffling of tasks to get the current jobs done more quickly. If this continues then the team can get into a rut of not having enough free cycles to start on the new processes because they are 100% consumed with lurching from crisis to crisis. Often individuals perform near-mythical feats; doing things the old way but making up for it with long hours and cutting corners elsewhere for example. Counter-intuitive as it sounds reducing the team size in such a situation is one of the most effective ways to engender the needed change.

The imperative here is similar to that imposed on IT organizations within large private companies who are given a realterm decrease in budget every year. The cost of IT has to be driven down, there is simply no excuse. But the danger being highlighted here is that when at the point of barely managing today's workload the approach of accepting technical debt for current success may significantly impair the ability to make the changes needed.

The developer perspective on the root access question

I don't want to come across as being overly critical of operations and support staff. Being the one who gets called in at 3am does change ones perspective somewhat. I have enormous respect for the support staff; they are often placed in impossible situations and get it from all sides.

In the interests of fairness therefore I want to re-visit the idea of trust and perspectives on who is responsible for the infrastructure and who will break it. As mentioned in the last post operations staff often severely limit who has privileged access to servers due to a fear of unknown changes and the impact thereof.

But a developer who is asking for complete freedom on a managed server will likely give a very different response when asked if he minds sharing a machine with another developer who also has root access. The strength of the response increases in direct proportion to the size of the respective teams, the deployed systems and their criticality. I still have the scars from once having to try and convince two teams to share the same WebLogic cluster. It wasn't pretty. The same developer who believes himself (using pronouns in a gender-independent way here) to be worthy of privileged access may well not assume the same for other 'colleagues' in different teams. The unspoken assumption is that the other developer is at best careless, at worse an idiot who will deploy awful code which will bring down the entire machine. Which is basically the same concern as that held by operations staff, the only difference is who is playing the respective role of virtuous defender and incompetent hacker.

The moral of this story is that the perceived chasm between operations and development staff is frequently not as wide as it first appears. Though they may have different perspectives on specific issues they in general want the same thing; a working platform that doesn't get messed up by someone else. Also yet again we see the value of the same solution set (virtualization, rapid rebuild processes) that we've touched on before.

I strongly believe that the traditional model of locking down access is the wrong approach. Instead clearly define what is the responsibility of operations, i.e. the bits they will give full support for. For the rest provide a more liberal access policy but combine it with tools to quickly restore the system to a known state. Instead of not giving out root access, let the developers experiment in the understanding that if they really screw up then they need rebuild the system to a known state before even thinking of picking up the phone to support. Obviously there are different facets depending on whether the system is development, reference or operational but the principle is, I believe, sound. It is after all what any virtual host provider does.

Friday, 5 December 2008

Amazon EC2 and controlling privileged access to IT resources

One of the traditional reasons for tightly locking down IT infrastructure is the fear of the damage that will result from unmanaged changes. The classic manifestation of this fear is the restrictions placed upon who gets administrator/root access to servers. In many cases the answer is simple -- noone except the IT support staff.

As someone who has worked in a support role I feel their pain. The risk being mitigated is the work required to fix things when someone else does something stupid. From a purely selfish perspective therefore it makes complete sense to simply restrict all access as opposed to trying to develop finer-grained processes that distinguish the careful from the deranged.

There's a few underlying assumptions here, perhaps the main one being that the roll-back of dumb changes is hard. Another is that a badly configured machine may have a negative impact on the rest of the IT estate. Anyone who has worked in IT for more than a few months will have war stories of the disasters caused by seemingly inocuous changes and this provides the evidentiary basis for the fears.

This all came to mind yesterday when using Amazon's Elastic Compute Cloud, better known as EC2. Basically it's a remote virtual server bank, you wave a credit card and then get charged for the instances you startup on a hourly basis. A new customer can get up to 20 virtual instances, if you want more (potentially up to thousands) then you have to ask nicely. I haven't had cause to do that but suspect that for any reasonable use case Amazon will gladly take your money. They're clever that way.

But glancing at the main EC2 page I saw something that hadn't registered with me before. Amazon have partitioned EC2 instances into availability zones and regions. There are multiple availability zones in a region and when starting instances you can spread them across zones or regions to improve resilience. Amazon guarantee 99.95% uptime on each region -- but since all availability zones are currently within a single region (US East) then the whole service SLA is effectively 99.95%.

Think about it. They will give any stranger with a credit card root access to 20 machines with no notice, approvals or vetting. Contrast that with the traditional approach described above and one of two things is happening; either Amazon is crazy or they have built an infrastructure that doesn't care what individual server root users do. Given the success of EC2 I think the latter is the case.

Virtualization is a key enabling technology here. All EC2 machines are Xen-hosted virtual machines ran from images hosted in Amazon's Simple Storage Service (S3). If you completely screw up an instance you just terminate it and start another. Instance storage survives across reboots but not shutdowns, once an instance is shutdown it effectively ceases to exist. (For the curious you use S3 or the Amazon Elastic Block Store for persistent data storage, the latter presents block devices to the EC2 instances that you just put a file system atop). This at a stroke removes the fear of remedial action being human effort heavy. It also highlights the criticality of carefully separating responsibilities, what Amazon is saying is "you break it, you fix it, or just start again". No Amazon sysadmin effort is consumed by fixing hosed instances, the company delivers the instance hosting service, intra-instance configuration is the responsibility of the owner.

As for the other concern of a hosed machine killing other parts of the infrastructure, I have no idea what capabilities Amazon have deployed to address the problem and don't feel inclined to try anything too exotic in an attempt to discover them. But the 99.95% availability SLA certainly implies that they have high-class monitoring services. But, in another inversion of traditional models, the company also includes charges not just for server uptime but also bandwidth utilization. I suspect therefore that if you really wanted to fire up 20 EC2 instances and start pushing enormous traffic across all network interfaces that two things would happen. Firstly, my guess is that virtual networking (i.e. VLANs) will be segregating the traffic on the internal Amazon network and that secondly the credit card bill for the external traffic you receive at month's end will prove a great incentive to curb similar future excesses. It's the sharp end of pay-per-use.