Monthly Archives: April 2011

But Can You Failback?

Your disaster recovery plans need to address not only failing over to standby resources, but also how you’re going to make your way back when the time comes.

Many organizations practice failover testing, wherein they engage a backup / secondary site in the event of trouble with their primary location / infrastructure. They run disaster recovery tests, and if they are able to continue their processing with their backup plan, they think they’re in pretty good shape. In a way, they are. However, few organizations think much about failing back – that is, to come back from their standby location / alternate hardware and resume production operations using their normal set-up. It is a non-trivial thing to do, as it requires data flows and perhaps even operational process adjustments that the organization almost never gets to practice. Further, it’s almost certainly part of what you’ll have to do (eventually) in the event of a mishap. Yet, most disaster recovery plans don’t even address failback – it’s just assumed that the organization will be able to do it once the primary site is ready again. It’s worth making this return trip part of your disaster recovery plan.

I once worked for an organization that was big on disaster recovery / business continuity. They were pretty good at the business continuity part, and could move the work done (paying people) from one location to another pretty seamlessly in the event of crippling snowstorms, fires, and the like. (If you think that sort of thing is easy, then you probably haven’t done it.) However, their ability to do this was facilitated by the fact that all sites shared a common data center in a separate location.

Given this dependency, they also spent a fair amount of time planning for data center outages. They had regular back-ups to a remote location equipped with standy hardware, and would run twice-yearly simulations of losing the primary site and continuing operations using the backup site. After a few attempts, they had most of the kinks worked out, and understood the timings and dependencies associated with such failover pretty well. Here I should mention that failing over a data center is much more complicated than failing over a single server, just as swapping out one piece of an engine is much easier than swapping out all of its parts and making sure they still align with one another.

One weekend, the feared “big event” arrived – our primary data center was flooded by some kind of pipe failure. Operations stopped dead. Yet, we didn’t fail over to our backup site, as almost everyone would have expected, because somebody asked a simple question – how do we come back?

The secondary site was clearly a temporary – understaffed and underpowered to do what we needed to do on a long-term basis. Using it was always assumed to be a sprint – a temporary burst of effort that would soon subside. (Imagine the cost of having it be otherwise, and the meter on said cost going up for year after uneventful year.) Such an arrangement requires some plan for ending the sprint, and coming back to the normal production site as expediently as possible.

We had never practiced coming back.

Fortunately for us, the outage happened at a low period in our monthly cycle, where we could skate by with a few days of downtime. (If it had happened a week earlier or a week later, then millions of people wouldn’t have gotten paid, and you would have heard about it on the six o’clock news.) So, rather than move things over to a temporary platform and then moving them back just in time for our big rush, we just waited for the primary site to be made available again. In many ways, this was bad, but the higher-ups decided that it was better than the risks we would run by failing back without strong, tested processes right before our busy time. (The risks would have been nasty things like overpayments, missed payments, erroneous payments, etc.)

So, we sat tight, twiddled some thumbs, and waited out the outage while the people at the primary data center worked like lunatics in order to restore it in time. Our often-practiced disaster recovery plan had proven to be a failure, not because we couldn’t handle a disaster, but because we couldn’t handle returning to normal in a reliable and predictable way.

So, if your disaster recovery plans include failing over to an alternate location or even alternate hardware, make sure they also specify what happens after the clouds lift, and practice those things too. The middle of a disaster is not the time to be sorting out those kinds of details.

Don’t Innovate in Your User Interface

Innovative user interfaces are probably lousy, no matter how sensible or well thought-out they may be, because by definition they break the cardinal rule of usability – analogy to something the user already knows.

It’s much better to bring a better conceptual metaphor to a problem set than to build a better UI widget that doesn’t quite work like all of those other not-so-innovative UI widgets.

You’re a plumber, not Picasso. Your UIs aren’t a canvas – their point is to move crap around.

A few special apps can violate this rule – odds are, yours isn’t one of those.

How Systems Get Exploited – Content Becomes Structure

Don’t let content become structure.

There are many different ways in which an information system can be exploited, to include buffer overflows, SQL injection, cross-site scripting, etc. However, the vast majority of common exploits can be avoided by adherence to a single principle:

Don’t let content become structure.

In the categories of exploits listed above, maliciously crafted content breaks out of its proper role and becomes structure (instructions) that the system follows . If you can ensure that the values your system manipulates never become instructions for the system to execute, then you’ll probably be okay in terms of exploits in the products you build yourself. (Password management and platform hardening are different stories.) The ways in which you keep content from becoming structure are technology-specific (sanitizing form inputs, using JDBC parameters, etc), but the underlying principle applies to most of the security holes you’re likely to create / avoid.

How to Estimate a Software Development Schedule

At first, the only way to estimate is badly. Then, the only way to estimate well is to keep estimating.

If you listen to what different camps are saying about software development schedule estimation, you’ll find that much of the discussion can be viewed as riffs on the “agile” vs “traditional” divide. Undoubtedly, such discussion raises many instructive points about estimation techniques, but it frequently overlooks a basic truth about estimation:

Everybody is terrible at estimating the first few times around, no matter how they do it.

This principle is often misapplied as a condemnation of traditional estimation techniques. I frequently hear people dismissing traditional techniques, such as Gantt charts, because they found that some attempt to use the technique was largely unsuccessful. That’s a bit like dismissing the piano because you can’t play like Mozart the first time out.

Any estimation technique you employ will require practice. The first time you estimate a project of non-trivial duration, your estimates will likely be horrifically wrong. The same might go for the second and third times as well. However, sooner or later you’ll catch on to certain things (e. g., sign-offs always seem to take a week, Joe low-balls everything, Susan always forgets to budget time for her testing), and you’ll revise your estimates accordingly. Over time, the quality and fidelity of those estimates will undoubtedly improve.

So-called “agile” estimation techniques are smart enough to formulate this truth more explicitly, by emphasizing that you need multiple sprints consisting of the same team members in order to establish a reliable velocity. However, you’re not just collecting velocity data over those multiple sprints, as though you were dipping a thermometer in different parts of a pool in order to establish an average water temperature. You’re also honing the team’s ability to estimate things by giving them estimation practice coupled with near-immediate feedback on how good their estimates were. Over the course of those sprints, the team is getting their hands dirty in the nitty-gritty of estimation, and learning how to do it better each time out.

The punchline here is that somebody working with an agile method is likely to learn how to estimate faster/better because of short cycle times, but this doesn’t mean that traditional methods don’t work – it just takes longer to get good enough at using the traditional methods, and many folks give up after the first or second time out.

Much more important than any specific method of estimation is the determination to keep applying your chosen method until you become competent with it.

Piwik Reports Chrome Frame as Chrome

Piwik 1.2.1 reports IE with Chrome Frame as pure Chrome. Just sayin’.

I’m playing with Piwik 1.2.1 – it’s a very nice tool, but it has a browser identification quirk that varies from the behavior of Google Analytics. I was running a copy of IE 7 with <a title="Chrome Frame site" href="http://www.google .com/chromeframe” target=”_blank”>Chrome Frame, and Piwik was reporting those visits as coming from Chrome. That is, Piwik seemed to recognize / record no difference between true Chrome visits and visits using IE with Chrome Frame – all were lumped in the same bucket.

In contrast, Google Analytics reports Chrome Frame usage as coming from “IE with Chrome Frame.” Use of full-on Chrome is reported as “Chrome,” so that you can really differentiate between the two configurations.

By the way, you should also bear in mind that both Piwik and Google Analytics are just reading the agent string header passed by the browser. If a user has Chrome Frame installed, it shows up in the agent string, even if the add-on isn’t currently active (that is, even if the visitor isn’t actually seeing things rendered by Chrome Frame). The only way to get Chrome out of the agent string, and have the browser present to Piwik / Google Analytics as plain old Internet Explorer, is to uninstall Chrome Frame.

What Between Means in SQL (Not What You’d Think)

In SQL, “BETWEEN X AND Y” means “greater than or equal to X and less than or equal to Y” – not necessarily what you might infer from its plain-English counterpart.

Is 6 between 3 and 9? Yes.

Is 6 between 9 and 3? Yes, but not in SQL.

In SQL, we use the BETWEEN operator to determine whether a value falls between two other values. We might think that using a clause like “BETWEEN X AND Y” establishes a range between X and Y, and that value comparisons will be true as long as the value being compared falls into that range.

However, the BETWEEN operator works differently. “BETWEEN X AND Y” actually translates into “greater than or equal to X and less than or equal to Y.” This means that X and Y have to be placed in ascending order for the statement to work correctly – X must be the lower of the two values.

Here’s a screenshot demonstrating this in Oracle, but virtually any database works the same way. I’ve included examples for both numbers and characters.

Screenshot of using BETWEEN
Arguments to BETWEEN must be in ascending order

Usually, this behavior isn’t a problem. Especially if you are working with literals, as in the screenshot above, you’re likely to put the lower value first, just as a matter of habit. However, if you are performing logic based on variables or column values, you need to make sure that the BETWEEN clause gets the lower end of the range first, or else your results may not be what you expect.