Google's smooth show hides frantic action behind the scenes

WIRED: WHAT MAKES Google successful? Ask the older generation of internet users, and they’ll say, “because its search was better…

WIRED:WHAT MAKES Google successful? Ask the older generation of internet users, and they'll say, "because its search was better than everyone else". Ask someone who has just joined the internet, and they might say, "because it's fast". Ask someone using its e-mail and applications and they might say, "because they're reliable", writes DANNY O'BRIEN

I think what has made Google successful lies behind the scenes: the backend software that made Google’s search so good in the beginning. But it is also code that covers up the fact that, despite appearances, Google’s software is not as fast or reliable as we think.

GoogleFS, BigTable and MapReduce: those are the three ingredients of what techies outside Google know about the software inside Google. They are key parts of how Google whips hundreds of thousands of computers into shape and uses them to serve millions of users per second.

What do these strangely capitalised names mean?

READ MORE

GoogleFS is the easiest to understand. “FS” stands for “filing system”. A filing system is the mechanism that lets you save files in a certain folder on your computer’s hard drive and recover them later. But Google’s filing system, GoogleFS, doesn’t just work on one hard drive. It works across Google’s hundreds of thousands of servers, across dozens of data centres. Most importantly, it can recover from any one of those machines dying in the harness.

That’s good, because Google’s early success was built on highly unreliable computers. Instead of spending money on expensive but guaranteed stable server hardware, Google used ordinary PCs with ordinary components. They overheated, crashed and died frequently but, when they did, Google just yanked the broken machine out and slotted in a new, cheapo PC.

GoogleFS covered up the problem by working around it, as it was designed to do. And it turns out that’s a great idea when you grow fast. When you have thousands of machines, one of them will die eventually. When you have hundreds of thousands, one of them will probably die in the next few minutes.

If GoogleFS was Google’s way of taking advantage of all of those terabytes of unreliable but affordable hard-drive storage across the planet, MapReduce was its strategy for conveniently using all of the processing power locked away in those PCs. Writing a program that runs on one computer is (fairly) easy. But how do you write one that can work on thousands, and keep scaling as you add new machines?

Programmers using MapReduce manage it by breaking down their tasks into two phases. The first involves taking an input and chopping it into lots of subproblems (each with their own result). The second phase involves taking all of the subproblems’ results, and reducing them down into a much smaller answer.

The trick is that the subproblems can count in the millions, and can be spread across as few or as many PCs in Google’s server armoury; and that final result can be as simple as a single figure, and delivered to just one person. So it is scalable for computers, but also comprehensible for the mere humans trying to gather the results.

Finally, we have BigTable. If GoogleFS is for saving files, and MapReduce is for running programs, BigTable is Google’s solution to storing lots of data.

Like its siblings, BigTable does not quite work like a standard database such as Microsoft Access or Oracle. As with MapReduce, it demands that its programmers use a far simpler model for handling their data. But as the name suggests, BigTable scales – well enough that the billions of e-mails in Google’s Gmail service and the millions of videos on YouTube depend on it.

It is software like this that gives Google the leeway to be quite haphazard behind its calm exterior. All of them are designed to collapse relatively gracefully. If you haven’t seen a database error or a “file missing” error when accessing Google, it’s not because things haven’t been going wrong behind the scenes.

When the application handling your Gmail requests dies horribly (and it does), you don’t see a crash because Google instantly switches you to another computer, running in another data centre.

That computer is pulling data from the same filing system and the same fault-resilient database.

And because the lag in switching machines is smaller than the giant pauses between you clicking a button on Gmail and getting an answer across the web, you barely notice it.

There are many attempts by enterprising coders to reproduce these programs outside Google.

But if you want to mimic Google’s success, perhaps you shouldn’t try to copy the software but to understand what led to that software and what Google has learned from it.

The internet is not about reliability and it’s not about speed. It’s about coping with the unreliability of computers and taking advantage of the relative sloth of modern internet connections.

Google may look great from the outside but, like a magician, if you want to know how it pulls off that trick, you need to stop watching the results and start looking at the frantic action behind the curtain.

When you look into that, this internet giant’s monopoly starts looking rather more vulnerable.