From Obama to Osama and back to Obama, Microsoft’s spellcheck function is constantly updating its dictionary and grammar roles
FOR BARACK Obama, April 2007 was a key moment in his nascent presidential campaign. Scheduled to speak at the Democratic National Convention, it was a chance to get his name out to the US electorate.
But before he could convince them to “turn the page for hope”, there was an unexpected foe to conquer. Forget Hillary, never mind McCain, this was an even more indefatigable opponent: the red squiggle.
Microsoft’s spellchecker was impervious to his existence. Worse, it was suggesting an alternative that was unintended fodder for racists everywhere: Osama.
“The speller is smarter than some people give it credit for, but it’s also dumber than some people give it credit for,” says Mike Calcagno, the general manager of Microsoft’s Natural Language Group and the man responsible for “fixing” the Obama/Osama problem.
“If a word is not in the dictionary, we will algorithmically suggest the nearest word in our dictionary, as long as that word is not offensive.” (Type siht or fcuk by mistake and the spellchecker won’t for a minute contemplate the idea that you were being rude.)
“The word Osama is not offensive in our dictionary; it’s one of the most common names worldwide,” Calcagno says. “It just so happens to also be the first name of the world’s most famous terrorist, the most wanted man in the world.”
Still, it was massively embarrassing for Microsoft and so Obama swiftly made his dictionary debut and updates were dispatched with haste. The story would return to haunt Calcagno and his team of 50 computational linguists and software developers at Microsoft HQ in Redmond, Washington. In the second wave of “fixes”, the problem had to be corrected on other products that use the Office Spellchecker, like Hotmail, while users with older versions of Office took longer to get the updates.
“Cases of Obama and Osama kept popping up,” says Calcagno. “I don’t think there were many people out there who thought Microsoft was taking some kind of editorial decision on Obama, but that theory was certainly proffered out there in the blogosphere.”
Squiggle intolerance – that feeling of indignation when those reprimanding corrugated lines appear – will be a familiar feeling to most Microsoft Office users.
For Calcagno, though, maintenance of the spellchecker dictionary and the auto-correct and grammar check functions is all about striking the delicate balance between being helpful and being obtrusive, as well as keeping track of the fast-evolving nature of more than 100 languages, from Irish to Urdu.
It does this in two ways: by studying how users customise their dictionaries – words such as “taekwondo” have been picked up this way – and by analysing the frequency with which unfound words crop up in collections of available texts, such as internet texts and the texts of the Wall Street Journal and Associated Press.
Calcagno’s group also gets direct requests, though many are turned down.
“I got a request the other day from someone who wanted us to accept cubmaster the same way we would accept scoutmaster, as a master of cub scouts. The reason cubmaster is not there is that we don’t find that term very often.”
As the arbiter of language in the computer age, Microsoft certainly has much stricter entry requirements than the Oxford English Dictionary, which in recent years has added such delights as bahookie, crunk, girlcott, anyhoo and puh-leeze into its lists.
Unlike the Oxford English Dictionary the spellchecker has no truck with hoodies or asbos and it doesn’t look too kindly on the new yogilates trend. Individual words can provoke passionate debate.
Maura Molloy, who heads Microsoft’s 20- person strong international language technology team at its European Development Centre in Sandyford, Dublin, says there is no such thing as an indifferent linguist. Whether a single word is in or out of the dictionary though is the “really easy part” of the work done by Calcagno, Molloy and the rest of the Natural Languages Group.
“How do you suggest the word that they meant?
“That’s where it gets a little bit more sophisticated and we will use contextual information in order to rank candidates that you might have meant when you misspelled the word. And the ideal situation for us is where we are sure what you meant. Then we will use auto-correct.”
Ah, auto-correct. Another bête noire of Word users.
There are two main complaints about Office’s spelling and grammar functions. The first of them, as educationalists point out, is that people become overly dependent on the spellchecker, which means writers sign off on documents littered with all kinds of homonyms and homophones.
(The internet poem Candidate for a Pullet Surprise has been paying, um, tribute to this effect since the mid-1990s.)
Some of these mistakes may now be caught with the “blue squiggle”, introduced in Microsoft Office 2007. Type “a pear of shoes” into this version of Word and the word “pear” will be highlighted in blue.
The second complaint about the spellchecker is that it introduces mistakes, via auto-correct.
“I think when the auto-correct is correct, users really appreciate it, because you don’t have to interrupt your writing,” Calcagno says. “You can keep on typing. Every time you have to go back and look at a red squiggle and right click it and decide which of these words you really meant, well you’re not typing and writing at that point.
“That’s the plus point of auto-correct, one that is perhaps taken for granted: it makes us all seem like fantastic typists. You may really have punched in “I collceted teh cinmea tcikets”, but even before you know it, the auto-correct functions will have changed it to “I collected the cinema tickets”.
The most ambitious part of the Microsoft language function though is the grammar “critiques”, as Calcagno calls them.
“We also do the green squiggles, yeah,” he says, which is a very modest way of describing the activities of a computational linguist with degrees in mathematics and computer science.
Calcagno joined the Natural Language Group as a software developer 10 years ago, at about the same point that users were grappling with Microsoft’s new “fragment (consider revising)” and “passive (consider revising)” critiques of their syntax and style.
“It’s very difficult to improve that technology,” he says. “If you improve part of the system, you tend to make other parts of the system worse, but the green squiggle is something that we’re looking hard at and we’ll try to make that better.”
Because grammar is a very personal thing, right?
“Yeah, absolutely.”
More data analysis on the critiques that people accept and the ones they simply ignore with a giant harrumph will eventually lead to a product that is “less noisy and more accurate”, he believes.
“My opinion is that the ‘fragment (consider revising)’ critique is not very well appreciated and that’s backed up by the data for how often the critique is ignored. People have the ability to turn the critique off, you know, but obviously we shouldn’t have turned it on in the first place if it’s something that causes that much dissatisfaction among our users.
“We have grown better over time at making the default critiques correspond to actual errors that people would usually want corrected, like repeated words or obvious subject-verb disagreement errors,” he says.
In other words, mistakes that people makes makes from time to time (okay, that last one was on purpose).
Incidentally, The Irish Timeswould like to point out that all errors in the above article are deliberate and not, in fact, ironic.
Squiggle intolerance
FRIENDED
Issue: Friended is a good example of the evolution of the way in which we use language. Friend used to be only a noun but thanks to the friend functions on MySpace (no squiggle) and Facebook (squiggled, with Face book suggested), you can now friend someone as a verb, which means people now use words such as friended and friending, plus defriended, defriending (or, if you prefer, unfriended and unfriending).
Status: Still underlined (on my Word 2007 anyway), but judging from Calcagno's remarks, it sounds like that won't always be the case.
CALENDER
Issue:This is an example of what Microsoft calls the "masking problem". Calender with an –er is a word. A calender is a series of rollers used to smooth out paper in the papermaking process. (With thanks to Wikipedia for that explanation.)
But calender is also what Calcagno calls a “classic misspelling of the grade school spelling bee” for calendar. “So many people mean calendar with an –ar that it would be . . . I don’t want to say socially irresponsible . . . but it wouldn’t be the best choice not to squiggle it.”
Status: Despite being a bona fide word, calender gets the red squiggle.
SOYFOOD
Issue: One of the most common types of requests for dictionary inclusion come from people in a particular industry who want words that are usually hyphenated or spelled apart spelled as one word. "The other day someone from the soy foods industry wrote in and asked us to accept soyfood as a single word," says Calcagno.
Status: Remains outside the dictionary. To include soyfood would open the floodgates to all manner of industry-speak. Soyfood industry types will have to lump it or satisfy themselves with adding it to their custom dictionaries.
NAMA
Issue: There is no escape these days from the National Asset Management Agency (Nama), in Ireland at least, but naturally the big red Microsoft marker will underline it, suggesting Name, Mama or Napa, the Californian wine valley where property prices are presumably holding up better than here.
NAMA in capital letters is fine, but The Irish Times style guide says that acronyms should be lower-cased. Nasa, which is how The Irish Times would refer to the US space agency of somewhat longer standing than Nama, is also squiggled, as it’s more frequently referred to as NASA.
Status:Destined to remain forever squiggled, unless there's a President Nama in our future.
TEH
Issue:There's no issue . . . yet. "Teh" to "the" is the most common auto-correct function performed by the Spellchecker. Type fast enough and you may not realise that this is even happening, but this could all change.
“If ‘teh’ ever becomes a common word in English, that will no longer work and we will have a major software update to do at that point,” says Calcagno, paling at the thought. “But right now that’s a typo that will just correct behind the scenes.”
Status:Always auto-corrected