Google project a winner for DCU team

A pioneering venture developing web technology for searching handwritten texts ha attracted investment from Google, writes Gabrielle…

A pioneering venture developing web technology for searching handwritten texts ha attracted investment from Google, writes Gabrielle Monaghan

The Book of Kells, George Washington's personal diaries and other rare historical documents may soon be available on the web, thanks to research Dublin City University is carrying out for Google.

Up to now, these kind of documents were kept behind closed doors or were only accessible by searching digital libraries one page at a time. DCU's adaptive information cluster (AIC) has linked up with two American universities to change all that, in a rare outsourcing of research by Google.

DCU researchers have designed algorithms that can detect the differences in shapes of handwritten words, a technique that could make handwritten manuscripts available on Google within a few years.

READ MORE

"Just as your DNA is unique, so too is your handwriting," says Alan Smeaton, professor of computing at DCU and one of two leaders of the Google project.

"When you get Christmas cards, you can recognise the handwriting. That's because of the shape of the words. We can take a handwritten word, digitise it by representing it based on its shape, and match one shape of a word with another," Prof Smeaton says.

"There's a huge volume of handwritten articles all over world that are digitised, but we can't search through them."

Last November, the Library of Congress in Washington DC said it would create a world digital library, an online collection of manuscripts, rare books and other materials that will be freely available for viewing by anyone with internet access.

Google gave the library, which is the world's largest, $3 million (€2.5 million) towards the project.

Google, owner of the world's biggest search engine, has a library digitisation project known as Google Book Search that aims to put the entire contents of the world's most prominent universities, including Harvard and Oxford, on the internet.

When book publishers rallied against Google's book scanning project last year, they accused the technology giant of stealing. In a lawsuit filed in a New York federal court, the publishers claimed that if Google made digital copies of library books available online for search purposes, the company would be committing massive copyright infringement.

For instance, DCU's technique could enable an academic interested in Washington's view on the death penalty to search his personal diaries and find out how he felt about the US shooting deserters from the army, Prof Smeaton says.

The same search would help a 12-year-old school child with a history project, he adds. "If you search the word 'death' in the Washington manuscript, our technology will find all occurrences of the word."

Enterprise Ireland funded research into this technique, which DCU originally developed to analyse pictures and videos and has since patented.

Prof Smeaton then met US academics, who suggested DCU apply its technique to handwriting in manuscripts, such as Washington's diaries.

Following this, the university approached Google with a proposal to work on the technique for the company's search index, in co-operation with the University of Buffalo and the University of Massachusetts at Amherst.

Google, founded by Larry Page and Sergey Brin in a Stanford University dorm room, typically buys companies that have developed technologies it wants, such as managing personal photographs or mapping satellite images, instead of outsourcing, Prof Smeaton points out.

When it does outsource activities to research groups, it usually hires the researchers themselves.

"We were more difficult to hire because we are scattered across different regions," the professor says. Instead, Google gave DCU's AIC an undisclosed sum to fund a team of up to five research staff to work on the project for a year.

AIC was set up two years ago and is funded by Science Foundation Ireland.

The research group comprises leading researchers from DCU and UCD working in sensor science, software engineering, electronic engineering, and computer science. The group works closely with industry and state bodies to develop applications for this research.

DCU is also involved with the Dublin Institute of Advanced Studies in a project that is digitising old manuscripts written in Irish.

The Irish Script on Screen programme has scanned thousands of images to make them searchable.

The system is based on detecting and identifying images of people, cars or other objects in different video frames and applying this to differing slants or shapes of words in handwriting.

The group will present its findings to Google in a year. If the project is successful with large manuscripts, Google will apply the research to a search engine such as www.scholar.google.com, Prof Smeaton says.

"They might, in future, provide a search engine solely for old manuscripts or museum articles," he says.

"Most people are familiar with Google's search web page, but they have a dozen or more search engines, such as for images or news."

The research group now has to choose which material it will use to develop the handwriting technique and then go about getting permission for the material. Media coverage of DCU's research for Google has led to offers of material from all over the world.

"I've been inundated with cold calls from people who have online material that they are interested in us using," Prof Smeaton says.

"It has varied from the curator of manuscripts for the witches of Salem to more local, smaller museums. Someone also offered the births and deaths register for Co Wicklow for a period in the 17th and 18th centuries."

Google's funding of the project has focused greater attention on university research in Ireland, he says. "It's a great brand name to have," Prof Smeaton concludes. "Google gets you attention."