Don't discard these wrappers

The Internet is a great way to spend a few hours perusing the latest Monica Lewinsky fan clubs. But sometimes we'd like a concise answer to a simple question. A famished tourist, for example, might want a map showing the Italian restaurant nearest the Eiffel Tower.

Simple enough. But keyword-based search will never be able to answer such questions; at best we'll retrieve maps of Paris or Italian restaurant reviews.

All the more frustrating is the realisation that the Internet contains all the information needed to answer the request. Various Internet sites list restaurant guides, telephone directories, maps, tourist attractions - surely there's a way to automatically merge this information to give us what we want.

The field of "Information Integration" has addressed these issues for several years. Ideally, we'd like a system that takes questions from the user, merges relevant information from across the Internet, and produces a customised report.

To see an information integration system (IIS) in action, try out Ariadne (www.isi.edu/ariadne/demo), developed by researchers at the Information Sciences Institute in California. Ariadne's "Restaurant Locator" application, for example, helps you find restaurants in Los Angeles. Tell Ariadne the cuisine you prefer, how much you want to spend, etc. and the system displays a map showing where to go. To generate the map, Ariadne draws on three Internet sites: restaurant details from the Zagat tourist guide (www.pathfinder.com/travel /zagat); latitude/longitude calculations from the ETAK company (www.etak.com); maps from US Census Bureau (tiger.census.gov).

How do Ariadne and similar IIS's work? The idea is to conjure up the illusion that the Internet is nothing more than a enormous database containing millions of potentially relevant facts. The IIS stitches this "virtual" database together on the fly.

How does an IIS manage to interpret Web sites as if they were databases?

Even a state-of-the-art IIS can handle only sites that are rather rigidly formatted. Inspect Zagat's pages, for example, and you'll notice that each restaurant is described exactly the same way: the name of the restaurant is rendered with a large font, the address is in italics, and so forth.

Because the formatting is constant, it's straightforward to write a program that extracts the relevant information from a Zagat page, while discarding irrelevant junk such as HTML tags and advertisements. These little programs are called "wrappers", because they "wrap" an Internet site, giving it a uniform appearance from the IIS's point of view.

But now we run into a big problem. Programming the Zagat wrapper is easy. But the Internet contains dozens of sites listing restaurant information, and a special-purpose wrapper must be written for each. That's just for restaurant information, one tiny piece of the puzzle. A realistic IIS might need wrappers for at least hundreds of Internet sites.

Worst of all, Internet businesses are constantly adjusting their sites to make them more user-friendly or attractive which often breaks wrappers. If the Zagat wrapper expects italicised addresses, but Zagat suddenly switches to bold, the wrapper won't work anymore. So not only does an IIS require many wrappers, but, like an online vegetable garden, the wrappers require constant nurturing and maintenance.

This wrapper bottleneck is a classic example of the "scaling problem" in computer science. If you design a system so that some external task (changing the wrapper programs) must be performed repeatedly, then you won't be able to "scale up" your system for more complex tasks. For an IIS, the solution is to generate wrappers automatically. If we can eliminate the need for frequent re-programming, we can build larger and more robust IIS's.

IIS researchers have investigated this "wrapper construction" problem and have come up with a possible solution: the user firstly gives the wrapper-construction system a series of sample pages from the Internet site in question. - for example, the user might fetch from Zagat the pages describing restaurants in Los Angeles, Detroit, Philadelphia and Atlanta.

Secondly, the user tells the system the relevant information on each page - restaurant names and addresses, price and quality details, and so forth. Finally, the system takes over. Inspecting the documents, it automatically generates rules for extracting the desired information. Instead of having a programmer determine that "italics means address", the wrapper-construction systems handles these tedious details.

As far as our hypothetical Parisian tourist is concerned, an IIS can barely scratch the surface. Many thorny issues have been ignored. An IIS usually treats all information as equally trustworthy - but would you prefer information about Mercedes automobiles from www.mercedes.com or from www.bmw.com? Even trustworthy information may become out-of- date.

Finally, IIS's are essentially parasites, exploiting the sites from which they gather information. Some parasites are beneficial to the host. But a site providing free restaurant listings might not want to have its content fetched by an IIS, because users would miss out on the advertisements that pay the site's bills.

Nicholas Kushmerick: nick@ucd.ie and www.cs.ucd.ie/staff/nick/itr