What kind of performance do you think that Internet search engines must deliver in the future?
Weikum: In the future, search engines must be able to search a gigantic variety of collected data. After all, databases are growing at breathtaking speed in all areas – in the business world, the academic world, and in day-to-day life. The search for information should be as intelligent as possible. It should produce a list of hits that a human expert would also regard as the best possible answer. And, despite the enormous quantities of data, it should be fast; it should have the same speed that we’re used to from Google.
What kinds of queries push today’s technologies to the limit and why?
Weikum: Google and other search engines can’t be beat when it comes to simple queries. The weakness of today’s search engines is with complex queries. Such queries simply can’t be expressed with one or two keywords or produce only a few good hits. For example, when IT experts look for tips on a specific software problem or when scientists search for the newest findings and specialized literature, Google offers only limited help right now. Luck often plays a large role in the chance selection of specific key words. And users find that a search on Google can take them close to a good site, but they have to click through several additional sites until they find the desired information.
How can search engines become fit for complex queries?
Weikum: Search engines need more context. They need something like an interest and experience profile of the user, and they also need a profile in terms of the data. The profiles must have a clear structure, notation system, and the relationships between related words. Such relationships would include an awareness that witches in literature are typically women or that Scotland is part of Great Britain. Computers can be taught an awareness of context and background knowledge, which can lead to a search engine capable of a satisfactory answer to a tricky query.
You’re developing software that will enable a new type of Internet search engine. What’s new about it, and what makes it special?
Weikum: We’re combining ontologies with methods from statistical learning and forms of knowledge presentation from the area of artificial intelligence. We’re also using search algorithms from the area of databases. As part of software development, we first worked with techniques that accelerated the search for XML data. XML data has very informative notation, such as concept-value pairs like “location = Berlin” or “person = Lady Macbeth.” Unlike the situation in a completely structured database, this data doesn’t possess a uniform schema for names and types because cross-document and content-oriented specifications are also possible. The search for such data is thus much more expressive, but it also involves much more effort.
Using a powerful ontology adds value when compared to a traditional search engine. An ontology is a collection of concepts that can model semantic relationships. The most important types of relationships are synonymy (related meaning), hypernomy (generalization), and hyponomy (specialization). Consider the following example. “Lady” is a hyponym of “woman.” If we use a similarity operator to search for “person = woman,” we find terms that are closely related ontologically, such as “Lady Macbeth” or “the third witch.” As a database of background knowledge linked to search engine, the ontology equates the inequality between the user’s terminology and the terminology of the database.
The same techniques can be used for Web data so that Internet and intranet searching create a unit. A search also has an advantage in the generation of concept-value pairs as annotations to the HTML pages prevalent on the Web. Here we use heuristics and tools that recognize proper names: named-entity recognition tools. We can use them to mark important captions, persons, or locations. We’ve also developed a focused crawler that uses a trained classifier for automatic organization of the Web sites found into a personal hierarchy of topics. We use support vector machines (SVMs) for classification. They sort data by recognizing patterns that they have learned. The classifier learns from examples for focused crawling. After the learning phase, we start the focused crawl with a few good Web sites and thus find appropriate content on the Web.
How is a search supported by ontology and based upon concepts better than previous use of the Semantic Web, which so far has shown only a limited ability to function?
Weikum: Most advocates of the Semantic Web attempt to represent all data, metadata, and ontologies with purely logic-based formalisms. That works for clearly delineated application areas. But at the level of the Web or very large intranets, I see a need to circumvent the data and ontological level with contradictory and ambiguous terminology and terminology networks. By using statistical methods, we can separate correlated and frequently used pairs (such as “woman” and “lady”) from the exotic terms in use so that the search for a woman avoids the use of inappropriate keywords like “matriarch” or “femme fatale.”
Who’s interested in this software? Wouldn’t customers rather buy a completed search engine right away?
Weikum: Search engine technology should be embedded in the infrastructure of each application. Examples would include embedding the technology in a digital database for scientists, a consulting center for medical personnel, or a business intelligence application so that managers are better prepared to make important business decisions. For the Web user at home, we imagine an intelligent search assistant: a software solution that can be installed on every PC. The software should contain a complete search engine that traverses the Internet at night for data, analyzes it, and sets up a local index of it.
The advantage over traditional, central search engines like Google would be that the local search engine can be tailed to individual users. Most of a user’s queries would then be answered locally by the user’s own search engine. If the user isn’t satisfied with the answer, the assistant contacts other search engines with equivalent authorizations. This approach produces a peer-to-peer system or collaborative searching. All users could work together to use this kind of information network to take advantage of the training input from all users. Examples include bookmark collections of millions of users or the relevance feedback that users provide implicitly with their clicks or explicitly from evaluations of Web sites. This approach significantly improves the performance of the search results, especially when topically related peers merge dynamically and organize themselves. It also averts the danger of de facto monopolies of large search engines and leads to a democratization of searching for information.