Exploring the Web: Beyond document search

Exploring the Web: Beyond document search

With the prevalence of the Web, we are living in an age of information explosion. Today we rely on Internet search engines, like Google or Bing, to find the relevant documents among trillions or quadrillions of HTML documents on the Web.

While such documents are easily accessible, they may demand much time for us to dig out the exact information and knowledge that we are looking for. For instance, we may search for a smart phone with good battery life, plan a trip to a top-ranked beach in America in July, seek out a relevant conference to submit a manuscript, or maybe find an expert that we’d like to collaborate with.

Internet search engines are insufficient to provide good answers to these queries since it’s very hard to “understand’’ HTML documents and correlate multiple documents to provide the desirable information. On the other hand, besides the easily searchable documents, there is an enormous amount of valuable information stored in “web databases,” which cover a large range of topics including products, travel arrangements, publications, financial and public records, science and governments. The rich “metadata” in web databases offer a great potential to provide information and knowledge desired by the user beyond documents.

The mission of School of Computing, Informatics, and Decision Systems Engineering (CIDSE) Associate Professor Yi Chen and her group is to enable web users to easily retrieve not just documents, but desirable information and knowledge from the Web using simple keyword-based queries.

This mission demands solutions to many technical challenges.

For instance, while the valuable information in databases can potentially improve search quality, databases do not support keyword based queries in general.

Additionally, how can we leverage the information in the documents and that in databases together? How should we extract valuable knowledge such as the objects and their relationships from documents?

Furthermore, how can we handle the high heterogeneity of information presentation format and information quality presented in different sources? Besides searching data, how can we support “social search,” i.e. finding relevant people on the Web who can answer one’s questions?

This research has been funded by several grants from National Science Foundation, including a CAREER award from Science Foundation of Arizona and by an IBM faculty award. These grants will enable this ambitious work, which promises to be “tremendously useful to everyone because search engines can no longer be restricted to a subset of information available in HTML documents,” explains Chen.