Bharath ruminates: Nuggetize: Faceted search for the web through dynamic categorization

The web has immense data on any topic. Traditionally search engines only return lists of links. Its up to the user to open those links and look for information. This causes search fatigue. An approach towards search retrieval - called faceted search, aims to help reduce this pain.

What is faceted search?

Also called as faceted navigation, exploratory search, faceted browsing, guided navigation and sometimes parametric search, it refers to a search engine showing its results from multiple points of view.

Faceted search is already wildly popular in eCommerce sites - due to companies like Endeca, and even open source implementations like Solr, Sphinx and Lucene. Try a search on amazon for "baby", and you'll notice the search results neatly classified into Baby, Clothing & Accessories, Toys & Games, Health & Personal Care, Books, etc.

This is so intuitive and useful, that the users hardly made much noise about this feature. They just embraced it. eCommerce sites have been very happy, since it helped users reach products much better - in the absence of search ranking systems like pagerank. It also contributed to a better browsing experience, and exposed more products at the store to users.

Wildly successful in eCommerce, faceted search was hitting hurdles in general web search. Traditional algorithms exploited the structured meta data available in product catalogs to provide good results. On the web, where unstructured data rules, it was a challenge to present information and pages available as well-organized categories. There have been several efforts to this, using different approaches. I'll use a sample queries like - michael jackson, and hurricane proof housing, to compare the results, and posit on the approaches taken by the different products.

Clusty, probably one of the first on the web that concentrated on exploratory search - organized pages into search results by clustering content inside them. They had the hard problem of naming a cluster, and mined for keywords to name them appropriately.
Kosmix, a hyper-aggregator for the web, classifies and organizes content around popular queries. Their recall is quite low, and performance drops when the user tries tail queries.
Cuil, uses wikipedia for extensive query analysis, matches other concepts related to the query to content available in search results and presents facets. Since they start with the query, and not the content, they work well when the query alternatives match the concepts available in the content. If the match is not strong, there's a topic drift. There also seem to be issues in allotting weightage between query terms.
Nuggetize, a learning and discovery engine, that presents information present in the web for a topic as nuggets - uses wikipedia's ontology and classifies nuggets into categories dynamically. Nuggetize's facets lead to nuggets, and not pages - since there can be facts along several dimensions in a given document. The dynamic categorization is also driven by the content and not the query.

Here are comparative results. You are invited to try more among these, or suggest other faceted search products on the web.

Michael Jackson:

i) Clusty

ii) Cuil:

iii) Kosmix:

iv) Nuggetize:

Hurricane proof housing:

i) Clusty:

ii) Cuil:

iii) Kosmix: (no results)

iv) Nuggetize:

Do you find the results from Nuggetize more contextual and appropriate? The main reason for this is that Nuggetize relies on the page content, and marries concepts present in the pages onto the wikipedia ontology. From there, a proprietary tree aggregation algorithm figures out the most relevant and informative categorization to display. The dynamic categorization does not just stop there. You can click on any category, or any topic, or even a search within these nuggets, and can see the categories change!

Notice the categories returned when we drill down on the category, "Weather hazards", for hurricane proof housing.

Next, the categories when I drill down along "Design".

The categories you see are entirely decided by the context of results (or sub-results).

Faceted search promises to be a great way of managing overload on the web, and expose multi-points of view, or dimensions of facts to users. Getting to those facts by opening links one after the other takes up considerable amount of time.

Here's another example: for the query, Yosemite winter activities.

Notice how Skiing, Snow, Travel and Roads come up. A winter tourist to Yosemite is obviously very interested in knowing which roads are open and which are closed.

Now contrast this to the categories that come up for Yosemite summer activities

Do you see how Locomotion, Exercise and Wilderness (involving rock climbing, swimming, running) show up? And skiing is pushed down? The content has this information, and the dynamic categorization gets its pulse!

So what is the catch? The catch is that this is dependent on the wikipedia ontology, and can only be as good as that. The wikipedia ontology also has quite a bit of noise, and needs a lot of pruning before it can be used for such purposes. But as far as we have people committing extra-ordinary work to wikipedia, Information Retrieval scientists can get relevance to users on the web. Did you donate to wikipedia at the end of the year? ;)

Bharath ruminates

Saturday, May 1, 2010

Nuggetize: Faceted search for the web through dynamic categorization

2 comments: