Monday, September 10, 2007

Harvesting information from the web for your personal GMail knowledge base

Objective

The purpose of this post is to examine how to gather information into your personal Gmail knowledge base using today's popular web2.0 social networking sites; namely Digg, StumbleUpon, del.icio.us and reddit.

Background

With the advent of web2.0 and the explosion of information in the internet today, many of us operate in constant information-overload mode, turning us into news junkies. As they say, too much of a good thing is bad for you. So how then do we sieve through the entire mass of information that the internet offers to find what we need? It was in this environment that search engines were born. Today search engines such as Google allow us to do keyword searches which still often yield large result sets for us to wade through, of course sorted by relevance or pagerank, which does help somewhat; Google does this using mathematical algorithms.

However, with the coming of the social network or community based web2.0 sites, we now have community or human based ratings which serve to deliver webpages to us based on our preferences. So which of these sites should we use? Digg, StumbleUpon, del.icio.us, or reddit?

Analysis and Comparison

First, we really need to sit down and understand our own needs. Are we merely wanting to keep updated with the latest happenings via news? Or are we more interested in studying specific topics and concepts. In the former, timing is important because news become less relevant overtime. In the latter, content is more important than the time in which the information was first made available. Or perhaps we need a combination of both concepts in our lives; we want some news in order to be relevant, and we also need to target specific information for our own personal or work needs. Digg and Reddit are more socially driven news and content sites, whereas StumbleUpon and del.icio.us succeeds more in delivering targetted and specific content.

The need for news

Many of us are news junkies eventhough we might not admit it. We crave for news. With Digg and Reddit, we can specify the broad categories of news that we would like to receive. The community then finds and submits those news in the form of links and comments to websites, blogs, or other materials. The rest of the social compact then votes which news are more "worthy" or "newsy" within the various categories. Reddit even allows users to vote down the news articles. Digg on the other hand allows the community to "bury" some of the more irrelevant news. At the end of the day, what users get are news that float to the top of the heap; these are news that the social network deem relevant and worthy through the mechanism of "digging" or voting. As users, we then cast our own votes for these news, at the same time Digg and Reddit remember the news items that we just voted for; at some later stage we can then come back to review or study those items that we voted for, or we may use RSS feeds to push these news items into our personal knowledge base.

Now if we participate fully in the Digg and Reddit communities, we should also be good citizens of these sites by giving accurate and relevant reviews or comments about the news items that we voted for. In so doing, we are also leaving for ourselves comments that are going to be useful for us as well at some later stage. If you have read my previous post on using Yahoo Alerts to feed information into your personalised Gmail knowledge base (KB) system, then all we need to do is to use the news RSS feeds from Digg and Reddit to pump our selected news into our KB. We can then use the Gmail search features to mine our news for nuggets of information as and when we need the information. Super isn't it!

The relevance of targeted information

StumbleUpon and del.icio.us are more successful at delivering directly relevant content to us and serving up news. With StumbleUpon and del.icio.us, we have the options of specifying in slightly more detailed form what our preferences are for information that we wish to consume. In StumbleUpon, users are directed to websites which are more relevant to their interests and those of their network of friends. The more you Stumble and review websites, the higher your rating will be. Again, it is the social network that determines whether your reviews are useful and relevant. Wikipedia states that "StumbleUpon uses collaborative filtering (an automated process combining human opinions with machine learning of personal preference) to create virtual communities of like-minded websurfers. Rating websites updates a personal profile (a blog-style record of rated sites) and generates peer networks of websurfers linked by common interest."

All those sites that you reviewed using the "thumbs up" or "thumbs down" function will be tracked by the system. As before, you can then push RSS feeds of these reviews back to your personalised Gmail knowledge base (KB). What you then get are highly personalised and relevant websites that are served to you and pushed into your KB for future mining.

del.icio.us operate in a slightly different manner in that it does not have any voting mechanism which will determine the relevancy of the website to your specific requirements. Instead, you can use it to create streams of bookmarks to websites that you have encountered (both yours and your friends) and use RSS to push these into your KB.

Impact to our lives

So what is the impact of these technologies to our lives today? With the careful use of Digg and Reddit for delivering relevant news, and StumbleUpon and del.icio.us for serving up personalised and relevant websites, and pushing these information via RSS into our personalised Gmail knowledge base, we now have very very refined information that we can mine at our fingertips. This information come to us through a process of social collaborative filtering which makes them more relevant to us, certainly more relevant than a list of results dished out to us by Google everytime we do a keyword search. As a result, information harvesting becomes less painful and more relevant.







No comments:

Google
 
Web chris-open-book.blogspot.com