Pandora For The Internet?

Before we get started, if you haven't already, read about the Music Genome Project. I'll wait. Wouldn't it be great if we have such a categorization and recommendation engine for information on the web? The music genome project categorizes music based on nearly 400 attributes. Surely, if we can do it for music, we can do it for text.

To narrow down the task a little, we could consider only blogs and leave out everything else. We can begin analyzing text on two basic attributes - form and function. By function, i refer to the content of the text, the subject that it tackles. And by form, i refer to the structure of the text, the emotion and the style of writing.

Function can be ascertained by diligent categorization based on the words used in the article. The overall content of the blog would help determine the general topic (blogs on technology or entertainment would generally write articles related to those areas). Parsing the text to look for specific terms and applying bayesian filters would provide a probability of the article belonging to a certain narrow topic (programming in python, iphone games or reviews of movies starring salma hayek). Another way of narrowing on the topic is to follow outgoing links and determining the general content of those webpages. The system would also group related topics in hierarchies.

Form is a little more difficult to categorize. We can divide form into two components: structure and sentiment. Structure consists of characteristics such as - the length of the article, the number of words in a sentence, the type of words or phrases used, the amount of dialogue, the density of the text, the use of language constructs like active vs passive voice, 1st person vs 3rd person and so on, the use of punctuation, the number of sentences in a paragraph, the rhythm of the text determined by the syllables - and then derive various metrics from these numbers. This would be one manner of categorization. Sentiment, on the other hand, is harder to determine. It is slightly easier in music (you can determine the rhythm, type of instruments used, the raaga, etc to home in on the sentiment), but it is much harder in english. It is very difficult to determine characteristics such as irony, sarcasm or humor by parsing text. Moreover, different cultures have different ways of expressing sentiments which renders any universal algorithm irrelevant. Having said that, some sentiments can be a trifle more easy to gauge - like anger, joy, sadness or want (by words, phrases, sentence lengths, punctuation used, etc).

How does all this help? It will help one wade through the garbage of the internet to find the little gems that we care about. If you liked an article, it would help you find other articles - perhaps on similar topics and with similar writing style - that you would be eager to read. It would also provide an opportunity to serendipitously discover great articles or pieces of information. Don't community powered recommendation engines already do this? Well, i would argue that they don't. Mostly, the articles submitted are what the original reader liked and not necessarily what you might enjoy (even if you are interested in the topic). Making a fully automated system can tailor content exactly as you would want it.

The system could sneak in a few surprises - like showing you an article with similar style on a completely unrelated topic to gauge your response and learn from it. It could also learn your reading patterns - what kind of articles you like reading in the morning vs reading at night, reading over the week vs reading over the weekend - and tune itself accordingly. It could show you articles based on your current mood. Over time, it would know what you want to read better than you do. As scary as that sounds, i think it's a beautiful thing.

I think it's time we stopped manually adding feeds to our feedreader. I think it's time we have an Internet Genome Project.

4 Comments

Jul 19, 2009
Tyam said...
Have you read about 'topic models'? They're very often used to classify spam/non-spam, and the same principles are used to classify various blog topics - like political, satire etc.
I haven't seen if anyone has stepped in to capitalize on this, but it's still a hot research topic.
Jul 19, 2009
Santhosh said...
Very nice post. I'm as scared as I'm excited for the day when the network knows me better than myself.
Jul 20, 2009
@Tyam - Thanks for the pointer. Seems interesting!
@Santhosh - We have to get used to it :)
Aug 21, 2009
Alan said...
Over the years I have amassed what I consider to be an enormous amount songs in my I Tunes library. It is an eclectic mix that I am proud of. Now that I've discovered Pandora, I don't listen to anything else.

I LOVE Pandora.

Leave a comment...

 
Got an account with one of these? Login here, or just enter your comment below.
Posterous-login    twitter