![]() |
Google-News-Like Headline Grouping Algorithm
In my spare time (rare) I've been working on a news aggregator that gathers headlines from various RSS sources and inserts them into a mysql db. I love the way Google News groups stories together and I've seen around the web how it's all algorithmic. But I'm wondering what the actual process is they use to evaluate a matched headline.
With that in mind, I've been trying to come up with a method of comparing strings (headline and/or story text) for common terms and grouping them together based on the results, much like Google News. So far the process I have is like the lists below. It grabs the headline and compares it with those in the database for a common word count, ignoring an array of commonly used terms. Then it runs a Levenshtein comparison on the strings for a matched character percentage. If the headline is still not a match, it then re-evaluates the common but thrown away terms. Looking at those figures it comes up with a point total based on the percentage of the match.
What I'm finding is that it only works about half the time. Sometimes obvious matches are left out and other times obvious non-matches are matched. I'm wondering if anyone has seen articles on doing comparisons such as this that can point me in a different or better direction. Any suggestions? TIA |
| All times are GMT. The time now is 12:31 PM. |
Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0