In my spare time (rare) I've been working on a news aggregator that gathers headlines from various RSS sources and inserts them into a mysql db. I love the way Google News groups stories together and I've seen around the web how it's all algorithmic. But I'm wondering what the actual process is they use to evaluate a matched headline.
With that in mind, I've been trying to come up with a method of comparing strings (headline and/or story text) for common terms and grouping them together based on the results, much like Google News.
So far the process I have is like the lists below. It grabs the headline and compares it with those in the database for a common word count, ignoring an array of commonly used terms. Then it runs a Levenshtein comparison on the strings for a matched character percentage. If the headline is still not a match, it then re-evaluates the common but thrown away terms. Looking at those figures it comes up with a point total based on the percentage of the match.
Grab the headline from the RSS source.
Compare the headline with the headlines in the database:
Word Matches Not In Common Term Array
Over 30% match on words gives 1 points.
Over 50% on words gives another 2 points.
Over 70% on words gives 3 more points.
Over 90% matched words gives 5 points.
Character Match Count (Lev)
2 points for a match of 85% or better
Word Matches That Were In Common Term Array
1 point for each word if the number of words were over 40% of the total words in the largest string
If the total points betters 10 points, it calls it a match.
Save into database as parent of the matching headline and call all children of that headline as children of this new headline
If no match is found, insert it on it's own with no parent
What I'm finding is that it only works about half the time. Sometimes obvious matches are left out and other times obvious non-matches are matched. I'm wondering if anyone has seen articles on doing comparisons such as this that can point me in a different or better direction. Any suggestions? TIA