TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 02-29-2008, 05:21 PM   #1 (permalink)
The Wanderer
 
serversphere's Avatar
 
Join Date: Dec 2006
Location: USA
Posts: 21
Thanks: 0
serversphere is on a distinguished road
Default Google-News-Like Headline Grouping Algorithm

In my spare time (rare) I've been working on a news aggregator that gathers headlines from various RSS sources and inserts them into a mysql db. I love the way Google News groups stories together and I've seen around the web how it's all algorithmic. But I'm wondering what the actual process is they use to evaluate a matched headline.

With that in mind, I've been trying to come up with a method of comparing strings (headline and/or story text) for common terms and grouping them together based on the results, much like Google News.

So far the process I have is like the lists below. It grabs the headline and compares it with those in the database for a common word count, ignoring an array of commonly used terms. Then it runs a Levenshtein comparison on the strings for a matched character percentage. If the headline is still not a match, it then re-evaluates the common but thrown away terms. Looking at those figures it comes up with a point total based on the percentage of the match.
  1. Grab the headline from the RSS source.
  2. Compare the headline with the headlines in the database:

    Word Matches Not In Common Term Array
    • Over 30% match on words gives 1 points.
    • Over 50% on words gives another 2 points.
    • Over 70% on words gives 3 more points.
    • Over 90% matched words gives 5 points.

    Character Match Count (Lev)
    • 2 points for a match of 85% or better

    Word Matches That Were In Common Term Array
    • 1 point for each word if the number of words were over 40% of the total words in the largest string

  3. If the total points betters 10 points, it calls it a match.
    • Save into database as parent of the matching headline and call all children of that headline as children of this new headline
  4. If no match is found, insert it on it's own with no parent

What I'm finding is that it only works about half the time. Sometimes obvious matches are left out and other times obvious non-matches are matched. I'm wondering if anyone has seen articles on doing comparisons such as this that can point me in a different or better direction. Any suggestions? TIA
Send a message via AIM to serversphere
serversphere is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 07:39 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design