TalkPHP

TalkPHP (http://www.talkphp.com/forums.php)
-   Show Off (http://www.talkphp.com/show-off/)
-   -   Nitrospirae (http://www.talkphp.com/show-off/4934-nitrospirae.html)

TheOnly92 09-09-2009 12:35 PM

Nitrospirae
 
Nitrospirae is a bot that I have written to crawl over 150+ RSS feeds and collect them in 1 single site.

Those 150+ RSS feeds are mostly IT News, PHP information, and other news which I personally favor, but it can be expanded of course, to grab many more information on the internet.

To put it simple, the purpose of this site is to gather the best of the best in 1 site and make them searchable. It can also act as a sort of IT "newspaper" which you can read everyday to get a grasp on what's going on around. You may find several articles on a same news which is written by several different sites.

http://nitrospirae.net

Please do give some suggestions or critics.

ETbyrne 09-09-2009 08:08 PM

Neat. I've been thinking about doing something along these lines for a while, but never translated those thoughts into actions.

One thing you need to do is make sure you don't get any duplicate articles on the website. When I visited I saw the same article posted on slashdot science and slashdot games.

TheOnly92 09-10-2009 07:50 AM

Yeah, I have been thinking of some way to get rid of them, but some duplicates are updates made just after the script grabbed the feed, currently still finding some way to remove them perhaps...

adamdecaf 09-10-2009 09:28 PM

You could compare the URI's or flag any articles (server side, then manually check the articles if they are not too numerous) that have a high enough correlation with the similar_text() function.

TheOnly92 09-11-2009 09:46 AM

Cool, I didn't know there was such function, I will look into it. Thanks for the tip!

adamdecaf 09-12-2009 06:53 PM

Just a note, the comparison algorithm is very slow O(N**3); n := str.length. This means that if a string is 5 characters long it will take 125[units], a bubble sort has a complexity of O(N**2), and a quick sort has O(n * log(n)) [best case].

What i'm saying is the comparison will take a _long_ time, so making users wait for the entire article to be scanned and then compared would be silly. I would suggest a [1,2,6,12,24] hour cron job (dependent on the server load and processing power) to check and flag/delete any similar articles.

TheOnly92 09-13-2009 02:03 AM

Yes, that's what I have in my mind now.


All times are GMT. The time now is 06:23 PM.

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0