TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 09-09-2009, 12:35 PM   #1 (permalink)
The Contributor
 
Join Date: Mar 2009
Posts: 49
Thanks: 0
TheOnly92 is on a distinguished road
Default Nitrospirae

Nitrospirae is a bot that I have written to crawl over 150+ RSS feeds and collect them in 1 single site.

Those 150+ RSS feeds are mostly IT News, PHP information, and other news which I personally favor, but it can be expanded of course, to grab many more information on the internet.

To put it simple, the purpose of this site is to gather the best of the best in 1 site and make them searchable. It can also act as a sort of IT "newspaper" which you can read everyday to get a grasp on what's going on around. You may find several articles on a same news which is written by several different sites.

http://nitrospirae.net

Please do give some suggestions or critics.
__________________
There are no noobs and pros in this world, only people who know how to use Google and those who don't.
TheOnly92 is offline  
Reply With Quote
Old 09-09-2009, 08:08 PM   #2 (permalink)
how quixotic are you?
 
ETbyrne's Avatar
 
Join Date: Dec 2007
Location: Lapeer, MI
Posts: 445
Thanks: 37
ETbyrne is on a distinguished road
Default

Neat. I've been thinking about doing something along these lines for a while, but never translated those thoughts into actions.

One thing you need to do is make sure you don't get any duplicate articles on the website. When I visited I saw the same article posted on slashdot science and slashdot games.
__________________
Dingo Web Systems > http://www.dingocode.com
My Website > http://www.evanbot.com
ETbyrne is offline  
Reply With Quote
Old 09-10-2009, 07:50 AM   #3 (permalink)
The Contributor
 
Join Date: Mar 2009
Posts: 49
Thanks: 0
TheOnly92 is on a distinguished road
Default

Yeah, I have been thinking of some way to get rid of them, but some duplicates are updates made just after the script grabbed the feed, currently still finding some way to remove them perhaps...
__________________
There are no noobs and pros in this world, only people who know how to use Google and those who don't.
TheOnly92 is offline  
Reply With Quote
Old 09-10-2009, 09:28 PM   #4 (permalink)
The Addict
 
Join Date: May 2009
Posts: 287
Thanks: 5
adamdecaf is on a distinguished road
Default

You could compare the URI's or flag any articles (server side, then manually check the articles if they are not too numerous) that have a high enough correlation with the similar_text() function.
__________________
My Site
adamdecaf is offline  
Reply With Quote
Old 09-11-2009, 09:46 AM   #5 (permalink)
The Contributor
 
Join Date: Mar 2009
Posts: 49
Thanks: 0
TheOnly92 is on a distinguished road
Default

Cool, I didn't know there was such function, I will look into it. Thanks for the tip!
__________________
There are no noobs and pros in this world, only people who know how to use Google and those who don't.
TheOnly92 is offline  
Reply With Quote
Old 09-12-2009, 06:53 PM   #6 (permalink)
The Addict
 
Join Date: May 2009
Posts: 287
Thanks: 5
adamdecaf is on a distinguished road
Default

Just a note, the comparison algorithm is very slow O(N**3); n := str.length. This means that if a string is 5 characters long it will take 125[units], a bubble sort has a complexity of O(N**2), and a quick sort has O(n * log(n)) [best case].

What i'm saying is the comparison will take a _long_ time, so making users wait for the entire article to be scanned and then compared would be silly. I would suggest a [1,2,6,12,24] hour cron job (dependent on the server load and processing power) to check and flag/delete any similar articles.
__________________
My Site
adamdecaf is offline  
Reply With Quote
Old 09-13-2009, 02:03 AM   #7 (permalink)
The Contributor
 
Join Date: Mar 2009
Posts: 49
Thanks: 0
TheOnly92 is on a distinguished road
Default

Yes, that's what I have in my mind now.
__________________
There are no noobs and pros in this world, only people who know how to use Google and those who don't.
TheOnly92 is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 02:07 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design