TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 02-26-2008, 05:18 PM   #1 (permalink)
The Wanderer
 
Join Date: Nov 2007
Location: Mumbai, India
Posts: 24
Thanks: 0
sunilbhatia79 is on a distinguished road
Default Tutorial on writing Website Scrapers

This article discusses about how to write a website scraper using PHP for web site data extraction. The concepts taught can be applied and programmed in Java, C#, etc. Basically any language that has a powerful string processing capability. This article will teach you the basics of website scraping. The article will further cover a tutorial to find web ranking from Yahoo.com search engine.

Steps involved to write a scraping program
  1. Visit the URL
  2. Understand the pattern
  3. Validate the structure of pattern on different URLs
  4. Write the program
  5. Test the program using various input parameters

Full post:

Writing Website Scrapers in PHP | Geek Files
__________________
Sunil Bhatia www.twitter.com/sunilbhatia79 - Follow me on Twitter
PHP5 Tutorials
Career Articles

Last edited by Wildhoney : 02-26-2008 at 06:25 PM.
sunilbhatia79 is offline  
Reply With Quote
The Following 3 Users Say Thank You to sunilbhatia79 For This Useful Post:
Alan @ CIT (02-26-2008), Rendair (02-26-2008), ReSpawN (02-28-2008)
Old 02-27-2008, 04:54 AM   #2 (permalink)
The Contributor
 
DeMo's Avatar
 
Join Date: Jan 2008
Location: Brazil
Posts: 77
Thanks: 14
DeMo is on a distinguished road
Default

If you're scraping content from websites (that is: HTML) I guess string processing via strpos() and regular expressions are a thing of the past.

If you're using PHP5 it's very easy to scrape content using the DOM Functions. All you need is a DOMDocument object, then you call the DOMDocument->loadHTML() function and you can navigate the DOM using functions like getElementById, getElementsByTagName.. just like JavaScript.
Send a message via ICQ to DeMo Send a message via MSN to DeMo Send a message via Skype™ to DeMo
DeMo is offline  
Reply With Quote
Old 02-27-2008, 10:19 AM   #3 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,381
Thanks: 5
Salathe is on a distinguished road
Default

A nice little start on the subject, and as DeMo said there is always more than one way to skin a cat (as the saying goes).

Using DOM & XPath, we could condense:
PHP Code:
$pos strpos($str"<div id=\"popsearchbd\"");
$pos $pos strlen("<div id=\"popsearchbd\"");
 
if(
$pos == false) {
    echo 
"No information available";
}
else {
 
    while(
1) {
        
$pos strpos($str"fp-buzzmod\">"$pos);
 
        if(
$pos === false) {
            break;
        }
 
        
$pos $pos strlen("fp-buzzmod\">");
        
$temppos $pos;
        
$pos strpos($str"</a>"$pos);
 
        
$datalength $pos $temppos;
 
        
$data substr($str$temppos $datalength);
        echo 
$data;
        echo 
"\n";
    }
 

into:
PHP Code:
$links $xpath->query('//div[@id="popsearchbd"]//a'); 
Now, depending on your skill level or experience with XPath (and/or string functions!) the latter might be even more scary that Sunhil's version! One thing to note is that (at least for the Yahoo site in this tutorial) a User Agent is required, else Yahoo will send back different HTML (not containing the top searches!). However, in the tutorial Sunhil sends along a UA string in the headers so that's ok. :
Salathe is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 08:54 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design