TalkPHP

TalkPHP (http://www.talkphp.com/forums.php)
-   Script Giveaway (http://www.talkphp.com/script-giveaway/)
-   -   Tutorial on writing Website Scrapers (http://www.talkphp.com/script-giveaway/2344-tutorial-writing-website-scrapers.html)

sunilbhatia79 02-26-2008 06:18 PM

Tutorial on writing Website Scrapers
 
This article discusses about how to write a website scraper using PHP for web site data extraction. The concepts taught can be applied and programmed in Java, C#, etc. Basically any language that has a powerful string processing capability. This article will teach you the basics of website scraping. The article will further cover a tutorial to find web ranking from Yahoo.com search engine.

Steps involved to write a scraping program
  1. Visit the URL
  2. Understand the pattern
  3. Validate the structure of pattern on different URLs
  4. Write the program
  5. Test the program using various input parameters

Full post:

Writing Website Scrapers in PHP | Geek Files

DeMo 02-27-2008 05:54 AM

If you're scraping content from websites (that is: HTML) I guess string processing via strpos() and regular expressions are a thing of the past.

If you're using PHP5 it's very easy to scrape content using the DOM Functions. All you need is a DOMDocument object, then you call the DOMDocument->loadHTML() function and you can navigate the DOM using functions like getElementById, getElementsByTagName.. just like JavaScript. :-)

Salathe 02-27-2008 11:19 AM

A nice little start on the subject, and as DeMo said there is always more than one way to skin a cat (as the saying goes).

Using DOM & XPath, we could condense:
PHP Code:

$pos strpos($str"<div id=\"popsearchbd\"");
$pos $pos strlen("<div id=\"popsearchbd\"");
 
if(
$pos == false) {
    echo 
"No information available";
}
else {
 
    while(
1) {
        
$pos strpos($str"fp-buzzmod\">"$pos);
 
        if(
$pos === false) {
            break;
        }
 
        
$pos $pos strlen("fp-buzzmod\">");
        
$temppos $pos;
        
$pos strpos($str"</a>"$pos);
 
        
$datalength $pos $temppos;
 
        
$data substr($str$temppos $datalength);
        echo 
$data;
        echo 
"\n";
    }
 


into:
PHP Code:

$links $xpath->query('//div[@id="popsearchbd"]//a'); 

Now, depending on your skill level or experience with XPath (and/or string functions!) the latter might be even more scary that Sunhil's version! One thing to note is that (at least for the Yahoo site in this tutorial) a User Agent is required, else Yahoo will send back different HTML (not containing the top searches!). However, in the tutorial Sunhil sends along a UA string in the headers so that's ok. ::-)


All times are GMT. The time now is 08:15 AM.

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0