TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 06-05-2009, 01:44 AM   #1 (permalink)
La Vida es Sueño
Advanced Programmer Top Contributor 
 
Wildhoney's Avatar
 
Join Date: Sep 2007
Location: Oldham
Posts: 2,280
Thanks: 90
Wildhoney is on a distinguished road
Big Grin ZF's Zend_Dom - The DOMDocument Wrapper

When scraping websites I've always used regular expressions, and I am not sure why. I've always been aware of DOMDocument, and after reading Zend_Dom today, which is essentially a DOMDocument wrapper, with a very useful query function added, I've decided it's much more elegant in terms of lucidity and readability. I think from now on I will use Zend_Dom for website scraping.



For example, to get all the links from the forum page of TalkPHP, we could use the following code. Bearing in mind that query returns a Zend_Dom_Query_Result object that contains one or more DOMElement objects. DOMDocument, I feel, was never the best documentation on PHP.net, and I think that's the reason I never became acquainted with it to any level to which I had considered using it over regular expressions.

php Code:
$szHtml = file_get_contents('http://www.talkphp.com/forums.php');
$pDom = new Zend_Dom_Query($szHtml);
$pResults = $pDom->query('.alt1Active div a');

foreach ($pResults as $pResult)
{
    printf('<a href="%s">%s</a><br />', $pResult->getAttribute('href'), $pResult->nodeValue);
}

This to me is much easier to read than regular expressions. It is clear what it's doing, and with the CSS style selector for the query function, it's ever so flexible. Incidently, the query function also accepts the XPath syntax for locating elements.

All in all it reminds me very much of JavaScript, and I think that's the predominant reason why I've fallen head over heels for it, especially now Zend has extended on it in their framework.

The private properties for getting such things as the innerHTML (As in JavaScript) is not as easy because they are not functions. Instead you have nodeValue, nodeName, nodeType, amongst others as described on this page.

It's also easy enough to extract the attributes with getAttribute.

I am sure most would use regular expressions to scrape content, but it's worth asking, I suppose, if anybody uses DOMDocument already on TalkPHP? And if not, would you agree that it's more lucid and readable than regular expressions?

I think from now on Zend_Dom will be my baby.
__________________
The man who comes back through the Door in the Wall will never be quite the same as the man who went out.
Send a message via AIM to Wildhoney Send a message via MSN to Wildhoney Send a message via Yahoo to Wildhoney
Wildhoney is offline  
Reply With Quote
Old 06-05-2009, 07:39 AM   #2 (permalink)
The Acquainted
 
Join Date: Oct 2007
Posts: 170
Thanks: 18
maZtah is an unknown quantity at this point
Default

Nice article!

But, somehow I'm still afraid to go with the Zend Framework. I don't know why really. Maybe we should make a list with starters tutorials.

Anyways, thanks for this post!
maZtah is offline  
Reply With Quote
Old 06-05-2009, 10:25 AM   #3 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

Quote:
Originally Posted by Wildhoney View Post
I am sure most would use regular expressions to scrape content, but it's worth asking, I suppose, if anybody uses DOMDocument already on TalkPHP? And if not, would you agree that it's more lucid and readable than regular expressions?
Like everyone, I've used regular expressions in the past (and probably will in the future) to scrape particular content. I've also used basic strings functions (strpos, substr and friends) to do the same.

For XML (and HTML) documents, I've been using DOM for many years (PHP4's support left a lot to be desired though) and on other forums (sorry!) where people ask about scraping with regular expressions, I've often pushed them in the direction of DOM where it seemed suitable.

Just to go through Wildhoney's example minus the Zend wrapper, a close translation would be:
PHP Code:
$szHtml   file_get_contents('http://www.talkphp.com/forums.php');
$pDom     = new DOMDocument;
@
$pDom->loadHTML($szHtml);
$pQuery   = new DOMXPath($pDom);
$pResults $pQuery->query('//*[@class="alt1Active"]//div//a'); // XPath equivalent to CSS selector ".alt1Active div a"
foreach ($pResults as $pResult)
{
    
printf('<a href="%s">%s</a><br />'$pResult->getAttribute('href'), $pResult->nodeValue);

Notes: Firstly, Zend_Dom uses the error-control operator (@) when loading the HTML (as I have above in the translation) but it would be better to use the libxml_*_errors functions to temporarily disable error reporting for libxml (which DOM uses) rather than the entire system. Secondly, one would normally opt for DOMDocument::loadHTMLFile rather than loading the string using file_get_contents then passing it to the DOM object.
Salathe is offline  
Reply With Quote
Old 06-06-2009, 12:42 PM   #4 (permalink)
Orc
The Prestige
 
Orc's Avatar
 
Join Date: Dec 2007
Posts: 1,044
Thanks: 193
Orc is on a distinguished road
Default

Why does it use the error suppression operator which is supposedly slow...
__________________
VillageIdiot can have my babbies ;d
Orc is offline  
Reply With Quote
Old 06-06-2009, 01:08 PM   #5 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

Quote:
Originally Posted by Orc View Post
Why does it use the error suppression operator which is supposedly slow...
There are any number of reasons why the author decided to use the error suppression operator. It is very quick and painless to write (just one extra character and all the noise goes away!), they may think that is the only way to keep the errors from popping up, they may not care about any tiny, tiny, tiny performance hit incurred when the line is called (perhaps only once or a few times per page-load for most uses), etc..
Salathe is offline  
Reply With Quote
Old 06-06-2009, 01:49 PM   #6 (permalink)
Orc
The Prestige
 
Orc's Avatar
 
Join Date: Dec 2007
Posts: 1,044
Thanks: 193
Orc is on a distinguished road
Default

Quote:
Originally Posted by Salathe View Post
There are any number of reasons why the author decided to use the error suppression operator. It is very quick and painless to write (just one extra character and all the noise goes away!), they may think that is the only way to keep the errors from popping up, they may not care about any tiny, tiny, tiny performance hit incurred when the line is called (perhaps only once or a few times per page-load for most uses), etc..
Well what happens if you need a bunch of these DomDocument instances and thus have to use the error-suppression operator again and again repeatedly, thus you might run into a performance hit.
__________________
VillageIdiot can have my babbies ;d
Orc is offline  
Reply With Quote
Old 06-06-2009, 01:55 PM   #7 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

If you were worried about the performance hit of the @ operator, you wouldn't be using DOMDocument, especially if you were using a bunch of DOMDocuments!
Salathe is offline  
Reply With Quote
Old 06-06-2009, 01:59 PM   #8 (permalink)
Orc
The Prestige
 
Orc's Avatar
 
Join Date: Dec 2007
Posts: 1,044
Thanks: 193
Orc is on a distinguished road
Default

Quote:
Originally Posted by Salathe View Post
If you were worried about the performance hit of the @ operator, you wouldn't be using DOMDocument, especially if you were using a bunch of DOMDocuments!
Then what would you use?
__________________
VillageIdiot can have my babbies ;d
Orc is offline  
Reply With Quote
Old 06-06-2009, 02:11 PM   #9 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

I'm not entirely sure that this is the place for such a discussion (performance of XML parsers). Either way, the DOM extension should suffice for most uses, most of the time. My point was simply that parsing, storing and manipulating a document using the DOM extension has far, far more impact on the performance of a script than prepending an expression with @, or conversely that the factoring out the @-operator purely for performance reasons would be silly.

A much better reason for not using the @-operator would be to avoid silencing errors which you would actually want to see. For instance, you could write @$pDom->loadHMTL($szHtml); by accident and no error would be reported (should be Fatal Error: Call to undefined method DOMDocument::loadHMTL()) yet your script would not work as expected since the HTML string was never loaded into the DOM document.
Salathe is offline  
Reply With Quote
Old 06-06-2009, 02:26 PM   #10 (permalink)
Orc
The Prestige
 
Orc's Avatar
 
Join Date: Dec 2007
Posts: 1,044
Thanks: 193
Orc is on a distinguished road
Default

Fine. But here's a good question, would it be good to use it in Freelancing projects or Open Source projects?
__________________
VillageIdiot can have my babbies ;d
Orc is offline  
Reply With Quote
Old 06-06-2009, 02:38 PM   #11 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

As far as I'm aware, the nature of the gig (Freelancing) or license on the software (Open Source) has nothing to do with choosing whether or not to use a particular operator. I think perhaps you are over-thinking things Orc.
Salathe is offline  
Reply With Quote
Old 06-06-2009, 02:43 PM   #12 (permalink)
Orc
The Prestige
 
Orc's Avatar
 
Join Date: Dec 2007
Posts: 1,044
Thanks: 193
Orc is on a distinguished road
Default

Quote:
Originally Posted by Salathe View Post
As far as I'm aware, the nature of the gig (Freelancing) or license on the software (Open Source) has nothing to do with choosing whether or not to use a particular operator. I think perhaps you are over-thinking things Orc.
Well what I ment was, was it compatible with PHP 4 considering there are handfuls out there who still run on PHP-4 Environments, then I found DOM XML @ http://us2.php.net/manual/en/ref.domxml.php

So I could just do a hybrid of this and DomDocument depending on the servers php version.
__________________
VillageIdiot can have my babbies ;d
Orc is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
Comment on my database wrapper. bdm Advanced PHP Programming 7 03-30-2013 07:50 AM
Gravatar Wrapper Class Wildhoney Script Giveaway 31 01-29-2013 12:43 PM
DB wrapper for framework Tanax Advanced PHP Programming 0 03-31-2009 05:16 PM
Ping.FM Wrapper Class Daniel Script Giveaway 0 06-21-2008 05:55 PM


All times are GMT. The time now is 06:25 PM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design