When scraping websites I've always used regular expressions, and I am not sure why. I've always been aware of
DOMDocument, and after reading
Zend_Dom today, which is essentially a
DOMDocument wrapper, with a very useful
query function added, I've decided it's much more elegant in terms of lucidity and readability. I think from now on I will use
Zend_Dom for website scraping.
For example, to get all the links from the forum page of TalkPHP, we could use the following code. Bearing in mind that query returns a
Zend_Dom_Query_Result object that contains one or more
DOMElement objects.
DOMDocument, I feel, was never the best documentation on PHP.net, and I think that's the reason I never became acquainted with it to any level to which I had considered using it over regular expressions.
php Code:
$szHtml =
file_get_contents('http://www.talkphp.com/forums.php');
$pDom =
new Zend_Dom_Query
($szHtml);
$pResults =
$pDom->
query('.alt1Active div a');
foreach ($pResults as $pResult){ printf('<a href="%s">%s</a><br />',
$pResult->
getAttribute('href'),
$pResult->
nodeValue);
}
This to me is much easier to read than regular expressions. It is clear what it's doing, and with the CSS style selector for the
query function, it's ever so flexible. Incidently, the
query function also accepts the XPath syntax for locating elements.
All in all it reminds me very much of JavaScript, and I think that's the predominant reason why I've fallen head over heels for it, especially now Zend has extended on it in their framework.
The private properties for getting such things as the
innerHTML (As in JavaScript) is not as easy because they are not functions. Instead you have
nodeValue,
nodeName,
nodeType, amongst others as described on
this page.
It's also easy enough to extract the attributes with
getAttribute.
I am sure most would use regular expressions to scrape content, but it's worth asking, I suppose, if anybody uses
DOMDocument already on TalkPHP? And if not, would you agree that it's more lucid and readable than regular expressions?
I think from now on
Zend_Dom will be my baby.