Quote:
Originally Posted by Wildhoney
I am sure most would use regular expressions to scrape content, but it's worth asking, I suppose, if anybody uses DOMDocument already on TalkPHP? And if not, would you agree that it's more lucid and readable than regular expressions?
|
Like everyone, I've used regular expressions in the past (and probably will in the future) to scrape particular content. I've also used basic strings functions (strpos, substr and friends) to do the same.
For XML (and HTML) documents, I've been using DOM for many years (PHP4's support left a lot to be desired though) and on other forums (sorry!) where people ask about scraping with regular expressions, I've often pushed them in the direction of DOM where it seemed suitable.
Just to go through Wildhoney's example minus the Zend wrapper, a close translation would be:
PHP Code:
$szHtml = file_get_contents('http://www.talkphp.com/forums.php');
$pDom = new DOMDocument;
@$pDom->loadHTML($szHtml);
$pQuery = new DOMXPath($pDom);
$pResults = $pQuery->query('//*[@class="alt1Active"]//div//a'); // XPath equivalent to CSS selector ".alt1Active div a"
foreach ($pResults as $pResult)
{
printf('<a href="%s">%s</a><br />', $pResult->getAttribute('href'), $pResult->nodeValue);
}
Notes: Firstly,
Zend_Dom uses the error-control operator (
@) when loading the HTML (as I have above in the translation) but it would be better to use the
libxml_*_errors functions to temporarily disable error reporting for libxml (which DOM uses) rather than the entire system. Secondly, one would normally opt for
DOMDocument::loadHTMLFile rather than loading the string using
file_get_contents then passing it to the DOM object.