View Single Post
Old 06-05-2009, 10:25 AM   #3 (permalink)
Salathe
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

Quote:
Originally Posted by Wildhoney View Post
I am sure most would use regular expressions to scrape content, but it's worth asking, I suppose, if anybody uses DOMDocument already on TalkPHP? And if not, would you agree that it's more lucid and readable than regular expressions?
Like everyone, I've used regular expressions in the past (and probably will in the future) to scrape particular content. I've also used basic strings functions (strpos, substr and friends) to do the same.

For XML (and HTML) documents, I've been using DOM for many years (PHP4's support left a lot to be desired though) and on other forums (sorry!) where people ask about scraping with regular expressions, I've often pushed them in the direction of DOM where it seemed suitable.

Just to go through Wildhoney's example minus the Zend wrapper, a close translation would be:
PHP Code:
$szHtml   file_get_contents('http://www.talkphp.com/forums.php');
$pDom     = new DOMDocument;
@
$pDom->loadHTML($szHtml);
$pQuery   = new DOMXPath($pDom);
$pResults $pQuery->query('//*[@class="alt1Active"]//div//a'); // XPath equivalent to CSS selector ".alt1Active div a"
foreach ($pResults as $pResult)
{
    
printf('<a href="%s">%s</a><br />'$pResult->getAttribute('href'), $pResult->nodeValue);

Notes: Firstly, Zend_Dom uses the error-control operator (@) when loading the HTML (as I have above in the translation) but it would be better to use the libxml_*_errors functions to temporarily disable error reporting for libxml (which DOM uses) rather than the entire system. Secondly, one would normally opt for DOMDocument::loadHTMLFile rather than loading the string using file_get_contents then passing it to the DOM object.
Salathe is offline  
Reply With Quote