When scraping websites I've always used regular expressions, and I am not sure why. I've always been aware of DOMDocument, and after reading Zend_Dom today, which is essentially a DOMDocument wrapper, with a very useful query function added, I've decided it's much more elegant in terms of lucidity and readability. I think from now on I will use Zend_Dom for website scraping.
For example, to get all the links from the forum page of TalkPHP, we could use the following code. Bearing in mind that query returns a Zend_Dom_Query_Result object that contains one or more DOMElement objects. DOMDocument, I feel, was never the best documentation on PHP.net, and I think that's the reason I never became acquainted with it to any level to which I had considered using it over regular expressions.
$szHtml = file_get_contents('http://www.talkphp.com/forums.php'); $pDom = new Zend_Dom_Query($szHtml); $pResults = $pDom->query('.alt1Active div a');
This to me is much easier to read than regular expressions. It is clear what it's doing, and with the CSS style selector for the query function, it's ever so flexible. Incidently, the query function also accepts the XPath syntax for locating elements.
It's also easy enough to extract the attributes with getAttribute.
I am sure most would use regular expressions to scrape content, but it's worth asking, I suppose, if anybody uses DOMDocument already on TalkPHP? And if not, would you agree that it's more lucid and readable than regular expressions?
I think from now on Zend_Dom will be my baby.
The man who comes back through the Door in the Wall will never be quite the same as the man who went out.