TalkPHP

TalkPHP (http://www.talkphp.com/forums.php)
-   General (http://www.talkphp.com/general/)
-   -   Crawling Web Pages (http://www.talkphp.com/general/2366-crawling-web-pages.html)

aristoworks 02-28-2008 02:42 AM

Crawling Web Pages
 
I'm trying to build a website caching system but want to simply cache the html code from specific websites that I target.

I'm not talking about anything shady - rather to do backups and such.

Is this able by php to "suck" the html from a remote url?

Thanks

TlcAndres 02-28-2008 02:48 AM

file_get_contents would do the trick just fine

aristoworks 02-28-2008 03:21 AM

Thanks for the tip but that was the first thing I tried and it wasn't returning anything. Any ideas?

Thanks
jw

Wildhoney 02-28-2008 03:59 AM

Some websites check to ensure that the user agent HTTP header is set. Every browser will set a user agent, unless home-made, and so if that's not set then it's a tell tale sign of a robot, not a person using a browser. To get around that use cURL and set the user agent.

php Code:
$szURL = 'http://www.talkphp.com/';
$szUserAgent = 'Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101';

$pCurl = curl_init($szURL);

curl_setopt($pCurl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($pCurl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($pCurl, CURLOPT_USERAGENT, $szUserAgent);

$szContent = curl_exec($pCurl);

curl_close($pCurl);

die($szContent);

Salathe 02-28-2008 02:37 PM

You can also set the User Agent by specifying something for the [b]user_agent[/i] setting in php.ini (the ini_set() function will work). Or, you can also create a stream context (stream_context_create()) and specify the User Agent header in there. Both of these methods will enable standard file functions (fopen/fread, file_get_contents, etc) to send along the UA without using the cURL extension.


All times are GMT. The time now is 11:23 PM.

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0