TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 02-28-2008, 02:42 AM   #1 (permalink)
The Contributor
 
aristoworks's Avatar
 
Join Date: Nov 2007
Location: Nashville
Posts: 44
Thanks: 7
aristoworks is on a distinguished road
Default Crawling Web Pages

I'm trying to build a website caching system but want to simply cache the html code from specific websites that I target.

I'm not talking about anything shady - rather to do backups and such.

Is this able by php to "suck" the html from a remote url?

Thanks
Send a message via AIM to aristoworks
aristoworks is offline  
Reply With Quote
Old 02-28-2008, 02:48 AM   #2 (permalink)
The Addict
 
Join Date: Nov 2007
Posts: 264
Thanks: 2
TlcAndres is on a distinguished road
Default

file_get_contents would do the trick just fine
TlcAndres is offline  
Reply With Quote
Old 02-28-2008, 03:21 AM   #3 (permalink)
The Contributor
 
aristoworks's Avatar
 
Join Date: Nov 2007
Location: Nashville
Posts: 44
Thanks: 7
aristoworks is on a distinguished road
Default

Thanks for the tip but that was the first thing I tried and it wasn't returning anything. Any ideas?

Thanks
jw
Send a message via AIM to aristoworks
aristoworks is offline  
Reply With Quote
Old 02-28-2008, 03:59 AM   #4 (permalink)
La Vida es Sueño
Advanced Programmer Top Contributor 
 
Wildhoney's Avatar
 
Join Date: Sep 2007
Location: Oldham
Posts: 2,280
Thanks: 90
Wildhoney is on a distinguished road
Default

Some websites check to ensure that the user agent HTTP header is set. Every browser will set a user agent, unless home-made, and so if that's not set then it's a tell tale sign of a robot, not a person using a browser. To get around that use cURL and set the user agent.

php Code:
$szURL = 'http://www.talkphp.com/';
$szUserAgent = 'Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101';

$pCurl = curl_init($szURL);

curl_setopt($pCurl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($pCurl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($pCurl, CURLOPT_USERAGENT, $szUserAgent);

$szContent = curl_exec($pCurl);

curl_close($pCurl);

die($szContent);
__________________
The man who comes back through the Door in the Wall will never be quite the same as the man who went out.
Send a message via AIM to Wildhoney Send a message via MSN to Wildhoney Send a message via Yahoo to Wildhoney
Wildhoney is offline  
Reply With Quote
Old 02-28-2008, 02:37 PM   #5 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

You can also set the User Agent by specifying something for the [b]user_agent[/i] setting in php.ini (the ini_set() function will work). Or, you can also create a stream context (stream_context_create()) and specify the User Agent header in there. Both of these methods will enable standard file functions (fopen/fread, file_get_contents, etc) to send along the UA without using the cURL extension.
Salathe is offline  
Reply With Quote
The Following User Says Thank You to Salathe For This Useful Post:
DeMo (02-28-2008)
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 09:52 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design