 |
Account Login
|
 |
 |
Latest Articles
|
 |
 |
IRC Channel
|
 |
 |
Associates
|
 |
 |
Associates
|
 |
|
 |
 |
|
 |
01-20-2008, 07:32 AM
|
#1 (permalink)
|
|
The Addict
Join Date: Jan 2008
Location: los angeles
Posts: 309
Thanks: 44
|
how to parse source code of a webpage
how do i grab the source code of a webpage and display it like say in a textbox
im trying to parse the source code of a webpage and maybe using preg_match grab a portion of the code and display it.
__________________
no signature set
|
|
|
|
01-20-2008, 10:11 AM
|
#2 (permalink)
|
|
The Frequenter
Join Date: Apr 2005
Location: South UK
Posts: 483
Thanks: 51
|
You're looking for what's known as "screen scraping" - essentially loading the webpage and "scraping" it for the data you want.
Take a look at some of these Google results - many tutorials out there that should get you started
php screen scrape - Google Search
Alan
|
|
|
01-20-2008, 10:56 AM
|
#3 (permalink)
|
|
The Frequenter
Join Date: Dec 2007
Location: Bucharest, Romania
Posts: 438
Thanks: 3
|
CURL is the way to go...here's some stuff to get you started:
Using curl to Query Remote Servers - PHP Tutorials
cURL and libcurl
You do need some patience to understand how this is working...after that, it's a piece of cake :) Just ask if you don't understand something, and I'll try to help you out., as I have a little experience with it.
__________________
I have optimistic thoughts, even though sometimes (if not always) life's a bitch.
|
|
|
|
11-06-2008, 12:11 AM
|
#4 (permalink)
|
|
The Acquainted
Join Date: Jan 2008
Posts: 119
Thanks: 21
|
I have searched the CURL info online and am having an issue finding how to single out what you want to scrape. For example, I would like to get the 24hour new snowfall amount from here:
Quote:
<div id="squeeze">
<div class="header-bar"></div>
<h1 class="title">Snowfall Tracker 2008-09</h1>
<div class="node ntype-story" id="node-3193">
<div class="content">
<div style="text-align: center;">
<table border="1" cellpadding="1" cellspacing="0" width="100%">
<tbody>
<tr>
<td class="table-header-a" bgcolor="#ccccff" width="21%"><div style="text-align: center;"><font size="-2"> <strong>REPORT
DATE</strong> </font></div></td>
<td colspan="2" class="table-header-a" bgcolor="#ccccff"><div style="text-align: center;"><font size="-2"> <strong>24 hr NEW
SNOW<br>
(inches - as of 6am)</strong> </font></div></td>
<td colspan="2" class="table-header-a" bgcolor="#ccccff"><div style="text-align: center;"><font size="-2"> <strong>SEASON
CUMULATIVE TOTALS</strong> (inches)</font></div></td>
</tr>
<tr class="table1Head">
<td width="21%"><div style="text-align: center;"><font size="-1"> <strong> </strong> </font></div></td>
<td width="21%"><div style="text-align: center;"><font size="-2"> <strong>6200'</strong> </font></div></td>
<td width="21%"><div style="text-align: center;"><font size="-2"> <strong>8200'</strong> </font></div></td>
<td width="19%"><div style="text-align: center;"><font size="-2"> <strong>6200'</strong> </font></div></td>
<td width="18%"><div style="text-align: center;"><font size="-2"> <strong>8200'</strong> </font></div></td>
</tr>
<tr class="alternateRow">
<td> Nov 4, 2008 </td>
<td> 4-6" </td>
<td> 8-10"</td>
<td> 6 </td>
<td> 18 </td>
</tr>
<tr>
<td> Nov 3, 2008 </td>
<td> 0-0" </td>
<td> 2-4"</td>
<td> 0 </td>
<td> 8 </td>
</tr>
<tr class="alternateRow">
<td> Nov 2, 2008 </td>
<td> 0" </td>
<td> 2-4"</td>
<td> 0 </td>
<td> 4 </td>
</tr>
<tr>
<td> Oct 10, 2008 </td>
<td> trace </td>
<td> trace </td>
<td> 0 </td>
<td> 0 </td>
</tr>
</tbody>
</table>
</div>
<p><strong>Note: Cumulative Totals are the total of new recorded snowfall and do not reflect base amounts.</strong> These figures approximate and are for natural snowfall and do not reflect snowmaking unless noted. </p>
</div>
</div>
</div>
|
How do I use CURL to zero in on what i am looking for? If you know of a good tutorial or site that covers this, I'll read it. I haven't come across it.
Thanks
|
|
|
|
11-06-2008, 12:51 AM
|
#5 (permalink)
|
|
The Contributor
Join Date: Nov 2008
Location: Norway
Posts: 58
Thanks: 20
|
Hello,
I made an attempt to solve your "problem", and I came up with the following (thanks to the manual) solution. I am sure there are better ways to do this, but this seems to work:
PHP Code:
<?php
// Create a new cURL resource
$ch = curl_init();
// Set URL and other appropriate options
curl_setopt( $ch, CURLOPT_URL, 'http://localhost/test.html' );
curl_setopt( $ch, CURLOPT_HEADER, 0 );
// Start output buffering to capture page source
ob_start();
// Grab URL and pass it to the browser
curl_exec( $ch );
// Close cURL resource, and free up system resources
curl_close( $ch );
// Store content of output buffering to a variable
$cache = ob_get_contents();
// Clean output buffer
ob_end_clean();
// Perform an expression match
preg_match( "/<tr class=\"alternateRow\">[\s]+<td>([\w\d ,]+)<\/td>[\s]+<td>([\d\"\- ]+)<\/td>[\s]+<td>([\d\"\- ]+)<\/td>/im", $cache, $matches );
// Print the matches
print_r( $matches );
?>
It will find the following code, then put each of the three matches (it returns four matches, actually, including the entire string) into an array:
Code:
<tr class="alternateRow">
<td> Nov 4, 2008 </td>
<td> 4-6" </td>
<td> 8-10"</td>
We will end up with this (at least I did):
Code:
Array
(
[0] => <tr class="alternateRow">
<td> Nov 4, 2008 </td>
<td> 4-6" </td>
<td> 8-10"</td>
[1] => Nov 4, 2008
[2] => 4-6"
[3] => 8-10"
)
Let me know what you think, if it works and if there is anything you would do another way!
Yours,
Runar
|
|
|
|
The Following User Says Thank You to Runar For This Useful Post:
|
|
11-06-2008, 01:58 AM
|
#6 (permalink)
|
|
Moderateur
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
|
As much as I love regular expressions, I think that using the DOM extension is more suited to screen scraping in general. For the HTML snippet that buildakicker provided, perhaps something along the lines of the following might be useful.
PHP Code:
<?php
// Always use error reporting when developing
error_reporting(E_ALL | E_STRICT);
// Load in HTML string
$html = file_get_contents('snow.txt', false);
// Create new DOM document and load in our HTML fragment
$dom = new DOMDocument;
$dom->loadHTML($html);
// Create a new XPath handler for our HTML
$xpath = new DOMXPath($dom);
// Find the first <tr class="alternateRow"> element
$block = $xpath->query('//tr[@class="alternateRow"][1]')->item(0);
// Find the <td> elements inside our <tr>
$tds = $block->getElementsByTagName('td');
// Create array holding our required snowfall data
$result = array
(
'date' => trim($tds->item(0)->nodeValue),
'24hrs' => array
(
'6200' => trim($tds->item(1)->nodeValue),
'8200' => trim($tds->item(2)->nodeValue)
),
'totals' => array
(
'6200' => trim($tds->item(3)->nodeValue),
'8200' => trim($tds->item(4)->nodeValue)
)
);
// Tell them something interesting!
printf('Snowfall for %s at 6200 feet was %s.', $result['date'], $result['24hrs']['6200']);
|
|
|
|
|
The Following User Says Thank You to Salathe For This Useful Post:
|
|
11-06-2008, 03:53 PM
|
#7 (permalink)
|
|
The Acquainted
Join Date: Jan 2008
Posts: 119
Thanks: 21
|
I have tried the DOM before, but gotten these same errors. Is my server set up wrong or something?
Quote:
Warning: domdocument::domdocument() expects at least 1 parameter, 0 given in G:\xampp\htdocs\AIR\scrape.php on line 10
Fatal error: Call to undefined method domdocument::loadHTML() in G:\xampp\htdocs\AIR\scrape.php on line 11
|
|
|
|
|
11-06-2008, 03:57 PM
|
#8 (permalink)
|
|
The Acquainted
Join Date: Jan 2008
Posts: 119
Thanks: 21
|
Runar - thanks for the reply. I haven't gotten the regular expressions down very well, so I appreciate you showing some explaination!
|
|
|
|
11-06-2008, 04:24 PM
|
#9 (permalink)
|
|
The Acquainted
Join Date: Jan 2008
Posts: 119
Thanks: 21
|
so do you have to traverse through all of the information in the code to get to the spot you want to scrape?
ie... Can I start anywhere with preg_match? What if the <tr> does not have a class or id? Do I start at the top and work down to it?
Quote:
// Perform an expression match
preg_match( "/<tr class=\"alternateRow\">[\s]+<td>([\w\d ,]+)<\/td>[\s]+<td>([\d\"\- ]+)<\/td>[\s]+<td>([\d\"\- ]+)<\/td>/im", $cache, $matches );
|
|
|
|
|
11-06-2008, 04:47 PM
|
#10 (permalink)
|
|
The Contributor
Join Date: Nov 2008
Location: Norway
Posts: 58
Thanks: 20
|
This is why it is a very bad idea to built entire sites using tables, without any ids and classes.
Yes, you may start anywhere using preg_match, but if there are lots of unnamed table rows or cells (or divs for that sake), preg_match is probably not the best solution. If you insist on using regular expressions, then you should know that [\s]+ will match all spaces and line breaks, which is useful search for more than one line.
|
|
|
11-06-2008, 05:19 PM
|
#11 (permalink)
|
|
The Acquainted
Join Date: Jan 2008
Posts: 119
Thanks: 21
|
so dom?
So using the DOM would be better? Is there a way to do it with jQuery or another JS library that would be better than using PHP? I have not been able to get the PHP method to work correctly.
EDIT!
Sweet... so DOMDocument... I am running this on my local server and didn't have the domxml() commented out in PHP.INI
Last edited by buildakicker : 11-06-2008 at 07:15 PM.
Reason: typing faster than my brain can work.
|
|
|
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Hybrid Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|