TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 10-01-2009, 10:04 PM   #1 (permalink)
The Wanderer
 
Join Date: Sep 2007
Posts: 11
Thanks: 2
pixelgod is on a distinguished road
Default Parsing HTML

Hi guys,
I don't even know where to start with this one. I have about a hundred html directory pages, and I want to convert it into a mysql database. All of the listings are enclosed in <li> and <hr> tags, so I think this should be fairly easy. The hard part will probably be extracting the information out of the listing.

I found lots of resources for scraping links, but I'm not experienced enough to convert that to my application. This is where I'm at so far:
Code:
$url = "#";
$input = @file_get_contents($url) or die('Could not access file!'); 
$regexp = "regex"; <- what do I put for this
if(preg_match_all("/$regexp/siU", $input, $matches)) { 

}else {
echo"No matches found!";
}
pixelgod is offline  
Reply With Quote
Old 10-01-2009, 10:32 PM   #2 (permalink)
The Wanderer
Newcomer 
 
etoolbox's Avatar
 
Join Date: Dec 2008
Location: Auckland, NZ
Posts: 24
Thanks: 0
etoolbox is on a distinguished road
Default

Take a look at the PHP Simple HTML DOM Parser:
http://simplehtmldom.sourceforge.net/

It makes extracting information from HTML pages very easy.
__________________
Chris Hope's LAMP Blog: http://www.electrictoolbox.com/
etoolbox is offline  
Reply With Quote
Old 10-01-2009, 11:15 PM   #3 (permalink)
The Wanderer
 
Join Date: Sep 2007
Posts: 11
Thanks: 2
pixelgod is on a distinguished road
Default

Quote:
Originally Posted by etoolbox View Post
Take a look at the PHP Simple HTML DOM Parser:
http://simplehtmldom.sourceforge.net/

It makes extracting information from HTML pages very easy.

Thanks, I already read into that, but I still don't know how to make that work with what I have. How can I get it to parse just from the <li> tag and stop at the <hr> tag, and repeat?
pixelgod is offline  
Reply With Quote
Old 10-02-2009, 07:18 AM   #4 (permalink)
The Contributor
 
Join Date: Nov 2008
Location: Sweden
Posts: 36
Thanks: 1
hjalmar is on a distinguished road
Default

So it would work with just getting the nodevalues? domdocument?

PHP Code:
class Dom{
    
    public 
$dom;
    
    public function 
__construct($url)
    {
        
$content file_get_contents($url);
        
        
$this->dom = new DomDocument();
        @
$this->dom->loadHTML($content);
    }
    
    public function 
getElements($element)
    {
        return 
$this->dom->getElementsByTagName($element);
    }
    
}

$dom     = new Dom("http://se2.php.net/manual/en/domdocument.loadhtml.php");
$list    $dom->getElements("li");

foreach(
$list as $li)
{
    print 
"{$li->nodeValue}<br />";

I'm still learning myself and verry new to the DomDocument, hopefully it was to some help adn what you wanted.

cheers
hjalmar is offline  
Reply With Quote
The Following User Says Thank You to hjalmar For This Useful Post:
pixelgod (10-03-2009)
Old 10-03-2009, 05:49 AM   #5 (permalink)
The Wanderer
 
Join Date: Sep 2007
Posts: 11
Thanks: 2
pixelgod is on a distinguished road
Default

Thank you very much guys, but I still don't know how to get it to stop at the <hr> and repeat at the next <li>
pixelgod is offline  
Reply With Quote
Old 10-03-2009, 06:29 AM   #6 (permalink)
The Contributor
 
Join Date: Nov 2008
Location: Sweden
Posts: 36
Thanks: 1
hjalmar is on a distinguished road
Default

Quote:
Originally Posted by pixelgod View Post
Thank you very much guys, but I still don't know how to get it to stop at the <hr> and repeat at the next <li>
Is it even valid markup? what does the markup look like? share the markup you want to traverse/parse and you will get a much better response.
hjalmar is offline  
Reply With Quote
Old 10-04-2009, 06:06 PM   #7 (permalink)
The Wanderer
 
Join Date: Sep 2007
Posts: 11
Thanks: 2
pixelgod is on a distinguished road
Default

Okay, this is how they're all structured. This one has all of the information, some of them are missing parts.

Code:
<LI>
<table><tr>
<td><IMG SRC="#"></td>
<td><a href="#">[Name] - [Location]</a> -- [Short Descripition]<br>
[Company]<br>
<a href="mailto:[email]">[email]</a><br>
[address1]<br>
[address2]<br>
[phone1]<br>
[phone2]<br>
[phone3]</td>
<td><img src="#"><!--image2--></td></tr></table>
<hr>
EDIT: I'm almost thinking this is impossible, becuase some posts are missing a phone number for example, and they're also missing the <br> at the end.

If it is impossible, then I guess I could just copy the entire code to the database as a listing, and not allow the old users to have an edit feature.
pixelgod is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
Html 5 Randy XHTML, HTML, CSS 12 04-04-2013 09:26 AM
Exciting Stuff in HTML 5! Wildhoney XHTML, HTML, CSS 20 02-17-2013 03:16 PM
Replace special chars not tags, php trigered by html Peuplarchie General 0 09-20-2009 02:32 PM
[CGI] Parsing PHP in HTML aristoworks General 3 02-07-2009 02:54 AM
problem getting data on the html table.... jetnet1 General 2 11-24-2008 06:55 PM


All times are GMT. The time now is 08:39 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design