TalkPHP

TalkPHP (http://www.talkphp.com/forums.php)
-   Absolute Beginners (http://www.talkphp.com/absolute-beginners/)
-   -   Parsing HTML (http://www.talkphp.com/absolute-beginners/4989-parsing-html.html)

pixelgod 10-01-2009 10:04 PM

Parsing HTML
 
Hi guys,
I don't even know where to start with this one. I have about a hundred html directory pages, and I want to convert it into a mysql database. All of the listings are enclosed in <li> and <hr> tags, so I think this should be fairly easy. The hard part will probably be extracting the information out of the listing.

I found lots of resources for scraping links, but I'm not experienced enough to convert that to my application. This is where I'm at so far:
Code:

$url = "#";
$input = @file_get_contents($url) or die('Could not access file!');
$regexp = "regex"; <- what do I put for this
if(preg_match_all("/$regexp/siU", $input, $matches)) {

}else {
echo"No matches found!";
}


etoolbox 10-01-2009 10:32 PM

Take a look at the PHP Simple HTML DOM Parser:
http://simplehtmldom.sourceforge.net/

It makes extracting information from HTML pages very easy.

pixelgod 10-01-2009 11:15 PM

Quote:

Originally Posted by etoolbox (Post 28627)
Take a look at the PHP Simple HTML DOM Parser:
http://simplehtmldom.sourceforge.net/

It makes extracting information from HTML pages very easy.


Thanks, I already read into that, but I still don't know how to make that work with what I have. How can I get it to parse just from the <li> tag and stop at the <hr> tag, and repeat?

hjalmar 10-02-2009 07:18 AM

So it would work with just getting the nodevalues? domdocument?

PHP Code:

class Dom{
    
    public 
$dom;
    
    public function 
__construct($url)
    {
        
$content file_get_contents($url);
        
        
$this->dom = new DomDocument();
        @
$this->dom->loadHTML($content);
    }
    
    public function 
getElements($element)
    {
        return 
$this->dom->getElementsByTagName($element);
    }
    
}

$dom     = new Dom("http://se2.php.net/manual/en/domdocument.loadhtml.php");
$list    $dom->getElements("li");

foreach(
$list as $li)
{
    print 
"{$li->nodeValue}<br />";


I'm still learning myself and verry new to the DomDocument, hopefully it was to some help adn what you wanted.

cheers

pixelgod 10-03-2009 05:49 AM

Thank you very much guys, but I still don't know how to get it to stop at the <hr> and repeat at the next <li>

hjalmar 10-03-2009 06:29 AM

Quote:

Originally Posted by pixelgod (Post 28643)
Thank you very much guys, but I still don't know how to get it to stop at the <hr> and repeat at the next <li>

Is it even valid markup? what does the markup look like? share the markup you want to traverse/parse and you will get a much better response.

pixelgod 10-04-2009 06:06 PM

Okay, this is how they're all structured. This one has all of the information, some of them are missing parts.

Code:


<LI>
<table><tr>
<td><IMG SRC="#"></td>
<td><a href="#">[Name] - [Location]</a> -- [Short Descripition]<br>
[Company]<br>
<a href="mailto:[email]">[email]</a><br>
[address1]<br>
[address2]<br>
[phone1]<br>
[phone2]<br>
[phone3]</td>
<td><img src="#"><!--image2--></td></tr></table>
<hr>

EDIT: I'm almost thinking this is impossible, becuase some posts are missing a phone number for example, and they're also missing the <br> at the end.

If it is impossible, then I guess I could just copy the entire code to the database as a listing, and not allow the old users to have an edit feature.


All times are GMT. The time now is 12:14 PM.

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0