TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 03-21-2008, 04:29 AM   #1 (permalink)
is cute and cuddly
 
delayedinsanity's Avatar
 
Join Date: Mar 2008
Location: Vegas, Baby
Posts: 963
Thanks: 31
delayedinsanity is on a distinguished road
Default Building an array of matches...

I want to take a page of HTML and put everything in paragraph tags into an array. So far I've failed, obviously... can this be done using a regular expression match easily enough, or is there a better way? This is the short piece of code I've been doing the test with so far;

PHP Code:
$match "<p>test1</p> <p>test2</p>";
$worked preg_match_all("/(\<p\>.*\<\/p\>)/"$match$matches); 
The problem is that it returns the whole string as a single match. Augh!
__________________
uʍop ǝpısdn sı ɹǝʇndɯoɔ ɹnoʎ sıɥʇ pɐǝɹ uɐɔ noʎ ɟı
delayedinsanity is offline  
Reply With Quote
Old 03-21-2008, 05:07 AM   #2 (permalink)
is cute and cuddly
 
delayedinsanity's Avatar
 
Join Date: Mar 2008
Location: Vegas, Baby
Posts: 963
Thanks: 31
delayedinsanity is on a distinguished road
Default

Hmm, feeling a little silly now, I changed it to

PHP Code:
$worked preg_match_all("/(\<p\>[a-z0-9_]*\<\/p\>)/i"$match$matches); 
Now, I just have to figure out how to allow other tags inside the P tags... question still stands though, is there a better method for doing this, or should I just keep with the regular expression till I get it?
__________________
uʍop ǝpısdn sı ɹǝʇndɯoɔ ɹnoʎ sıɥʇ pɐǝɹ uɐɔ noʎ ɟı
delayedinsanity is offline  
Reply With Quote
Old 03-21-2008, 01:17 PM   #3 (permalink)
La Vida es Sueño
Advanced Programmer Top Contributor 
 
Wildhoney's Avatar
 
Join Date: Sep 2007
Location: Oldham
Posts: 2,280
Thanks: 90
Wildhoney is on a distinguished road
Default

You're definitely going down the right path with the regular expressions -- but what precisely are you trying to do, just get everything between the 2 P tags? How about something like the following:

php Code:
preg_match_all('~<p>(.+?)</p>~i', $match, $matches);
__________________
The man who comes back through the Door in the Wall will never be quite the same as the man who went out.
Send a message via AIM to Wildhoney Send a message via MSN to Wildhoney Send a message via Yahoo to Wildhoney
Wildhoney is offline  
Reply With Quote
Old 03-21-2008, 03:58 PM   #4 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

Why not use DOM in this instance, it will provide a far more reliable means of grabbing the paragraph elements rather than trying to delve into the intricacies of a suitable regular expression.

For example:

PHP Code:
<?php

/*
    Load the HTML document. It is a good idea 
    to cache the remote document rather than
    load it from the remote server every time 
    the script is called 
*/
$dom = @DOMDocument::loadHTMLFile('http://lipsum.com/feed/html');

/*
    Grab all paragraph elements in the document.
    $nodes is a DOMNodeList object
*/
$nodes $dom->getElementsByTagName('p');

/*
    Quick debugging to see what we've got 
*/
header('Content-Type: text/plain; charset=utf-8');
foreach (
$nodes as $p)
{
    
// Could use $p->textContent if we only wanted
    // the text content (no HTML tags)
    
var_dump($dom->saveXML($p));
}
Salathe is offline  
Reply With Quote
Old 03-21-2008, 04:01 PM   #5 (permalink)
is cute and cuddly
 
delayedinsanity's Avatar
 
Join Date: Mar 2008
Location: Vegas, Baby
Posts: 963
Thanks: 31
delayedinsanity is on a distinguished road
Default

Yeah, everything between an opening and closing P including other tags, etc. So that the following,

HTML Code:
<p>Fusce porta pede nec eros. Maecenas ipsum sem, interdum non, aliquam vitae, interdum nec, metus. Maecenas ornare lobortis risus. Etiam placerat varius mauris.</p>

<p>Maecenas viverra. <a href="">Sed feugiat.</a> Donec mattis quam aliquam risus. Proin quis massa semper felis euismod ultricies.</p>
...for example, would return two matches.
__________________
uʍop ǝpısdn sı ɹǝʇndɯoɔ ɹnoʎ sıɥʇ pɐǝɹ uɐɔ noʎ ɟı
delayedinsanity is offline  
Reply With Quote
Old 03-23-2008, 03:15 PM   #6 (permalink)
The Contributor
RegEx Guru 
 
Join Date: Dec 2007
Location: Belgium
Posts: 60
Thanks: 6
Geert is on a distinguished road
Default

Worked out WildHoney's regex a bit further. Now also allows newlines inside p elements, as well as html attributes.

Code:
#<p\b[^>]*+>(.+?)</p>#is
__________________
Kohana - PHP5 framework
Geert is offline  
Reply With Quote
Old 03-24-2008, 07:50 PM   #7 (permalink)
is cute and cuddly
 
delayedinsanity's Avatar
 
Join Date: Mar 2008
Location: Vegas, Baby
Posts: 963
Thanks: 31
delayedinsanity is on a distinguished road
Default

Salathe: I'll look more into that - it grabs the elements and everything in between them though, or does it just go through and match the elements themsevles?

Geert: Thank you, I've actually gotten slowed down working on the design again and less on the coding, but I should get back into it in the next day or two here and I'll give that a try.
__________________
uʍop ǝpısdn sı ɹǝʇndɯoɔ ɹnoʎ sıɥʇ pɐǝɹ uɐɔ noʎ ɟı
delayedinsanity is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 04:51 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design