TalkPHP

TalkPHP (http://www.talkphp.com/forums.php)
-   Advanced PHP Programming (http://www.talkphp.com/advanced-php-programming/)
-   -   Majorly Advanced Regex (http://www.talkphp.com/advanced-php-programming/1715-majorly-advanced-regex.html)

bluesaga 12-12-2007 12:16 AM

Majorly Advanced Regex
 
Looking for some clever clogs to figure out something for me:

Currently the strip_tags php function is rather rubbish, and simple folk must of wrote it! Well it doesn't check for the angle bracket within tag attributes.

So for example you have the html code:
Code:

<a onclick="javascript.writeln(\">these angl>>>>>>e brackets are smelly');">Boo Boo</a>
and you run it via strip_tags, PHP will return '>>>>>e brackets are smelly');">Boo Boo'

What i am requesting is some regex that will handle it as it should returning 'Boo Boo', i've been fiddling with lookaheads, behinds and arounds and just can't get it to match the whole tag!

Salathe 12-12-2007 01:28 AM

I've arrived at a pattern (it's not majorly advanced, even if it has that appearance) from a starting point offered elsewhere where the same question was asked.
#<.*?(?:\s+[\w\W]+?(?:\s*=\s*([\'"]?).*?(?<!\\\\)\\1))*?\>#s
It basically looks for tags (items surrounded by <>), with the bulk of the pattern catering for optional content within the tag (attributes or other random junk).

In tuning the pattern, I also wrote up a quick series of tests (generally something I do with all code snippets) which you can try out yourself.

php Code:
<?php

$tests = array(
    '<aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>moo',
    '<a onclick="javascript.writeln(\">these angl>>>>>>e brackets are smelly\');">Boo Boo</a>',
    '<img src="image.gif" onload="if (this.width<50) {this.src=\'image2.gif\'; this.width=\'120\'; this.height=\'90\'}">
<p>This is some text</p>'
,
    '<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45">testestets>blah</A></TD>'
,
    file_get_contents('http://example.org/')
);

// Tweaked from [url]http://forums.devnetwork.net/viewtopic.php?t=25494[/url]
$pattern = '#<.*?(?:\s+[\w\W]+?(?:\s*=\s*([\'"]?).*?(?<!\\\\)\\1))*?\>#s';

foreach ($tests as $id => $test)
{
    $start  = microtime(true);
    $result = preg_replace($pattern, '', $test);
    $time   = round((microtime(true) - $start) * 1000, 6);
    printf('<h4>TEST %d (%s ms)</h4><pre>%s</pre>', $id + 1, $time, $result);
    echo "\n";
}

It's only a very quick solution so there could well be huge flaws!! I'll look over it for potential problems when it's not so late and my eyes aren't struggling to focus on the screen. 8-)

Geert 12-12-2007 05:17 PM

Quote:

Originally Posted by bluesaga (Post 6344)
So for example you have the html code:
Code:

<a onclick="javascript.writeln(\">these angl>>>>>>e brackets are smelly');">Boo Boo</a>

Are you aware that that actually is invalid html? As far as I know html does not allow embedded quotes to be escaped. Put that link in a file and open it in a browser, the javascript won't work and you'll only see the colored part:
Code:

<a onclick="javascript.writeln(\">these angl>>>>>>e brackets are smelly');">Boo Boo</a>
So the question is whether you really want to match html strings like this because when your matching your opening tag beyond the \" you're going on where normal html browsers stop.


All times are GMT. The time now is 10:09 PM.

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0