I've arrived at a pattern (it's not
majorly advanced, even if it has that appearance) from a starting point offered
elsewhere where the same question was asked.
#<.*?(?:\s+[\w\W]+?(?:\s*=\s*([\'"]?).*?(?<!\\\\)\\1))*?\>#s
It basically looks for tags (items surrounded by
<>), with the bulk of the pattern catering for optional content within the tag (attributes or other random junk).
In tuning the pattern, I also wrote up a quick series of tests (generally something I do with all code snippets) which you can try out yourself.
php Code:
<?php$tests =
array( '<aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>moo',
'<a onclick="javascript.writeln(\">these angl>>>>>>e brackets are smelly\');">Boo Boo</a>',
'<img src="image.gif" onload="if (this.width<50) {this.src=\'image2.gif\'; this.width=\'120\'; this.height=\'90\'}">
<p>This is some text</p>',
'<TD WIDTH="14%" BACKGROUND="images.jpg"><A HREF="http://something.xxx">
<IMG SRC="image.gif" BORDER="0" ONLOAD="if (this.width>50) this.border=1" ALT="Preview by Thumbshots"
WIDTH="45">testestets>blah</A></TD>',
file_get_contents('http://example.org/'));
// Tweaked from [url]http://forums.devnetwork.net/viewtopic.php?t=25494[/url]$pattern =
'#<.*?(?:\s+[\w\W]+?(?:\s*=\s*([\'"]?).*?(?<!\\\\)\\1))*?\>#s';
foreach ($tests as $id =>
$test){ $start =
microtime(true);
$result =
preg_replace($pattern,
'',
$test);
$time =
round((microtime(true) -
$start) *
1000,
6);
printf('<h4>TEST %d (%s ms)</h4><pre>%s</pre>',
$id +
1,
$time,
$result);
echo "\n";
}
It's only a very quick solution so there could well be huge flaws!! I'll look over it for potential problems when it's not so late and my eyes aren't struggling to focus on the screen.
