TalkPHP

TalkPHP (http://www.talkphp.com/forums.php)
-   General (http://www.talkphp.com/general/)
-   -   RegEx (http://www.talkphp.com/general/1708-regex.html)

xenon 12-11-2007 05:38 PM

RegEx
 
Hello ppl. Over the time, I've encountered some issues with regex (amongst other stuff) which still aren't clear to me. I guess either it's advanced stuff, or I just simply couldn't find an article or something that will answer my questions. So, here they go:

1. I have a html-formatted text:
Code:

<p>some <span>text</span></p><span>another</span>
...and a rule that *should* match the text in the first span (between the opening and its corresponding closing tag):
Code:

/<span(?.*)>(.*)<\/span>/
That should do (although it's too lame, since only one pair of tags will be matched), but the question is: how do I know that $2 will match only "text" and not "text</span></p><span>another"? How can I tell the regex engine to "stop when you run into the first </span>"? I believe the answer it's related to the next question...

2. I'd be very thankful to someone who knows the regular expressions well and has a little time to explain to me (and others, of course) what is the difference between "greedy" and "non-greedy" matches (with some simple examples). I've read about these in some books, and gone thorough them a couple of times, but still I can't fully understand this technique.

3. What is ?: supposed to do? I know that ? will make the previous match optional, meaning that the following expression will match both "a-bc" and "abc".
Code:

/[a-z]+\-?[a-z]+/
Thanks in advance.

Geert 12-11-2007 06:37 PM

Alright, question 1 and 2 about greediness strongly relate to each other. Regex quantifiers are greedy by default. This means that metacharacters like *, + and ? will always try to match as much as possible.

Let's take this example string: John said: "I like octopuses." Jeff added: "Especially orange ones."

When you apply the regex ".*" to that string the following happens:
  1. The regex starts looking for the first ".
  2. As soon as it finds one, the .* part kicks in.
  3. . matches any character and races through to the end of the string.
  4. Then it starts looking back, one character at a time until it encouters another ". This process is called backtracking.
  5. This is the part of the original string that gets matched: "I like octopuses." Jeff added: "Especially orange ones."

Now, when you add a question mark after the quantifier (change .* to .*?), you make it ungreedy (aka lazy).

.*? won't race through till the end of the string. Instead it will first look at the following character and stops as soon as it encouters ". Thus it only matches "I like octopuses.".


Back to your original target string: <p>some <span>text</span></p><span>another</span>. Basically what you need to do is replace the " from my example with <span> and </span>.

The regex becomes: <span>(.*?)</span>.

Try it. Play with it. Experiment. I hope I did a somewhat decent job in explaining this stuff.

I'll leave the non-capturing parentheses stuff for Salathe. ;-)

wGEric 12-11-2007 06:42 PM

For speed you probably shouldn't be matching every character within the attributes for the span tag.

Salathe 12-11-2007 06:53 PM

Thanks Geert. The two items ?: and ? are not related. When you specify a group such as (abc), that group is returned as part of the matches array -- it captures whatever is inside as a special group. If you use (?:abc) then the group becomes non-capturing, which simply means that the group is not returned with the matches.

For those who like code snippets, here's what happens:
php Code:
preg_match('/t(es)t/', 'test', $matches);
/*
Array
(
    [0] => test
    [1] => es
)
*/


preg_match('/t(?:es)t/', 'test', $matches);
/*
Array
(
    [0] => test
)
*/

In extension to Geert's post above, the U (PCRE_UNGREEDY) modifier can be used at the end of the regex in order to reverse the normal behaviour. By default, patterns are greedy but with the U modifier they will become non-greedy. In this case, if the ? is used (eg. (.*?) it will make the preceeding quantifier greedy. That's probably confusing to deal with at the moment so I just mention it as an aside.

Geert 12-11-2007 07:05 PM

About the U modifier, in my opinion its only goal is to add confusion. I recommend to never use it.

xenon 12-11-2007 09:26 PM

Thanks alot Geert. Your explanations really enlightened me, and alot of stuff makes sense right now :-D

And than you, too, Salathe. It's good to know that. Now it's the time that I rewrite some of my regex rules, as they are pretty primitive :-)

xenon 12-12-2007 10:38 AM

One more thing, though. How about recursivity in a regex rule? How does that work? I'm trying to make a simple BB code parser (just playing around), and something doesn't add up. Example:

Code:

/\[(b|i|u)(?:.+?)?\](.+?)\[\/\1\]/
would successfully match:

Code:

[b ]some text[/b ]
,

(without the spaces of course), but the following expression...

Code:

[b ]some [i ]text[/i ][/b ]
would leave me with:

Code:

some text[/i ]
That's when recursivity popped in my mind and I did a Google search. But I believe I don't use properly, or the back-references don't remain the same during a recursive operation, because this rule:

Code:

/\[(((b|i|u)(?:.+?)?)|(?R))\](.+?)\[\/\1\]/
, applied on the text introduced at the top, leaves me with:

Code:

some [i ]text[/i ]
Spooky :-/


All times are GMT. The time now is 03:30 AM.

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0