Hello ppl. Over the time, I've encountered some issues with regex (amongst other stuff) which still aren't clear to me. I guess either it's advanced stuff, or I just simply couldn't find an article or something that will answer my questions. So, here they go:
1. I have a html-formatted text:
...and a rule that *should* match the text in the first span (between the opening and its corresponding closing tag):
That should do (although it's too lame, since only one pair of tags will be matched), but the question is: how do I know that $2 will match only "text" and not "text</span></p><span>another"? How can I tell the regex engine to "stop when you run into the first </span>"? I believe the answer it's related to the next question...
2. I'd be very thankful to someone who knows the regular expressions well and has a little time to explain to me (and others, of course) what is the difference between "greedy" and "non-greedy" matches (with some simple examples). I've read about these in some books, and gone thorough them a couple of times, but still I can't fully understand this technique.
3. What is ?: supposed to do? I know that ? will make the previous match optional, meaning that the following expression will match both "a-bc" and "abc".
Thanks in advance.
I have optimistic thoughts, even though sometimes (if not always) life's a bitch.
Alright, question 1 and 2 about greediness strongly relate to each other. Regex quantifiers are greedy by default. This means that metacharacters like *, + and ? will always try to match as much as possible.
Let's take this example string: John said: "I like octopuses." Jeff added: "Especially orange ones."
When you apply the regex ".*" to that string the following happens:
The regex starts looking for the first ".
As soon as it finds one, the .* part kicks in.
. matches any character and races through to the end of the string.
Then it starts looking back, one character at a time until it encouters another ". This process is called backtracking.
This is the part of the original string that gets matched: "I like octopuses." Jeff added: "Especially orange ones."
Now, when you add a question mark after the quantifier (change .* to .*?), you make it ungreedy (aka lazy).
.*? won't race through till the end of the string. Instead it will first look at the following character and stops as soon as it encouters ". Thus it only matches "I like octopuses.".
Back to your original target string: <p>some <span>text</span></p><span>another</span>. Basically what you need to do is replace the " from my example with <span> and </span>.
The regex becomes: <span>(.*?)</span>.
Try it. Play with it. Experiment. I hope I did a somewhat decent job in explaining this stuff.
I'll leave the non-capturing parentheses stuff for Salathe.
Thanks Geert. The two items ?: and ? are not related. When you specify a group such as (abc), that group is returned as part of the matches array -- it captures whatever is inside as a special group. If you use (?:abc) then the group becomes non-capturing, which simply means that the group is not returned with the matches.
For those who like code snippets, here's what happens:
preg_match('/t(es)t/', 'test', $matches); /* Array (  => test  => es ) */
In extension to Geert's post above, the U (PCRE_UNGREEDY) modifier can be used at the end of the regex in order to reverse the normal behaviour. By default, patterns are greedy but with the U modifier they will become non-greedy. In this case, if the ? is used (eg. (.*?) it will make the preceeding quantifier greedy. That's probably confusing to deal with at the moment so I just mention it as an aside.
The Following User Says Thank You to Salathe For This Useful Post: