![]() |
RegEx
Hello ppl. Over the time, I've encountered some issues with regex (amongst other stuff) which still aren't clear to me. I guess either it's advanced stuff, or I just simply couldn't find an article or something that will answer my questions. So, here they go:
1. I have a html-formatted text: Code:
<p>some <span>text</span></p><span>another</span>Code:
/<span(?.*)>(.*)<\/span>/2. I'd be very thankful to someone who knows the regular expressions well and has a little time to explain to me (and others, of course) what is the difference between "greedy" and "non-greedy" matches (with some simple examples). I've read about these in some books, and gone thorough them a couple of times, but still I can't fully understand this technique. 3. What is ?: supposed to do? I know that ? will make the previous match optional, meaning that the following expression will match both "a-bc" and "abc". Code:
/[a-z]+\-?[a-z]+/ |
Alright, question 1 and 2 about greediness strongly relate to each other. Regex quantifiers are greedy by default. This means that metacharacters like
*, + and ? will always try to match as much as possible.Let's take this example string: John said: "I like octopuses." Jeff added: "Especially orange ones."When you apply the regex ".*" to that string the following happens:
Now, when you add a question mark after the quantifier (change .* to .*?), you make it ungreedy (aka lazy)..*? won't race through till the end of the string. Instead it will first look at the following character and stops as soon as it encouters ". Thus it only matches "I like octopuses.".Back to your original target string: <p>some <span>text</span></p><span>another</span>. Basically what you need to do is replace the " from my example with <span> and </span>.The regex becomes: <span>(.*?)</span>.Try it. Play with it. Experiment. I hope I did a somewhat decent job in explaining this stuff. I'll leave the non-capturing parentheses stuff for Salathe. ;-) |
For speed you probably shouldn't be matching every character within the attributes for the span tag.
|
Thanks Geert. The two items
?: and ? are not related. When you specify a group such as (abc), that group is returned as part of the matches array -- it captures whatever is inside as a special group. If you use (?:abc) then the group becomes non-capturing, which simply means that the group is not returned with the matches.For those who like code snippets, here's what happens: php Code:
In extension to Geert's post above, the U (PCRE_UNGREEDY) modifier can be used at the end of the regex in order to reverse the normal behaviour. By default, patterns are greedy but with the U modifier they will become non-greedy. In this case, if the ? is used (eg. (.*?) it will make the preceeding quantifier greedy. That's probably confusing to deal with at the moment so I just mention it as an aside. |
About the
U modifier, in my opinion its only goal is to add confusion. I recommend to never use it. |
Thanks alot Geert. Your explanations really enlightened me, and alot of stuff makes sense right now :-D
And than you, too, Salathe. It's good to know that. Now it's the time that I rewrite some of my regex rules, as they are pretty primitive :-) |
One more thing, though. How about recursivity in a regex rule? How does that work? I'm trying to make a simple BB code parser (just playing around), and something doesn't add up. Example:
Code:
/\[(b|i|u)(?:.+?)?\](.+?)\[\/\1\]/Code:
[b ]some text[/b ](without the spaces of course), but the following expression... Code:
[b ]some [i ]text[/i ][/b ]Code:
some text[/i ]Code:
/\[(((b|i|u)(?:.+?)?)|(?R))\](.+?)\[\/\1\]/Code:
some [i ]text[/i ] |
| All times are GMT. The time now is 03:30 AM. |
Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0