TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 12-11-2007, 05:38 PM   #1 (permalink)
The Frequenter
Newcomer 
 
xenon's Avatar
 
Join Date: Dec 2007
Location: Bucharest, Romania
Posts: 438
Thanks: 3
xenon is on a distinguished road
Default RegEx

Hello ppl. Over the time, I've encountered some issues with regex (amongst other stuff) which still aren't clear to me. I guess either it's advanced stuff, or I just simply couldn't find an article or something that will answer my questions. So, here they go:

1. I have a html-formatted text:
Code:
<p>some <span>text</span></p><span>another</span>
...and a rule that *should* match the text in the first span (between the opening and its corresponding closing tag):
Code:
/<span(?.*)>(.*)<\/span>/
That should do (although it's too lame, since only one pair of tags will be matched), but the question is: how do I know that $2 will match only "text" and not "text</span></p><span>another"? How can I tell the regex engine to "stop when you run into the first </span>"? I believe the answer it's related to the next question...

2. I'd be very thankful to someone who knows the regular expressions well and has a little time to explain to me (and others, of course) what is the difference between "greedy" and "non-greedy" matches (with some simple examples). I've read about these in some books, and gone thorough them a couple of times, but still I can't fully understand this technique.

3. What is ?: supposed to do? I know that ? will make the previous match optional, meaning that the following expression will match both "a-bc" and "abc".
Code:
/[a-z]+\-?[a-z]+/
Thanks in advance.
__________________
I have optimistic thoughts, even though sometimes (if not always) life's a bitch.
xenon is offline  
Reply With Quote
Old 12-11-2007, 06:37 PM   #2 (permalink)
The Contributor
RegEx Guru 
 
Join Date: Dec 2007
Location: Belgium
Posts: 60
Thanks: 6
Geert is on a distinguished road
Default

Alright, question 1 and 2 about greediness strongly relate to each other. Regex quantifiers are greedy by default. This means that metacharacters like *, + and ? will always try to match as much as possible.

Let's take this example string: John said: "I like octopuses." Jeff added: "Especially orange ones."

When you apply the regex ".*" to that string the following happens:
  1. The regex starts looking for the first ".
  2. As soon as it finds one, the .* part kicks in.
  3. . matches any character and races through to the end of the string.
  4. Then it starts looking back, one character at a time until it encouters another ". This process is called backtracking.
  5. This is the part of the original string that gets matched: "I like octopuses." Jeff added: "Especially orange ones."

Now, when you add a question mark after the quantifier (change .* to .*?), you make it ungreedy (aka lazy).

.*? won't race through till the end of the string. Instead it will first look at the following character and stops as soon as it encouters ". Thus it only matches "I like octopuses.".


Back to your original target string: <p>some <span>text</span></p><span>another</span>. Basically what you need to do is replace the " from my example with <span> and </span>.

The regex becomes: <span>(.*?)</span>.

Try it. Play with it. Experiment. I hope I did a somewhat decent job in explaining this stuff.

I'll leave the non-capturing parentheses stuff for Salathe.
__________________
Kohana - PHP5 framework
Geert is offline  
Reply With Quote
The Following 4 Users Say Thank You to Geert For This Useful Post:
Karl (12-11-2007), victorius (12-12-2007), Wildhoney (12-11-2007), xenon (12-12-2007)
Old 12-11-2007, 06:42 PM   #3 (permalink)
The Acquainted
 
wGEric's Avatar
 
Join Date: Nov 2007
Posts: 166
Thanks: 0
wGEric is on a distinguished road
Default

For speed you probably shouldn't be matching every character within the attributes for the span tag.
__________________
Eric
wGEric is offline  
Reply With Quote
Old 12-11-2007, 06:53 PM   #4 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

Thanks Geert. The two items ?: and ? are not related. When you specify a group such as (abc), that group is returned as part of the matches array -- it captures whatever is inside as a special group. If you use (?:abc) then the group becomes non-capturing, which simply means that the group is not returned with the matches.

For those who like code snippets, here's what happens:
php Code:
preg_match('/t(es)t/', 'test', $matches);
/*
Array
(
    [0] => test
    [1] => es
)
*/


preg_match('/t(?:es)t/', 'test', $matches);
/*
Array
(
    [0] => test
)
*/

In extension to Geert's post above, the U (PCRE_UNGREEDY) modifier can be used at the end of the regex in order to reverse the normal behaviour. By default, patterns are greedy but with the U modifier they will become non-greedy. In this case, if the ? is used (eg. (.*?) it will make the preceeding quantifier greedy. That's probably confusing to deal with at the moment so I just mention it as an aside.
Salathe is offline  
Reply With Quote
The Following User Says Thank You to Salathe For This Useful Post:
Matt83 (12-11-2007)
Old 12-11-2007, 07:05 PM   #5 (permalink)
The Contributor
RegEx Guru 
 
Join Date: Dec 2007
Location: Belgium
Posts: 60
Thanks: 6
Geert is on a distinguished road
Default

About the U modifier, in my opinion its only goal is to add confusion. I recommend to never use it.
__________________
Kohana - PHP5 framework
Geert is offline  
Reply With Quote
Old 12-11-2007, 09:26 PM   #6 (permalink)
The Frequenter
Newcomer 
 
xenon's Avatar
 
Join Date: Dec 2007
Location: Bucharest, Romania
Posts: 438
Thanks: 3
xenon is on a distinguished road
Default

Thanks alot Geert. Your explanations really enlightened me, and alot of stuff makes sense right now

And than you, too, Salathe. It's good to know that. Now it's the time that I rewrite some of my regex rules, as they are pretty primitive
__________________
I have optimistic thoughts, even though sometimes (if not always) life's a bitch.
xenon is offline  
Reply With Quote
Old 12-12-2007, 10:38 AM   #7 (permalink)
The Frequenter
Newcomer 
 
xenon's Avatar
 
Join Date: Dec 2007
Location: Bucharest, Romania
Posts: 438
Thanks: 3
xenon is on a distinguished road
Default

One more thing, though. How about recursivity in a regex rule? How does that work? I'm trying to make a simple BB code parser (just playing around), and something doesn't add up. Example:

Code:
/\[(b|i|u)(?:.+?)?\](.+?)\[\/\1\]/
would successfully match:

Code:
[b ]some text[/b ]
,

(without the spaces of course), but the following expression...

Code:
[b ]some [i ]text[/i ][/b ]
would leave me with:

Code:
some text[/i ]
That's when recursivity popped in my mind and I did a Google search. But I believe I don't use properly, or the back-references don't remain the same during a recursive operation, because this rule:

Code:
/\[(((b|i|u)(?:.+?)?)|(?R))\](.+?)\[\/\1\]/
, applied on the text introduced at the top, leaves me with:

Code:
some [i ]text[/i ]
Spooky
__________________
I have optimistic thoughts, even though sometimes (if not always) life's a bitch.
xenon is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 08:29 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design