Thursday, January 26, 2012

Weird .NET Regex

I was working on a test for SimpleXml and encountered a really weird regex behavior.

I was trying to have a multiline regex match some xml to verify that it had been updated correctly.  I chose regex just because I thought it would be simpler than using an xml parsing library (other than SimpleXml, and I wasn't sure I liked the idea of using SimpleXml in the SimpleXml tests...).

For example, I was trying to match xml like this:
<root>
  <node>test1</node>
  <node>test2</node>
</root>
with a regex like this:
Regex.IsMatch(xmlString, "<node>test1</node>.*<node>test2</node>", RegexOptions.MultiLine);
It should match, but it's not matching.  I tried all kinds of variations throwing in end of line and start of line matchers, etc and nothing worked until I found this:
Regex.IsMatch(xmlString, "<node>test1</node>.*\s*<node>test2</node>", RegexOptions.MultiLine);
For giggles I tried it with .*.* but that doesn't work.  The only pattern I found that worked was .*\s* and I really don't understand why.  So if you can explain why, I'd love to hear it!

update:
Thanks commenters!

Turns out there were 3 things I thought I understood about regex that I didn't:
#1: As explained on regexlib.com \s matches any white-space character including \n and \r.  So that's actually all I needed.  No .* required, and no Multiline option required.
#2: Multiline doesn't change the behavior of .* to make it match newlines like I thought.  It only affects $ and ^, as explained in msdn here.
#3: Singleline is the option that changes the behavior of .* to make it match \n.

So, the final regex I needed was simply:
Regex.IsMatch(xmlString, "<node>test1</node>\s*<node>test2</node>");

2 comments:

  1. '.' will not match a newline, but '\s' will (regardless of Multiline mode)

    See http://regexlib.com/CheatSheet.aspx

    ReplyDelete
  2. You meant "RegexOptions.Singleline."
    source

    ReplyDelete