How to use Perl Regex to detect <p> inside another <p> -
i trying parse "wrong html" fix using perl regex. wrong html following: <p>foo<p>bar</p>foo</p>
i perl regex return me : <p>foo<p>
i tried like: '|(<p\b[^>]*>(?!</p>)*?<p[^>]*>)|'
no success because cannot repeat (?!</p>)*?
is there way in perl regex charactère except following sequence (in case </p>
)
try like:
<p>(?:(?!</?p>).)*</p>(?!(?:(?!</?p>).)*(<p>|$))
a quick break down:
<p>(?:(?!</?p>).)*</p>
matches <p> ... </p>
not contain either <p>
, </p>
. , part:
(?!(?:(?!</?p>).)*(<p>|$))
is "true" when looking ahead ((?! ... )
) there no <p>
or end of input ((<p>|$)
), without <p>
, </p>
in between ((?:(?!</?p>).)*
).
a demo:
my $txt="<p>aaa aa a</p> <p>foo <p>bar</p> foo</p> <p> bb <p>x</p> bb</p>"; while($txt =~ m/(<p>(?:(?!<\/?p>).)*<\/p>)(?!(?:(?!<\/?p>).)*(<p>|$))/g) { print "found: $1\n"; }
prints:
found: <p>bar</p> found: <p>x</p>
note regex trickery works <p>baz</p>
in string:
<p>foo <p>bar</p> <p>baz</p> foo</p>
<p>bar</p>
not matched! after replacing <p>baz</p>
, 2nd run on input, in case <p>bar</p>
matched.
Comments
Post a Comment