regex - Matching text within HTML with PHP's Regexp functions -


possible duplicates:
preg match text in php between html tags
regex match open tags except xhtml self-contained tags

i have large amount of text formatted in following way:

    <p><b>1- title</b>     <p>     <dl><dd>&nbsp;&nbsp;&nbsp; text text text text text text text     </dl><p>     <p><b>2 - title 2</b>     <p>     <dl><dd>&nbsp;&nbsp;&nbsp; text text text text text text text text text text text text text text text text text text text text text     <br><i>additional irrelevant information</i>     </dl><p> 

i'm trying use php's regexp functions retrieve title-text value pairs while stripping out   characters irrelevant info follows of text blocks. preferably i'd to:

grab between <p><b> , </b> title

grab text between

<dl><dd>&nbsp;&nbsp;&nbsp; 

and next html tag (<) text, , somehow keep 2 associated further processing. idea how php's regexp functions?

as comments on question suggest, questions along same lines asked on stack overflow, , right answer "don't try parse html regular expressions". making point, however, think it's useful have example in answer of showing how 1 might take suggested approach. case in question, 1 do:

<?php  $html = <<<eof     <p><b>1- title</b>     <p>     <dl><dd>&nbsp;&nbsp;&nbsp; text text text text text text text     </dl><p>     <p><b>2 - title 2</b>     <p>     <dl><dd>&nbsp;&nbsp;&nbsp; text text text text text text text text text text text text text text text text text text text text text     <br><i>additional irrelevant information</i>     </dl><p> eof;  $d = new domdocument; $d->loadhtml($html);  $xp = new domxpath($d);  $matches = $xp->query("//p/b", $d); foreach ($matches $dn) {     echo "title is: " . $dn->nodevalue . "\n";     $dl = $dn->parentnode->nextsibling->nextsibling->firstchild;     $dd = $dl->firstchild;     echo "content is: " . $dd->nodevalue . "\n"; } ?> 

depending on how robust need be, want check nextsiblings , children tags name expect, shows idea anyway.


Comments

Popular posts from this blog

apache - Add omitted ? to URLs -

redirect - bbPress Forum - rewrite to wwww.mysite prohibits login -

php - How can I stop spam on my custom forum/blog? -