regex - Matching text within HTML with PHP's Regexp functions -
possible duplicates:
preg match text in php between html tags
regex match open tags except xhtml self-contained tags
i have large amount of text formatted in following way:
<p><b>1- title</b> <p> <dl><dd> text text text text text text text </dl><p> <p><b>2 - title 2</b> <p> <dl><dd> text text text text text text text text text text text text text text text text text text text text text <br><i>additional irrelevant information</i> </dl><p>
i'm trying use php's regexp functions retrieve title-text value pairs while stripping out characters irrelevant info follows of text blocks. preferably i'd to:
grab between <p><b> , </b>
title
grab text between
<dl><dd>
and next html tag (<) text, , somehow keep 2 associated further processing. idea how php's regexp functions?
as comments on question suggest, questions along same lines asked on stack overflow, , right answer "don't try parse html regular expressions". making point, however, think it's useful have example in answer of showing how 1 might take suggested approach. case in question, 1 do:
<?php $html = <<<eof <p><b>1- title</b> <p> <dl><dd> text text text text text text text </dl><p> <p><b>2 - title 2</b> <p> <dl><dd> text text text text text text text text text text text text text text text text text text text text text <br><i>additional irrelevant information</i> </dl><p> eof; $d = new domdocument; $d->loadhtml($html); $xp = new domxpath($d); $matches = $xp->query("//p/b", $d); foreach ($matches $dn) { echo "title is: " . $dn->nodevalue . "\n"; $dl = $dn->parentnode->nextsibling->nextsibling->firstchild; $dd = $dl->firstchild; echo "content is: " . $dd->nodevalue . "\n"; } ?>
depending on how robust need be, want check nextsibling
s , children tags name expect, shows idea anyway.
Comments
Post a Comment