What are some good ways to parse HTML and CSS in Perl? -
i have project input files used xml. i'm being asked start processing html embedded css instead, , i'd accomplish cleanly , few code changes possible. using xml::libxml parse xml files, we're moving html css, i'm thinking i'll need move else. said, before dig myself knee deep silly decisions i'll regret, wanted ask here: guys use kind of task?
the structures of old xml , new html input files pretty similar, both holding same information. html uses divs in place of xml's text nodes, , holds style information in style tags , attributes instead of separated xml attributes.
an example of old xml is:
<text font="timesnewroman,bolditalic" size="11.04" x="59" y="405" w="52" h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12" o_size="11.04" o_cs="4.6"> text </text>
an example of new html is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;"> <span class="ft19" > text </span></nobr> </div>
where "ft19" refers css style element top of page of format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px; font-family:times;color:#000000;x-pdf-color:#000000;font-style:italic; x-pdf-letter-spacing:0.83px;}
basically, want parser can read stylistic elements of each node attributes, like:
my @texts_arr = $page_node->findnodes('text'); $test_node = $texts_arr[1]; print "node\'s bold value is: " . $text_node->getattribute('bold');
as i'm able xml. exist parsing html? i'd make sure start right way instead of finding sort of want on cpan , realizing 2 months later there module way better i'm trying do.
ideas?
the basic 1 aware of html::parser
.
there project works it, marpa::html
work of larger parser project marpa
, parses language can described in bnf, documented on author's blog interesting newer , experimental.
i see wildly successful www::mechanize uses html::tokeparser
, , uses html::pullparser
, there's too.
if need more generic (and evil) can "writing" own using text::balanced
(which has nice methods tags, not sure tag properties though) or regexp::grammars
, again means reinventing wheel somewhat, choose these routes if above don't need.
perhaps haven't helped. perhaps have done literature search you, maybe 1 of these work better others.
edit: 1 more parser you, seems might need html::tree
. @ methods look_down
html::element
act on tree. saw example here.
Comments
Post a Comment