What are some good ways to parse HTML and CSS in Perl? -

January 15, 2010

i have project input files used xml. i'm being asked start processing html embedded css instead, , i'd accomplish cleanly , few code changes possible. using xml::libxml parse xml files, we're moving html css, i'm thinking i'll need move else. said, before dig myself knee deep silly decisions i'll regret, wanted ask here: guys use kind of task?

the structures of old xml , new html input files pretty similar, both holding same information. html uses divs in place of xml's text nodes, , holds style information in style tags , attributes instead of separated xml attributes.

an example of old xml is:

<text font="timesnewroman,bolditalic" size="11.04" x="59" y="405" w="52"       h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"       o_size="11.04" o_cs="4.6"> text </text>

an example of new html is:

<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">   <span class="ft19" >     text   </span></nobr> </div>

where "ft19" refers css style element top of page of format:

.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;        font-family:times;color:#000000;x-pdf-color:#000000;font-style:italic;        x-pdf-letter-spacing:0.83px;}

basically, want parser can read stylistic elements of each node attributes, like:

my @texts_arr = $page_node->findnodes('text'); $test_node = $texts_arr[1]; print "node\'s bold value is: " . $text_node->getattribute('bold');

as i'm able xml. exist parsing html? i'd make sure start right way instead of finding sort of want on cpan , realizing 2 months later there module way better i'm trying do.

ideas?

the basic 1 aware of html::parser.

there project works it, marpa::html work of larger parser project marpa, parses language can described in bnf, documented on author's blog interesting newer , experimental.

i see wildly successful www::mechanize uses html::tokeparser, , uses html::pullparser, there's too.

if need more generic (and evil) can "writing" own using text::balanced (which has nice methods tags, not sure tag properties though) or regexp::grammars, again means reinventing wheel somewhat, choose these routes if above don't need.

perhaps haven't helped. perhaps have done literature search you, maybe 1 of these work better others.

edit: 1 more parser you, seems might need html::tree. @ methods look_down html::element act on tree. saw example here.

Search This Blog

Assebmley

What are some good ways to parse HTML and CSS in Perl? -

Comments

Post a Comment

Popular posts from this blog

apache - Add omitted ? to URLs -

redirect - bbPress Forum - rewrite to wwww.mysite prohibits login -

php - How can I stop spam on my custom forum/blog? -