mysql - HTML::TableExtract: how to run the right argument [see live example] -
a question regarding parser. there chance catch separators within separate table... paser script runs allready nicely. note - want store data mysql database. great have seperators - (commas, tabs or else - tab seperated values or comma seperated values handy formats work with...
( here data out of following site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20 )
lfd. nr. schul- nummer schulname straße plz ort telefon fax schulart webseite 1 0401 mädchenrealschule marienburg, abenberg, der diözese eichstätt marienburg 1 91183 abenberg 09178/509210 realschulen mrs-marienburg.homepage.t-online.de 2 6581 volksschule abenberg (grundschule) güssübelstr. 2 91183 abenberg 09178/215 09178/905060 volksschulen home.t-online.de/home/vs-abenberg 6 3074 private berufsschule zur sonderpäd. förderung, förderschwerpunkt lernen, abensberg regensburger straße 60 93326 abensberg 09443/709191 09443/709193 berufsschulen zur sonderpädog. förderung www.berufsschule-abensberg.de
well need have lines divided @ least 3 columns - take first record.
name: volksschule abenberg (grundschule) street: güssübelstr. 2 postal-code , town: 91183 abenberg fax , telephone: 09178/215 09178/905060 type of school: volksschulen website: home.t-online.de/home/vs-abenberg
or even better - have divided postal-code , town 2 seperate columns!? question: possible?
by way: see first record: (here show names of school)
1 0401 mädchenrealschule marienburg, abenberg, 6 3074 private berufsschule zur sonderpäd. förderung, förderschwerpunkt lernen, abensberg
those have commas inside name; make difficult create parser creates csv-fomate?
any idea how in perl... if possible great!! many many thx hint regarding little issue - besides great , fascinating!
zero
btw - if want - can add code. no problem here.
#!/usr/bin/perl use strict; use warnings; use html::tableextract; use lwp::simple; use cwd; use posix qw(strftime); $te = html::tableextract->new; $total_records = 0; $suchbegriffe = "e"; $treffer = 50; $range = 0; $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q="; $processdir = "processing"; $counter = 50; $displaydate = ""; $percent = 0; &workdir(); chdir $processdir; &processurl(); print "\npress <enter> continue\n"; <>; $displaydate = strftime('%y%m%d%h%m%s', localtime); open outfile, ">webdata_for_$suchbegriffe\_$displaydate.txt"; &processdata(); close outfile; print "finished processing $total_records records...\n"; print "processed data saved $env{home}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n"; unlink 'processing.html'; die "\n"; sub processurl() { print "\nprocessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n"; getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'unable page'; while( <tempfile.html> ) { open( fh, "$_" ) or die; while( <fh> ) { if( $_ =~ /^.*?(treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) { $total_records = $6; print "total records process $total_records\n"; } } close fh; } unlink 'tempfile.html'; } sub processdata() { while ( $range <= $total_records) { getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'unable page'; $te->parse_file('processing.html'); ($table) = $te->tables; $row ( $table->rows ) { cleanup(@$row); print outfile "@$row\n"; } $| = 1; print "processed records $range $counter"; print "\r"; $counter = $counter + 50; $range = $range + 50; $te = html::tableextract->new; } } sub cleanup() { ( @_ ) { s/s+/ /g; } } sub workdir() { # use home directory process data chdir or die "$!"; if ( ! -d $processdir ) { mkdir ("$env{home}/$processdir", 0755) or die "cannot make directory $processdir: $!"; } }
#!/usr/bin/perl use warnings; use strict; use lwp::simple; use html::tableextract; use text::csv; $html= 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20'; $html =~ tr/\r//d; # strip carriage returns $html =~ s/ / /g; # expand spaces $te = new html::tableextract(); $te->parse($html); @cols = qw( rownum number name phone type website ); @fields = qw( rownum number name street postal town phone fax type website ); $csv = text::csv->new({ binary => 1 }); foreach $ts ($te->table_states) { foreach $row ($ts->rows) { # trim leading/trailing whitespace base fields s/^\s+//, s/\s+$// @$row; # load fields hash using "hash slice" %h; @h{@cols} = @$row; # derive fields base fields, again using hash slice @h{qw/name street postal town/} = split /\n+/, $h{name}; @h{qw/phone fax/} = split /\n+/, $h{phone}; # trim leading/trailing whitespace derived fields s/^\s+//, s/\s+$// @h{qw/name street postal town/}; $csv->combine(@h{@fields}); print $csv->string, "\n"; } }
Comments
Post a Comment