html parsing

Hi,
I'm trying to extract information from html like this...

http://www.rafb.net/paste/results/Ze4RTm27.html

I've tried modifiying examples from the man pages for HTML::TokeParser,
and HTML::TreeBuilder without much success.

I just want to identify such blocks of html by the attributes in the
child nodes; extract the text node under the first '<td>',
extract the text node under the second '<td>' as well as the href
attribute in the enclosed '<a>' node,
store the output in a hash which I can pass to other functions or
print to a csv file.

If anyone can suggest anything while I read the docs and relevant
hacks in "Spidering Hacks" more carefully it would be appreciated.

Regards,
Malcolm.
malcolm.mill [ Mo, 02 Mai 2005 17:29 ] [ ID #774501 ]

RE: html parsing

You should consult O'Reilly's Perl and LWP for a good explanation of how =
to
use the toke parser. Here's some code that I wrote.

The important part is that token->[0] refers to the token type.
Token->[1] often holds the text of the token.
Token->[4] has the source code in case of a start tag.

Andrew Johnson
Marketing Writer
Elias/Savion Advertising
Phone: 412.642.7700 Fax 412.642.2277
www.elias-savion.com
andrew.johnson [at] elias-savion.com

sub Report
{ open (ARTICLES, "$_[0]");
open (DATA, ">data.csv");
while (<ARTICLES>)
{
my $count=3D0;
my $numtokens=3D0;
my $response=3D$browser->get("$_");
die "Error getting: ", $response->status_line,
$response->headers_as_string
if $response->is_error;
my $content =3D $response->content;
my $stream =3D HTML::TokeParser->new(\$content)
|| die "Coulnd't read HTML $content BLAH BLAH LAH";
my $header =3D$response->header('X-META-PUB-DATE');
$header=3D~ s/,/;/g;
if (!$header)
{
$header=3D'N/A';
}
print DATA "$header,";=09
print DATA "BusinessWeek,";
$header=3D$response->header('X-META-AUTHOR');
$header=3D~ s/,/;/g;
$header=3D~s/\n/ /;
if (!$header)
{
$header=3D'N/A';
}
print DATA "$header,";
$header =3D$response->header('X-META-HEADLINE');
$header=3D~ s/,/;/g;
if (!$header)
{
$header=3D'N/A';
}
print DATA "$header,";
my %keyfinds;
while(my $token=3D$stream->get_token)
{
if ($token->[0] eq 'T')
{
if ($token->[1] =3D~ /\w/)
{
if ($token->[1] =3D~
/(BUSH|CLINTON)/)
{
$keyfinds{$1}+=3D1;
$numtokens++;
my [at] rawdata=3D$token->[1];
chomp [at] rawdata;
foreach my $line ( [at] rawdata)
{
$line =3D~ s/\t/ /g;
my
[at] array=3Dsplit(/\s/,$line);
foreach my $word
( [at] array)
{
unless($word
eq '')
{

$count++;
}
}
}
}
}
}
}
my $value =3D $count*3885/20;
print DATA
"$count,$numtokens,977128,$value";$count=3D0;$numtokens=3D0;
my $highest=3D0;
my $highstring;
foreach my $key (%keyfinds)
{
if ($keyfinds{$key} > $highest)
{
$highest=3D$keyfinds{$key};
$highstring=3D$key;
}
}
if ($highstring)
{
print DATA ",$highstring,$highest,$_"; $highest=3D0;

}
else
{
print DATA ",,,$_";
}
} =09
close DATA;
close ARTICLES;
} =09


-----Original Message-----
From: Malcolm Mill [mailto:malcolm.mill [at] gmail.com]
Sent: Monday, May 02, 2005 11:30 AM
To: libwww [at] perl.org
Subject: html parsing

Hi,
I'm trying to extract information from html like this...

http://www.rafb.net/paste/results/Ze4RTm27.html

I've tried modifiying examples from the man pages for HTML::TokeParser,
and HTML::TreeBuilder without much success.

I just want to identify such blocks of html by the attributes in the
child nodes; extract the text node under the first '<td>',
extract the text node under the second '<td>' as well as the href
attribute in the enclosed '<a>' node,
store the output in a hash which I can pass to other functions or
print to a csv file.

If anyone can suggest anything while I read the docs and relevant
hacks in "Spidering Hacks" more carefully it would be appreciated.

Regards,
Malcolm.
Andrew.Johnson [ Mo, 02 Mai 2005 17:36 ] [ ID #774502 ]
Perl » perl.libwww » html parsing

Vorheriges Thema: [PATCH] skip some tests when Compress::Zlib is not installed
Nächstes Thema: URI module problems