Help help! Writer trying to program!

------_=_NextPart_001_01C519C6.979D0D08
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hello there,

I wrote a script to scrape businessweek's search results. It
worked fine, but now I am trying to authenticate my agent to =
businessweek
first, before I do my search, so that my search results don't point at
register pages, and so I can access the results and parse them. I =
realize my
code is ghetto, but that's because I did not understand the better Perl =
HTML
parsing modules.



The first script is my script that works.



The second is my mangled attempt to authenticate.



Any help would be much appreciated.



use LWP::Simple;

use HTML::SimpleParse;

use Win32API::File 0.08 qw( :ALL );

use LWP::UserAgent;

use Win32::OLE;

use Win32::SAM;

use Win32::Slingshot;



$| =3D 1;

my [at] words =3D ('Different',

'"key+words"');



my $ref =3D -1;

foreach ( [at] words){

$ref++;

[at] index[$ref]=3Dget


("http://search.businessweek.com/Search?searchTerm=3D [at] words[ $ref]&skin=3D=
Busines
sWeek&x=3D9&y=3D5");

$p =3D new HTML::SimpleParse( $index[$ref] );

open(OUTFILE, ">output[$ref].txt") or die "Can't open
output.txt: $!";



$flag =3D 0;

$test=3D0;



foreach ($p->tree) {

if ($p->execute($_) =3D~ /Results /)

{

$flag=3D1;

}

if ($flag=3D=3D1)

{



$test++;

print OUTFILE $p->execute($_);

if ($p->execute($_) =3D~ /Result page/)

{

$flag =3D 0;}

}



}

print "There were $test lines saved for parsing for =
[at] words[$ref]
\n";

close OUTFILE;

open(INFILE, "output[$ref].txt") or die "Can't open =
output.txt:
$!";

open(OUTFILE, ">goodies[$ref].txt") or die "Can't open
goodies.txt: $!";



while (<INFILE>)

{

if ($_ =3D~ /<a href/ )

{

($url,$BetweenTheBold) =3D $_ =3D~ =
/.*'(.*)'.*<b>(.*)<\/b>/ ;

print OUTFILE "$url\n";

print OUTFILE "$BetweenTheBold\n";

}

elsif ($_ =3D~ /\d{2}/ )

{($date) =3D $_ =3D~

/-.*((January|February|September|November|December|March|Apr il|May|June|J=
uly
|Augu

st|October).{2}.*\d{4}).*/ ;

print OUTFILE "$date\n\n";

}

}

close INFILE;

close OUTFILE;

}



my $var=3D-1;

open(OUTFILE, ">total.txt") or die "Can't open total.txt: $!";

while ($var < $ref)



{ $var++;

open(INFILE, "goodies[$var].txt") or die "Can't open
goodies.txt: $!";

while (<INFILE>)

{if ($_ =3D~ /\w/)

{print OUTFILE $_;}

}

close INFILE;

DeleteFile ("goodies[$var].txt");

DeleteFile ("output[$var].txt");

}

close OUTFILE;



AND WITH AUTHENTICATION



use LWP::Simple;

use HTML::SimpleParse;

use Win32API::File 0.08 qw( :ALL );

use LWP::UserAgent;

use Win32::OLE;

use Win32::SAM;

use Win32::Slingshot;



$| =3D 1;

my [at] words =3D ('Different',

'"key+words"');



#AUTHENTICATE



my $browser =3D LWP::UserAgent->new;

$browser->credentials(

'www-secure.businessweek.com',

'',

'andrewljohnson' =3D> 'hermit85'

);





my $ref =3D -1;

foreach ( [at] words){

$ref++;

[at] index[$ref]=3D$browser->get


("http://search.businessweek.com/Search?searchTerm=3D [at] words[ $ref]&skin=3D=
Busines
sWeek&x=3D9&y=3D5");

$p =3D new HTML::SimpleParse( $index[$ref] );

open(OUTFILE, ">output[$ref].txt") or die "Can't open
output.txt: $!";



$flag =3D 0;

$test=3D0;



foreach ($p->tree) {

if ($p->execute($_) =3D~ /Results /)

{

$flag=3D1;

}

if ($flag=3D=3D1)

{



$test++;

print OUTFILE $p->execute($_);

if ($p->execute($_) =3D~ /Result page/)

{

$flag =3D 0;}

}



}

print "There were $test lines saved for parsing for =
[at] words[$ref]
\n";

close OUTFILE;

open(INFILE, "output[$ref].txt") or die "Can't open =
output.txt:
$!";

open(OUTFILE, ">goodies[$ref].txt") or die "Can't open
goodies.txt: $!";



while (<INFILE>)

{

if ($_ =3D~ /<a href/ )

{

($url,$BetweenTheBold) =3D $_ =3D~ =
/.*'(.*)'.*<b>(.*)<\/b>/ ;

print OUTFILE "$url\n";

print OUTFILE "$BetweenTheBold\n";

}

elsif ($_ =3D~ /\d{2}/ )

{($date) =3D $_ =3D~

/-.*((January|February|September|November|December|March|Apr il|May|June|J=
uly
|Augu

st|October).{2}.*\d{4}).*/ ;

print OUTFILE "$date\n\n";

}

}

close INFILE;

close OUTFILE;

}



my $var=3D-1;

open(OUTFILE, ">total.txt") or die "Can't open total.txt: $!";

while ($var < $ref)



{ $var++;

open(INFILE, "goodies[$var].txt") or die "Can't open
goodies.txt: $!";

while (<INFILE>)

{if ($_ =3D~ /\w/)

{print OUTFILE $_;}

}

close INFILE;

DeleteFile ("goodies[$var].txt");

DeleteFile ("output[$var].txt");

}

close OUTFILE;







( Andrew Johnson )

) Marketing Writer (

( Elias/Savion Advertising )
( Phone: 412.642.7700 Fax 412.642.2277 )
) www.elias-savion.com (

( andrew.johnson [at] elias-savion.com )





------_=_NextPart_001_01C519C6.979D0D08--
Andrew.Johnson [ Mi, 23 Februar 2005 17:41 ] [ ID #660539 ]

Re: Help help! Writer trying to program!

On Wed, Feb 23, 2005 at 11:41:58AM -0500, Andrew Johnson (Andrew.Johnson [at] elias-savion.com) wrote:
> code is ghetto, but that's because I did not understand the better Perl HTML
> parsing modules.

Go take a look at WWW::Mechanize first. Much of your parsing for links
is handled for you.

Also, make sure you have "use warnings;" and "use strict;" at the top of
every program.

xoa


--
Andy Lester => andy [at] petdance.com => www.petdance.com => AIM:petdance
Andy [ Mi, 23 Februar 2005 17:54 ] [ ID #660540 ]
Perl » perl.libwww » Help help! Writer trying to program!

Vorheriges Thema: [PATCH] Improved news:/nntp: support in LWP
Nächstes Thema: problem with 500/read timeout