Problem with blocking

Hello,

I am using HTML::Parser to extract Hyperlinks from a Web Page. I wrote a
Module MyParser wich is based on the above quoted Perl module.

Now I want to ensure, that the Parser finished his work before I query the
links. I have a member function getHyperlinks which returns a hash containing
the links. No big thing:

$p->parse;
my %links = $p->getHyperlinks;

I thougt that the parser may still parse, so I assigned a boolean variable
$isParsing which is set to true by the start_document handler and set to
false by the end_document handler.

sub getHyperlinks {
my $self = shift;
while($isParsing) { }
return %{self->{HYPERLINKS}};
}

Actually the parser seems to be blocking now. What is the best way to ensure
that the parser finished extracting all links without blocking the whole
thing?

Best Regards,

Oliver
oliver.block [ Di, 25 April 2006 02:00 ] [ ID #1289332 ]

Re: Problem with blocking

>
> sub getHyperlinks {
> my $self = shift;
> while($isParsing) { }
> return %{self->{HYPERLINKS}};
> }

The while loop is empty. Nothing can change the value of $isParsing.

You may want to investigate any of the number of extant link
extracting modules on CPAN.

For that matter, if you want to just fetch a page and return a list
of links, it can be as simple as:

use WWW::Mechanize;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech->get( "http://myurl" );
for my $link ( $mech->links ) {
print $link->url, "\n";
}

xoxo,
Andy


--
Andy Lester => andy [at] petdance.com => www.petdance.com => AIM:petdance
Andy [ Di, 25 April 2006 04:58 ] [ ID #1289333 ]

Re: Problem with blocking

Am Dienstag, 25. April 2006 04:58 schrieb Andy Lester:
> > sub getHyperlinks {
> > my $self = shift;
> > while($isParsing) { }
> > return %{self->{HYPERLINKS}};
> > }
>
> The while loop is empty. Nothing can change the value of $isParsing.

That's right! At least not in the while loop.

The value is changed in the start_document_handler (true) and
end_document_handler (false). The loop is just to ensure the parser is not
parsing.

But possibly I got confused, because I was handling with the ithread module
and with locking and semaphores the last days!? :)

...
$p->parse;
my %links = $p->getHyperlinks;
...

Is there a chance that the calling program calls $p->getHyperlinks while the
parser is still parsing the page? Or isn't it that the $p->getHyperlinks is
called after the $p->parse returned (though without a value).

I am sorry if I am confusing someone.

Best Regards,

Oliver
oliver.block [ Di, 25 April 2006 15:45 ] [ ID #1289334 ]

Re: Problem with blocking

>>> sub getHyperlinks {
>>> my $self = shift;
>>> while($isParsing) { }
>>> return %{self->{HYPERLINKS}};
>>> }
>>
>> The while loop is empty. Nothing can change the value of $isParsing.
>
> That's right! At least not in the while loop.
>

Once you get into that while loop, you can never get out.

--
Andy Lester => andy [at] petdance.com => www.petdance.com => AIM:petdance
Andy [ Di, 25 April 2006 16:16 ] [ ID #1289335 ]
Perl » perl.libwww » Problem with blocking

Vorheriges Thema: RT bug filed for HTML::Parser's HTML::TokeParser
Nächstes Thema: Using Crypt::SSLeay to post