WWW::Mechanize -- bug in find_all_images

--------------040203030005010108060508
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Andy,

This little script:

use WWW::Mechanize;

my $mech = WWW::Mechanize->new();
$mech->get ( "http://news.google.com" );
my [at] tImages = $mech->find_all_images( url_regex => qr/imgurl=/ );

Produces the following output:

Use of uninitialized value in pattern match (m//) at .../WWW/Mechanize.pm line 1053.
Use of uninitialized value in pattern match (m//) at .../WWW/Mechanize.pm line 1053.
Use of uninitialized value in pattern match (m//) at .../WWW/Mechanize.pm line 1053.
...


This patch to v1.30 fixes the problem:

--- Mechanize.pm-1.30 2007-06-16 22:42:27.000000000 +0200
+++ Mechanize.pm 2007-06-16 22:59:21.000000000 +0200
[at] [at] -1049,10 +1049,11 [at] [at]
# No conditions, anything matches
return 1 unless keys %$p;

- return if defined $p->{url} && !($image->url eq $p->{url} );
- return if defined $p->{url_regex} && !($image->url =~ $p->{url_regex} );
- return if defined $p->{url_abs} && !($image->url_abs eq $p->{url_abs} );
- return if defined $p->{url_abs_regex} && !($image->url_abs =~ $p->{url_abs_regex} );
+ return if defined $p->{url} && !($image->url && $image->url eq $p->{url} ); #[1]
+ return if defined $p->{url_regex} && !($image->url && $image->url =~ $p->{url_regex} );
+ return if defined $p->{url_abs} && !($image->url_abs && $image->url_abs eq $p->{url_abs} );
+ return if defined $p->{url_abs_regex} && !($image->url_abs_regex && $image->url_abs =~ $p->{url_abs_regex} );
+
return if defined $p->{alt} && !(defined($image->alt) && $image->alt eq $p->{alt} );
return if defined $p->{alt_regex} && !(defined($image->alt) && $image->alt =~ $p->{alt_regex} );
return if defined $p->{tag} && !($image->tag && $image->tag eq $p->{tag} );


I'm not sure if all 4 lines really need the change - the second line
would fix my problem - but I put them in to be safe :-)

Cheers,

Peter


--------------040203030005010108060508--
peter.stevens [ Sa, 16 Juni 2007 23:13 ] [ ID #1740113 ]

Re: WWW::Mechanize -- bug in find_all_images

On Jun 16, 2007, at 4:13 PM, Peter Stevens wrote:

> + return if defined $p->{url} && !($image->url && $image->url eq $p-
> >{url} ); #[1] + return if defined $p->{url_regex} && !($image->url
> && $image->url =~ $p->{url_regex} ); + return if defined $p->
> {url_abs} && !($image->url_abs && $image->url_abs eq $p->
> {url_abs} ); + return if defined $p->{url_abs_regex} && !($image-
> >url_abs_regex && $image->url_abs =~ $p->{url_abs_regex} ); +

But why would $image->url come back as undef? That should be the
real thing to check.

--
Andy Lester => andy [at] petdance.com => www.petdance.com => AIM:petdance
Andy [ Sa, 16 Juni 2007 23:17 ] [ ID #1740114 ]

Re: WWW::Mechanize -- bug in find_all_images

--------------000106070809060206050003
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Andy Lester wrote:
> But why would $image->url come back as undef? That should be the real
> thing to check.
>
Quite simple really. There are images which are not containd in an
<a>...</a> block. Again from news.google.com, here are two examples:

<img width=1 height=1 alt="">
<img src=/images/cleardot.gif width=1 height=2>

I like the first example because it is a pure placeholder. No real image
at all :-)

Cheers,
Peter

--------------000106070809060206050003--
peter.stevens [ So, 17 Juni 2007 05:27 ] [ ID #1740615 ]

Re: WWW::Mechanize -- bug in find_all_images

On Jun 16, 2007, at 10:27 PM, Peter Stevens wrote:

> Quite simple really. There are images which are not containd in an
> <a>...</a> block. Again from news.google.com, here are two examples:
>
> <img width=1 height=1 alt="">
> <img src=/images/cleardot.gif width=1 height=2>

It's not that they're not in <a> tags. It's that the first one
doesn't have an src. That's bizarre. I'm not sure it's a behavior
I'm too worried about.

--
Andy Lester => andy [at] petdance.com => www.petdance.com => AIM:petdance
Andy [ So, 17 Juni 2007 06:07 ] [ ID #1740616 ]

Re: WWW::Mechanize -- bug in find_all_images

>> <img width=1 height=1 alt="">
>
> It's not that they're not in <a> tags. It's that the first one
> doesn't have an src. That's bizarre. I'm not sure it's a behavior
> I'm too worried about.
Sorry, your right..

news.google.com uses src-less imgs 16 times and that is exactly how many
errors mech reports.

The patch keeps my log files much smaller. :-)

I do hope you will add it to the standard release.

Thanks

Peter

>
>
>
>
peter.stevens [ So, 17 Juni 2007 07:35 ] [ ID #1740617 ]
Perl » perl.libwww » WWW::Mechanize -- bug in find_all_images

Vorheriges Thema: Help with mechanize and javascript
Nächstes Thema: Hi