Patch for WWW::RobotsRules.pm
I've got a spider that uses LWP::RobotUA (WWW::RobotRules) and a few
users of the spider have complained that the warning messages were
not obvious enough. I guess I can agree because when they are
spidering multiple hosts the message doesn't tell them what robots.txt
had a problem.
So maybe something like:
--- RobotRules.pm.old 2004-04-09 08:37:08.000000000 -0700
+++ RobotRules.pm 2004-09-16 09:46:03.000000000 -0700
[at] [at] -70,7 +70,7 [at] [at]
}
elsif (/^\s*Disallow\s*:\s*(.*)/i) {
unless (defined $ua) {
- warn "RobotRules: Disallow without preceding User-agent\n";
+ warn "RobotRules: [$robot_txt_uri] Disallow without preceding User-agent\n";
$is_anon = 1; # assume that User-agent: * was intended
}
my $disallow = $1;
[at] [at] -97,7 +97,7 [at] [at]
}
}
else {
- warn "RobotRules: Unexpected line: $_\n";
+ warn "RobotRules: [$robot_txt_uri] Unexpected line: $_\n";
}
}
--
Bill Moseley
moseley [at] hank.org
Re: Patch for WWW::RobotsRules.pm
Bill Moseley <moseley [at] hank.org> writes:
> I've got a spider that uses LWP::RobotUA (WWW::RobotRules) and a few
> users of the spider have complained that the warning messages were
> not obvious enough. I guess I can agree because when they are
> spidering multiple hosts the message doesn't tell them what robots.txt
> had a problem.
The patch I've now applied is this one:
Index: lib/WWW/RobotRules.pm
============================================================ =======
RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
retrieving revision 1.31
retrieving revision 1.32
diff -u -p -u -r1.31 -r1.32
--- lib/WWW/RobotRules.pm 12 Nov 2004 16:05:09 -0000 1.31
+++ lib/WWW/RobotRules.pm 12 Nov 2004 16:14:25 -0000 1.32
[at] [at] -1,8 +1,8 [at] [at]
package WWW::RobotRules;
-# $Id: RobotRules.pm,v 1.31 2004/11/12 16:05:09 gisle Exp $
+# $Id: RobotRules.pm,v 1.32 2004/11/12 16:14:25 gisle Exp $
-$VERSION = sprintf("%d.%02d", q$Revision: 1.31 $ =~ /(\d+)\.(\d+)/);
+$VERSION = sprintf("%d.%02d", q$Revision: 1.32 $ =~ /(\d+)\.(\d+)/);
sub Version { $VERSION; }
use strict;
[at] [at] -70,7 +70,7 [at] [at] sub parse {
}
elsif (/^\s*Disallow\s*:\s*(.*)/i) {
unless (defined $ua) {
- warn "RobotRules: Disallow without preceding User-agent\n";
+ warn "RobotRules <$robot_txt_uri>: Disallow without preceding User-agent\n" if $^W;
$is_anon = 1; # assume that User-agent: * was intended
}
my $disallow = $1;
[at] [at] -97,7 +97,7 [at] [at] sub parse {
}
}
else {
- warn "RobotRules: Unexpected line: $_\n";
+ warn "RobotRules <$robot_txt_uri>: Unexpected line: $_\n" if $^W;
}
}
> So maybe something like:
>
> --- RobotRules.pm.old 2004-04-09 08:37:08.000000000 -0700
> +++ RobotRules.pm 2004-09-16 09:46:03.000000000 -0700
> [at] [at] -70,7 +70,7 [at] [at]
> }
> elsif (/^\s*Disallow\s*:\s*(.*)/i) {
> unless (defined $ua) {
> - warn "RobotRules: Disallow without preceding User-agent\n";
> + warn "RobotRules: [$robot_txt_uri] Disallow without preceding User-agent\n";
> $is_anon = 1; # assume that User-agent: * was intended
> }
> my $disallow = $1;
> [at] [at] -97,7 +97,7 [at] [at]
> }
> }
> else {
> - warn "RobotRules: Unexpected line: $_\n";
> + warn "RobotRules: [$robot_txt_uri] Unexpected line: $_\n";
> }
> }