libwww utf8 woes

LWP seems to have issues with fetching pages that are utf-8 encoded.

Using a simple script like

use LWP::UserAgent;
use Encode;

my $ua = LWP::UserAgent->new();
my $resp = $ua->get("http://bild.de");

if(Encode::is_utf8($resp->content)) {
print "utf8\n";
} else {
print "no utf8\n";
}

shows

"no utf8"

(meaning that although the page is utf-8 encoded, the resulting Perl string isn't)
and it prints the warning

Parsing of undecoded UTF-8 will give garbage when decoding entities
at .../LWP/Protocol.pm line 114.

which seems to be related to a message I posted last year:

http://www.nntp.perl.org/group/perl.libwww/2006/08/msg6801.h tml

although there were no responses at the time.

Verified with perl 5.8.5, HTML::Parser 3.56 and libwww 5.805.

Is there known workarounds or fixes?

-- Mike

Mike Schilli
libwww [at] perlmeister.com
libwww [ Di, 20 Februar 2007 22:41 ] [ ID #1635210 ]

Re: libwww utf8 woes

libwww [at] perlmeister.com writes:

> LWP seems to have issues with fetching pages that are utf-8 encoded.
>
> Using a simple script like
>
> use LWP::UserAgent;
> use Encode;
>
> my $ua = LWP::UserAgent->new();
> my $resp = $ua->get("http://bild.de");
>
> if(Encode::is_utf8($resp->content)) {
> print "utf8\n";
> } else {
> print "no utf8\n";
> }
>
> shows
>
> "no utf8"
>
> (meaning that although the page is utf-8 encoded, the resulting Perl string isn't)
> and it prints the warning
>
> Parsing of undecoded UTF-8 will give garbage when decoding entities
> at .../LWP/Protocol.pm line 114.
>
> which seems to be related to a message I posted last year:
>
> http://www.nntp.perl.org/group/perl.libwww/2006/08/msg6801.h tml
>
> although there were no responses at the time.

This is indeed an outstanding bug that I don't have any good fix for
yet, but you can work around the warning by setting the 'parse_head'
attribute to FALSE:

my $ua = LWP::UserAgent->new(parse_head => 0);

Extracting the content using the:

$resp->decoded_content

method will decode the UTF-8 if there is a proper header to be found
in the HTTP reponse.

--Gisle
gisle [ Mi, 21 Februar 2007 14:26 ] [ ID #1639123 ]
Perl » perl.libwww » libwww utf8 woes

Vorheriges Thema: Crypt::SSLeay, accessing server keys and such?
Nächstes Thema: executing embedded Javascript from a scraped webpage