LWP: Warning with utf8 data in HTML head section

There seems to be a bug in LWP which causes a warning in
HTML::HeadParser on fetched web documents which contain utf-8 encoded
data in the header section.

Example:

use strict;
use LWP;
use 5.008;

my $url = 'http://perlmeister.com/test/utf8.html';
my $ua = LWP::UserAgent->new();
my $res = $ua->get($url);

This snippet shows the warning

Parsing of undecoded UTF-8 will give garbage when decoding
entities at /home/y/lib/perl5/site_perl/5.8/LWP/Protocol.pm line
114.

with LWP-5.805 and HTML-Parser-3.55.

HTML::HeadParser issues this warning if it finds UTF-8 encoded data
but the string handed in doesn't have the utf-8 bit set.

Setting the utf-8 bit on web server responses which indicate
UTF-8 content in a content header like 'text/html; charset=utf-8'
seems to be one possible solution, but this header setting might also
occur in the HTML header section, which HTML::HeadParser is supposed
to parse:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

in which case the warning probably needs to be suppressed until
HTML::HeadParser is done and has verified that there's no such setting
in the HTML head.

-- Mike

Mike Schilli
m [at] perlmeister.com
libwww [ Do, 03 August 2006 00:47 ] [ ID #1415167 ]
Perl » perl.libwww » LWP: Warning with utf8 data in HTML head section

Vorheriges Thema: really low level http, https access
Nächstes Thema: need to set local_addr for LWP::UserAgent