libwww-perl-5.801

am 17.11.2004 15:58:00 von gisle

Eventually I found time to fix the problem with code references as
content that was introduced by 5.800 and integrate some more patches.
I probably will make a 5.802 later this week so if there are new or
old patches you really want applied this might be is a good time to
speak up.

The changes since 5.800 are:

HTTP::Message improved content/content_ref interaction. Fixes
DYNAMIC_FILE_UPLOAD and other uses of code content in requests.

HTML::Form:
- Handle clicking on nameless image.
- Don't let $form->click invoke a disabled submit button.

HTTP::Cookies could not handle a "old-style" cookie named
"Expires".

HTTP::Headers work-around for thread safety issue in perl <=3D 5.8.4.

HTTP::Request::Common improved documentation.

LWP::Protocol: Check that we can write to the file specified in
$ua->request(..., $file) or $ua->mirror.

LWP::UserAgent clone() dies if proxy was not set. Patch by
Andy Lester

HTTP::Methods now avoid "use of uninitialized"-warning when server
replies with incomplete status line.

lwp-download will now actually tell you why it aborts if it runs
out of disk space of fails to write some other way.

WWW::RobotRules: only display warning when running under 'perl -w'
and show which robots.txt file they correspond to. Based on
patch by Bill Moseley.

WWW::RobotRules: Don't empty cache when agent() is called if the
agent name does not change. Patch by Ville Skyttä fi>.

Enjoy!

Regards,
Gisle

libwww-perl-5.802

am 01.12.2004 10:57:45 von gisle

libwww-perl-5.802 is available from CPAN. The changes since 5.801 are:

The HTTP::Message object now have a decoded_content() method.
This will return the content after any Content-Encodings and
charsets has been decoded.

Compress::Zlib is now a prerequisite module.

HTTP::Request::Common: The POST() function created an invalid
Content-Type header for file uploads with no parameters.

Net::HTTP: Allow Transfer-Encoding with trailing whitespace.

Net::HTTP: Don't allow empty content to be treated as a valid
HTTP/0.9 response.

File::Protocol::file: Fixup directory links in HTML generated
for directories. Patch by Moshe Kaminsky .

Makefile.PL will try to discover misconfigured systems that
can't talk to themselves and disable tests that depend on this.

Makefile.PL will now default to 'n' when asking about whether
to install the "GET", "HEAD", "POST" programs. There has been
too many name clashes with these common names.

Enjoy!

decoded_content

am 01.12.2004 11:56:35 von gisle

Gisle Aas writes:

> The HTTP::Message object now have a decoded_content() method.
> This will return the content after any Content-Encodings and
> charsets has been decoded.

The current $mess->decoded_content implementation is quite na=EFve in
it's mapping of charsets. It need to either start using Björn's
HTML::Encoding module or start doing similar sniffing to better guess
the charset when the Content-Header does not provide any.

I also plan to expose a $mess->charset method that would just return
the guessed charset, i.e. something similar to
encoding_from_http_message() provided by HTML::Encoding.

Another problem is that I have no idea how well the charset names
found in the HTTP/HTML maps to the encoding names that the perl Encode
module supports. Anybody knows what the state here is?

When this works the next step is to figure out the best way to do
streamed decoding. This is needed for the HeadParser that LWP
invokes.

The main motivation for decoded_content is that HTML::Parser now works
better if properly decoded Unicode can be provided to it, but it still
fails here:

$ lwp-request -d www.microsoft.com
Parsing of undecoded UTF-8 will give garbage when decoding entities
at lib/LWP/Protocol.pm line 114.

Here decoded_content needs to sniff the BOM to be able to guess that
they use UTF-8 so that a properly decoded string can be provided to
HTML::HeadParser.

The decoded_content also solve the frequent request of supporting
compressed content. Just do something like this:

$ua =3D LWP::UserAgent->new;
$ua->default_header("Accept-Encoding" =3D> "gzip, deflate");

$res =3D $ua->get("http://www.example.com");
print $res->decoded_content(charset =3D> "none");

Regards,
Gisle

Re: libwww-perl-5.802

am 04.12.2004 20:49:17 von kaminsky

--FCuugMFkClbJLl1L
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Gisle Aas [01/12/04 12:02]:
> libwww-perl-5.802 is available from CPAN. The changes since 5.801 are:
>=20
> The HTTP::Message object now have a decoded_content() method.
> This will return the content after any Content-Encodings and
> charsets has been decoded.
>=20

For some reason, the original content is killed in the response object=20
when I use this method - the content() method returns an empty string=20
after calling decoded_content. The reason appears to be passing=20
$$content_ref to Encode::decode in line 220 of HTTP/Message.pm. I guess=20
it's probably some problem with decode(),
but in any case, replacing that line with

my $cont =3D $$content_ref;
$content_ref =3D \Encode::decode($charset, $cont, Encode::FB_CROAK());

Solved the problem. This is with HTTP::Message version 1.52, perl=20
version 5.8.6, Encode version 2.08 on linux.

Also, I would like to suggest adding a flag, which will cause the=20
content() method to return the output of decoded_content(). This will=20
allow scripts which ignored the charset to automatically do the right=20
thing by simply setting this flag.

Thanks,
Moshe

--FCuugMFkClbJLl1L
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBshS9kBjmVsKMBeMRAlkQAKDbgf/vsHErtOgyOBt1aPwlc80P8gCg wQdD
XHla2RsbFA3BoJBQ/kE7da0=
=WNKF
-----END PGP SIGNATURE-----

--FCuugMFkClbJLl1L--

Re: decoded_content

am 04.12.2004 22:54:56 von derhoermi

* Gisle Aas wrote:
>The current $mess->decoded_content implementation is quite naïve in
>it's mapping of charsets. It need to either start using Björn's
>HTML::Encoding module or start doing similar sniffing to better guess
>the charset when the Content-Header does not provide any.

. I very much welcome ideas
and patches that would help here. The module is currently just good
enough to replace the custom detection code in the W3C Markup Validator
check script (which is the basic motivation of the module ever since)
and in that pretty much ad hoc... I do indeed think that the libwww-perl
modules would be a better place for much of the functionality.

>I also plan to expose a $mess->charset method that would just return
>the guessed charset, i.e. something similar to
>encoding_from_http_message() provided by HTML::Encoding.

A $mess->header_charset might be a good start here which just gives the
charset parameter in the content-type header. This would be what

HTML::Encoding::encoding_from_content_type($mess->header('Co ntent-Type'))

does. HTTP::Message would be a better place for that code as the charset
parameter is far more common than just HTML/XML (all text/* types have
one, for example). The same probably goes for other things aswell such
as the BOM detection code in HTML::Encoding.

>Another problem is that I have no idea how well the charset names
>found in the HTTP/HTML maps to the encoding names that the perl Encode
>module supports. Anybody knows what the state here is?

Things might work out in common cases, but it's not quite where I think
it should be, I've recently started a thread on perl-unicode about it,
; I found that using
the I18N::Charset is needed in addition to Encode and that I18N::Charset
(still) lacks quite a number of mappings (see the comments in the source
of the module).

>When this works the next step is to figure out the best way to do
>streamed decoding. This is needed for the HeadParser that LWP
>invokes.

One problem here are stateful encodings such as UTF-7 or the ISO-2022
family of encodings as Encode::PerlIO notes (and attempts to work around
for many encodings). For example, the code you posted to perl-unicode
(re incomplete sequences) would fail for UTF-7 "Bj+APY-rn" if it happens
to split the string after "Bj+APY" which would be a complete sequence
but the meaning of the following "-rn" depends on the current state of
the decoder which decode() does not maintain, so it might sometimes
decode to "Bjö-rn" and sometimes to "Björn" which is not desirable (it
might have security implications, for example).

I am not sure whether there is an easy way to use the PerlIO workarounds
without using PerlIO. I've tried using PerlIO::scalar in HTML::Encoding,
but it modifies the
scalar on some encoding errors and I did not investigate this further.
Maybe Encode should provide a simpler means for decoding possibly incom-
plete sequences...

Also, HTML::Parser might be the best blace to deal at least with the
case where the (or an) encoding is already known so it would decode the
bytes passed to it itself, I would then probably replace my poor custom
HTML::Encoding::encoding_from_meta_element with HTML::HeadParser looping
through possible encodings (probably giving up once that worked out, it
would currently decode with UTF-8 and ISO-8859-1 for most cases which is
quite unlikely to return different results...)
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: libwww-perl-5.802

am 06.12.2004 14:47:51 von gisle

Moshe Kaminsky writes:

> * Gisle Aas [01/12/04 12:02]:
> > libwww-perl-5.802 is available from CPAN. The changes since 5.801 are:
> >
> > The HTTP::Message object now have a decoded_content() method.
> > This will return the content after any Content-Encodings and
> > charsets has been decoded.
> >
>
> For some reason, the original content is killed in the response object
> when I use this method - the content() method returns an empty string
> after calling decoded_content. The reason appears to be passing
> $$content_ref to Encode::decode in line 220 of HTTP/Message.pm. I guess
> it's probably some problem with decode(),
> but in any case, replacing that line with
>
> my $cont = $$content_ref;
> $content_ref = \Encode::decode($charset, $cont, Encode::FB_CROAK());
>
> Solved the problem. This is with HTTP::Message version 1.52, perl
> version 5.8.6, Encode version 2.08 on linux.

Thanks for your report. There was a similar issue with memGunzip and
the patch I applied for it will also fix this problem.

> Also, I would like to suggest adding a flag, which will cause the
> content() method to return the output of decoded_content(). This will
> allow scripts which ignored the charset to automatically do the right
> thing by simply setting this flag.

I'm not too happy about this suggestion asis. One option is to
introduce a '$mess->decode_content' method and then make
LWP::UserAgent grow some option that makes it automatically call this
for all responses it receives. The 'decode_content' would be like

$resp->content(encode_utf8($res->decoded_content));

but would also fix up the Content-Encoding and Content-Type header.

Regards,
Gisle