"David Hofmann" <mor4321 [at] hotmail.com> writes:
> I'm currently using your perl module for processing input from a
> spider I wrote.
>
> The problem I'm encountering is some pages have <> in the title.
>
> Example HTML:
>
> <TITLE>274500 - XL: "Save Changes in <Bookname>" Prompt Even If No
> Changes Are Made</TITLE>
>
> The result I get back is "XL: "Save Changes in ". Also the
> description, keywords and last-modified come back bank on these pages
> if they were after the title on the page.
It looks like most other browsers parse <title> stuff in what the
HTML::Parser sources call literal mode. I've now applied the
following patch to my sources, but I'm not really sure this is a good
idea. I might still decide to revert it before release.
Index: hparser.c
============================================================ =======
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.98
retrieving revision 2.99
diff -u -p -u -r2.98 -r2.99
--- hparser.c 11 Nov 2004 10:12:51 -0000 2.98
+++ hparser.c 15 Nov 2004 22:19:49 -0000 2.99
[at] [at] -1,4 +1,4 [at] [at]
-/* $Id: hparser.c,v 2.98 2004/11/11 10:12:51 gisle Exp $
+/* $Id: hparser.c,v 2.99 2004/11/15 22:19:49 gisle Exp $
*
* Copyright 1999-2002, Gisle Aas
* Copyright 1999-2000, Michael A. Chase
[at] [at] -27,6 +27,7 [at] [at] literal_mode_elem[] =
{5, "style", 1},
{3, "xmp", 1},
{9, "plaintext", 1},
+ {5, "title", 0},
{8, "textarea", 0},
{0, 0, 0}
};
The problem here is that other browsers seems to switch into a mode
where tags inside <title> is still recognized if no </title> end tag
was found in the document. HTML-Parser does not have brains to do
something like this. It tries to parse the document in a stream-like
fashion, and buffering of it all to figure out what quirk-mode to
parse in does not seem attractive.
Regards,
Gisle
