Autoclose for <script> and <style> in HTML::Parser

Good day to all!

When HTML::Parser encounters an unclosed <script> or <style> tag it
emulates one just before the next opening tag.

Indeed, this behaviour is described in Changes for Release 3.39_91.

Why is it so? It breaks my tests so I'd like to know the rationale
behind this :) Can someone enlight me?

Thanks a lot in advance!

--
Alex Kapranoff.
kappa [ Do, 08 Juni 2006 14:12 ] [ ID #1348215 ]

Re: Autoclose for <script> and <style> in HTML::Parser

Alex Kapranoff <kappa [at] rambler-co.ru> writes:

> Why is it so? It breaks my tests so I'd like to know the rationale
> behind this :) Can someone enlight me?

Probably just that I found some other browsers that appeared to behave
like that. What behaviour do you suggest is the sane one?

--Gisle
gisle [ Do, 08 Juni 2006 16:23 ] [ ID #1348217 ]

Re: Autoclose for <script> and <style> in HTML::Parser

* Gisle Aas <gisle [at] ActiveState.com> [June 08 2006, 18:23]:
> > Why is it so? It breaks my tests so I'd like to know the rationale
> > behind this :) Can someone enlight me?
>
> Probably just that I found some other browsers that appeared to behave
> like that. What behaviour do you suggest is the sane one?

I'm ok with this explanation. Browsers on my machine display this case
in different ways but none of them seems to actually interpret
anything inside these unclosed elements (anything after opening tag).

It would, to my mind, be more logical NOT to create artificial end_tag
event so that the element last till EOF (all the other elements do so).
MSIE 6 seems to do so, but not Firefox & Opera -- they interpret the
lonely '<script>' as '<script></script>' -- they close it immediatly.

I do not know which way is the best. But HTML::Parser appears to
have its own -- it closes '<script>' at the next opening tag.

--
Alex Kapranoff.
kappa [ Do, 08 Juni 2006 17:27 ] [ ID #1348222 ]

Re: Autoclose for <script> and <style> in HTML::Parser

Alex Kapranoff <kappa [at] rambler-co.ru> writes:

> * Gisle Aas <gisle [at] ActiveState.com> [June 08 2006, 18:23]:
> > > Why is it so? It breaks my tests so I'd like to know the rationale
> > > behind this :) Can someone enlight me?
> >
> > Probably just that I found some other browsers that appeared to behave
> > like that. What behaviour do you suggest is the sane one?
>
> I'm ok with this explanation. Browsers on my machine display this case
> in different ways but none of them seems to actually interpret
> anything inside these unclosed elements (anything after opening tag).
>
> It would, to my mind, be more logical NOT to create artificial end_tag
> event so that the element last till EOF (all the other elements do so).
> MSIE 6 seems to do so, but not Firefox & Opera -- they interpret the
> lonely '<script>' as '<script></script>' -- they close it immediatly.
>
> I do not know which way is the best. But HTML::Parser appears to
> have its own -- it closes '<script>' at the next opening tag.

HTML::Parser basically treat all the literal tags; <script>, <style>,
<xmp>, <plaintext>, <title> and <textarea> the same. I did some
experiments with Firefox and found that:

Unclosed <script>, <style> are treated as empty tags.
Unclosed <xmp>, <plaintext>, <textarea> treat the rest of the
file as text.
Unclosed <title> closes at the next tag.

Konqueror simply ignored any text following for all these expect
<plaintext>. I don't have MSIE to test on right now, but I'm
considering applying this patch that implements the Firefox behaviour.

Index: hparser.c
============================================================ =======
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.131
diff -u -p -u -r2.131 hparser.c
--- hparser.c 9 Jun 2006 07:59:37 -0000 2.131
+++ hparser.c 9 Jun 2006 08:34:43 -0000
[at] [at] -1760,9 +1760,26 [at] [at] parse(pTHX_

while (s < end) {
if (p_state->literal_mode) {
- if (strEQ(p_state->literal_mode, "plaintext") && !p_state->closing_plaintext)
+ if (strEQ(p_state->literal_mode, "plaintext") ||
+ strEQ(p_state->literal_mode, "xmp") ||
+ strEQ(p_state->literal_mode, "textarea"))
+ {
+ /* rest is considered text */
break;
- p_state->pending_end_tag = p_state->literal_mode;
+ }
+ if (strEQ(p_state->literal_mode, "script") ||
+ strEQ(p_state->literal_mode, "style"))
+ {
+ /* effectively make it an empty element */
+ token_pos_t t;
+ char dummy;
+ t.beg = p_state->literal_mode;
+ t.end = p_state->literal_mode + strlen(p_state->literal_mode);
+ report_event(p_state, E_END, &dummy, &dummy, 0, &t, 1, self);
+ }
+ else {
+ p_state->pending_end_tag = p_state->literal_mode;
+ }
p_state->literal_mode = 0;
s = parse_buf(aTHX_ p_state, s, end, utf8, self);
continue;
gisle [ Fr, 09 Juni 2006 10:50 ] [ ID #1349911 ]
Perl » perl.libwww » Autoclose for <script> and <style> in HTML::Parser

Vorheriges Thema: Why do I seg fault?
Nächstes Thema: Re: Utf-8 and Content-Length header