Autoclose for <script> and <style> in HTML::Parser
Good day to all!
When HTML::Parser encounters an unclosed <script> or <style> tag it
emulates one just before the next opening tag.
Indeed, this behaviour is described in Changes for Release 3.39_91.
Why is it so? It breaks my tests so I'd like to know the rationale
behind this :) Can someone enlight me?
Thanks a lot in advance!
--
Alex Kapranoff.
Re: Autoclose for <script> and <style> in HTML::Parser
Alex Kapranoff <kappa [at] rambler-co.ru> writes:
> Why is it so? It breaks my tests so I'd like to know the rationale
> behind this :) Can someone enlight me?
Probably just that I found some other browsers that appeared to behave
like that. What behaviour do you suggest is the sane one?
--Gisle
Re: Autoclose for <script> and <style> in HTML::Parser
* Gisle Aas <gisle [at] ActiveState.com> [June 08 2006, 18:23]:
> > Why is it so? It breaks my tests so I'd like to know the rationale
> > behind this :) Can someone enlight me?
>
> Probably just that I found some other browsers that appeared to behave
> like that. What behaviour do you suggest is the sane one?
I'm ok with this explanation. Browsers on my machine display this case
in different ways but none of them seems to actually interpret
anything inside these unclosed elements (anything after opening tag).
It would, to my mind, be more logical NOT to create artificial end_tag
event so that the element last till EOF (all the other elements do so).
MSIE 6 seems to do so, but not Firefox & Opera -- they interpret the
lonely '<script>' as '<script></script>' -- they close it immediatly.
I do not know which way is the best. But HTML::Parser appears to
have its own -- it closes '<script>' at the next opening tag.
--
Alex Kapranoff.
Re: Autoclose for <script> and <style> in HTML::Parser
Alex Kapranoff <kappa [at] rambler-co.ru> writes:
> * Gisle Aas <gisle [at] ActiveState.com> [June 08 2006, 18:23]:
> > > Why is it so? It breaks my tests so I'd like to know the rationale
> > > behind this :) Can someone enlight me?
> >
> > Probably just that I found some other browsers that appeared to behave
> > like that. What behaviour do you suggest is the sane one?
>
> I'm ok with this explanation. Browsers on my machine display this case
> in different ways but none of them seems to actually interpret
> anything inside these unclosed elements (anything after opening tag).
>
> It would, to my mind, be more logical NOT to create artificial end_tag
> event so that the element last till EOF (all the other elements do so).
> MSIE 6 seems to do so, but not Firefox & Opera -- they interpret the
> lonely '<script>' as '<script></script>' -- they close it immediatly.
>
> I do not know which way is the best. But HTML::Parser appears to
> have its own -- it closes '<script>' at the next opening tag.
HTML::Parser basically treat all the literal tags; <script>, <style>,
<xmp>, <plaintext>, <title> and <textarea> the same. I did some
experiments with Firefox and found that:
Unclosed <script>, <style> are treated as empty tags.
Unclosed <xmp>, <plaintext>, <textarea> treat the rest of the
file as text.
Unclosed <title> closes at the next tag.
Konqueror simply ignored any text following for all these expect
<plaintext>. I don't have MSIE to test on right now, but I'm
considering applying this patch that implements the Firefox behaviour.
Index: hparser.c
============================================================ =======
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.131
diff -u -p -u -r2.131 hparser.c
--- hparser.c 9 Jun 2006 07:59:37 -0000 2.131
+++ hparser.c 9 Jun 2006 08:34:43 -0000
[at] [at] -1760,9 +1760,26 [at] [at] parse(pTHX_
while (s < end) {
if (p_state->literal_mode) {
- if (strEQ(p_state->literal_mode, "plaintext") && !p_state->closing_plaintext)
+ if (strEQ(p_state->literal_mode, "plaintext") ||
+ strEQ(p_state->literal_mode, "xmp") ||
+ strEQ(p_state->literal_mode, "textarea"))
+ {
+ /* rest is considered text */
break;
- p_state->pending_end_tag = p_state->literal_mode;
+ }
+ if (strEQ(p_state->literal_mode, "script") ||
+ strEQ(p_state->literal_mode, "style"))
+ {
+ /* effectively make it an empty element */
+ token_pos_t t;
+ char dummy;
+ t.beg = p_state->literal_mode;
+ t.end = p_state->literal_mode + strlen(p_state->literal_mode);
+ report_event(p_state, E_END, &dummy, &dummy, 0, &t, 1, self);
+ }
+ else {
+ p_state->pending_end_tag = p_state->literal_mode;
+ }
p_state->literal_mode = 0;
s = parse_buf(aTHX_ p_state, s, end, utf8, self);
continue;