HTML::Parser & <plaintext> tag
Good day to all!
As far as I can understand HTML::Parser simply ignores closing
</plaintext> tag. I read the tests and Changes so I see that this is
intended behaviour and <plaintext> is special-cased of all CDATA
elements.
Does someone know the reasoning of this decision? :) It is just plain
interesting. Does HTML::Parser imitate some old browser here? It
results in weird effects for me as I write a HTML sanitizer for
WebMail.
--
Alex Kapranoff,
#!/usr/bin/perl -w
$SIG{__WARN__}=sub{print substr(" [at] _",-43+ord$_,1)for
'6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43;
Re: HTML::Parser & <plaintext> tag
Alex Kapranoff <kappa [at] rambler-co.ru> writes:
> As far as I can understand HTML::Parser simply ignores closing
> </plaintext> tag. I read the tests and Changes so I see that this is
> intended behaviour and <plaintext> is special-cased of all CDATA
> elements.
>
> Does someone know the reasoning of this decision? :) It is just plain
> interesting.
A long time ago the HTTP protocol did not have MIME-like headers. The
client sent a "GET foo" line and the server responded with HTML and
then closed the connection. Since there was no way for the server to
indicate any other Content-Type than text/html the <plaintext> tag was
introduced so that text files could be served by just prefixing the
file content with this tag.
This was before the <img> tag was invented so luckily we don't have a
similar unclosed <gif> tag :)
> Does HTML::Parser imitate some old browser here?
Yes, it was there in the beginning but still seems well supported. Of
my current browsers both Konqueror and MSIE support this. Firefox
support it in the same way as <xmp>, i.e. it allow you to escape out
of it with </plaintext>.
The <plaintext> tag is described in this historic document:
http://www.w3.org/History/19921103-hypertext/hypertext/WWW/M arkUp/Tags.html#7
> It results in weird effects for me as I write a HTML sanitizer for
> WebMail.
Howcome? Do you have a need to suppress this behaviour in HTML::Parser?
Regards,
Gisle
Re: HTML::Parser & <plaintext> tag
* Gisle Aas <gisle [at] activestate.com> [November 10 2004, 21:25]:
> then closed the connection. Since there was no way for the server to
> indicate any other Content-Type than text/html the <plaintext> tag was
> introduced so that text files could be served by just prefixing the
> file content with this tag.
>
> This was before the <img> tag was invented so luckily we don't have a
> similar unclosed <gif> tag :)
Thank you very much for this enlightment! It explains everything!
BTW, by that time I had even seen computers once or twice from far
away :)
> my current browsers both Konqueror and MSIE support this. Firefox
> support it in the same way as <xmp>, i.e. it allow you to escape out
> of it with </plaintext>.
This Firefox behaviour is likely to have confused me. Look, what if
I've got such a html: `<plaintext></plaintext><script>nasties;</script>'?
HTML::Parser stops parsing after `<plaintext>' so that no interesting
event is triggered on `<script>' tag and my sanitizer has no chance to
rip out the nasties. Firefox (my 1st browser to test) happily resumes
parsing after `</plaintext>' and that's the problem. Maybe it is the
gecko people who are at fault.
> > It results in weird effects for me as I write a HTML sanitizer for
> > WebMail.
> Howcome? Do you have a need to suppress this behaviour in HTML::Parser?
Yes, I'd like to have an option to resume parsing after `</plaintext>'
just as firefox does. As I understand the original intentions now I'll
try to produce a patch.
--
Alex Kapranoff,
#!/usr/bin/perl -w
$SIG{__WARN__}=sub{print substr(" [at] _",-43+ord$_,1)for
'6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43;
Re: HTML::Parser & <plaintext> tag
Alex Kapranoff <kappa [at] rambler-co.ru> writes:
> * Alex Kapranoff <kappa [at] rambler-co.ru> [November 11 2004, 11:11]:
> > > > It results in weird effects for me as I write a HTML sanitizer for
> > > > WebMail.
> > > Howcome? Do you have a need to suppress this behaviour in HTML::Parser?
> > Yes, I'd like to have an option to resume parsing after `</plaintext>'
> > just as firefox does. As I understand the original intentions now I'll
> > try to produce a patch.
>
> I've filed a ticket 8362 in rt.cpan.org with the patch. It creates an
> additional boolean attribute `closing_plaintext'. Not that I insist on
> naming.
Seems good; and I've just uploaded HTML-Parser-3.38 with this patch.
Re: HTML::Parser & <plaintext> tag
* Alex Kapranoff <kappa [at] rambler-co.ru> [November 11 2004, 11:11]:
> > > It results in weird effects for me as I write a HTML sanitizer for
> > > WebMail.
> > Howcome? Do you have a need to suppress this behaviour in HTML::Parser?
> Yes, I'd like to have an option to resume parsing after `</plaintext>'
> just as firefox does. As I understand the original intentions now I'll
> try to produce a patch.
I've filed a ticket 8362 in rt.cpan.org with the patch. It creates an
additional boolean attribute `closing_plaintext'. Not that I insist on
naming.
--
Alex Kapranoff,
#!/usr/bin/perl -w
$SIG{__WARN__}=sub{print substr(" [at] _",-43+ord$_,1)for
'6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43;