Weird loadHTML behaviour
Hi all,
I'm in the process of setting up a PHP script that reads a HTML file,
does a character conversion and then displays the contents of a single
HTML tag as follows:
$str =3D mb_convert_encoding (file_get_contents ('aktuel.htm'),
'HTML-ENTITIES', 'ISO-8859-1');
file_put_contents ('dmp.htm', $str);
$dom =3D DOMDocument::loadHTML ($str);
$elem =3D $dom->getElementsByTagName ('h5');
if ($elem->length) {
$n =3D $elem->item (0)->nodeValue;
var_dump (bin2hex ($n));
What's interesting is that the source HTML file is properly ISO-8859-1
encoded (which the contents of "dmp.htm" verifies). The trouble starts
when I retrieve the contents of the first <h5> tag that has an umlaut
in it. In this case, the umlaut is screwed up - what used to be a
"=C3=9C" (capital U umlaut, ISO-88591 0xdc) has now become "=C3=83=C5=93" (=
0xc3 0x9c
as the var_dump confirms). What surprises me are two things: that
somehow the character changes and that the umlaut is not HTML-encoded
as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
box.
Any thoughts?
Cheers, Christoph
Re: Weird loadHTML behaviour
On May 9, 1:16 pm, monochro... [at] gmail.com wrote:
> Hi all,
>
> I'm in the process of setting up a PHP script that reads a HTML file,
> does a character conversion and then displays the contents of a single
> HTML tag as follows:
>
> $str =3D mb_convert_encoding (file_get_contents ('aktuel.htm'),
> 'HTML-ENTITIES', 'ISO-8859-1');
>
> file_put_contents ('dmp.htm', $str);
>
> $dom =3D DOMDocument::loadHTML ($str);
> $elem =3D $dom->getElementsByTagName ('h5');
> if ($elem->length) {
> $n =3D $elem->item (0)->nodeValue;
> var_dump (bin2hex ($n));
>
> What's interesting is that the source HTML file is properly ISO-8859-1
> encoded (which the contents of "dmp.htm" verifies). The trouble starts
> when I retrieve the contents of the first <h5> tag that has an umlaut
> in it. In this case, the umlaut is screwed up - what used to be a
> "=C3=9C" (capital U umlaut, ISO-88591 0xdc) has now become "=C3=83=C5=93"=
(0xc3 0x9c
> as the var_dump confirms). What surprises me are two things: that
> somehow the character changes and that the umlaut is not HTML-encoded
> as HTML-ENTITIES would suggest. I use PHP version 5.2.1 on a linux
> box.
>
> Any thoughts?
>
> Cheers, Christoph
After some :-) research, it turns out that the encoding of the
contents of the first <h5> tag
has acutally changed to UTF-8 - hence the strange byte sequence. This
begs the question
if the default encoding for parsed HTML strings in the DOM package is
UTF-8 (if we are looking
at HTML-ENTITIES-conformant encoding initially). Is this a bug of
DOMDocument or a feature?
Cheers, Christoph