Broken xml output

The following code produces XML which is not valid UTF-8 according to
xmllint. For convenient testing and confidentiality issues I've
simplified it considerably, but normally it should encode several megs
of text from a MySQL table.

<?
$xml = new XmlWriter();
$xml->openMemory();
$xml->setIndent(true);
// $xml->startDocument('1.0','ISO-8859-1');
$xml->startDocument('1.0','UTF-8');
$xml->startElement('dataroot');
$xml->writeAttribute('xmlns:od', 'urn:schemas-microsoft-com:officedata');

$xml->startElement('VACATURE');
$xml->startElement("bla");
$xml->text("commerciële");
$xml->endElement(); // </bla>
$xml->endElement(); // </VACATURE>
$xml->endElement();
$xml->endDocument();
header("Content-type: application/xml");
print $xml->outputMemory(true);
?>

That's because it's ISO-8859-1. If I manually change this attribute in
the output, the XML validates.

Changing the encoding argument to startDocument in the script results in
a conv error:

Warning: XMLWriter::outputMemory() function.XMLWriter-outputMemory:
output conversion failed due to conv error, bytes 0xEB 0x6C 0x65 0x3C in
broken-xml-output.txt on line 18

Doesn't make any sense to me, because it doesn't even need to convert
the string. Same trouble on two machines with PHP 5.1.2 and 5.2.0. I
could always replace it in the output with a regexp on the first line,
but that's just plain Bad and Wrong.

Steven
Steven Mocking [ Do, 18 Januar 2007 14:08 ] [ ID #1600325 ]

Re: Broken xml output

On Thu, 18 Jan 2007 14:08:01 +0100, Steven Mocking
<ufo [at] quicknet.youmightwanttogetridofthis.nl> wrote:

> The following code produces XML which is not valid UTF-8 according to
> xmllint. For convenient testing and confidentiality issues I've
> simplified it considerably, but normally it should encode several megs
> of text from a MySQL table.
>
> <?
> $xml = new XmlWriter();
> $xml->openMemory();
> $xml->setIndent(true);
> // $xml->startDocument('1.0','ISO-8859-1');
> $xml->startDocument('1.0','UTF-8');
> $xml->startElement('dataroot');
> $xml->writeAttribute('xmlns:od',
> 'urn:schemas-microsoft-com:officedata');
>
> $xml->startElement('VACATURE');
> $xml->startElement("bla");
> $xml->text("commerciële");
> $xml->endElement(); // </bla>
> $xml->endElement(); // </VACATURE>
> $xml->endElement();
> $xml->endDocument();
> header("Content-type: application/xml");
> print $xml->outputMemory(true);
> ?>
>
> That's because it's ISO-8859-1. If I manually change this attribute in
> the output, the XML validates.
>
> Changing the encoding argument to startDocument in the script results in
> a conv error:
>
> Warning: XMLWriter::outputMemory() function.XMLWriter-outputMemory:
> output conversion failed due to conv error, bytes 0xEB 0x6C 0x65 0x3C in
> broken-xml-output.txt on line 18
>
> Doesn't make any sense to me, because it doesn't even need to convert
> the string. Same trouble on two machines with PHP 5.1.2 and 5.2.0. I
> could always replace it in the output with a regexp on the first line,
> but that's just plain Bad and Wrong.
>
> Steven

Having no experience with XMLWriter, I thought of these questions:
- maybe the writer does try to convert the input text into whatever
charset you supply?
- maybe it tries to save the input text as whatever charset you supply?
- the e-umlaut may be part of the iso-8859-1, but is it part of utf-8,
too? Or would you need a unicode number?

Hope this helps!

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
OmegaJunior [ So, 21 Januar 2007 12:02 ] [ ID #1603139 ]

Re: Broken xml output

OmegaJunior wrote:
> On Thu, 18 Jan 2007 14:08:01 +0100, Steven Mocking
> <ufo [at] quicknet.youmightwanttogetridofthis.nl> wrote:
>
>> Changing the encoding argument to startDocument in the script results in
>> a conv error:
>>
>> Warning: XMLWriter::outputMemory() function.XMLWriter-outputMemory:
>> output conversion failed due to conv error, bytes 0xEB 0x6C 0x65 0x3C in
>> broken-xml-output.txt on line 18
>>
>> Doesn't make any sense to me, because it doesn't even need to convert
>> the string. Same trouble on two machines with PHP 5.1.2 and 5.2.0. I
>> could always replace it in the output with a regexp on the first line,
>> but that's just plain Bad and Wrong.
>
> Having no experience with XMLWriter, I thought of these questions:
> - maybe the writer does try to convert the input text into whatever
> charset you supply?

Agreed, that's what it looks like. How do I prevent it from doing that
without rewriting the builtin class myself?

> - maybe it tries to save the input text as whatever charset you supply?
> - the e-umlaut may be part of the iso-8859-1, but is it part of utf-8,
> too? Or would you need a unicode number?

True.
Steven Mocking [ Di, 23 Januar 2007 17:24 ] [ ID #1605336 ]
PHP » alt.php » Broken xml output

Vorheriges Thema: Re: Pingy: Spacey Gurl
Nächstes Thema: PHP Version 5.1.6 versus PHP Version 4.3.10