How to detect text charset (UTF-8 or Latin-1)

Hi.

I'm creating a Perl script extracting text from a webpage using LWP,
and want to check if text is UTF-8 or Latin-1 encoded?

Is there any known function? I don't know if "use utf8;" is enough

Thank you very much in advance.
tarmstrong [ Di, 15 Januar 2008 17:59 ] [ ID #1908783 ]

Re: How to detect text charset (UTF-8 or Latin-1)

Thomas Armstrong <tarmstrong [at] gmail.com> writes:
> Hi.
>
> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?

Considering that each and every series of UTF-8 octets is also a valid
(if nonsensical) series of Latin-1 octets, you are asking a question
that cannot be solved rigorously.

However, there are series of valid Latin-1 octets that are NOT
UTF-8, so that can be used as a heuristic guide.

Finally, you can devise a heuristic to give a confidence level that a
given series of octets is LIKELY to be UTF-8 or Latin-1 even in the
cases where they are valid in both codings.

The sad news is, I wrote one that works wonders, but the code is
encumbered.

--
Lawrence Statton - lawrenabae [at] abaluon.abaom s/aba/c/g
Computer software consists of only two components: ones and
zeros, in roughly equal proportions. All that is required is to
place them into the correct order.
Lawrence Statton [ Di, 15 Januar 2008 18:22 ] [ ID #1908784 ]

Re: How to detect text charset (UTF-8 or Latin-1)

On Jan 15, 11:59 am, Thomas Armstrong <tarmstr... [at] gmail.com> wrote:
> Hi.
>
> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?
>
> Is there any known function? I don't know if "use utf8;" is enough
>
> Thank you very much in advance.


Parse the Content-type header, for example:
content="text/html; charset=UTF-8"

Web pages that lie or omit the Content-type are not scarce,
unfortunately.
smallpond [ Di, 15 Januar 2008 18:45 ] [ ID #1908785 ]

Re: How to detect text charset (UTF-8 or Latin-1)

Thomas Armstrong wrote:

> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?
>
> I don't know if "use utf8;" is enough

It is not appropriate.

The pragma "use utf8" tells the Perl interpreter that your program file
contains strings encoded in UTF-8 format. It does *not* affect how
Perl handles data from external sources.

-Joe
Joe Smith [ Di, 15 Januar 2008 20:16 ] [ ID #1908795 ]

Re: How to detect text charset (UTF-8 or Latin-1)

>Thomas Armstrong wrote:
>> I'm creating a Perl script extracting text from a webpage using LWP,
>> and want to check if text is UTF-8 or Latin-1 encoded?

Check the <META Charset='...'> tag.

jue
jurgenex [ Di, 15 Januar 2008 21:53 ] [ ID #1908806 ]

Re: How to detect text charset (UTF-8 or Latin-1)

Jürgen Exner <jurgenex [at] hotmail.com> writes:

>>Thomas Armstrong wrote:
>>> I'm creating a Perl script extracting text from a webpage using LWP,
>>> and want to check if text is UTF-8 or Latin-1 encoded?
>
> Check the <META Charset='...'> tag.
>
> jue

or the xml prolog (if there is one)

Joost.
Joost Diepenmaat [ Di, 15 Januar 2008 23:11 ] [ ID #1908815 ]
Perl » comp.lang.perl.misc » How to detect text charset (UTF-8 or Latin-1)

Vorheriges Thema: FAQ 8.16 How can I sleep() or alarm() for under a second?
Nächstes Thema: Comparing two files