unicode weirdness

I have code that reads the clipboard, expecting text copied out of Firefox
from mint.com. However when I copy the lines I end up with a bunch of
unicode characters mixed in. The n-dash is particularly irritating, and I
want to change it to a regular hyphen.

When I "paste" into a BBEdit UTF8 window, BBEdit says the n-dash is
character x2013, so I thought this code would work:

**************************************
my ($date,$comment,$mcat,$amt) =3D split /\t/;
my [at] pd =3D parseDate($date);
my $ds =3D dateStamp( [at] pd);

# make sure $amt looks like a real amount
$amt =3D~ tr/$,//d;
$amt =3D~ s/\x{2013}/-/g;
**************************************

.... but it does nothing, the substitution doesn't find the n-dash.

So I went in and added this code to test it:

**************************************
print $amt, "\n";
print $_.": ord(".ord($_).") chr(".chr(ord($_)).")\n" for split(//,$amt);
exit;
**************************************

And here's what it prints:

**************************************
=AD16.58
?: ord(226) chr(?)
?: ord(128) chr(?)
?: ord(147) chr(?)
1: ord(49) chr(1)
6: ord(54) chr(6)
..: ord(46) chr(.)
5: ord(53) chr(5)
8: ord(56) chr(8)
**************************************

Uh -- I thought perl treated unicode characters as regular characters (as
opposed to bytes)? Why does the n-dash come up as three separate
characters? How do I change that n-dash into a hyphen?

TIA.

- Bryan



--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Bryan Harris [ Sa, 24 Juli 2010 21:57 ] [ ID #2045074 ]

Re: unicode weirdness

On Saturday 24 Jul 2010 22:57:00 Bryan Harris wrote:
> I have code that reads the clipboard, expecting text copied out of Firefox
> from mint.com. However when I copy the lines I end up with a bunch of
> unicode characters mixed in. The n-dash is particularly irritating, and I
> want to change it to a regular hyphen.
>
> When I "paste" into a BBEdit UTF8 window, BBEdit says the n-dash is
> character x2013, so I thought this code would work:
>
> **************************************
> my ($date,$comment,$mcat,$amt) =3D split /\t/;
> my [at] pd =3D parseDate($date);
> my $ds =3D dateStamp( [at] pd);
>
> # make sure $amt looks like a real amount
> $amt =3D~ tr/$,//d;
> $amt =3D~ s/\x{2013}/-/g;
> **************************************
>
> ... but it does nothing, the substitution doesn't find the n-dash.
>
> So I went in and added this code to test it:
>
> **************************************
> print $amt, "\n";
> print $_.": ord(".ord($_).") chr(".chr(ord($_)).")\n" for split(//,$amt);
> exit;
> **************************************
>
> And here's what it prints:
>
> **************************************
> =AD16.58
> ?: ord(226) chr(?)
> ?: ord(128) chr(?)
> ?: ord(147) chr(?)
> 1: ord(49) chr(1)
> 6: ord(54) chr(6)
> .: ord(46) chr(.)
> 5: ord(53) chr(5)
> 8: ord(56) chr(8)
> **************************************
>
> Uh -- I thought perl treated unicode characters as regular characters (as
> opposed to bytes)? Why does the n-dash come up as three separate
> characters? How do I change that n-dash into a hyphen?
>

You probably should use the Encode module and read
http://perldoc.perl.org/perlunitut.html .

Regards,

Shlomi Fish

=2D-
=2D--------------------------------------------------------- -------
Shlomi Fish http://www.shlomifish.org/
Optimising Code for Speed - http://shlom.in/optimise

God considered inflicting XSLT as the tenth plague of Egypt, but then
decided against it because he thought it would be too evil.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Shlomi Fish [ So, 25 Juli 2010 11:11 ] [ ID #2045076 ]

Re: unicode weirdness

On Sat, Jul 24, 2010 at 15:57, Bryan Harris <bryanslists [at] gmail.com> wrote:
>
>
> I have code that reads the clipboard, expecting text copied out of Firefo=
x
> from mint.com. =C2=A0However when I copy the lines I end up with a bunch =
of
> unicode characters mixed in. =C2=A0The n-dash is particularly irritating,=
and I
> want to change it to a regular hyphen.
>
> When I "paste" into a BBEdit UTF8 window, BBEdit says the n-dash is
> character x2013, so I thought this code would work:
>
> **************************************
> =C2=A0my ($date,$comment,$mcat,$amt) =3D split /\t/;
> =C2=A0my [at] pd =3D parseDate($date);
> =C2=A0my $ds =3D dateStamp( [at] pd);
>
> =C2=A0# make sure $amt looks like a real amount
> =C2=A0$amt =3D~ tr/$,//d;
> =C2=A0$amt =3D~ s/\x{2013}/-/g;
> **************************************
snip
> ?: ord(226) chr(?)
> ?: ord(128) chr(?)
> ?: ord(147) chr(?)
snip

In UTF-8, the bytes 226, 128, 147 represent U+2013 EN DASH.

snip
> Uh -- I thought perl treated unicode characters as regular characters (as
> opposed to bytes)? =C2=A0Why does the n-dash come up as three separate
> characters? =C2=A0How do I change that n-dash into a hyphen?
snip

Perl 5 does deal with characters (rather than bytes) by default, but
the encoding of a string depends on where you read it from. For
instance, what if I had a binary data file with the number 226, 128,
and 147 in it. Should Perl 5 convert that into one character? What
if I am using a different encoding than UTF-8? The solution is to
know what a source provides and to use the [Encode][0] module to tell
Perl 5 what the ecoding is:

#!/usr/bin/perl

use strict;
use warnings;

use Encode;

binmode STDOUT, ":utf8";

my $bin =3D "\x{E2}\x{80}\x{93}";

print "raw: [$bin]\nUTF-8: [", decode("utf8", $bin), "]\n";


If you can skip using the Encode module directly if you are using a
filehandle to read the data in. In that case you can just set the
[filehandle's encoding][1]:

#!/usr/bin/perl

use strict;
use warnings;

#using the OS X clipboard, replace with xclip or the like for other OSes.
my $read =3D "/usr/bin/pbcopy";
my $write =3D "/usr/bin/pbpaste";

binmode STDOUT, ":utf8";

#replace pb
system "/bin/echo -n \x{2013} | $read";

my $raw =3D do {
local $/;
open my $fh, "-|", $write
or die "couldn't open clipboard: $!\n";
<$fh>;
};

my $utf8 =3D do {
local $/;
open my $fh, "-|:encoding(utf8)", $write
or die "couldn't open clipboard: $!\n";
<$fh>;
};

print "raw: [$raw]\nUTF-8: [$utf8]\n";

Note, there are subtle differences between :utf8 and :encoding(utf8)
that I don't fully understand, but encoding(utf8) is recommended for
reading input. I believe that :encoding(utf8) will error if it is
handed malformed UTF-8 on ASCII platforms or UTF-EBCDIC on EBCDIC
platforms.


[0]: http://perldoc.perl.org/Encode.html
[1]: http://perldoc.perl.org/PerlIO.html
--
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
chas.owens [ So, 25 Juli 2010 15:24 ] [ ID #2045077 ]

Re: unicode weirdness

--000feaf1dc6acd37f9048c39b700
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

You could also try playing with the Text::Unidecode module:
http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/ Unidecode.pm
Jason Lewis

Email jasonlewis.x [at] gmail.com

Mobile 410.428.0253

AIM canweriotnow
Facebook http://www.facebook.com/canweriotnow


On Sat, Jul 24, 2010 at 3:57 PM, Bryan Harris <bryanslists [at] gmail.com> wrote=
:

>
>
> I have code that reads the clipboard, expecting text copied out of Firefo=
x
> from mint.com. However when I copy the lines I end up with a bunch of
> unicode characters mixed in. The n-dash is particularly irritating, and =
I
> want to change it to a regular hyphen.
>
> When I "paste" into a BBEdit UTF8 window, BBEdit says the n-dash is
> character x2013, so I thought this code would work:
>
> **************************************
> my ($date,$comment,$mcat,$amt) =3D split /\t/;
> my [at] pd =3D parseDate($date);
> my $ds =3D dateStamp( [at] pd);
>
> # make sure $amt looks like a real amount
> $amt =3D~ tr/$,//d;
> $amt =3D~ s/\x{2013}/-/g;
> **************************************
>
> ... but it does nothing, the substitution doesn't find the n-dash.
>
> So I went in and added this code to test it:
>
> **************************************
> print $amt, "\n";
> print $_.": ord(".ord($_).") chr(".chr(ord($_)).")\n" for split(//,$amt);
> exit;
> **************************************
>
> And here's what it prints:
>
> **************************************
> =C2=AD16.58
> ?: ord(226) chr(?)
> ?: ord(128) chr(?)
> ?: ord(147) chr(?)
> 1: ord(49) chr(1)
> 6: ord(54) chr(6)
> .: ord(46) chr(.)
> 5: ord(53) chr(5)
> 8: ord(56) chr(8)
> **************************************
>
> Uh -- I thought perl treated unicode characters as regular characters (as
> opposed to bytes)? Why does the n-dash come up as three separate
> characters? How do I change that n-dash into a hyphen?
>
> TIA.
>
> - Bryan
>
>
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
> For additional commands, e-mail: beginners-help [at] perl.org
> http://learn.perl.org/
>
>
>

--000feaf1dc6acd37f9048c39b700--
Jason Lewis [ So, 25 Juli 2010 19:36 ] [ ID #2045078 ]
Perl » gmane.comp.lang.perl.beginners » unicode weirdness

Vorheriges Thema: Extract data from BEncoded .torrent files
Nächstes Thema: Get variable name from a list