unicode weirdness
I have code that reads the clipboard, expecting text copied out of Firefox
from mint.com. However when I copy the lines I end up with a bunch of
unicode characters mixed in. The n-dash is particularly irritating, and I
want to change it to a regular hyphen.
When I "paste" into a BBEdit UTF8 window, BBEdit says the n-dash is
character x2013, so I thought this code would work:
**************************************
my ($date,$comment,$mcat,$amt) =3D split /\t/;
my [at] pd =3D parseDate($date);
my $ds =3D dateStamp( [at] pd);
# make sure $amt looks like a real amount
$amt =3D~ tr/$,//d;
$amt =3D~ s/\x{2013}/-/g;
**************************************
.... but it does nothing, the substitution doesn't find the n-dash.
So I went in and added this code to test it:
**************************************
print $amt, "\n";
print $_.": ord(".ord($_).") chr(".chr(ord($_)).")\n" for split(//,$amt);
exit;
**************************************
And here's what it prints:
**************************************
=AD16.58
?: ord(226) chr(?)
?: ord(128) chr(?)
?: ord(147) chr(?)
1: ord(49) chr(1)
6: ord(54) chr(6)
..: ord(46) chr(.)
5: ord(53) chr(5)
8: ord(56) chr(8)
**************************************
Uh -- I thought perl treated unicode characters as regular characters (as
opposed to bytes)? Why does the n-dash come up as three separate
characters? How do I change that n-dash into a hyphen?
TIA.
- Bryan
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: unicode weirdness
On Saturday 24 Jul 2010 22:57:00 Bryan Harris wrote:
> I have code that reads the clipboard, expecting text copied out of Firefox
> from mint.com. However when I copy the lines I end up with a bunch of
> unicode characters mixed in. The n-dash is particularly irritating, and I
> want to change it to a regular hyphen.
>
> When I "paste" into a BBEdit UTF8 window, BBEdit says the n-dash is
> character x2013, so I thought this code would work:
>
> **************************************
> my ($date,$comment,$mcat,$amt) =3D split /\t/;
> my [at] pd =3D parseDate($date);
> my $ds =3D dateStamp( [at] pd);
>
> # make sure $amt looks like a real amount
> $amt =3D~ tr/$,//d;
> $amt =3D~ s/\x{2013}/-/g;
> **************************************
>
> ... but it does nothing, the substitution doesn't find the n-dash.
>
> So I went in and added this code to test it:
>
> **************************************
> print $amt, "\n";
> print $_.": ord(".ord($_).") chr(".chr(ord($_)).")\n" for split(//,$amt);
> exit;
> **************************************
>
> And here's what it prints:
>
> **************************************
> =AD16.58
> ?: ord(226) chr(?)
> ?: ord(128) chr(?)
> ?: ord(147) chr(?)
> 1: ord(49) chr(1)
> 6: ord(54) chr(6)
> .: ord(46) chr(.)
> 5: ord(53) chr(5)
> 8: ord(56) chr(8)
> **************************************
>
> Uh -- I thought perl treated unicode characters as regular characters (as
> opposed to bytes)? Why does the n-dash come up as three separate
> characters? How do I change that n-dash into a hyphen?
>
You probably should use the Encode module and read
http://perldoc.perl.org/perlunitut.html .
Regards,
Shlomi Fish
=2D-
=2D--------------------------------------------------------- -------
Shlomi Fish http://www.shlomifish.org/
Optimising Code for Speed - http://shlom.in/optimise
God considered inflicting XSLT as the tenth plague of Egypt, but then
decided against it because he thought it would be too evil.
Please reply to list if it's a mailing list post - http://shlom.in/reply .
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: unicode weirdness
On Sat, Jul 24, 2010 at 15:57, Bryan Harris <bryanslists [at] gmail.com> wrote:
>
>
> I have code that reads the clipboard, expecting text copied out of Firefo=
x
> from mint.com. =C2=A0However when I copy the lines I end up with a bunch =
of
> unicode characters mixed in. =C2=A0The n-dash is particularly irritating,=
and I
> want to change it to a regular hyphen.
>
> When I "paste" into a BBEdit UTF8 window, BBEdit says the n-dash is
> character x2013, so I thought this code would work:
>
> **************************************
> =C2=A0my ($date,$comment,$mcat,$amt) =3D split /\t/;
> =C2=A0my [at] pd =3D parseDate($date);
> =C2=A0my $ds =3D dateStamp( [at] pd);
>
> =C2=A0# make sure $amt looks like a real amount
> =C2=A0$amt =3D~ tr/$,//d;
> =C2=A0$amt =3D~ s/\x{2013}/-/g;
> **************************************
snip
> ?: ord(226) chr(?)
> ?: ord(128) chr(?)
> ?: ord(147) chr(?)
snip
In UTF-8, the bytes 226, 128, 147 represent U+2013 EN DASH.
snip
> Uh -- I thought perl treated unicode characters as regular characters (as
> opposed to bytes)? =C2=A0Why does the n-dash come up as three separate
> characters? =C2=A0How do I change that n-dash into a hyphen?
snip
Perl 5 does deal with characters (rather than bytes) by default, but
the encoding of a string depends on where you read it from. For
instance, what if I had a binary data file with the number 226, 128,
and 147 in it. Should Perl 5 convert that into one character? What
if I am using a different encoding than UTF-8? The solution is to
know what a source provides and to use the [Encode][0] module to tell
Perl 5 what the ecoding is:
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
binmode STDOUT, ":utf8";
my $bin =3D "\x{E2}\x{80}\x{93}";
print "raw: [$bin]\nUTF-8: [", decode("utf8", $bin), "]\n";
If you can skip using the Encode module directly if you are using a
filehandle to read the data in. In that case you can just set the
[filehandle's encoding][1]:
#!/usr/bin/perl
use strict;
use warnings;
#using the OS X clipboard, replace with xclip or the like for other OSes.
my $read =3D "/usr/bin/pbcopy";
my $write =3D "/usr/bin/pbpaste";
binmode STDOUT, ":utf8";
#replace pb
system "/bin/echo -n \x{2013} | $read";
my $raw =3D do {
local $/;
open my $fh, "-|", $write
or die "couldn't open clipboard: $!\n";
<$fh>;
};
my $utf8 =3D do {
local $/;
open my $fh, "-|:encoding(utf8)", $write
or die "couldn't open clipboard: $!\n";
<$fh>;
};
print "raw: [$raw]\nUTF-8: [$utf8]\n";
Note, there are subtle differences between :utf8 and :encoding(utf8)
that I don't fully understand, but encoding(utf8) is recommended for
reading input. I believe that :encoding(utf8) will error if it is
handed malformed UTF-8 on ASCII platforms or UTF-EBCDIC on EBCDIC
platforms.
[0]: http://perldoc.perl.org/Encode.html
[1]: http://perldoc.perl.org/PerlIO.html
--
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: unicode weirdness
--000feaf1dc6acd37f9048c39b700
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
You could also try playing with the Text::Unidecode module:
http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/ Unidecode.pm
Jason Lewis
Email jasonlewis.x [at] gmail.com
Mobile 410.428.0253
AIM canweriotnow
Facebook http://www.facebook.com/canweriotnow
On Sat, Jul 24, 2010 at 3:57 PM, Bryan Harris <bryanslists [at] gmail.com> wrote=
:
>
>
> I have code that reads the clipboard, expecting text copied out of Firefo=
x
> from mint.com. However when I copy the lines I end up with a bunch of
> unicode characters mixed in. The n-dash is particularly irritating, and =
I
> want to change it to a regular hyphen.
>
> When I "paste" into a BBEdit UTF8 window, BBEdit says the n-dash is
> character x2013, so I thought this code would work:
>
> **************************************
> my ($date,$comment,$mcat,$amt) =3D split /\t/;
> my [at] pd =3D parseDate($date);
> my $ds =3D dateStamp( [at] pd);
>
> # make sure $amt looks like a real amount
> $amt =3D~ tr/$,//d;
> $amt =3D~ s/\x{2013}/-/g;
> **************************************
>
> ... but it does nothing, the substitution doesn't find the n-dash.
>
> So I went in and added this code to test it:
>
> **************************************
> print $amt, "\n";
> print $_.": ord(".ord($_).") chr(".chr(ord($_)).")\n" for split(//,$amt);
> exit;
> **************************************
>
> And here's what it prints:
>
> **************************************
> =C2=AD16.58
> ?: ord(226) chr(?)
> ?: ord(128) chr(?)
> ?: ord(147) chr(?)
> 1: ord(49) chr(1)
> 6: ord(54) chr(6)
> .: ord(46) chr(.)
> 5: ord(53) chr(5)
> 8: ord(56) chr(8)
> **************************************
>
> Uh -- I thought perl treated unicode characters as regular characters (as
> opposed to bytes)? Why does the n-dash come up as three separate
> characters? How do I change that n-dash into a hyphen?
>
> TIA.
>
> - Bryan
>
>
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
> For additional commands, e-mail: beginners-help [at] perl.org
> http://learn.perl.org/
>
>
>
--000feaf1dc6acd37f9048c39b700--