utf-8

I run perl v5.8.7 and my regular expresion is ($txt =3D~ m/(\w+|=E9\w+)/g)
which do not take every utf-8 word. How to make this regular
expression to take every utf-8 word ?
julia_2683 [ Mo, 31 Dezember 2007 20:33 ] [ ID #1896844 ]

Re: utf-8

julia_2683 [at] hotmail.com writes:

> I run perl v5.8.7 and my regular expresion is ($txt =~ m/(\w+|é\w+)/g)
> which do not take every utf-8 word. How to make this regular
> expression to take every utf-8 word ?

Just \w should work, provided you're handling your encodings correctly *and*
your $txt is actually utf-8 encoded. This is IMO a bug.

Note that if your script itself is utf8 encoded you need to "use utf8"
somewhere at the top of your script.

For instance:

#/usr/bin/perl -w
use strict;

# set output stream as utf-8 encoded (i have a utf-8 enabled terminal)
binmode STDOUT,":utf8";

my $str="\x{e9}"; # "é", not necessarily as utf-8 - very likely latin-1
utf8::upgrade($str); # force utf-8 encoding

print "$str was ",($str =~ /\w+/ ? "" : "not "),"matched\n";

Joost.
Joost Diepenmaat [ Mo, 31 Dezember 2007 20:45 ] [ ID #1896845 ]
Perl » comp.lang.perl.misc » utf-8

Vorheriges Thema: FAQ 6.12 What does it mean that regexes are greedy? How can I get around it?
Nächstes Thema: is there any "hello-world" demo for plugin-based application?