Help with Regular Expression

I am running Active Perl 5.8.8.

I am converting a large enterprise database into a new system and have
run across a free-form text field in which users have entered all manner
of garbage.

One scenario is where two sentences have been run together with no
ending '.' or space. Here are some examples:

madeStyle
facilitatedOne
Anti-magneticQuality

As you can see, the new sentence begins with an upper-case letter, so if
I can just break apart the construct like this I'll be OK: "madeStyle"
should become "made. Style".

Difficulty: the fields contain hundreds of words both preceding and
following the "bad" words, so I have to be able to pick out the
lower-case words that contain one embedded upper-case character.

Ant ideas?

Barry Brevik
_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Barry Brevik [ Sa, 16 Mai 2009 00:18 ] [ ID #2001547 ]

Re: Help with Regular Expression

On Fri, May 15, 2009 at 11:18 PM, Barry Brevik <BBrevik [at] stellarmicro.com> w=
rote:
> I am running Active Perl 5.8.8.
>
> I am converting a large enterprise database into a new system and have
> run across a free-form text field in which users have entered all manner
> of garbage.
>
> One scenario is where two sentences have been run together with no
> ending '.' or space. Here are some examples:
>
> =A0 =A0madeStyle
> =A0 =A0facilitatedOne
> =A0 =A0Anti-magneticQuality
>
> As you can see, the new sentence begins with an upper-case letter, so if
> I can just break apart the construct like this I'll be OK: =A0"madeStyle"
> should become =A0"made. Style".
>
> Difficulty: the fields contain hundreds of words both preceding and
> following the "bad" words, so I have to be able to pick out the
> lower-case words that contain one embedded upper-case character.
>
> Ant ideas?
>
> Barry Brevik

Hi Barry,

Maybe something like this would help:

$ cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality

$ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
made. Style
facilitated. One
Anti-magnetic. Quality

Regards,
Ari Constancio
_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Ari Constancio [ Sa, 16 Mai 2009 01:20 ] [ ID #2001548 ]

Re: Help with Regular Expression

--===============0014407029==
Content-Type: multipart/alternative;
boundary="part1_cd1.4f44e578.373f76f5_boundary"


--part1_cd1.4f44e578.373f76f5_boundary
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

hi ari and barry --

In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time,
ari.constancio [at] gmail.com writes:

> On Fri, May 15, 2009 at 11:18 PM, Barry Brevik <BBrevik [at] stellarmicro.com>
wrote:
>
> > I am running Active Perl 5.8.8.
> > ...
> > Difficulty: the fields contain hundreds of words both preceding and
> > following the "bad" words, so I have to be able to pick out the
> > lower-case words that contain one embedded upper-case character.
> > ...
> > Barry Brevik
>
> Hi Barry,
>
> Maybe something like this would help:
>
> $ cat test.txt
> madeStyle
> facilitatedOne
> Anti-magneticQuality
>
> $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
> made. Style
> facilitated. One
> Anti-magnetic. Quality
>
> Regards, Ari Constancio

the replacement string in a s/// should use capture variables rather
than backreferences; perl warns about this if warnings are on (always
a good idea). a '.' (period) character in a replacement string is not
a metacharacter and needs no escape.

also, the regex used, /(\w+)([A-Z])/, will allow any number greater than
zero of upper case letters, digits or underscores to precede the uc letter
that is supposed to be the initial letter of a new sentence: probably not
what is intended.

>cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality
123FOO

>cat test.txt | perl -wMstrict -pe
"s/(\w+)([A-Z])/\1\. \2/g"
\1 better written as $1 at -e line 1.
\2 better written as $2 at -e line 1.
made. Style
facilitated. One
Anti-magnetic. Quality
123FO. O

a better approach might be something like:

>cat test.txt | perl -wMstrict -pe
"s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"
made. Style
facilitated. One
Anti-magnetic. Quality
123FOO

hth -- bill walters
<BR><BR>**************<BR>Recession-proof vacation ideas. Find free things to do in
the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=emlcntustrav00000002)</HTML>

--part1_cd1.4f44e578.373f76f5_boundary
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

<HTML><FONT FACE=3Darial,helvetica><FONT SIZE=3D2 PTSIZE=3D10>hi ari and=
barry --   
<BR>
<BR>In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time, ari.con=
stancio [at] gmail.com writes:
<BR>
<BR>> On Fri, May 15, 2009 at 11:18 PM, Barry Brevik <BBrevik [at] stella=
rmicro.com> wrote:
<BR>>
<BR>> > I am running Active Perl 5.8.8.
<BR>> > ...
<BR>> > Difficulty: the fields contain hundreds of words both preced=
ing and
<BR>> > following the "bad" words, so I have to be able to pick out=
the
<BR>> > lower-case words that contain one embedded upper-case charac=
ter.
<BR>> > ...
<BR>> > Barry Brevik
<BR>>
<BR>> Hi Barry,
<BR>>
<BR>> Maybe something like this would help:
<BR>>
<BR>> $ cat test.txt
<BR>> madeStyle
<BR>> facilitatedOne
<BR>> Anti-magneticQuality
<BR>>
<BR>> $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
<BR>> made. Style
<BR>> facilitated. One
<BR>> Anti-magnetic. Quality
<BR>>
<BR>> Regards, Ari Constancio
<BR>
<BR>the replacement string in a  s///  should use capture variab=
les rather
<BR>than backreferences; perl warns about this if warnings are on (always=

<BR>a good idea).   a '.' (period) character in a replacement st=
ring is not
<BR>a metacharacter and needs no escape.   
<BR>
<BR>also, the regex used, /(\w+)([A-Z])/, will allow any number greater th=
an
<BR>zero of upper case letters, digits or underscores to precede the uc le=
tter
<BR>that is supposed to be the initial letter of a new sentence: probably=
not
<BR>what is intended.   
<BR>
<BR>>cat test.txt
<BR>madeStyle
<BR>facilitatedOne
<BR>Anti-magneticQuality
<BR>123FOO
<BR>
<BR>>cat test.txt | perl -wMstrict -pe
<BR>"s/(\w+)([A-Z])/\1\. \2/g"
<BR>\1 better written as $1 at -e line 1.
<BR>\2 better written as $2 at -e line 1.
<BR>made. Style
<BR>facilitated. One
<BR>Anti-magnetic. Quality
<BR>123FO. O
<BR>
<BR>a better approach might be something like:   
<BR>
<BR>>cat test.txt | perl -wMstrict -pe
<BR>"s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"
<BR>made. Style
<BR>facilitated. One
<BR>Anti-magnetic. Quality
<BR>123FOO
<BR>
<BR>hth -- bill walters   
<BR></FONT><BR><BR>**************<BR>Recession-proof vacation ideas. Find=
free things to do in the U.S. (http://travel.aol.com/travel-ideas/domesti=
c/national-tourism-week?ncid=3Demlcntustrav00000002)</HTML>

--part1_cd1.4f44e578.373f76f5_boundary--

--===============0014407029==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0014407029==--
Williamawalters [ Sa, 16 Mai 2009 03:55 ] [ ID #2001549 ]

Re: Help with Regular Expression

--===============0317845474==
Content-Type: multipart/alternative;
boundary="part1_ce3.4c6fbb02.373ff15b_boundary"


--part1_ce3.4c6fbb02.373ff15b_boundary
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

hi guys --

In a message dated 5/15/2009 8:55:30 PM Eastern Standard Time,
Williamawalters [at] aol.com writes:

> In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time,
ari.constancio [at] gmail.com writes:
>
> > On Fri, May 15, 2009 at 11:18 PM, Barry Brevik <BBrevik [at] stellarmicro.com
> wrote:
> >
> > > I am running Active Perl 5.8.8.
> > > ...
> > > Difficulty: the fields contain hundreds of words both preceding and
> > > following the "bad" words, so I have to be able to pick out the
> > > lower-case words that contain one embedded upper-case character.
> > > ...
> > > Barry Brevik
> >
> > Hi Barry,
> >
> > Maybe something like this would help:
> >
> > $ cat test.txt
> > madeStyle
> > facilitatedOne
> > Anti-magneticQuality
> >
> > $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
> > made. Style
> > facilitated. One
> > Anti-magnetic. Quality
> >
> > Regards, Ari Constancio
>
> ...
>
> a better approach might be something like:
>
> >cat test.txt | perl -wMstrict -pe
> "s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"
> made. Style
> facilitated. One
> Anti-magnetic. Quality
> 123FOO
>
> hth -- bill walters

well, english is a complicated thing, as, i guess, are all natural
languages.

it occurred to me that the solution i suggested, that a new sentence begins
with a uc letter and at least one lc letter (which was how i interpreted
the
original 'lower-case words that contain one embedded upper-case character'
spec), fails for a very common word. the approach below makes separate
regex definitions for end-of-sentence and beginning-of-sentence patterns;
these are more easily adapted as requirements mature.

of course, the new approach fails for BiCapitalized words. sigh.
using separate regex definitions might come into play here: one might,
for instance, define a list of bi-capitalized words that would be used with
a look-around to avoid improper substitutions.

(i cannot think of a case in which a proper sentence ends with
anything other than an lc letter before the period. if there is such,
the separate regex approach could, i think, be easily adapted to handle
it.)

>cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality
123FOO
the endA new
PowerPoint

>cat test.txt | perl -wMstrict -pe
"INIT {
my $sen_end = qr{ [[:lower:]] }xms;
my $new_sen = qr{ [[:upper:]] }xms;
sub S { s{ ($sen_end) ($new_sen) }{$1. $2}xmsg }
}
S;
"
made. Style
facilitated. One
Anti-magnetic. Quality
123FOO
the end. A new
Power. Point

again, hth -- bill walters
<BR><BR>**************<BR>Recession-proof vacation ideas. Find free things to do in
the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=emlcntustrav00000002)</HTML>

--part1_ce3.4c6fbb02.373ff15b_boundary
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

<HTML><FONT FACE=3Darial,helvetica><FONT SIZE=3D2 PTSIZE=3D10>hi guys --=
  
<BR>
<BR>In a message dated 5/15/2009 8:55:30 PM Eastern Standard Time, William=
awalters [at] aol.com writes:
<BR>
<BR>> In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time, ar=
i.constancio [at] gmail.com writes:
<BR>>
<BR>> > On Fri, May 15, 2009 at 11:18 PM, Barry Brevik <BBrevik [at] s=
tellarmicro.com> wrote:
<BR>> >
<BR>> > > I am running Active Perl 5.8.8.
<BR>> > > ...
<BR>> > > Difficulty: the fields contain hundreds of words both=
preceding and
<BR>> > > following the "bad" words, so I have to be able to pick=
out the
<BR>> > > lower-case words that contain one embedded upper-case=
character.
<BR>> > > ...
<BR>> > > Barry Brevik
<BR>> >
<BR>> > Hi Barry,
<BR>> >
<BR>> > Maybe something like this would help:
<BR>> >
<BR>> > $ cat test.txt
<BR>> > madeStyle
<BR>> > facilitatedOne
<BR>> > Anti-magneticQuality
<BR>> >
<BR>> > $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
<BR>> > made. Style
<BR>> > facilitated. One
<BR>> > Anti-magnetic. Quality
<BR>> >
<BR>> > Regards, Ari Constancio
<BR>>
<BR>> ...
<BR>>
<BR>> a better approach might be something like:    
<BR>>
<BR>> >cat test.txt | perl -wMstrict -pe
<BR>> "s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"
<BR>> made. Style
<BR>> facilitated. One
<BR>> Anti-magnetic. Quality
<BR>> 123FOO
<BR>>
<BR>> hth -- bill walters    
<BR>
<BR>well, english is a complicated thing, as, i guess, are all natural lan=
guages.   
<BR>
<BR>it occurred to me that the solution i suggested, that a new sentence=
begins
<BR>with a uc letter and at least one lc letter (which was how i interpret=
ed the
<BR>original 'lower-case words that contain one embedded upper-case charac=
ter'
<BR>spec), fails for a very common word.   the approach below ma=
kes separate
<BR>regex definitions for end-of-sentence and beginning-of-sentence patter=
ns;
<BR>these are more easily adapted as requirements mature.   
<BR>
<BR>of course, the new approach fails for BiCapitalized words.   =
;sigh.   
<BR>using separate regex definitions might come into play here: one might,=

<BR>for instance, define a list of bi-capitalized words that would be used=
with
<BR>a look-around to avoid improper substitutions.   
<BR>
<BR>(i cannot think of a case in which a proper sentence ends with
<BR>anything other than an lc letter before the period.   if the=
re is such,
<BR>the separate regex approach could, i think, be easily adapted to handl=
e
<BR>it.)   
<BR>
<BR>>cat test.txt
<BR>madeStyle
<BR>facilitatedOne
<BR>Anti-magneticQuality
<BR>123FOO
<BR>the endA new
<BR>PowerPoint
<BR>
<BR>>cat test.txt | perl -wMstrict -pe
<BR>"INIT {
<BR>   my $sen_end =3D qr{ [[:lower:]] }xms;
<BR>   my $new_sen =3D qr{ [[:upper:]] }xms;
<BR>   sub S { s{ ($sen_end) ($new_sen) }{$1. $2}xmsg }
<BR>   }
<BR> S;
<BR>"
<BR>made. Style
<BR>facilitated. One
<BR>Anti-magnetic. Quality
<BR>123FOO
<BR>the end. A new
<BR>Power. Point
<BR>
<BR>again, hth -- bill walters   
<BR></FONT><BR><BR>**************<BR>Recession-proof vacation ideas. Find=
free things to do in the U.S. (http://travel.aol.com/travel-ideas/domesti=
c/national-tourism-week?ncid=3Demlcntustrav00000002)</HTML>

--part1_ce3.4c6fbb02.373ff15b_boundary--

--===============0317845474==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0317845474==--
Williamawalters [ Sa, 16 Mai 2009 12:37 ] [ ID #2001550 ]

RE: Help with Regular Expression

This is a multi-part message in MIME format.

--===============1378440988==
Content-class: urn:content-classes:message
Content-Type: multipart/alternative;
boundary="----_=_NextPart_001_01C9D625.9E9268EC"

This is a multi-part message in MIME format.

------_=_NextPart_001_01C9D625.9E9268EC
Content-Type: text/plain;
charset=us-ascii
Content-Transfer-Encoding: quoted-printable

Here's something a bit simpler based on the original example Barry sent.
Basically looks for a single upper case letter with a single non-upper
case, non-white space char before it. \w doesn't do that, we also don't
need to use the "+" modifier since all we care about is matching a
single char. (Better performance if not searching for a variable length
string.)

perl -we 'my =
$t=3D"madeStyle\nfacilitatedOne\nAnti-magneticQuality\n123FO O
BAR";
$t=3D~s/([^A-Z\s])([A-Z])/$1. $2/g;
print "----------\n$t\n";'
----------
made. Style
facilitated. One
Anti-magnetic. Quality
123. FOO BAR


Curtis


________________________________

From: activeperl-bounces [at] listserv.ActiveState.com
[mailto:activeperl-bounces [at] listserv.ActiveState.com] On Behalf Of
Williamawalters [at] aol.com
Sent: Friday, May 15, 2009 8:55 PM
To: ari.constancio [at] gmail.com
Cc: activeperl [at] listserv.activestate.com
Subject: Re: Help with Regular Expression


hi ari and barry --

In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time,
ari.constancio [at] gmail.com writes:

> On Fri, May 15, 2009 at 11:18 PM, Barry Brevik
<BBrevik [at] stellarmicro.com> wrote:
>
> > I am running Active Perl 5.8.8.
> > ...
> > Difficulty: the fields contain hundreds of words both preceding and
> > following the "bad" words, so I have to be able to pick out the
> > lower-case words that contain one embedded upper-case character.
> > ...
> > Barry Brevik
>
> Hi Barry,
>
> Maybe something like this would help:
>
> $ cat test.txt
> madeStyle
> facilitatedOne
> Anti-magneticQuality
>
> $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
> made. Style
> facilitated. One
> Anti-magnetic. Quality
>
> Regards, Ari Constancio

the replacement string in a s/// should use capture variables rather
than backreferences; perl warns about this if warnings are on (always
a good idea). a '.' (period) character in a replacement string is not
a metacharacter and needs no escape.

also, the regex used, /(\w+)([A-Z])/, will allow any number greater than

zero of upper case letters, digits or underscores to precede the uc
letter
that is supposed to be the initial letter of a new sentence: probably
not
what is intended.

>cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality
123FOO

>cat test.txt | perl -wMstrict -pe
"s/(\w+)([A-Z])/\1\. \2/g"
\1 better written as $1 at -e line 1.
\2 better written as $2 at -e line 1.
made. Style
facilitated. One
Anti-magnetic. Quality
123FO. O

a better approach might be something like:

>cat test.txt | perl -wMstrict -pe
"s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"
made. Style
facilitated. One
Anti-magnetic. Quality
123FOO

hth -- bill walters


**************
Recession-proof vacation ideas. Find free things to do in the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=3D=

emlcntustrav00000002)

------_=_NextPart_001_01C9D625.9E9268EC
Content-Type: text/html;
charset=us-ascii
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2900.3492" name=3DGENERATOR></HEAD>
<BODY>
<DIV dir=3Dltr align=3Dleft><SPAN class=3D987141812-16052009><FONT =
face=3DArial
color=3D#0000ff size=3D2>Here's something a bit simpler based on the =
original
example Barry sent.  Basically looks for a single upper case letter =
with a
single non-upper case, non-white space char before it.  \w doesn't =
do that,
we also don't need to use the "+" modifier since all we care about is =
matching a
single char.  (Better performance if not searching for a variable =
length
string.)</FONT></SPAN></DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2></FONT> </DIV>
<DIV><FONT face=3DArial color=3D#0000ff size=3D2>
<DIV><SPAN class=3D987141812-16052009><FONT face=3DArial color=3D#0000ff =
size=3D2>perl
-we 'my $t=3D"madeStyle\nfacilitatedOne\nAnti-magneticQuality\n123FO O
BAR";</FONT></SPAN></DIV>
<DIV><SPAN class=3D987141812-16052009><FONT face=3DArial color=3D#0000ff =

size=3D2>          &nbs=
p;
 $t=3D~s/([^A-Z\s])([A-Z])/$1. $2/g;</FONT></SPAN></DIV>
<DIV><SPAN class=3D987141812-16052009><FONT face=3DArial color=3D#0000ff =

size=3D2>          &nbs=
p;
 print "----------\n$t\n";'</FONT></SPAN></DIV>
<DIV><SPAN class=3D987141812-16052009><FONT face=3DArial color=3D#0000ff =

size=3D2>----------</FONT></SPAN></DIV>
<DIV><SPAN class=3D987141812-16052009><FONT face=3DArial color=3D#0000ff =
size=3D2>made.
Style<BR>facilitated. One<BR>Anti-magnetic. Quality<BR>123. FOO
BAR</FONT></SPAN></DIV>
<DIV><SPAN =
class=3D987141812-16052009></SPAN> </DIV></FONT></DIV><!-- =
Converted from text/rtf format -->
<P><SPAN =
class=3D987141812-16052009><STRONG><EM></EM></STRONG></SPAN><FONT
face=3DArial><FONT color=3D#0000ff><FONT size=3D2>C<SPAN
class=3D987141812-16052009>urtis</SPAN></FONT></FONT></FONT><BR></P>
<DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B>
activeperl-bounces [at] listserv.ActiveState.com
[mailto:activeperl-bounces [at] listserv.ActiveState.com] <B>On Behalf Of
</B>Williamawalters [at] aol.com<BR><B>Sent:</B> Friday, May 15, 2009 8:55
PM<BR><B>To:</B> ari.constancio [at] gmail.com<BR><B>Cc:</B>
activeperl [at] listserv.activestate.com<BR><B>Subject:</B> Re: Help with =
Regular
Expression<BR></FONT><BR></DIV>
<DIV></DIV><FONT face=3Darial,helvetica><FONT size=3D2 PTSIZE=3D"10">hi =
ari and barry
--    <BR><BR>In a message dated 5/15/2009 6:20:40 PM Eastern =
Standard
Time, ari.constancio [at] gmail.com writes: <BR><BR>> On Fri, May 15, 2009 =
at
11:18 PM, Barry Brevik <BBrevik [at] stellarmicro.com> wrote: <BR>> =
<BR>>
> I am running Active Perl 5.8.8. <BR>> > ... <BR>> > =
Difficulty:
the fields contain hundreds of words both preceding and <BR>> > =
following
the "bad" words, so I have to be able to pick out the <BR>> > =
lower-case
words that contain one embedded upper-case character. <BR>> > ... =
<BR>>
> Barry Brevik <BR>> <BR>> Hi Barry, <BR>> <BR>> Maybe =
something
like this would help: <BR>> <BR>> $ cat test.txt <BR>> =
madeStyle
<BR>> facilitatedOne <BR>> Anti-magneticQuality <BR>> <BR>> =
$ cat
test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g' <BR>> made. Style =
<BR>>
facilitated. One <BR>> Anti-magnetic. Quality <BR>> <BR>> =
Regards, Ari
Constancio <BR><BR>the replacement string in a  s///  should =
use
capture variables rather <BR>than backreferences; perl warns about this =
if
warnings are on (always <BR>a good idea).   a '.' (period) =
character
in a replacement string is not <BR>a metacharacter and needs no escape.
   <BR><BR>also, the regex used, /(\w+)([A-Z])/, will allow =
any number
greater than <BR>zero of upper case letters, digits or underscores to =
precede
the uc letter <BR>that is supposed to be the initial letter of a new =
sentence:
probably not <BR>what is intended.    <BR><BR>>cat test.txt =

<BR>madeStyle <BR>facilitatedOne <BR>Anti-magneticQuality <BR>123FOO
<BR><BR>>cat test.txt | perl -wMstrict -pe <BR>"s/(\w+)([A-Z])/\1\. =
\2/g"
<BR>\1 better written as $1 at -e line 1. <BR>\2 better written as $2 at =
-e line
1. <BR>made. Style <BR>facilitated. One <BR>Anti-magnetic. Quality =
<BR>123FO. O
<BR><BR>a better approach might be something like:    =
<BR><BR>>cat
test.txt | perl -wMstrict -pe <BR>"s{ ([[:lower:]]) ([[:upper:]] =
[[:lower:]])
}{$1. $2}xmsg" <BR>made. Style <BR>facilitated. One <BR>Anti-magnetic. =
Quality
<BR>123FOO <BR><BR>hth -- bill walters   
<BR></FONT><BR><BR>**************<BR>Recession-proof vacation ideas. =
Find free
things to do in the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=3D=
emlcntustrav00000002)
</FONT></BODY></HTML>

------_=_NextPart_001_01C9D625.9E9268EC--


--===============1378440988==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============1378440988==--
Curtis Leach [ Sa, 16 Mai 2009 14:55 ] [ ID #2001551 ]

Re: Help with Regular Expression

--===============1652949719==
Content-Type: multipart/alternative;
boundary="part1_d0f.4a58efa0.37402d57_boundary"


--part1_d0f.4a58efa0.37402d57_boundary
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

hi curtis --

In a message dated 5/16/2009 7:56:26 AM Eastern Standard Time,
cleach [at] harrahs.com writes:

> 123FOO BAR
> ...
> ----------
> ...
> 123. FOO BAR

but i was thinking that 123FOO was *not* something
that would need punctuation: it's probably not the end of
one sentence and the beginning of the next.

br -- bill walters
<BR><BR>**************<BR>Recession-proof vacation ideas. Find free things to do in
the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=emlcntustrav00000002)</HTML>

--part1_d0f.4a58efa0.37402d57_boundary
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

<HTML><FONT FACE=3Darial,helvetica><FONT SIZE=3D2 PTSIZE=3D10>hi curtis=
--   
<BR>
<BR>In a message dated 5/16/2009 7:56:26 AM Eastern Standard Time, cleach [at] =
harrahs.com writes:
<BR>
<BR>> 123FOO BAR
<BR>> ...
<BR>> ----------
<BR>> ...
<BR>> 123. FOO BAR
<BR>
<BR>but i was thinking that  123FOO  was *not* something
<BR>that would need punctuation: it's probably not the end of
<BR>one sentence and the beginning of the next.   
<BR>
<BR>br -- bill walters   
<BR></FONT><BR><BR>**************<BR>Recession-proof vacation ideas. Find=
free things to do in the U.S. (http://travel.aol.com/travel-ideas/domesti=
c/national-tourism-week?ncid=3Demlcntustrav00000002)</HTML>

--part1_d0f.4a58efa0.37402d57_boundary--

--===============1652949719==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============1652949719==--
Williamawalters [ Sa, 16 Mai 2009 16:53 ] [ ID #2001552 ]

Re: Help with Regular Expression

This is a multipart message in MIME format.
--===============0498470677==
Content-Type: multipart/alternative;
boundary="=_alternative 005DF262862575BA_="

This is a multipart message in MIME format.
--=_alternative 005DF262862575BA_=
Content-Type: text/plain; charset="US-ASCII"

$ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
made. Style
facilitated. One
Anti-magnetic. Quality


RE pedanticism: \1 et alia are only supposed to be used on the LHS of the
subst cmd. You'd want:
cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'

or no need for cat (ye olde pipeline debate ;-):
perl -pe 's/(\w+)([A-Z])/$1. $2/g' test.txt


You're supposed to use the \1 format to match a current match, like a
duplicated word
$ echo "her here hear hear hop hip hip ho!" | perl -pe
's/(\w+)\s+\1\s+/double "${1}s" /g;'

her here double "hears" hop double "hips" ho!

Might you need to worry about 2 capital letters?
perl -pe 's/([a-z])([A-Z])/$1. $2/g' test.txt

Non-ascii text (ranges like 'a-z' are only true ranges in ascii)? Use
POSIX class shorthand names (if your Perl is new enough):
perl -pe 's/([[:lower:]])([[:upper:]])/$1. $2/g' test.txt

a
a
----------------------
Andy Bach
Systems Mangler
Internet: andy_bach [at] wiwb.uscourts.gov
Voice: (608) 261-5738;
Cell: (608) 658-1890

Civilization advances by the number of important operations
which we can perform without thinking about them.
--Alfred North Whitehead
--=_alternative 005DF262862575BA_=
Content-Type: text/html; charset="US-ASCII"

<tt><font size=2>$ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'<br>
made. Style<br>
facilitated. One<br>
Anti-magnetic. Quality<br>
</font></tt>
<br>
<br><font size=2 face="sans-serif">RE pedanticism: \1 et alia are only
supposed to be used on the LHS of the subst cmd. You'd want:</font>
<br><tt><font size=2>cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'</font></tt>
<br>
<br><font size=2 face="sans-serif">or no need for cat (ye olde pipeline
debate ;-):</font>
<br><tt><font size=2>perl -pe 's/(\w+)([A-Z])/$1. $2/g' test.txt</font></tt>
<br>
<br>
<br><tt><font size=2>You're supposed to use the \1 format to match a current
match, like a duplicated word</font></tt>
<br><tt><font size=2>$ echo "her here hear hear hop hip hip ho!"
| perl -pe 's/(\w+)\s+\1\s+/double "${1}s" /g;'</font></tt>
<br>
<br><tt><font size=2>her here double "hears" hop double "hips"
ho!</font></tt>
<br>
<br><tt><font size=2>Might you need to worry about 2 capital letters? </font></tt>
<br><tt><font size=2>perl -pe 's/([a-z])([A-Z])/$1. $2/g' test.txt</font></tt>
<br>
<br><tt><font size=2>Non-ascii text (ranges like 'a-z' are only true ranges
in ascii)? Use POSIX class shorthand names (if your Perl is new enough):</font></tt>
<br><tt><font size=2>perl -pe 's/([[:lower:]])([[:upper:]])/$1. $2/g' test.txt</font></tt>
<br>
<br><tt><font size=2>a</font></tt>
<br><tt><font size=2>a</font></tt>
<br><font size=2 face="sans-serif">----------------------<br>
Andy Bach<br>
Systems Mangler<br>
Internet: andy_bach [at] wiwb.uscourts.gov<br>
Voice: (608) 261-5738; <br>
Cell: (608) 658-1890<br>
<br>
Civilization advances by the number of important operations<br>
which we can perform without thinking about them.<br>
--Alfred North Whitehead</font>
--=_alternative 005DF262862575BA_=--

--===============0498470677==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0498470677==--
Andy_Bach [ Mo, 18 Mai 2009 19:09 ] [ ID #2001638 ]

RE: Help with Regular Expression

This is a multipart message in MIME format.
--===============0642896885==
Content-Type: multipart/alternative;
boundary="=_alternative 005EAED3862575BB_="

This is a multipart message in MIME format.
--=_alternative 005EAED3862575BB_=
Content-Type: text/plain; charset="US-ASCII"

>> You're supposed to use the \1 format to match a current match,
>> like a duplicated word
>> $ echo "her here hear hear hop hip hip ho!" | perl \
-pe 's/(\w+)\s+\1\s+/double "${1}s" /g;'

Barry B wrote:
> I am confused about this. I thought that a back-reference looks like
"$1", not "\1". Is there a difference?

Yeah, mostly w/ placement. Back refs on the left hand side (LHS) of the
subst:
s/(\w+)\s+\1\s+/

are backslash digit. Backreferences to the captured match on the RHS use
$1 as they do outside the subst command. This got made more concrete
somewhere in early v5, I believe. As noted, warnings will tell you:
\1 better written as $1 at -e line 1

if you had tried:
$ echo "her here hear hear hop hip hip ho!" | perl \
-w -pe 's/(\w+)\s+\1\s+/double "\1s" /g;'


though it still works. But the idea is the \1 version can be used during
the course of the matching phase, but $1 version is used during the
replacement phase. In a sense, the \1 'magic var' is supposed to be
localized to the LHS:
s/ ... /

context, while $1 et alia are actual globals so you can do:
> -pe 'if ( s/(\w+)\s+\1\s+/double "${1}s" / ) { warn "found a $1\n"; }'

and have a value outside the subst command. Trying "\1" in the warn():
warn "found a \1\n";

would get you ... well you get the "001" char ;->

$ echo "her here hear hear hop hip hip ho" | perl -pe 'if (
s/(\w+)\s+\1\s+/double "${1}s" / ) { warn "found a $1\n" };'
found a hear
her here double "hears" hop hip hip ho

but (note, I dropped the "/g"):
$ echo "her here hear hear hop hip hip ho" | perl -pe 'if (
s/(\w+)\s+\1\s+/double "${1}s" / ) { warn "found a \1\n" };'
found a <unprintable>
her here double "hears" hop hip hip ho

Interesting, in a way, is how with the '/g' you get:
echo "her here hear hear hop hip hip ho" | perl -pe 'if (
s/(\w+)\s+\1\s+/double "${1}s" /g ) { warn "found a $1\n" };'
found a h
her here double "hears" hop double "hips" ho

I think what happens here is the capture parens matched the 'h' of the
final 'ho' but there's no match for the \1 part. So no subst is done.
However, $1 keeps the captured value (it did match a \w+ char). Not
exactly what I expected, to be honest - I'd've thought if the LHS RE
failed, $1 wouldn't be 'updated' but would keep the last full match (i.e.
'hip').

Wrong again ...

a
----------------------
Andy Bach
Systems Mangler
Internet: andy_bach [at] wiwb.uscourts.gov
Voice: (608) 261-5738;
Cell: (608) 658-1890

Civilization advances by the number of important operations
which we can perform without thinking about them.
--Alfred North Whitehead
--=_alternative 005EAED3862575BB_=
Content-Type: text/html; charset="US-ASCII"

<font size=2 color=blue face="Verdana">>> </font><tt><font size=2>You're
supposed to use the \1 format to match a current match,</font></tt><font size=2 color=blue face="Verdana"> </font>
<br><font size=2 color=blue face="Verdana">>></font><tt><font size=2>
like a duplicated word</font></tt><font size=3> </font><font size=2 face="Verdana"><br>
>></font><font size=2 color=blue face="Verdana"> </font><tt><font size=2>$
echo "her here hear hear hop hip hip ho!" | perl</font></tt><font size=2 color=blue face="Verdana"> 
\</font>
<br><tt><font size=2>  -pe 's/(\w+)\s+\1\s+/double "${1}s"
/g;'</font></tt><font size=2 color=blue face="Verdana"> </font>
<br><font size=3> </font>
<br><font size=3>Barry B wrote:</font>
<br><font size=2 face="Arial">> I am confused about this. I thought
that a back-reference looks like "$1", not "\1". Is
there a difference?</font>
<br><font size=3> </font>
<br><font size=2 face="sans-serif">Yeah, mostly w/ placement. Back refs
on the left hand side (LHS) of the subst:</font>
<br><tt><font size=2>s/(\w+)\s+\1\s+/</font></tt>
<br>
<br><font size=2 face="sans-serif">are backslash digit. Backreferences
to the captured match on the RHS use $1 as they do outside the subst command.
 This got made more concrete somewhere in early v5, I believe. As
noted, warnings will tell you:</font>
<br><font size=2 face="sans-serif">\1 better written as $1 at -e line 1</font>
<br>
<br><font size=2 face="sans-serif">if you had tried:</font>
<br><tt><font size=2>$ echo "her here hear hear hop hip hip ho!"
| perl</font></tt><font size=2 color=blue face="Verdana">  \</font>
<br><tt><font size=2>  -w  -pe 's/(\w+)\s+\1\s+/double "\1s"
/g;'</font></tt><font size=2 color=blue face="Verdana"> </font>
<br>
<br>
<br><font size=2 face="sans-serif">though it still works.  But the
idea is the \1 version can be used during the course of the matching phase,
but $1 version is used during the replacement phase. In a sense, the \1
'magic var' is supposed to be localized to the LHS:</font>
<br><font size=2 face="sans-serif">s/ ... /</font>
<br>
<br><font size=2 face="sans-serif">context, while $1 et alia are actual
globals so you can do:</font>
<br><font size=2 color=blue face="Verdana">></font><tt><font size=2> 
-pe 'if ( s/(\w+)\s+\1\s+/double "${1}s" / ) { warn "found
a $1\n"; }'</font></tt><font size=2 color=blue face="Verdana"> </font>
<br>
<br><font size=2 face="sans-serif">and have a value outside the subst command.
Trying "\1" in the warn():</font>
<br><font size=2 face="sans-serif">warn "found a \1\n"; </font>
<br>
<br><font size=2 face="sans-serif">would get you ... well you get the "001"
char ;-></font>
<br>
<br><font size=2 face="sans-serif">$  echo "her here hear hear
hop hip hip ho" | perl -pe 'if ( s/(\w+)\s+\1\s+/double "${1}s"
/ ) { warn "found a $1\n" };'</font>
<br><font size=2 face="sans-serif">found a hear</font>
<br><font size=2 face="sans-serif">her here double "hears" hop
hip hip ho</font>
<br>
<br><font size=2 face="sans-serif">but (note, I dropped the "/g"):</font>
<br><font size=2 face="sans-serif">$  echo "her here hear hear
hop hip hip ho" | perl -pe 'if ( s/(\w+)\s+\1\s+/double "${1}s"
/ ) { warn "found a \1\n" };'</font>
<br><font size=2 face="sans-serif">found a <unprintable></font>
<br><font size=2 face="sans-serif">her here double "hears" hop
hip hip ho</font>
<br>
<br><font size=2 face="sans-serif">Interesting, in a way, is how with the
'/g' you get:</font>
<br><font size=2 face="sans-serif"> echo "her here hear hear
hop hip hip ho" | perl -pe 'if ( s/(\w+)\s+\1\s+/double "${1}s"
/g ) { warn "found a $1\n" };'</font>
<br><font size=2 face="sans-serif">found a h</font>
<br><font size=2 face="sans-serif">her here double "hears" hop
double "hips" ho</font>
<br>
<br><font size=2 face="sans-serif">I think what happens here is the capture
parens matched the 'h' of the final 'ho' but there's no match for the \1
part. So no subst is done. However, $1 keeps the captured value (it did
match a \w+ char). Not exactly what I expected, to be honest  - I'd've
thought if the LHS RE failed, $1 wouldn't be 'updated' but would keep the
last full match (i.e. 'hip').</font>
<br>
<br><font size=2 face="sans-serif">Wrong again ... </font>
<br>
<br><font size=2 face="sans-serif">a</font>
<br><font size=2 face="sans-serif">----------------------<br>
Andy Bach<br>
Systems Mangler<br>
Internet: andy_bach [at] wiwb.uscourts.gov<br>
Voice: (608) 261-5738; <br>
Cell: (608) 658-1890<br>
<br>
Civilization advances by the number of important operations<br>
which we can perform without thinking about them.<br>
--Alfred North Whitehead</font>
--=_alternative 005EAED3862575BB_=--

--===============0642896885==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0642896885==--
Andy_Bach [ Di, 19 Mai 2009 19:17 ] [ ID #2001778 ]

RE: Help with Regular Expression

This is a multipart message in MIME format.
--===============1003059762==
Content-Type: multipart/alternative;
boundary="=_alternative 00618FA3862575BB_="

This is a multipart message in MIME format.
--=_alternative 00618FA3862575BB_=
Content-Type: text/plain; charset="US-ASCII"

Sorry, I really don't do the concept justice - a more comprehensive answer
is to say - if you want help w/ REs, get "Mastering Regular Expressions"
by J. Friedl
http://oreilly.com/catalog/9780596528126/

It's it a great, great book, up there w/ the Camel and "Perl Best
Practices". It covers more than just Perl REs too. Even if you just go to
the book store and read the appropriate parts. But buy it and read it -
you'll solve lots of your RE problems.

Wait until Perl6 RE/regexs get here ;->

http://search.cpan.org/~dconway/Perl6-Rules-0.03/Rules.pm
http://www.ibm.com/developerworks/linux/library/l-cpregex.ht ml?ca=dgr-lnxw01Perl6Gram

I know there are better links, but I don't have them at the moment.

a

----------------------
Andy Bach
Systems Mangler
Internet: andy_bach [at] wiwb.uscourts.gov
Voice: (608) 261-5738;
Cell: (608) 658-1890

Some people, when confronted with a problem, think
"I know, I'll use regular expressions."
Now they have two problems
-- Jamie Zawinski
--=_alternative 00618FA3862575BB_=
Content-Type: text/html; charset="US-ASCII"

<font size=2 face="sans-serif">Sorry, I really don't do the concept justice
- a more comprehensive answer is to say - if you want help w/ REs, get
"Mastering Regular Expressions" by J. Friedl</font>
<br><a href=http://oreilly.com/catalog/9780596528126/><font size=2 face="sans-serif">http://oreilly.com/catalog/9780596528126/</font></a>
<br>
<br><font size=2 face="sans-serif">It's it a great, great book, up there
w/ the Camel and "Perl Best Practices". It covers more than just
Perl REs too.  Even if you just go to the book store and read the
appropriate parts. But buy it and read it - you'll solve lots of your RE
problems.</font>
<br>
<br><font size=2 face="sans-serif">Wait until Perl6 RE/regexs get here
;-></font>
<br>
<br><font size=2 face="sans-serif">http://search.cpan.org/~dconway/Perl6-Rules-0.03/Rules.pm</font>
<br><font size=2 face="sans-serif"> http://www.ibm.com/developerworks/linux/library/l-cpregex.ht ml?ca=dgr-lnxw01Perl6Gram</font>
<br>
<br><font size=2 face="sans-serif">I know there are better links, but I
don't have them at the moment. </font>
<br>
<br><font size=2 face="sans-serif">a</font>
<br>
<br><font size=2 face="sans-serif">----------------------<br>
Andy Bach<br>
Systems Mangler<br>
Internet: andy_bach [at] wiwb.uscourts.gov<br>
Voice: (608) 261-5738; <br>
Cell: (608) 658-1890</font>
<br><font size=2 face="sans-serif"><br>
Some people, when confronted with a problem, think</font>
<br><font size=2 face="sans-serif"> "I know, I'll use regular
expressions."</font>
<br><font size=2 face="sans-serif">Now they have two problems  </font>
<br><font size=2 face="sans-serif">-- Jamie Zawinski</font>
--=_alternative 00618FA3862575BB_=--

--===============1003059762==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl [at] listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============1003059762==--
Andy_Bach [ Di, 19 Mai 2009 19:49 ] [ ID #2001779 ]
Perl » gmane.comp.lang.perl.active-perl » Help with Regular Expression

Vorheriges Thema: search and replace
Nächstes Thema: PerlMagick (Image::Magick) Installation