Help with regular expressions

--0015174410d68388f004a2dbcefb
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi List,

I am trying to write a small script to parse bibliographic references like
this:

Morgan, M.J., Wilson, C.E., Crim, L.W., 1999. The effect of stress on
reproduction in Atlantic cod. J. Fish Biol. 54, 477-488.

What I want to be able to do eventually is parse each name separately and
associate that with the title. I am not sure how yet, but I haven't even go=
t
there.

Right now I am just trying to see if I can parse the names, so I came up
with this:

foreach ( [at] entries){
if (/((\w)*, (([A-Z].)*),){1,}/){
my $name =3D "$&";
$name =3D~ s/\.,/\. /g;
push [at] names, $name;
}
}

It works fine for the first name, but as expected if [at] entries contain
several strings with authors names (I did that by matching the year and
storing $` in the [at] entries) it will match the first author and it will go t=
o
the next $entries. Is there a way to match the pattern more than once, but
to store each match separately? For example, would I be able to store
Morgan, M.J. as one item in an array and Wilson, C.E. as another one?

As always, any help is much appreciated.

Cheers,

Tiago
--
"Education is not to be used to promote obscurantism." - Theodonius
Dobzhansky.

"Gracias a la vida que me ha dado tanto
Me ha dado el sonido y el abecedario
Con =E9l, las palabras que pienso y declaro
Madre, amigo, hermano
Y luz alumbrando la ruta del alma del que estoy amando

Gracias a la vida que me ha dado tanto
Me ha dado la marcha de mis pies cansados
Con ellos anduve ciudades y charcos
Playas y desiertos, monta=F1as y llanos
Y la casa tuya, tu calle y tu patio"

Violeta Parra - Gracias a la Vida

Tiago S. F. Hori
PhD Candidate - Ocean Science Center-Memorial University of Newfoundland

--0015174410d68388f004a2dbcefb--
Tiago Hori [ Mo, 09 Mai 2011 20:14 ] [ ID #2059294 ]

Re: Help with regular expressions

On Mon, May 9, 2011 at 11:44 PM, Tiago Hori <tiago.hori [at] gmail.com> wrote:
> I am trying to write a small script to parse bibliographic references like
> this:
>
> Morgan, M.J., Wilson, C.E., Crim, L.W., 1999. The effect of stress on
> reproduction in Atlantic cod. J. Fish Biol. 54, 477-488.
>
> What I want to be able to do eventually is parse each name separately and
> associate that with the title. I am not sure how yet, but I haven't even got
> there.

I took a stab at this. It might not be perfect and catch all possible
variations. But in any case, unless you have rules for the text in
these entries, it is very difficult to catch them all.

=========================================================
#!/usr/bin/perl
#

use strict;
use warnings;

my $text = <<END;
Morgan, M.J., Wilson, C.E., Crim, L.W., 1999. The effect of stress on
reproduction in Atlantic cod. J. Fish Biol. 54, 477-488.
END

my [at] authors=();

# Extract authors
# Assuming each author is composed of one of more matches of:
# <SPACE>* WORD, <SPACE>* (ALPHABET PERIOD)+
if (my [at] matches = $text =~ m/(\s*\w+,\s*(\w\.)+),/gs) {
while( [at] matches) {
my $match = shift [at] matches;
my [at] comps = map {s/^ +//;s/ +$//;$_} (split ",", $match);
push [at] authors, join " ", [at] comps[1,0];
shift [at] matches;
}
}

# Extract title
# Everything from the first period followed by a space to the next period.
# Authors should have periods followed by either a letter or a comma
# for this to work
if ($text =~m/\. (.*?)\./s) {
my $title = $1;
$title =~ s/\n/ /g;
foreach( [at] authors) {
print "$title: $_\n";
}
}
============================================================ =========

$ ./match_2.pl
The effect of stress on reproduction in Atlantic cod: M.J. Morgan
The effect of stress on reproduction in Atlantic cod: C.E. Wilson
The effect of stress on reproduction in Atlantic cod: L.W. Crim

All, please let me know if there is a way to combine both the regexes.
I had a brain coredump before I gave up.

Thanks,
Sandip

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Sandip Bhattacharya [ Mo, 09 Mai 2011 21:04 ] [ ID #2059295 ]

Re: Help with regular expressions

--001485e525885d8ee004a2de9e9c
Content-Type: text/plain; charset=ISO-8859-1

On Mon, May 9, 2011 at 12:04, Sandip Bhattacharya <
sandipb [at] foss-community.com> wrote:

> On Mon, May 9, 2011 at 11:44 PM, Tiago Hori <tiago.hori [at] gmail.com> wrote:
> > I am trying to write a small script to parse bibliographic references
> like
> > this:
> >
> > Morgan, M.J., Wilson, C.E., Crim, L.W., 1999. The effect of stress on
> > reproduction in Atlantic cod. J. Fish Biol. 54, 477-488.
> >
> > What I want to be able to do eventually is parse each name separately and
> > associate that with the title. I am not sure how yet, but I haven't even
> got
> > there.
>
> I took a stab at this. It might not be perfect and catch all possible
> variations. But in any case, unless you have rules for the text in
> these entries, it is very difficult to catch them all.
>
> =========================================================
> #!/usr/bin/perl
> #
>
> use strict;
> use warnings;
>
> my $text = <<END;
> Morgan, M.J., Wilson, C.E., Crim, L.W., 1999. The effect of stress on
> reproduction in Atlantic cod. J. Fish Biol. 54, 477-488.
> END
>
> my [at] authors=();
>
> # Extract authors
> # Assuming each author is composed of one of more matches of:
> # <SPACE>* WORD, <SPACE>* (ALPHABET PERIOD)+
> if (my [at] matches = $text =~ m/(\s*\w+,\s*(\w\.)+),/gs) {
> while( [at] matches) {
> my $match = shift [at] matches;
> my [at] comps = map {s/^ +//;s/ +$//;$_} (split ",", $match);
> push [at] authors, join " ", [at] comps[1,0];
> shift [at] matches;
> }
> }
>
> # Extract title
> # Everything from the first period followed by a space to the next period.
> # Authors should have periods followed by either a letter or a comma
> # for this to work
> if ($text =~m/\. (.*?)\./s) {
> my $title = $1;
> $title =~ s/\n/ /g;
> foreach( [at] authors) {
> print "$title: $_\n";
> }
> }
> ============================================================ =========
>
> $ ./match_2.pl
> The effect of stress on reproduction in Atlantic cod: M.J. Morgan
> The effect of stress on reproduction in Atlantic cod: C.E. Wilson
> The effect of stress on reproduction in Atlantic cod: L.W. Crim
>
> All, please let me know if there is a way to combine both the regexes.
> I had a brain coredump before I gave up.
>
> Thanks,
> Sandip
>

Hasn't someone already fixed this problem? If there isn't a CPAN module to
perform standardized bibliographic reference formatting/parsing. I haven't
looked at CPAN; did either of you? If a CPAN module doesn't exist, one
should!

Ken Wolcott

--001485e525885d8ee004a2de9e9c--
Kenneth Wolcott [ Mo, 09 Mai 2011 23:35 ] [ ID #2059296 ]

Re: Help with regular expressions

--0016e68deb3d9f6ab904a2e25cdf
Content-Type: text/plain; charset=ISO-8859-1

On Mon, May 9, 2011 at 6:35 PM, Kenneth Wolcott <kennethwolcott [at] gmail.com>wrote:

> Hasn't someone already fixed this problem? If there isn't a CPAN module to
> perform standardized bibliographic reference formatting/parsing. I haven't
> looked at CPAN; did either of you? If a CPAN module doesn't exist, one
> should!
>

What standard?

Kalthoff K (2001) Analysis of biological development. McGraw-Hill, NY.


Or


> Manning JT, Barley L, Walton J, Lewis-Jones DI, Trivers RL, Singh D,
> Thornhill R, Rohde P, Bereczkei T, Henzi P, Soler M, Szwed A. (2000) The
> 2nd:4th digit ratio, sexual dimorphism, population differences, and
> reproductive success. evidence for sexually antagonistic genes? Evol Hum
> Behav. 21(3):163-183.


Or


> Berger, M., Lawrence, M., Demichelis, F., Drier, Y., Cibulskis, K.,
> Sivachenko, A., Sboner, A., Esgueva, R., Pflueger, D., Sougnez, C., Onofrio,
> R., Carter, S., Park, K., Habegger, L., Ambrogio, L., Fennell, T., Parkin,
> M., Saksena, G., Voet, D., Ramos, A., Pugh, T., Wilkinson, J., Fisher, S.,
> Winckler, W., Mahan, S., Ardlie, K., Baldwin, J., Simons, J., Kitabayashi,
> N., MacDonald, T., Kantoff, P., Chin, L., Gabriel, S., Gerstein, M., Golub,
> T., Meyerson, M., Tewari, A., Lander, E., Getz, G., Rubin, M., & Garraway,
> L. (2011). The genomic complexity of primary human prostate cancer Nature,
> 470 (7333), 214-220 DOI: 10.1038/nature09744


?

If there's a standard, then sure, someone has probably put that into CPAN.
The problem is that I don't think that there is, though I'd be glad to be
proven wrong.

On Mon, May 9, 2011 at 3:14 PM, Tiago Hori <tiago.hori [at] gmail.com> wrote:

> Hi List,
>
>
Howdy.



> What I want to be able to do eventually is parse each name separately and
> associate that with the title. I am not sure how yet, but I haven't even
> got
> there.
>
>
That can range from pretty simple to fairly complex, depending on how much
you want to squeeze out of that relationship. If you just want to be able to
say "Morgan, M.J wrote an article for X journal, titled Y", then that's just
a hash (of hashes), and you need to look no further than this mail. But if
you also want to say, "Journal X has these authors. One of them is Wilson,
C.E, who co-wrote article Y, where Crim, L.W. was also a collaborator, and
whose primary author is Morgan, M.J.", then hashes will probably not cut it
anymore (a cyclical hash of hashes might do, but that's pretty tough to
handle, and _very_ rough on the eyes). You'll probably want an object model
there, or some database interaction.

But we are getting ahead of ourselves for now :)


> foreach ( [at] entries){
> if (/((\w)*, (([A-Z].)*),){1,}/){
>
>
You probably want some like my [at] names = /( \w+, (?: [A-Z] \. )+ ,\s* )+/xg
instead.


> my $name = "$&";
>
>
Try not to use $& and $` - There's a program-wide speed penalty if you do.
Just using capturing groups should make do.


> It works fine for the first name, but as expected if [at] entries contain
> several strings with authors names (I did that by matching the year and
> storing $` in the [at] entries) it will match the first author and it will go
> to
> the next $entries. Is there a way to match the pattern more than once, but
> to store each match separately?
>

You are looking for the /g switch. You can look it up in perlretut[0].


> For example, would I be able to store
> Morgan, M.J. as one item in an array and Wilson, C.E. as another one?
>
>
>
Sure. the my [at] names = ... from above will suffice for that. But chances are
you want more than that - In general, you have two options. Either you make
several small regexes to extract the data piece by piece, or you create a
grammar to do the job for you. For the latter, there's two main options: a
(?(DEFINE)) pattern, which is Pure Perl and in the language since 5.010, or
you pull out Regexp::Grammars from CPAN. They are pretty similar, but
Regexp::Grammars is much more powerful, letting you access the full parse
tree - so what I'll have to do in two steps in the next snippet, R::G would
do in one.

Here's my stab at it, using (?(DEFINE))[1], named captures[2], Unicode
character properties[3], and a probably unnecessary lookbehind[1] in the
split by the end. I made some arbitrary assumptions on the data, like saying
that a title can't be longer than 52 characters, or can't have a period in
it, or that the journal's name can't have digits in it, which I suppose is a
tad disingenuous, but take it as an example, not a solution : P

use 5.010;

$_ = 'Morgan, M.J., Wilson, C.E., Crim, L.W., 1999. The effect of stress on
reproduction in Atlantic cod. J. Fish Biol. 54, 477-488.';

/
(?<all_names> (?&ALL_NAMES) )
(?<year> (?&YEAR) )\. \s+
(?<title> (?&TITLE) )\. \s+
(?<journal> (?&JOURNAL) )\. \s*
(?<edition> (?&NUM)+ ), \s*
(?<pages> (?&NUM)+-(?&NUM)+ )\.


(?(DEFINE)
(?<ALL_NAMES> ( (?&FULL_NAME), \s+)+ )
(?<FULL_NAME> (?&SURNAME), \s* (?&INITIALS) )
(?<SURNAME> \p{Lu}\p{L}* )
(?<INITIALS> (?:\p{Lu}\.)+ )
(?<YEAR> \p{PosixDigit}{4} )
(?<TITLE> [^.]{1,52} ) #Article title
(?<JOURNAL> \P{PosixDigit}+ ) #Journal name
(?<NUM> \p{PosixDigit} ) #A generic number. Maybe just Digit?
)
/x;
#Assuming it succeed, the results are in the %+ hash:
my [at] names = split /(?<=\.),\s*/, $+{all_names};

say [at] names;

(The same plus a small aggregation & dumping of the results:
http://ideone.com/Od3L7)

Brian.

[0] http://perldoc.perl.org/perlretut.html
[1] http://perldoc.perl.org/perlre.html#Extended-Patterns
[2] http://perldoc.perl.org/perlretut.html#Named-backreferences and
http://perldoc.perl.org/perlvar.html#%25%2b
[3] http://perldoc.perl.org/perluniprops.html

--0016e68deb3d9f6ab904a2e25cdf--
Brian Fraser [ Di, 10 Mai 2011 04:03 ] [ ID #2059308 ]

Re: Help with regular expressions

--0015174784843ef29f04a2eace5a
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

> > Hasn't someone already fixed this problem? If there isn't a CPAN modul=
e
> to
> > perform standardized bibliographic reference formatting/parsing. I
> haven't
> > looked at CPAN; did either of you? If a CPAN module doesn't exist, one
> > should!
> >
>
> What standard?
>
> Kalthoff K (2001) Analysis of biological development. McGraw-Hill, NY.
>
>
> Or
>
>
> > Manning JT, Barley L, Walton J, Lewis-Jones DI, Trivers RL, Singh D,
> > Thornhill R, Rohde P, Bereczkei T, Henzi P, Soler M, Szwed A. (2000) Th=
e
> > 2nd:4th digit ratio, sexual dimorphism, population differences, and
> > reproductive success. evidence for sexually antagonistic genes? Evol Hu=
m
> > Behav. 21(3):163-183.
>
>
> Or
>
>
> > Berger, M., Lawrence, M., Demichelis, F., Drier, Y., Cibulskis, K.,
> > Sivachenko, A., Sboner, A., Esgueva, R., Pflueger, D., Sougnez, C.,
> Onofrio,
> > R., Carter, S., Park, K., Habegger, L., Ambrogio, L., Fennell, T.,
> Parkin,
> > M., Saksena, G., Voet, D., Ramos, A., Pugh, T., Wilkinson, J., Fisher,
> S.,
> > Winckler, W., Mahan, S., Ardlie, K., Baldwin, J., Simons, J.,
> Kitabayashi,
> > N., MacDonald, T., Kantoff, P., Chin, L., Gabriel, S., Gerstein, M.,
> Golub,
> > T., Meyerson, M., Tewari, A., Lander, E., Getz, G., Rubin, M., &
> Garraway,
> > L. (2011). The genomic complexity of primary human prostate cancer
> Nature,
> > 470 (7333), 214-220 DOI: 10.1038/nature09744
>
>
> ?
>
> If there's a standard, then sure, someone has probably put that into CPAN=
..
> The problem is that I don't think that there is, though I'd be glad to be
> proven wrong.
>
>

> > What I want to be able to do eventually is parse each name separately a=
nd
> > associate that with the title. I am not sure how yet, but I haven't eve=
n
> > got
> > there.
> >
> >
> That can range from pretty simple to fairly complex, depending on how muc=
h
> you want to squeeze out of that relationship. If you just want to be able
> to
> say "Morgan, M.J wrote an article for X journal, titled Y", then that's
> just
> a hash (of hashes), and you need to look no further than this mail. But i=
f
> you also want to say, "Journal X has these authors. One of them is Wilson=
,
> C.E, who co-wrote article Y, where Crim, L.W. was also a collaborator, an=
d
> whose primary author is Morgan, M.J.", then hashes will probably not cut =
it
> anymore (a cyclical hash of hashes might do, but that's pretty tough to
> handle, and _very_ rough on the eyes). You'll probably want an object mod=
el
> there, or some database interaction.
>
> But we are getting ahead of ourselves for now :)
>
>
I figured that eventually it would be easier to somehow pass the results
into mySQL tables, but I left that bridge to be crossed once I get there.


>
>
> > It works fine for the first name, but as expected if [at] entries contain
> > several strings with authors names (I did that by matching the year and
> > storing $` in the [at] entries) it will match the first author and it will =
go
> > to
> > the next $entries. Is there a way to match the pattern more than once,
> but
> > to store each match separately?
> >
>
> You are looking for the /g switch. You can look it up in perlretut[0].
>
>
I actually remember reading on the Llama book that the /g modifier could be
use with m// also and not only with s/// and thinking but when would you
need it with m//. :)


> For example, would I be able to store
> > Morgan, M.J. as one item in an array and Wilson, C.E. as another one?
> >
> >
> >
> Sure. the my [at] names =3D ... from above will suffice for that. But chances=
are
> you want more than that - In general, you have two options. Either you ma=
ke
> several small regexes to extract the data piece by piece, or you create a
> grammar to do the job for you. For the latter, there's two main options: =
a
> (?(DEFINE)) pattern, which is Pure Perl and in the language since 5.010, =
or
> you pull out Regexp::Grammars from CPAN. They are pretty similar, but
> Regexp::Grammars is much more powerful, letting you access the full parse
> tree - so what I'll have to do in two steps in the next snippet, R::G wou=
ld
> do in one.
>
> Here's my stab at it, using (?(DEFINE))[1], named captures[2], Unicode
> character properties[3], and a probably unnecessary lookbehind[1] in the
> split by the end. I made some arbitrary assumptions on the data, like
> saying
> that a title can't be longer than 52 characters, or can't have a period i=
n
> it, or that the journal's name can't have digits in it, which I suppose i=
s
> a
> tad disingenuous, but take it as an example, not a solution : P
>
>
Thanks! This gives me a lot to read on.

Cheers,

T.



--
"Education is not to be used to promote obscurantism." - Theodonius
Dobzhansky.

"Gracias a la vida que me ha dado tanto
Me ha dado el sonido y el abecedario
Con =E9l, las palabras que pienso y declaro
Madre, amigo, hermano
Y luz alumbrando la ruta del alma del que estoy amando

Gracias a la vida que me ha dado tanto
Me ha dado la marcha de mis pies cansados
Con ellos anduve ciudades y charcos
Playas y desiertos, monta=F1as y llanos
Y la casa tuya, tu calle y tu patio"

Violeta Parra - Gracias a la Vida

Tiago S. F. Hori
PhD Candidate - Ocean Science Center-Memorial University of Newfoundland

--0015174784843ef29f04a2eace5a--
Tiago Hori [ Di, 10 Mai 2011 14:07 ] [ ID #2059311 ]
Perl » gmane.comp.lang.perl.beginners » Help with regular expressions

Vorheriges Thema: gettting rid of whitespace characters
Nächstes Thema: how do I pass arrays from mainscript into the subroutine