Parse transcripts on speaker's name and grab subsequent paragraphs

Here's the sort of text I'm looking at that's driving me nuts.

####

JOE: Hello, Jane.

How are you?

Has it been a good day?

JANE: Hey, Joe.

It's been good for me.

JOE: Great.

####

I'd like to parse the transcripts into an ordered hash that would have

[speaker => name,
statement => concatenation of multiple lines of text spoken by that
person
order => For instance, Joe's first statement is 1, Jane's 2, et
cetera.
]

I've tried stepping through the text file with a foreach $line, or as
a total string, with split()'s and regexes built around /[A-Z]+:/ but
I can't get it line up. I fear the regex is beyond me. Can anyone
help?

Thanks.
Perchance [ Sa, 26 Januar 2008 23:26 ] [ ID #1917010 ]

Re: Parse transcripts on speaker's name and grab subsequent paragraphs

perchance <totalbadfaith [at] gmail.com> wrote:


> I'd like to parse the transcripts into an ordered hash that would have


There is no such thing as an "ordered hash"...


> [speaker => name,
> statement => concatenation of multiple lines of text spoken by that
> person
> order => For instance, Joe's first statement is 1, Jane's 2, et
> cetera.
> ]
>
> I've tried stepping through the text file with a foreach $line, or as
> a total string, with split()'s and regexes built around /[A-Z]+:/ but


BILLY BOB: But what about matching my name Perchance?


> I can't get it line up. I fear the regex is beyond me.


The regex is of "Hello World" complexity, it must be something
else that is beyond you.

:-)


> Can anyone
> help?


You simply need a better data structure.

If you want ordering, then you want an array.

If you want to save several attributes in each array element,
then you want a hash.

If you want ordering and named attributes, you want a LoH.

(List of Hashes, really an array containing hash references.)

See:
perldoc perlreftut
etc...

--------------------------------
#!/usr/bin/perl
use warnings;
use strict;

my($speaker, $stmt);
my [at] stmts;
while ( <DATA> ) {
next if /^\s+$/;

if ( /^([A-Z ]+):\s+(.*)/ ) { # new speaker
push [at] stmts, { speaker => $speaker, stmt => $stmt} if $stmt;
$speaker = $1;
$stmt = $2;
}
else { # more dialog
chomp;
$stmt .= " $_";
}
}
push [at] stmts, { speaker => $speaker, stmt => $stmt};

foreach ( 0 .. $#stmts ) { # Hash Slice to get attributes out
my($speaker, $stmt) = [at] { $stmts[$_] }{ qw/ speaker stmt / };
print "$_: $speaker\n $stmt\n\n";
}

__DATA__
JOE: Hello, Jane.

How are you?

Has it been a good day?

JANE: Hey, Joe.

It's been good for me.

JOE: Great.
--------------------------------



--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
Tad J McClellan [ So, 27 Januar 2008 01:28 ] [ ID #1917642 ]

Re: Parse transcripts on speaker's name and grab subsequent paragraphs

[A complimentary Cc of this posting was sent to
Tad J McClellan
<tadmc [at] seesig.invalid>], who wrote in article <slrnfpnk1f.pvm.tadmc [at] tadmc30.sbcglobal.net>:
> my($speaker, $stmt);
> my [at] stmts;
> while ( <DATA> ) {
> next if /^\s+$/;

Do not see a switch to a paragraph mode.

>
> if ( /^([A-Z ]+):\s+(.*)/ ) { # new speaker
> push [at] stmts, { speaker => $speaker, stmt => $stmt} if $stmt;
> $speaker = $1;
> $stmt = $2;
> }
> else { # more dialog
> chomp;
> $stmt .= " $_";

Chomp()ing looks suspicious... I would remove NL from each paragraph,
and would separate same-speaker paragraphs by a double-NL (if this is
what the OP wanted).

Hope this helps,
Ilya
Ilya Zakharevich [ So, 27 Januar 2008 03:31 ] [ ID #1917647 ]

Re: Parse transcripts on speaker's name and grab subsequent paragraphs

Tad J McClellan schreef:

> __DATA__
> JOE: Hello, Jane.
>
> How are you?
>
> Has it been a good day?
>
> JANE: Hey, Joe.
>
> It's been good for me.
>
> JOE: Great.

Yesterday I asked

BOB: How are you?

;)

--
Affijn, Ruud

"Gewoon is een tijger."
rvtol+news [ So, 27 Januar 2008 13:35 ] [ ID #1917651 ]
Perl » comp.lang.perl.misc » Parse transcripts on speaker's name and grab subsequent paragraphs

Vorheriges Thema: FAQ 9.3 How can I get better error messages from a CGI program?
Nächstes Thema: FAQ 8.44 How do I tell the difference between errors from the shell and perl?