Shrink large file according to REG_EXP

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

Any help is really appreciated.

Best regards,
Davide
thellper [ Mi, 16 Januar 2008 18:28 ] [ ID #1910002 ]

Re: Shrink large file according to REG_EXP

thellper <thellper [at] gmail.com> wrote:
> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .

Figure out which regex is slow, why it is slow, and then make it faster.

If you did the first step and posted the culprit with some sample input, we
might be able to help with the latter two.

> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?

I'd try to make the single-threaded one faster first, and resort to
parallelization only as a last resort. Also, if I were doing
parallelization of this, I probably wouldn't use forks.pm to do it. Once
started, your threads (or processes) really don't need to communicate with
each other (as long as you make independent output files to be combined
later) , so a simpler solution, like Parallel::ForkManager or just doing
fork yourself. Or just start the jobs as separate processes in the first
place.

If the orders of the lines in the output files isn't important, I'd give
each job a different integer token (from 0 to num_job-1) and then have each
job process only those lines where
$token == $. % $num_job

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
xhoster [ Mi, 16 Januar 2008 18:54 ] [ ID #1910003 ]

Re: Shrink large file according to REG_EXP

On Wed, 16 Jan 2008 09:28:26 -0800 (PST) thellper <thellper [at] gmail.com> wrote:

t> The problem is that this solution is slow. I'm now reading line by
t> line the whole file, and then I'm applying the reg_exp... but it is
t> very slow. I've noticed that the time to read and write the file
t> without doing anything is very small, so I'm loosing a lot of time
t> for my reg_exps... .

t> Ok, the whole program is more complicated: the files may have
t> different syntax, and I have syntax files which tell me how to split
t> each line in its fields. Then I load separately files with the rules
t> (the reg_exps) used to filter them.... . Anyway, my idea was to try
t> to use the FORKS.pm module (s. CPAN) to split the file in chunks and
t> let each thread work on a chunk of the file: can somebody tell me how
t> to do this ? Or a better way?

Please post a practical example of what's slow (with sample input) so we
can see, comment on, and test it. There's a Benchmark module that will
measure the performance of a function well.

Ted
Ted Zlatanov [ Mi, 16 Januar 2008 19:00 ] [ ID #1910005 ]

Re: Shrink large file according to REG_EXP

In article
<ab9782ce-07b5-4841-84e2-88cff0dee2b5 [at] v67g2000hse.googlegroups.com>,
thellper <thellper [at] gmail.com> wrote:

> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .
>
> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?

If your program is I/O bound, then it might be faster to work on
different parts simultaneously. However, you are going to suffer some
head thrashing as your multiple processes attempt to read different
parts of the same file at the same time.

If your program is cpu bound, then splitting up the work won't help
unless you are using a multi-processor system.

If, as you say, reading the file without doing any processing is quick
enough, then it is the processing of the data that is the bottleneck.
You should concentrate on improving that part of your program. People
here can help, if you post short examples of what you are trying to do.
Show us some of your regexes, at least, and samples of these "syntax
files".

--
Jim Gibson

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
Jim Gibson [ Mi, 16 Januar 2008 19:02 ] [ ID #1910007 ]

Re: Shrink large file according to REG_EXP

On Jan 16, 12:28=A0pm, thellper <thell... [at] gmail.com> wrote:
> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .
>
> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?
>

check out /REGEX/o

and qr/REGEX/

=2E..also, if you keep a history of which filters get used the most,
stick those at the top. this will speed up the file processing if the
trend does not change. may want to do this periodically in case it
does change.
it_says_BALLS_on_your [ Mi, 16 Januar 2008 19:13 ] [ ID #1910008 ]

Re: Shrink large file according to REG_EXP

>>>>> "nc" == nolo contendere <simon.chao [at] fmr.com> writes:

nc> check out /REGEX/o

obsolete and probably useless.

nc> and qr/REGEX/

we still haven't seen his code so that is not a solution. more likely
his loops are clunky and slow and his regexes are worse.

nc> ...also, if you keep a history of which filters get used the most,
nc> stick those at the top. this will speed up the file processing if the
nc> trend does not change. may want to do this periodically in case it
nc> does change.

or which are the slowest regexes and speed those up. there are too many
ways to optimize unknown code. let's see if the OP will actually post
some data and code.

uri

--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Uri Guttman [ Mi, 16 Januar 2008 20:17 ] [ ID #1910013 ]

Re: Shrink large file according to REG_EXP

On Jan 16, 2:17=A0pm, Uri Guttman <u... [at] stemsystems.com> wrote:
> >>>>> "nc" =3D=3D nolo contendere <simon.c... [at] fmr.com> writes:
>
> =A0 nc> check out /REGEX/o
>
> obsolete and probably useless.
>

really? is this since 5.10?

> =A0 nc> and qr/REGEX/
>
> we still haven't seen his code so that is not a solution. more likely
> his loops are clunky and slow and his regexes are worse.
>
> =A0 nc> ...also, if you keep a history of which filters get used the most,=

> =A0 nc> stick those at the top. this will speed up the file processing if =
the
> =A0 nc> trend does not change. may want to do this periodically in case it=

> =A0 nc> does change.
>
> or which are the slowest regexes and speed those up. there are too many
> ways to optimize unknown code. let's see if the OP will actually post
> some data and code.
>

yeah, Xho already suggested the speed-up-the-slowest-regex solution,
so I was going for something different. you're right though, code +
data would help enormously.
it_says_BALLS_on_your [ Mi, 16 Januar 2008 20:22 ] [ ID #1910014 ]

Re: Shrink large file according to REG_EXP

>>>>> "nc" == nolo contendere <simon.chao [at] fmr.com> writes:

nc> On Jan 16, 2:17 pm, Uri Guttman <u... [at] stemsystems.com> wrote:
>> >>>>> "nc" == nolo contendere <simon.c... [at] fmr.com> writes:
>>
>>   nc> check out /REGEX/o
>>
>> obsolete and probably useless.
>>

nc> really? is this since 5.10?

since at least when qr// came in. also dynamic regexes (those with
interpolation) are not recompiled unless some variable in them
changes. this is what /o was all about in the early days of perl5. so
it's purpose of not recompiling has been moot for eons. and qr// even
makes it even more useless.

uri

--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Uri Guttman [ Mi, 16 Januar 2008 20:54 ] [ ID #1910018 ]

Re: Shrink large file according to REG_EXP

On Jan 16, 2:54=A0pm, Uri Guttman <u... [at] stemsystems.com> wrote:
> >>>>> "nc" =3D=3D nolo contendere <simon.c... [at] fmr.com> writes:
>
> =A0 nc> On Jan 16, 2:17=A0pm, Uri Guttman <u... [at] stemsystems.com> wrote:
> =A0 >> >>>>> "nc" =3D=3D nolo contendere <simon.c... [at] fmr.com> writes:
> =A0 >>
> =A0 >> =A0 nc> check out /REGEX/o
> =A0 >>
> =A0 >> obsolete and probably useless.
> =A0 >>
>
> =A0 nc> really? is this since 5.10?
>
> since at least when qr// came in. also dynamic regexes (those with
> interpolation) are not recompiled unless some variable in them
> changes. this is what /o was all about in the early days of perl5. so
> it's purpose of not recompiling has been moot for eons. and qr// even
> makes it even more useless.

Ok, thanks for the info.
it_says_BALLS_on_your [ Mi, 16 Januar 2008 21:00 ] [ ID #1910019 ]

Re: Shrink large file according to REG_EXP

On Wed, 16 Jan 2008 10:02:20 -0800,
Jim Gibson <jimsgibson [at] gmail.com> wrote:
> In article
><ab9782ce-07b5-4841-84e2-88cff0dee2b5 [at] v67g2000hse.googlegroups.com>,
> thellper <thellper [at] gmail.com> wrote:
>

>> The problem is that this solution is slow.
>> I'm now reading line by line the whole file, and then I'm applying the
>> reg_exp... but it is very slow.
>> I've noticed that the time to read and write the file without doing
>> anything is very small, so I'm loosing a lot of time for my
>> reg_exps... .

> If your program is I/O bound, then it might be faster to work on
> different parts simultaneously.

If the process is I/O bound, then it's unlikely that it'll speed up if
you work on multiple parts simultaneously, unles you can guarantee that
those multiple parts are going to come from a different part of your I/O
subsystem, i.e. ones that don't compete with each other for resources.
Given that it's one single file as input, it's very unlikely that you'll
be able to pick your parts to work on in such a way that you avoid I/O
contention.

You might see some improvement if you're lucky, but you could also see a
marked decrease in total I/O speed, if you're unlucky.

Splitting a process in multiple worker processes generally only is
better if each worker process can then utilise a piece of hardware that
wasn't used before, like another I/O system, or another CPU.

> However, you are going to suffer some
> head thrashing as your multiple processes attempt to read different
> parts of the same file at the same time.

Indeed, at least, if your file is on a single disk. If it's on a RAID
system, the O/S might be able to avoid contention on disks. Or not. For
linear access patterns you generally do get some improvement.

> If your program is cpu bound, then splitting up the work won't help
> unless you are using a multi-processor system.

Indeed.

But CPU bound processes can benefit from algorithm improvements, or even
small tweaks to code if that code is in a place that gets executed a
lot.

Profiling would be able to identify that.

> If, as you say, reading the file without doing any processing is quick
> enough, then it is the processing of the data that is the bottleneck.

Agree :) It also is really the only bit which is likely to be
Perl-specific. All the previous is not.

Martien
--
|
Martien Verbruggen | The Second Law of Thermodenial: In any closed
| mind the quantity of ignorance remains
| constant or increases.
Martien Verbruggen [ Mi, 16 Januar 2008 21:43 ] [ ID #1910022 ]

Re: Shrink large file according to REG_EXP

Uri Guttman wrote:
>>>>>> "nc" == nolo contendere <simon.chao [at] fmr.com> writes:
>
> nc> On Jan 16, 2:17 pm, Uri Guttman <u... [at] stemsystems.com> wrote:
> >> >>>>> "nc" == nolo contendere <simon.c... [at] fmr.com> writes:
> >>
> >> nc> check out /REGEX/o
> >>
> >> obsolete and probably useless.
> >>
>
> nc> really? is this since 5.10?
>
> since at least when qr// came in. also dynamic regexes (those with
> interpolation) are not recompiled unless some variable in them
> changes. this is what /o was all about in the early days of perl5. so
> it's purpose of not recompiling has been moot for eons. and qr// even
> makes it even more useless.

I won't ask you lots of questions - but do you have a link
to this info that I can read - it's of (substantial) interest
to me.

BugBear
bugbear [ Do, 17 Januar 2008 10:17 ] [ ID #1910933 ]

Re: Shrink large file according to REG_EXP

>>>>> "b" == bugbear <bugbear [at] trim_papermule.co.uk_trim> writes:

b> Uri Guttman wrote:
>>>>>>> "nc" == nolo contendere <simon.chao [at] fmr.com> writes:
nc> On Jan 16, 2:17 pm, Uri Guttman <u... [at] stemsystems.com> wrote:
>> >> >>>>> "nc" == nolo contendere <simon.c... [at] fmr.com> writes:
>> >> >> nc> check out /REGEX/o
>> >> >> obsolete and probably useless.
>> >> nc> really? is this since 5.10?
>> since at least when qr// came in. also dynamic regexes (those with
>> interpolation) are not recompiled unless some variable in them
>> changes. this is what /o was all about in the early days of perl5. so
>> it's purpose of not recompiling has been moot for eons. and qr// even
>> makes it even more useless.

b> I won't ask you lots of questions - but do you have a link
b> to this info that I can read - it's of (substantial) interest
b> to me.

this should be in perlop under the regexp quote like ops but it doesn't
mention that /o is useless now. the faq covers it. and 5.6 is pretty old
so /o has been useless for years.


perlfaq6: What is /o really for? (code snipped)

The /o option for regular expressions (documented in perlop and
perlreref) tells Perl to compile the regular expression only once. This
is only useful when the pattern contains a variable. Perls 5.6 and later
handle this automatically if the pattern does not change.

Since the match operator m//, the substitution operator s///, and the
regular expression quoting operator qr// are double-quotish constructs,
you can interpolate variables into the pattern. See the answer to "How
can I quote a variable to use in a regex?" for more details.

Versions of Perl prior to 5.6 would recompile the regular expression for
each iteration, even if $pattern had not changed. The /o would prevent
this by telling Perl to compile the pattern the first time, then reuse
that for subsequent iterations:

In versions 5.6 and later, Perl won't recompile the regular expression
if the variable hasn't changed, so you probably don't need the /o
option. It doesn't hurt, but it doesn't help either. If you want any
version of Perl to compile the regular expression only once even if the
variable changes (thus, only using its initial value), you still need
the /o.

You can watch Perl's regular expression engine at work to verify for
yourself if Perl is recompiling a regular expression. The use re 'debug'
pragma (comes with Perl 5.005 and later) shows the details. With Perls
before 5.6, you should see re reporting that its compiling the regular
expression on each iteration. With Perl 5.6 or later, you should only
see re report that for the first iteration.

uri

--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Uri Guttman [ Do, 17 Januar 2008 10:38 ] [ ID #1910934 ]

Re: Shrink large file according to REG_EXP

Uri Guttman wrote:
>
> b> I won't ask you lots of questions - but do you have a link
> b> to this info that I can read - it's of (substantial) interest
> b> to me.

(helpful stuff snipped)

Thank you for that - most helpful.

BugBear
bugbear [ Do, 17 Januar 2008 12:10 ] [ ID #1910935 ]

Re: Shrink large file according to REG_EXP

On Jan 16, 9:28 am, thellper <thell... [at] gmail.com> wrote:
> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>...

Just a guess but splitting into pieces and then applying the regex
to each piece may well be a signifcant slowdown. Have you considered
trying to tweak the regex to avoid the split and resultant copies...


--
Charles DeRykus
Charles DeRykus [ Do, 17 Januar 2008 21:59 ] [ ID #1910973 ]

Re: Shrink large file according to REG_EXP

[A complimentary Cc of this posting was sent to
Uri Guttman
<uri [at] stemsystems.com>], who wrote in article <x7abn4u70f.fsf [at] mail.sysarch.com>:
> In versions 5.6 and later, Perl won't recompile the regular expression
> if the variable hasn't changed, so you probably don't need the /o
> option. It doesn't hurt, but it doesn't help either.

Yet another case of broken documentation. Still, //o helps (though
nowhere as dramatically as before). It avoids CHECKING that the
pattern did not change.

Hope this helps,
Ilya
Ilya Zakharevich [ Do, 17 Januar 2008 23:52 ] [ ID #1910976 ]

Re: Shrink large file according to REG_EXP

Ilya Zakharevich <nospam-abuse [at] ilyaz.org> wrote:

> Yet another case of broken documentation.

Important question: how can this be fixed?

Preferable both:

- the documentation itself,
- and a way to make the fixing process easier (wiki?)

--
John

http://johnbokma.com/mexit/
John Bokma [ Fr, 18 Januar 2008 02:36 ] [ ID #1911638 ]
Perl » comp.lang.perl.misc » Shrink large file according to REG_EXP

Vorheriges Thema: FAQ 8.11 How do I decode encrypted password files?
Nächstes Thema: Help: Login FTP