Recursively filter an array

Hi,

I'm very much a beginner.

Could anyone point me in the right direction on how to accomplish the
following, please?

I have a fairly long log file call it file A, it has around 20,000
lines of three element space separated variables.

File A looks like:-

55223 jimmy smith
55224 davy crocket
55227 walter mitty
63256 mickey mouse
...

I also have a .txt file with around 600 of the numbers found in file
A. Cal that File B.

File B looks like:-

63256
55223
...

I need to (quickly as possible) remove all the lines from File A,
whose numbers can be found in File B.

Now, I have this working just fine in Windows Powershell, but it
really slow, as I am foreaching file b into a where-object filter, it
works but takes minutes to run. Based on my limited exposure to perl,
I'm pretty certain that TMTOWTDI !!

Many thanks,
Stuart



I need to remove frok


--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Kryten [ Mi, 01 September 2010 13:55 ] [ ID #2047000 ]

Re: Recursively filter an array

Hi Stuart,

On Wednesday 01 September 2010 14:55:21 Kryten wrote:
> Hi,
>
> I'm very much a beginner.
>
> Could anyone point me in the right direction on how to accomplish the
> following, please?
>
> I have a fairly long log file call it file A, it has around 20,000
> lines of three element space separated variables.
>
> File A looks like:-
>
> 55223 jimmy smith
> 55224 davy crocket
> 55227 walter mitty
> 63256 mickey mouse
> ..
>
> I also have a .txt file with around 600 of the numbers found in file
> A. Cal that File B.
>
> File B looks like:-
>
> 63256
> 55223
> ..
>
> I need to (quickly as possible) remove all the lines from File A,
> whose numbers can be found in File B.
>
> Now, I have this working just fine in Windows Powershell, but it
> really slow, as I am foreaching file b into a where-object filter, it
> works but takes minutes to run. Based on my limited exposure to perl,
> I'm pretty certain that TMTOWTDI !!

Use a hash:

http://perl-begin.org/topics/hashes/

Populate the keys of the hash with the number of file B, read file A line by
line and check if they exist in the hash to filter them out, while writing a
new temporary file.

Regards,

Shlomi Fish

--
------------------------------------------------------------ -----
Shlomi Fish http://www.shlomifish.org/
Parody on "The Fountainhead" - http://shlom.in/towtf

God considered inflicting XSLT as the tenth plague of Egypt, but then
decided against it because he thought it would be too evil.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Shlomi Fish [ Mi, 01 September 2010 14:42 ] [ ID #2047001 ]

Re: Recursively filter an array

On Wed, Sep 1, 2010 at 07:55, Kryten <kryten68 [at] googlemail.com> wrote:
> Hi,
>
> I'm very much a beginner.
>
> Could anyone point me in the right direction on how to accomplish the
> following, please?
>
> I have a fairly long log file call it file A, it has around 20,000
> lines of three element space separated variables.
>
> File A looks like:-
>
> 55223 jimmy smith
> 55224 davy crocket
> 55227 walter mitty
> 63256 mickey mouse
> ..
>
> I also have a .txt file with around 600 of the numbers found in file
> A. Cal that File B.
>
> File B looks like:-
>
> 63256
> 55223
> ..
>
> I need to (quickly as possible) remove all the lines from File A,
> whose numbers can be found in File B.
>
> Now, I have this working just fine in Windows Powershell, but it
> really slow, as I am foreaching file b into a where-object filter, it
> works but takes minutes to run. Based on my limited exposure to perl,
> I'm pretty certain that TMTOWTDI !!
snip

What you need is a hash set. A hash set is a hash where you only care
about the existence of the keys, not the values associated with those
keys. If you create a hash set of the items in file B, you can read
each line from file A and see if the field is in the hash, if it is,
don't print it. This is the fastest it can possibly run; the hash
lookup is amortized O(1) and each file is read only once (which means
it has a runtime of O(m+n) where m is the number of lines in m and n
is the number of lines in n).

#!/usr/bin/perl

use strict;
use warnings;

die "usage: $0 id_file data_file\n" unless [at] ARGV == 2;

my ($id_file, $data_file) = [at] ARGV;

open my $ids, "<", $id_file
or die "could not open $id_file: $!\n";

open my $data, "<", $data_file
or die "could not open $data_file: $!\n";

my %remove;
while (my $id = <$ids>) {
chomp $id;
#using undef instead of 1 because it takes up less room
$remove{$id} = undef;
}

while (<$data>) {
my ($id) = split;
print unless exists $remove{$id};
}


--
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
chas.owens [ Mi, 01 September 2010 14:45 ] [ ID #2047002 ]

Re: Recursively filter an array

On 10-09-01 07:55 AM, Kryten wrote:
> Hi,
>
> I'm very much a beginner.
>
> Could anyone point me in the right direction on how to accomplish the
> following, please?
>
> I have a fairly long log file call it file A, it has around 20,000
> lines of three element space separated variables.
>
> File A looks like:-
>
> 55223 jimmy smith
> 55224 davy crocket
> 55227 walter mitty
> 63256 mickey mouse
> ..
>
> I also have a .txt file with around 600 of the numbers found in file
> A. Cal that File B.
>
> File B looks like:-
>
> 63256
> 55223
> ..
>
> I need to (quickly as possible) remove all the lines from File A,
> whose numbers can be found in File B.
>
> Now, I have this working just fine in Windows Powershell, but it
> really slow, as I am foreaching file b into a where-object filter, it
> works but takes minutes to run. Based on my limited exposure to perl,
> I'm pretty certain that TMTOWTDI !!
>
> Many thanks,
> Stuart

Load file B into a hash. Then for each line in A, separate the ID and
check if it's in the hash. If it is, skip it; else print it to the new
file.


--
Just my 0.00000002 million dollars worth,
Shawn

Programming is as much about organization and communication
as it is about coding.

The secret to great software: Fail early & often.

Eliminate software piracy: use only FLOSS.

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Shawn H Corey [ Mi, 01 September 2010 14:54 ] [ ID #2047003 ]

Re: Recursively filter an array

Wow. Thank you Shlomi, Thank you Chas and Thank you Shawn.

Hash sets seem to be the way to go here. Much quickness too!

Here is what I have ( the least I can do is give you all a chance to
laugh
at my code! ):-


#!/usr/bin/perl
use warnings ;

my $names_file = 'C:\names.log' ;
my $exclude_list = 'C:\exclude.txt' ;
my %names_hash ;

open ( NAMES, $names_file ) ;
open ( EXCLUDE, $exclude_list ) ;

[at] allnames = <NAMES> ;
[at] allexcludes = <EXCLUDE> ;

#---> Create the hash set #--->
foreach $exclusion ( [at] allexcludes ) {
chomp( $exclusion ) ;
$names_hash{ $exclusion } = 1
}

#---> Walk the names array #--->
foreach $name ( [at] allnames ) {
chomp( $name ) ;
if ( ($dn) = $name =~ /(\d{5,6})/ ) {
if ( $names_hash{ $dn } ) {
print "This would be excluded: $name\n"
} # Close inner IF
} # Close outer IF
}

This seems to work; I'm aghast at how quickly it tears through the
data.
My Powershell script was taking 2-3 minutes. This method completes in
about 4 seconds. Though to be fair I'm not using an associative array
in
my Powershell code, so it's not this same.

This is an order of magnitude better than what I am doing now, so big,
big
thanks and I'd be delighted for any pointers for how to make my code
more .. perl'ish.

Thanks,
Stuart


--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Kryten [ Mi, 01 September 2010 20:18 ] [ ID #2047075 ]

Re: Recursively filter an array

>>>>> "K" == Kryten <kryten68 [at] googlemail.com> writes:

K> Here is what I have ( the least I can do is give you all a chance to
K> laugh
K> at my code! ):-

here comes the laughter! :)


K> #!/usr/bin/perl
K> use warnings ;

put use strict in there too. you are declaring some vars, strict
enforces that you declare all of them.

also i changed some names to use _ chars which makes them more readable.

K> my $names_file = 'C:\names.log' ;
K> my $exclude_list = 'C:\exclude.txt' ;
K> my %names_hash ;

K> open ( NAMES, $names_file ) ;
K> open ( EXCLUDE, $exclude_list ) ;

always check open calls for success. use lexical handles and not global
ones.

K> [at] allnames = <NAMES> ;
K> [at] allexcludes = <EXCLUDE> ;

if you want speed, that is not the best way to read in the file
lines. File::Slurp (on cpan) can do that for you and is cleaner as well:

use File::Slurp ;

my [at] all_names = read_file( $exclude_file ) ;
my [at] all_excludes = read_file( $names_file ) ;

and you can chomp an array which is faster than doing it for each line:

chomp [at] all_names ;
chomp [at] all_excludes ;

K> #---> Create the hash set #--->
K> foreach $exclusion ( [at] allexcludes ) {
K> chomp( $exclusion ) ;
K> $names_hash{ $exclusion } = 1
K> }

this may be more advanced than you want but you can use a slice for
this:


my %excluded_names ;
[at] excluded_names{ [at] all_excludes } = () ;

notice i didn't assign anything, all the entries will be undef. but they will
also i picked a better name. names_hash tells me nothing about the hash

K> #---> Walk the names array #--->
K> foreach $name ( [at] allnames ) {
K> chomp( $name ) ;
K> if ( ($dn) = $name =~ /(\d{5,6})/ ) {
K> if ( $names_hash{ $dn } ) {


if ( exists( $excluded_names{ $dn } ) ) {
K> print "This would be excluded: $name\n"
K> } # Close inner IF

useless comment. the indent tells you that. and any decent code editor
will handle checking of paired chars.

K> } # Close outer IF
K> }

K> This seems to work; I'm aghast at how quickly it tears through the
K> data.
K> My Powershell script was taking 2-3 minutes. This method completes in
K> about 4 seconds. Though to be fair I'm not using an associative array
K> in
K> my Powershell code, so it's not this same.

it isn't the language but the algorithm. a double loop as you likely
wrote in powershell is VERY slow. it would be N * M checks. with the
hash and one loop it is M times. so this is N times faster, easily
speeding up to 4 seconds vs 120-180 seconds.

K> This is an order of magnitude better than what I am doing now, so
K> big, big thanks and I'd be delighted for any pointers for how to
K> make my code more .. perl'ish.

my improvements will make your code more solid and even do a little more
speedup.

uri

--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Uri Guttman [ Do, 02 September 2010 10:39 ] [ ID #2047079 ]

Re: Recursively filter an array

Hi Stuart,

a few comments on your code.

On Wednesday 01 September 2010 21:18:10 Kryten wrote:
> Wow. Thank you Shlomi, Thank you Chas and Thank you Shawn.
>
> Hash sets seem to be the way to go here. Much quickness too!
>
> Here is what I have ( the least I can do is give you all a chance to
> laugh
> at my code! ):-
>
>
> #!/usr/bin/perl
> use warnings ;

Add "use strict;" too. See:

http://perl-begin.org/tutorials/bad-elements/

>
> my $names_file = 'C:\names.log' ;

Don't call variables "file". See:

http://perl-begin.org/tutorials/bad-elements/

> my $exclude_list = 'C:\exclude.txt' ;

It's probably safer to use 'C:/exclude.txt' because the backslash has special
meaning in perls.

> my %names_hash ;

No need to call a hash variable with a name trailing with "_hash".

>
> open ( NAMES, $names_file ) ;
> open ( EXCLUDE, $exclude_list ) ;

See:

http://perl-begin.org/tutorials/bad-elements/#open-function- style

>
> [at] allnames = <NAMES> ;
> [at] allexcludes = <EXCLUDE> ;

Make sure you declare all variables using "my". Also see:

http://perl-begin.org/tutorials/bad-elements/#identifiers-wi thout-underscores

>
> #---> Create the hash set #--->
> foreach $exclusion ( [at] allexcludes ) {
> chomp( $exclusion ) ;
> $names_hash{ $exclusion } = 1
> }

This would be done faster, by saying:

chomp( [at] allexcludes);
my %names_hash = (map { $_ => 1 } [at] allexcludes).

>
> #---> Walk the names array #--->
> foreach $name ( [at] allnames ) {

In this case it would consume less memory not to slurp a file needlessly and
use:

http://perl-begin.org/tutorials/bad-elements/#foreach-lines

> chomp( $name ) ;
> if ( ($dn) = $name =~ /(\d{5,6})/ ) {
> if ( $names_hash{ $dn } ) {
> print "This would be excluded: $name\n"
> } # Close inner IF
> } # Close outer IF

Why do you have comments such as "#Close IF"? Everything can see this is the
case by following the indentation and by using their editor's "go-to-matching
brace" feature (e.g: the "%" key in http://www.vim.org/ ).

Regards,

Shlomi Fish

> }
>
> This seems to work; I'm aghast at how quickly it tears through the
> data.
> My Powershell script was taking 2-3 minutes. This method completes in
> about 4 seconds. Though to be fair I'm not using an associative array
> in
> my Powershell code, so it's not this same.
>
> This is an order of magnitude better than what I am doing now, so big,
> big
> thanks and I'd be delighted for any pointers for how to make my code
> more .. perl'ish.
>
> Thanks,
> Stuart

--
------------------------------------------------------------ -----
Shlomi Fish http://www.shlomifish.org/
Interview with Ben Collins-Sussman - http://shlom.in/sussman

God considered inflicting XSLT as the tenth plague of Egypt, but then
decided against it because he thought it would be too evil.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Shlomi Fish [ Do, 02 September 2010 10:45 ] [ ID #2047080 ]

Re: Recursively filter an array

On Thu, Sep 2, 2010 at 04:39, Uri Guttman <uri [at] stemsystems.com> wrote:
snip
> if you want speed, that is not the best way to read in the file
> lines. File::Slurp (on cpan) can do that for you and is cleaner as well:
snip

If there was one thing I could change about this list, it would be
that to ban people from saying something is faster or slower without
providing the Big O Notation or a benchmark. File::Slurp is written
in Perl 5; there is no magical speed up from using it. In fact, there
is a slow down (see benchmark at the end).

And that is ignoring the fact that slurping the file is a terrible
idea. Further ignoring the fact that we are filling (and possibly
overfilling) memory uselessly, we are then looping over the lines.
This means we must loop once to read in the file in File::Slurp, and
once to read each line. This is wasted time.

The safest and fastest way to read a file line by line is the trusty
old while loop. If you are using a filesystem that provides a
readahead buffer, you can't beat it in terms of speed and memory
usage.

Note, there is a flaw in this benchmark: the filesystem plays an
important part in the read speed. In this case the fact that file is
already buffered in memory may play a part in while's speediness, but
everyone has that advantage. A similar test using around five hundred
seven megabyte files (on a box with two gigabytes of RAM) had similar
results
: while took 220 seconds, array took 233 seconds, and slurp took 293
seconds. The run order was array, while, slurp.

array: 65
while: 65
slurp: 65
Rate slurp array while
slurp 6280/s -- -35% -43%
array 9660/s 54% -- -12%
while 10960/s 75% 13% --

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;
use File::Slurp;

my %subs = (
slurp => sub {
my [at] lines = read_file "/etc/passwd";
return scalar [at] lines;
},
array => sub {
open my $fh, "<", "/etc/passwd" or die $!;
my [at] lines = <$fh>;
return scalar [at] lines;
},
while => sub {
open my $fh, "<", "/etc/passwd" or die $!;
my [at] lines;
while (my $line = <$fh>) {
push [at] lines, $line;
}
return scalar [at] lines;
},
);

for my $sub (keys %subs) {
print "$sub: ", $subs{$sub}(), "\n";
}

Benchmark::cmpthese -1, \%subs;




--
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
chas.owens [ Do, 02 September 2010 15:56 ] [ ID #2047084 ]

Re: Recursively filter an array

>>>>> "CO" == Chas Owens <chas.owens [at] gmail.com> writes:

CO> On Thu, Sep 2, 2010 at 04:39, Uri Guttman <uri [at] stemsystems.com> wrote:
CO> snip
>> if you want speed, that is not the best way to read in the file
>> lines. File::Slurp (on cpan) can do that for you and is cleaner as well:
CO> snip

CO> If there was one thing I could change about this list, it would be
CO> that to ban people from saying something is faster or slower
CO> without providing the Big O Notation or a benchmark. File::Slurp
CO> is written in Perl 5; there is no magical speed up from using it.
CO> In fact, there is a slow down (see benchmark at the end).

there is no need for O() for a beginner. and on winblows the speed gain
is less for text files slurped into an array due to the line ending
issues. but even slower, it is better coding than an open with no
checking and then slurping with <>.


CO> And that is ignoring the fact that slurping the file is a terrible
CO> idea. Further ignoring the fact that we are filling (and possibly
CO> overfilling) memory uselessly, we are then looping over the lines.
CO> This means we must loop once to read in the file in File::Slurp, and
CO> once to read each line. This is wasted time.

for the excluded hash, it is simpler and probably much faster than line
by line. the latter way needs to run much more perl code which is slower
than a single slice. i won't benchmark it because it is also better
coding which is more important. the other file could be line by line but
i stuck with this paradigm for the moment.

uri

--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Uri Guttman [ Do, 02 September 2010 19:08 ] [ ID #2047092 ]

Re: Recursively filter an array

On Thu, Sep 2, 2010 at 13:08, Uri Guttman <uri [at] stemsystems.com> wrote:
snip
> for the excluded hash, it is simpler and probably much faster than line
> by line. the latter way needs to run much more perl code which is slower
> than a single slice. i won't benchmark it because it is also better
> coding which is more important. the other file could be line by line but
> i stuck with this paradigm for the moment.
snip

You may think there is more code in the while loop version, but really
it there is less. File::Slurp is a pure Perl module. That means that
whatever loop it is using to get the data must happen in Perl. Then
you copy that data to an array, then the hash slice has to iterate
over the array. You have three loops (two of which are in C), but the
while loop has one. Using File::Slurp to get an array is slower than
dumping the file handle into an array. File::Slurp may be faster than
a straight slurp when slurping to a scalar, but I doubt it.
File::Slurp's only real advantage is that

my $string = read_file "filename";

is easier to understand than

my $string = do {
open my $fh, "<", "filename" or die $!;
local $/;
<$fh>;
};


--
Chas. Owens
wonkden.net
The most important skill a programmer can have is the ability to read.

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
chas.owens [ Do, 02 September 2010 21:15 ] [ ID #2047094 ]

Re: Recursively filter an array

>>>>> "CO" == Chas Owens <chas.owens [at] gmail.com> writes:

CO> You may think there is more code in the while loop version, but really
CO> it there is less. File::Slurp is a pure Perl module. That means that
CO> whatever loop it is using to get the data must happen in Perl. Then
CO> you copy that data to an array, then the hash slice has to iterate
CO> over the array. You have three loops (two of which are in C), but the
CO> while loop has one. Using File::Slurp to get an array is slower than
CO> dumping the file handle into an array. File::Slurp may be faster than
CO> a straight slurp when slurping to a scalar, but I doubt it.
CO> File::Slurp's only real advantage is that

you haven't seen all the deeper benchmarks. it is faster than most any
other slurp method and handles all errors correctly and with the style
of your choice. also returning an array ref is faster still but i didn't
want to overload the newbie with that as well. in the git repo there is
a newer benchmark script which is more informative and allows for
multiple size tests as well. don't go off on speed things when you don't
have all the facts. :)

i am not going to debate speed issues of slurp here as it is the wrong
forum.

uri

--
Uri Guttman ------ uri [at] stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Uri Guttman [ Do, 02 September 2010 21:47 ] [ ID #2047095 ]

Re: Recursively filter an array

On 2010-09-02 21:15, Chas. Owens wrote:

> my $string = do {
> open my $fh, "<", "filename" or die $!;
> local $/;
> <$fh>;
> };

To make it use less memory, write it like this:

my $string;
{
open my $fh, "<", "filename" or die $!;
local $/;
$string = <$fh>;
};

--
Ruud

--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
rvtol+usenet [ Do, 02 September 2010 21:47 ] [ ID #2047205 ]
Perl » gmane.comp.lang.perl.beginners » Recursively filter an array

Vorheriges Thema: Flip-flop conversations using Socket
Nächstes Thema: usage of Split on pattern matching