
how to parse complex table
--00151747b9a4ecd40904a196c40e
Content-Type: text/plain; charset=ISO-8859-1
Hi,
I'm trying to parse a table containing information about genes in a
bacterial chromosome. Below is a sample for one gene, and there's about 4500
such blocks in a file:
gene_oid Locus Tag Source Cluster Information Gene
Information E-value
642745051 SeSA_B0001 COG_category [T] Signal transduction
mechanisms
642745051 SeSA_B0001 COG_category [K] Transcription
642745051 SeSA_B0001 COG1974 SOS-response transcriptional
repressors (RecA-mediated autopeptidases) 2.0e-29
642745051 SeSA_B0001 pfam00717 Peptidase_S24 1.7e-13
642745051 SeSA_B0001 EC:3.4.21.- Hydrolases. Acting on peptide
bonds (peptide hydrolases). Serine endopeptidases.
642745051 SeSA_B0001 KO:K03503 DNA polymerase V [EC:3.4.21.-]
0.0e+00
642745051 SeSA_B0001 ITERM:03797 SOS response UmuD protein. Serine
peptidase. MEROPS family S24
642745051 SeSA_B0001 Locus_type CDS
642745051 SeSA_B0001 NCBI_accession YP_002112883
642745051 SeSA_B0001 Product_name protein SamA
642745051 SeSA_B0001 Scaffold NC_011092
642745051 SeSA_B0001 Coordinates 34..459(+)
642745051 SeSA_B0001 DNA_length 426bp
642745051 SeSA_B0001 Protein_length 141aa
642745051 SeSA_B0001 GC .52
I want to parse information for Locus_Tag, Source, and Cluster Info for each
gene so that the output table looks like this
locus COG_category COG_category COGID Cluster_Information
SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription
COG1974 SOS-response transcriptional repressors (RecA-mediated
autopeptidases)
SeSA_B0002 "\t" [L] Replication, recombination and repair COG0389
Nucleotidyltransferase/DNA polymerase involved in DNA repair
My problem is that some genes have 2 entries for COG_category, some only one
and others none. I took a look at perldsc and tried to fit the table into
one of the complex structures but didn't get far. Below is the code I came
up with so far:
#!/usr/bin/perl
# parse_IMG_gene_info.pl
use strict; use warnings;
open( IN, "<", [at] ARGV ) or die "Failed to open: $!\n";
print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Informati on\n\n";
my( %locus, [at] cogs, %cog_cat, %cog_id, $oid, $locus, $source, $cluster_info,
$e );
while( <IN> ) {
if( $_=~ /COG_category/ ) {
( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
$cog_cat{ $locus } = $cluster_info;
push( [at] cogs, { %cog_cat } );
} elsif ( $_=~ /COG\d+/ ) {
( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
$cog_id{ $locus } = $cluster_info;
}
}
close IN;
#print scalar [at] cogs, "\n";
for my $test( sort keys %cog_cat ) {
print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
}
print "\n";
Your insight is greatly appreciated!
galeb
--00151747b9a4ecd40904a196c40e--
Re: how to parse complex table
now, here's a homework question!! :)
On Sat, Apr 23, 2011 at 10:27 AM, galeb abu-ali <abualiga2 [at] gmail.com> wrote=
:
> Hi,
>
> I'm trying to parse a table containing information about genes in a
> bacterial chromosome. Below is a sample for one gene, and there's about 4=
500
> such blocks in a file:
>
> gene_oid =A0 =A0Locus Tag =A0 =A0Source =A0 =A0Cluster Information =A0 =
=A0Gene
> Information =A0 =A0E-value
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0COG_category =A0 =A0[T] Signal transdu=
ction
> mechanisms
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0COG_category =A0 =A0[K] Transcription
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0COG1974 =A0 =A0SOS-response transcript=
ional
> repressors (RecA-mediated autopeptidases) =A0 =A0 =A0 =A02.0e-29
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0pfam00717 =A0 =A0Peptidase_S24 =A0 =A0=
=A0 =A01.7e-13
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0EC:3.4.21.- =A0 =A0Hydrolases. Acting =
on peptide
> bonds (peptide hydrolases). Serine endopeptidases.
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0KO:K03503 =A0 =A0DNA polymerase V [EC:=
3.4.21.-]
> =A0 =A00.0e+00
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0ITERM:03797 =A0 =A0SOS response UmuD p=
rotein. Serine
> peptidase. MEROPS family S24
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0Locus_type =A0 =A0 =A0 =A0CDS
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0NCBI_accession =A0 =A0 =A0 =A0YP_00211=
2883
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0Product_name =A0 =A0 =A0 =A0protein Sa=
mA
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0Scaffold =A0 =A0 =A0 =A0NC_011092
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0Coordinates =A0 =A0 =A0 =A034..459(+)
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0DNA_length =A0 =A0 =A0 =A0426bp
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0Protein_length =A0 =A0 =A0 =A0141aa
> 642745051 =A0 =A0SeSA_B0001 =A0 =A0GC =A0 =A0 =A0 =A0.52
>
>
>
>
> I want to parse information for Locus_Tag, Source, and Cluster Info for e=
ach
> gene so that the output table looks like this
>
>
> locus =A0 =A0COG_category =A0 =A0COG_category =A0 =A0COGID =A0 =A0Cluster=
_Information
>
> SeSA_B0001 =A0 [T] Signal transduction mechanisms =A0 =A0[K] Transcriptio=
n
> COG1974 =A0 =A0SOS-response transcriptional repressors (RecA-mediated
> autopeptidases)
> SeSA_B0002 =A0 =A0"\t" [L] Replication, recombination and repair =A0 =A0C=
OG0389
> Nucleotidyltransferase/DNA polymerase involved in DNA repair
>
>
> My problem is that some genes have 2 entries for COG_category, some only =
one
> and others none. I took a look at perldsc and tried to fit the table into
> one of the complex structures but didn't get far. Below is the code I cam=
e
> up with so far:
>
> #!/usr/bin/perl
> # parse_IMG_gene_info.pl
> use strict;
use warnings;
good, but no need to save space - you have a return key, put different
things on different lines unless you *really* fell it looks / reads
better to do otherwise.
>
>
> open( IN, "<", [at] ARGV ) or die "Failed to open: $!\n";
open( my $file, "<", $ARGV[ 0 ]) or die ".... $!\n";
>
> print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Informati on\n\n"=
;
>
> my( %locus, [at] cogs, %cog_cat, %cog_id, $oid, $locus, $source, $cluster_inf=
o,
> $e );
>
> while( <IN> ) {
> =A0 =A0if( $_=3D~ /COG_category/ ) {
> =A0 =A0 =A0 =A0( $oid, $locus, $source, $cluster_info ) =3D split "\t", $=
_;
> =A0 =A0 =A0 =A0$cog_cat{ $locus } =3D =A0$cluster_info;
> =A0 =A0 =A0 =A0push( [at] cogs, { %cog_cat } );
> =A0 =A0} elsif ( $_=3D~ /COG\d+/ ) {
> =A0 =A0 =A0 =A0( $oid, $locus, $source, $cluster_info ) =3D split "\t", $=
_;
> =A0 =A0 =A0 =A0$cog_id{ $locus } =3D =A0$cluster_info;
> =A0 =A0}
> }
>
i don't really have the knowledge to help here, nor really want to
parse this. instead, i'll suggest using Text::CSV_XS, it's much easier
and will give you a good data structure, all you do to figure out a
column is there is 'if( $csv-[ $col ] ) { ..column has data.. }'
> close IN;
close $file;
or just let it go out of scope and close one its own.
>
> #print scalar [at] cogs, "\n";
>
> for my $test( sort keys %cog_cat ) {
> =A0 =A0print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
> }
> print "\n";
can i suggest a database? it isn't that hard and will help tons in
future processing of the data and manipulation. also, a quick google
brought up some interesting results on your field:
http://oreilly.com/catalog/begperlbio/chapter/ch10.html
http://search.cpan.org/~mingyiliu/Bio-ASN1-EntrezGene-1.10-w ithoutworldwrit=
eables/lib/Bio/ASN1/EntrezGene.pm
it might help to look at this (though, i think that Text::CSV will
suite your needs just fine):
http://oreilly.com/catalog/perlsysadm/chapter/ch09.html
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: how to parse complex table
--00151747b9a4f8c99604a19805a6
Content-Type: text/plain; charset=ISO-8859-1
thank you for the advice Shawn, I'll try what you suggest!
BTW, it's not homework, It's supporting metadata for my research and I'm
trying to parse it in a format that will be easier to lookup later.
thanks again
galeb
On Sat, Apr 23, 2011 at 11:10 AM, shawn wilson <ag4ve.us [at] gmail.com> wrote:
> now, here's a homework question!! :)
>
> On Sat, Apr 23, 2011 at 10:27 AM, galeb abu-ali <abualiga2 [at] gmail.com>
> wrote:
> > Hi,
> >
> > I'm trying to parse a table containing information about genes in a
> > bacterial chromosome. Below is a sample for one gene, and there's about
> 4500
> > such blocks in a file:
> >
> > gene_oid Locus Tag Source Cluster Information Gene
> > Information E-value
> > 642745051 SeSA_B0001 COG_category [T] Signal transduction
> > mechanisms
> > 642745051 SeSA_B0001 COG_category [K] Transcription
> > 642745051 SeSA_B0001 COG1974 SOS-response transcriptional
> > repressors (RecA-mediated autopeptidases) 2.0e-29
> > 642745051 SeSA_B0001 pfam00717 Peptidase_S24 1.7e-13
> > 642745051 SeSA_B0001 EC:3.4.21.- Hydrolases. Acting on peptide
> > bonds (peptide hydrolases). Serine endopeptidases.
> > 642745051 SeSA_B0001 KO:K03503 DNA polymerase V [EC:3.4.21.-]
> > 0.0e+00
> > 642745051 SeSA_B0001 ITERM:03797 SOS response UmuD protein.
> Serine
> > peptidase. MEROPS family S24
> > 642745051 SeSA_B0001 Locus_type CDS
> > 642745051 SeSA_B0001 NCBI_accession YP_002112883
> > 642745051 SeSA_B0001 Product_name protein SamA
> > 642745051 SeSA_B0001 Scaffold NC_011092
> > 642745051 SeSA_B0001 Coordinates 34..459(+)
> > 642745051 SeSA_B0001 DNA_length 426bp
> > 642745051 SeSA_B0001 Protein_length 141aa
> > 642745051 SeSA_B0001 GC .52
> >
> >
> >
> >
> > I want to parse information for Locus_Tag, Source, and Cluster Info for
> each
> > gene so that the output table looks like this
> >
> >
> > locus COG_category COG_category COGID Cluster_Information
> >
> > SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription
> > COG1974 SOS-response transcriptional repressors (RecA-mediated
> > autopeptidases)
> > SeSA_B0002 "\t" [L] Replication, recombination and repair COG0389
> > Nucleotidyltransferase/DNA polymerase involved in DNA repair
> >
> >
> > My problem is that some genes have 2 entries for COG_category, some only
> one
> > and others none. I took a look at perldsc and tried to fit the table into
> > one of the complex structures but didn't get far. Below is the code I
> came
> > up with so far:
> >
> > #!/usr/bin/perl
> > # parse_IMG_gene_info.pl
> > use strict;
> use warnings;
>
> good, but no need to save space - you have a return key, put different
> things on different lines unless you *really* fell it looks / reads
> better to do otherwise.
>
> >
> >
> > open( IN, "<", [at] ARGV ) or die "Failed to open: $!\n";
>
> open( my $file, "<", $ARGV[ 0 ]) or die ".... $!\n";
>
> >
> > print
> "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Informati on\n\n";
> >
> > my( %locus, [at] cogs, %cog_cat, %cog_id, $oid, $locus, $source,
> $cluster_info,
> > $e );
> >
> > while( <IN> ) {
> > if( $_=~ /COG_category/ ) {
> > ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
> > $cog_cat{ $locus } = $cluster_info;
> > push( [at] cogs, { %cog_cat } );
> > } elsif ( $_=~ /COG\d+/ ) {
> > ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
> > $cog_id{ $locus } = $cluster_info;
> > }
> > }
> >
>
> i don't really have the knowledge to help here, nor really want to
> parse this. instead, i'll suggest using Text::CSV_XS, it's much easier
> and will give you a good data structure, all you do to figure out a
> column is there is 'if( $csv-[ $col ] ) { ..column has data.. }'
>
> > close IN;
>
> close $file;
> or just let it go out of scope and close one its own.
>
> >
> > #print scalar [at] cogs, "\n";
> >
> > for my $test( sort keys %cog_cat ) {
> > print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
> > }
> > print "\n";
>
> can i suggest a database? it isn't that hard and will help tons in
> future processing of the data and manipulation. also, a quick google
> brought up some interesting results on your field:
>
> http://oreilly.com/catalog/begperlbio/chapter/ch10.html
>
> http://search.cpan.org/~mingyiliu/Bio-ASN1-EntrezGene-1.10-w ithoutworldwriteables/lib/Bio/ASN1/EntrezGene.pm
>
> it might help to look at this (though, i think that Text::CSV will
> suite your needs just fine):
> http://oreilly.com/catalog/perlsysadm/chapter/ch09.html
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
> For additional commands, e-mail: beginners-help [at] perl.org
> http://learn.perl.org/
>
>
>
--00151747b9a4f8c99604a19805a6--
Re: how to parse complex table
On Sat, Apr 23, 2011 at 11:56 AM, galeb abu-ali <abualiga2 [at] gmail.com> wrote:
> BTW, it's not homework, It's supporting metadata for my research and I'm
> trying to parse it in a format that will be easier to lookup later.
>
it was a joke. i figured you were either a graduate student,
researcher, or hacking around with large data sets for the heck of it.
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: how to parse complex table
--00151747b9a417900b04a19a828b
Content-Type: text/plain; charset=ISO-8859-1
i guess the sensitivity stemmed from reading random snippets from the thread
'the nature of this list' and how using it do homework is considered abuse,
so i didn't want to fall in that category.
thanks again
galeb
On Sat, Apr 23, 2011 at 12:51 PM, shawn wilson <ag4ve.us [at] gmail.com> wrote:
> On Sat, Apr 23, 2011 at 11:56 AM, galeb abu-ali <abualiga2 [at] gmail.com>
> wrote:
>
> > BTW, it's not homework, It's supporting metadata for my research and I'm
> > trying to parse it in a format that will be easier to lookup later.
> >
>
> it was a joke. i figured you were either a graduate student,
> researcher, or hacking around with large data sets for the heck of it.
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
> For additional commands, e-mail: beginners-help [at] perl.org
> http://learn.perl.org/
>
>
>
--00151747b9a417900b04a19a828b--
Re: how to parse complex table
On 11-04-23 02:54 PM, galeb abu-ali wrote:
> using it do homework is considered abuse
Using to do your homework is abuse. Using it to ask about homework
isn't. The difference how much effort you put in. Some guidelines:
1. Say it's homework from the start.
2. Include the code you have so far.
3. Include some input data.
4. State what you want for the output. Those who are interested can
run the code with the data to see what goes wrong.
5. Don't expect an immediate response. Everyone here is a volunteer
and are often busy doing other things.
Please note that everything posted is advice; it may not work and you
are not force to take it.
--
Just my 0.00000002 million dollars worth,
Shawn
Confusion is the first step of understanding.
Programming is as much about organization and communication
as it is about coding.
The secret to great software: Fail early & often.
Eliminate software piracy: use only FLOSS.
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: how to parse complex table
On Sat, Apr 23, 2011 at 10:27:07AM -0400, galeb abu-ali wrote:
> Hi,
>
> I'm trying to parse a table containing information about genes in a
> bacterial chromosome. Below is a sample for one gene, and there's about 4500
> such blocks in a file:
<snip>
> My problem is that some genes have 2 entries for COG_category, some only one
> and others none. I took a look at perldsc and tried to fit the table into
> one of the complex structures but didn't get far. Below is the code I came
> up with so far:
<snip>
#!/usr/bin/perl
# parse_IMG_gene_info.pl
use strict; use warnings;
# '##' marks the lines of you code I changed/removed.
# One of these days you're going to want this program
# to report on more than one file.
# So wrap 'for my $file ( [at] ARGV ) { ... }' around it
# And open( my $IN, '<', $file or die "Failed to open $file: $!\n";
open( IN, "<", [at] ARGV ) or die "Failed to open: $!\n";
# Put this print statement down by the other. Keeping functional groups
# together aids understanding and it'll all be there when you realize
# the task has grown to the point you want an output routine.
print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Informati on\n\n";
# In Perl it's customary to declare your variables just before you
# need them and makes it easier to notice that %locus & $e aren't used
my( %locus, [at] cogs, %cog_cat, %cog_id, $oid, $locus, $source,
$cluster_info, $e);
# Keeping functional groups together suggests:
# open...; while(...); close;
# with nothing else intervening. Again $IN.
# And while( my $line = <$IN> ) protects you from the many things
# that change $_.
# If you don't get out of the habit of using $_ it 'will' bite you.
# BTW, chomp, split, /COG_category/ and several other functions
# act on $_ by default so 'split "\t", $_' and 'split "\t"
# are equivalent. note also that 'split /\t/' is preferable.
while( <IN> ) {
chomp; # remove linefeeds from $cluster_info
if( $_=~ /COG_category/ ) {
## ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
# the tabs got lost in the email and it was easier for me to
# change the split than change the data file.
( $oid, $locus, $source, $cluster_info ) = split / +/, $_;
# When you found the 2nd cog_cat you overwrote the first
## $cog_cat{ $locus } = $cluster_info;
push [at] { $cog_cat{ $locus } }, $cluster_info if($cluster_info);
# You never used this and it's an array of hashes when what I think
# you need is an array in your hash
## push( [at] cogs, { %cog_cat } );
} elsif ( $_=~ /COG\d+/ ) {
#( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
( $oid, $locus, $source, $cluster_info ) = split / +/, $_;
$cog_id{ $locus } = $cluster_info;
}
}
close IN;
#print scalar [at] cogs, "\n";
# Uncomment and take a look at the output of the next 2 lines
# and I think you'll see where you could use a single hash
# use Data::Dumper;
# print Dumper \%cog_cat, \%cog_id ;
for my $test( sort keys %cog_cat ) {
## print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
print $test, map {"\t$_"} [at] {$cog_cat{ $test }}, "\t$cog_id{ $test }\n";
}
print "\n";
__END__
This is the output I got running this against data from one of your
earlier posts. I think it's what you're looking for except for
an extra tab that I don't see where is coming from.
Challenge for the student? :)
locus COG_category COG_category COGID Cluster_Information
SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription SOS-response transcriptional repressors (RecA-mediated autopeptidases)
SeSA_B0002 [L] Replication, recombination and repair Nucleotidyltransferase/DNA polymerase involved in DNA repair
HTH,
Mike
--
Satisfied user of Linux since 1997.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Fwd: how to parse complex table
--0015174c3dba8761bc04a1b16be7
Content-Type: text/plain; charset=ISO-8859-1
Mike,
many thanks for the help, particularly your comments which were as valuable
as your code!
I revised the code to the following:
#!/usr/bin/perl
# parse_IMG_gene_info4.pl
use strict; use warnings;
for my $file( [at] ARGV ) {
open( my $IN, "<", $file ) or die "Failed to open: $!\n";
my( %cog_cat, %cog_id, [at] cogs, $oid, $locus, $source, $cluster_info );
while( my $line = <$IN> ) {
chomp $line;
if( $line=~ /COG_category/ ) {
( $oid, $locus, $source, $cluster_info ) = split /\t/, $line;
push [at] { $cog_cat{ $locus } }, $cluster_info if( $cluster_info );
} elsif ( $line=~ /COG\d+/ ) {
( $oid, $locus, $source, $cluster_info ) = split /\t/, $line;
push [at] { $cog_id{ $locus } }, $source if( $source );
push [at] { $cog_id{ $locus } }, $cluster_info if( $cluster_info );
}
}
close $IN;
#use Data::Dumper;
#print Dumper \%cog_cat, \%cog_id;
print
"Locus\tCOG_category\tCOG_category\tCOG_ID\tCluster_informat ion\tCOG_ID\tCluster_information\n\n";
for my $test( sort keys %cog_cat ) {
print $test, map {"\t$_"} [at] { $cog_cat{ $test } }; #\t$cog_id{ $test
}
if( scalar [at] { $cog_cat{ $test } } > 1 ) {
print map {"\t$_"} [at] { $cog_id{ $test } }, "\n";
} else {
print "\t", map {"\t$_"} [at] { $cog_id{ $test } }, "\n";
}
}
print "\n";
}
I ended up using 2 hashes instead of 1 to include more information, but
you're right that 1 would suffice in the initial design. The extra tab went
away, no idea how. I also added a tab after COG_category, so that it looks
like this now:
Locus COG_category COG_category COG_ID Cluster_information COG_ID
Cluster_information
SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription
COG1974 SOS-response
transcriptional repressors (RecA-mediated autopeptidases)
SeSA_B0002 [L] Replication, recombination and repair
COG0389 Nucleotidyltransferase/DNA polymerase involved in DNA repair
Many thanks again! Must've spent ~ 4 days on this. I've been flirting with
Perl less than a year, it's so seductive I find myself debating whether to
go back to school.
cheers
galeb
On Sat, Apr 23, 2011 at 3:30 PM, Mike McClain <mike.junk [at] cox.net> wrote:
> On Sat, Apr 23, 2011 at 10:27:07AM -0400, galeb abu-ali wrote:
> > Hi,
> >
> > I'm trying to parse a table containing information about genes in a
> > bacterial chromosome. Below is a sample for one gene, and there's about
> 4500
> > such blocks in a file:
> <snip>
> > My problem is that some genes have 2 entries for COG_category, some only
> one
> > and others none. I took a look at perldsc and tried to fit the table into
> > one of the complex structures but didn't get far. Below is the code I
> came
> > up with so far:
> <snip>
>
> #!/usr/bin/perl
> # parse_IMG_gene_info.pl
> use strict; use warnings;
>
> # '##' marks the lines of you code I changed/removed.
>
> # One of these days you're going to want this program
> # to report on more than one file.
> # So wrap 'for my $file ( [at] ARGV ) { ... }' around it
> # And open( my $IN, '<', $file or die "Failed to open $file: $!\n";
>
> open( IN, "<", [at] ARGV ) or die "Failed to open: $!\n";
>
> # Put this print statement down by the other. Keeping functional groups
> # together aids understanding and it'll all be there when you realize
> # the task has grown to the point you want an output routine.
>
> print "locus\tCOG_category\tCOG_category\tCOGID\tCluster_Informati on\n\n";
>
> # In Perl it's customary to declare your variables just before you
> # need them and makes it easier to notice that %locus & $e aren't used
> my( %locus, [at] cogs, %cog_cat, %cog_id, $oid, $locus, $source,
> $cluster_info, $e);
>
> # Keeping functional groups together suggests:
> # open...; while(...); close;
> # with nothing else intervening. Again $IN.
> # And while( my $line = <$IN> ) protects you from the many things
> # that change $_.
> # If you don't get out of the habit of using $_ it 'will' bite you.
> # BTW, chomp, split, /COG_category/ and several other functions
> # act on $_ by default so 'split "\t", $_' and 'split "\t"
> # are equivalent. note also that 'split /\t/' is preferable.
> while( <IN> ) {
> chomp; # remove linefeeds from $cluster_info
> if( $_=~ /COG_category/ ) {
> ## ( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
>
> # the tabs got lost in the email and it was easier for me to
> # change the split than change the data file.
> ( $oid, $locus, $source, $cluster_info ) = split / +/, $_;
> # When you found the 2nd cog_cat you overwrote the first
> ## $cog_cat{ $locus } = $cluster_info;
> push [at] { $cog_cat{ $locus } }, $cluster_info if($cluster_info);
> # You never used this and it's an array of hashes when what I think
> # you need is an array in your hash
> ## push( [at] cogs, { %cog_cat } );
> } elsif ( $_=~ /COG\d+/ ) {
> #( $oid, $locus, $source, $cluster_info ) = split "\t", $_;
> ( $oid, $locus, $source, $cluster_info ) = split / +/, $_;
> $cog_id{ $locus } = $cluster_info;
> }
> }
> close IN;
>
> #print scalar [at] cogs, "\n";
>
> # Uncomment and take a look at the output of the next 2 lines
> # and I think you'll see where you could use a single hash
> # use Data::Dumper;
> # print Dumper \%cog_cat, \%cog_id ;
>
> for my $test( sort keys %cog_cat ) {
> ## print "$test\t$cog_cat{ $test }\t$cog_id{ $test }\n";
> print $test, map {"\t$_"} [at] {$cog_cat{ $test }}, "\t$cog_id{ $test }\n";
> }
> print "\n";
> __END__
> This is the output I got running this against data from one of your
> earlier posts. I think it's what you're looking for except for
> an extra tab that I don't see where is coming from.
> Challenge for the student? :)
>
> locus COG_category COG_category COGID Cluster_Information
>
> SeSA_B0001 [T] Signal transduction mechanisms [K] Transcription
> SOS-response transcriptional repressors (RecA-mediated autopeptidases)
> SeSA_B0002 [L] Replication, recombination and repair
> Nucleotidyltransferase/DNA polymerase involved in DNA repair
>
> HTH,
> Mike
> --
> Satisfied user of Linux since 1997.
> O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
> For additional commands, e-mail: beginners-help [at] perl.org
> http://learn.perl.org/
>
>
>
--0015174c3dba8761bc04a1b16be7--
Re: Fwd: how to parse complex table
On Sun, 24 Apr 2011 18:14:52 -0400, galeb abu-ali wrote:
> I revised the code to the following:
[...]
> for my $file( [at] ARGV ) {
> open( my $IN, "<", $file ) or die "Failed to open: $!\n";
>
> my( %cog_cat, %cog_id, [at] cogs, $oid, $locus, $source, $cluster_info
> );
>
> while( my $line =3D <$IN> ) {
> chomp $line;
>
> if( $line=3D~ /COG_category/ ) {
> ( $oid, $locus, $source, $cluster_info ) =3D split /\t/,
> $line;
>
> push [at] { $cog_cat{ $locus } }, $cluster_info if(
> $cluster_info );
>
> } elsif ( $line=3D~ /COG\d+/ ) {
> ( $oid, $locus, $source, $cluster_info ) =3D split /\t/,
> $line; push [at] { $cog_id{ $locus } }, $source if( $source );
> push [at] { $cog_id{ $locus } }, $cluster_info if( $cluster_inf=
o
> );
> }
> }
Be ruthless about removing duplication. The more unnecessary code you
can prune, the more what's left reveals its true intention, like
chiseling away everything that is not David from a block of marble :-)
So in the above, you can take the line
( $oid, $locus, $source, $cluster_info ) =3D split /\t/, $lin=
e;
from both clauses and put it before the if statement. Also, Perl's topic=
variable allows you to eliminate the use of certain variables that
otherwise serve no purpose, like $line in your code is only there to get
at its contents. So instead you can say
while (<$IN>) {
chomp;
if ( /COG_CATEGORY/ ) {
my ( $oid, $locus, $source, $cluster_info ) =3D split /\t/;
See also how I declared those four variables right there? Remove their
declaration from before the while loop now and you've gotten rid of some
more duplication and unnecessarily wide scoping.
> Many thanks again! Must've spent ~ 4 days on this. I've been flirting
> with Perl less than a year, it's so seductive I find myself debating
> whether to go back to school.
Heh, camels can be like that :-)
--
Peter Scott
http://www.perlmedic.com/ http://www.perldebugged.com/
http://www.informit.com/store/product.aspx?isbn=3D0137001274
http://www.oreillyschool.com/courses/perl3/
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: Fwd: how to parse complex table
--000e0cd3ff9274db3804a1c78d64
Content-Type: text/plain; charset=ISO-8859-1
thanks Peter,
code looks elegant now!
"like chiseling away everything that is not David from a block of marble :-)
"
very cool reference!
--000e0cd3ff9274db3804a1c78d64--
Re: how to parse complex table
On Apr 23, 11:56=A0am, abuali... [at] gmail.com (galeb abu-ali) wrote:
> thank you for the advice Shawn, I'll try what you suggest!
> BTW, it's not homework, It's supporting metadata for my research and I'm
> trying to parse it in a format that will be easier to lookup later.
>
> thanks again
>
> galeb
Hi Galeb,
Might I offer a suggestion? Instead of parsing the lines into data
structures to be printed at the end of each file, just print out the
whole (matching) line and defer the parsing of certain fields until a
later time. It will greatly simplify your code and just acts as a
filter to print only the tab seperated data of interest.
#!/usr/bin/perl
use strict;
use warnings;
# perl parse_IMG_gene_info4.pl your_input_file_here01.txt
your_input_file_here02.txt (etc.)
while( <> ) {
print if /COG_category/ || /COG\d+/;
print "\n" if eof;
}
Just offering an idea. :-)
Chris
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/