
help me with a parsing script please
HI,
I have this file format
chr start end strand
x 12 24 1
x 24 48 1
1 100 124 -1
1 124 148 -1
Basically I would like to create a new file by grouping the start of the
first line (12) with the end of the second line (48) and so on
the output should look like this:
x 12 48 1
1 100 148 -1
I have this script to split and iterate over each line, but I don't
know how to group 2 lines together, and take the start of the firt line
and the end on the second line? could you please advise? thanks
unless (open(FH, $file)){
print "Cannot open file \"$file\"\n\n";
}
my [at] list = <FH>;
close FH;
open(OUTFILE, ">grouped.txt");
foreach my $line( [at] list){
chomp $line;
my [at] coordinates = split(/' '/, $region);
my $chromosome = $coordinates[0];
my $start = $coordinates[1];
my $end = $coordinates[2];
my $strand = $coordinates[3];
....???
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: help me with a parsing script please
--0016367d5c108f3e3a04a310e920
Content-Type: text/plain; charset=UTF-8
You are almost there :-)
my ($helper1, $helper2);
my $counter = 1;
foreach my $line( [at] list){
chomp $line;
my [at] coordinates = split(/' '/, $region);
my $chromosome = $coordinates[0];
my $start = $coordinates[1];
my $end = $coordinates[2];
my $strand = $coordinates[3];
# Using a simple modulo operation (returns 1 if counter is an uneven
number and 0 otherwise
# you can simply decide the even and uneven lines on the uneven line
you capture the Chromosome and Start
# and on the even lines you capture the End and the Strand, as well as
printing out the result of the
# beginning of the previous line and the end of the current line.
#
# Using a for loop instead of a for each loop will result in a nicer
looking loop
# and it might (never actually tested this be a little bit faster as
well (benchmark that to be sure)
# which on large amounts of data as you are likely to be processing
might save you a
# decent bit of time.
if ( ! $counter % 2 ) { $helper1 = "$chromosome $start"; }
if ( $counter % 2 ) { $helper2 = "$end $strand"; print "$helper1
$helper2\n"; }
$counter++;
}
Hope that helps,
Rob
On Thu, May 12, 2011 at 11:23 AM, Nathalie Conte <nac [at] sanger.ac.uk> wrote:
>
> HI,
>
> I have this file format
> chr start end strand
> x 12 24 1
> x 24 48 1
> 1 100 124 -1
> 1 124 148 -1
>
> Basically I would like to create a new file by grouping the start of the
> first line (12) with the end of the second line (48) and so on
> the output should look like this:
> x 12 48 1
> 1 100 148 -1
>
> I have this script to split and iterate over each line, but I don't know
> how to group 2 lines together, and take the start of the firt line and the
> end on the second line? could you please advise? thanks
>
> unless (open(FH, $file)){
> print "Cannot open file \"$file\"\n\n";
> }
>
> my [at] list = <FH>;
> close FH;
>
> open(OUTFILE, ">grouped.txt");
>
>
> foreach my $line( [at] list){
> chomp $line;
> my [at] coordinates = split(/' '/, $region);
> my $chromosome = $coordinates[0];
> my $start = $coordinates[1];
> my $end = $coordinates[2];
> my $strand = $coordinates[3];
> ...???
>
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
> a charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
> --
> To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
> For additional commands, e-mail: beginners-help [at] perl.org
> http://learn.perl.org/
>
>
>
--0016367d5c108f3e3a04a310e920--
Re: help me with a parsing script please
Nathalie Conte wrote:
>
> HI,
Hello,
> I have this file format
> chr start end strand
> x 12 24 1
> x 24 48 1
> 1 100 124 -1
> 1 124 148 -1
>
> Basically I would like to create a new file by grouping the start of the
> first line (12) with the end of the second line (48) and so on
> the output should look like this:
> x 12 48 1
> 1 100 148 -1
>
> I have this script to split and iterate over each line, but I don't know
> how to group 2 lines together, and take the start of the firt line and
> the end on the second line? could you please advise? thanks
>
> unless (open(FH, $file)){
> print "Cannot open file \"$file\"\n\n";
> }
What you are saying is "if the file doesn't open print a message but use
the filehandle anyway".
You should not try to use an invalid filehandle. The usual way to exit
the program if open fails:
open FH, '<', $file or die qq[Cannot open file "$file" because: $!\n\n];
> my [at] list = <FH>;
Is there a good reason to read the entire file into memory instead of
just processing one line at a time?
> close FH;
>
> open(OUTFILE, ">grouped.txt");
You should always verify that the file opened correctly:
open OUTFILE, '>', 'grouped.txt' or die "Cannot open 'grouped.txt'
because: $!";
> foreach my $line( [at] list){
> chomp $line;
> my [at] coordinates = split(/' '/, $region);
Your regular expression says to match a single quote character followed
by a space character followed by a single quote character. It looks
like you meant either:
my [at] coordinates = split ' ', $region;
Or:
my [at] coordinates = split /\s+/, $region;
Or possibly:
my [at] coordinates = split / +/, $region;
The first two would mean that the chomp() on the previous line is redundant.
> my $chromosome = $coordinates[0];
> my $start = $coordinates[1];
> my $end = $coordinates[2];
> my $strand = $coordinates[3];
There is no reason for the array [at] coordinates:
my ( $chromosome, $start, $end, $strand ) = split ' ', $region;
> ...???
John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: help me with a parsing script please
--bcaec52d51d7aabe9d04a314fcb1
Content-Type: text/plain; charset=ISO-8859-1
On Thu, May 12, 2011 at 6:23 AM, Nathalie Conte <nac [at] sanger.ac.uk> wrote:
> I have this script to split and iterate over each line, but I don't know
> how to group 2 lines together, and take the start of the firt line and the
> end on the second line? could you please advise? thanks
>
>
You have a couple of options for handling two lines at a time. Generally,
two main options: Handle one line at a time, keeping the previous line in
memory and having a conditional inside the loop deciding what to do, or get
the two lines that you need and have the loop be relatively simpler. What
you are doing right now - Reading the entire file into an array - Makes the
latter somewhat simpler, but with a big enough file it just won't be viable
and you'll have to modify the script. Still, here's the nice and easy way to
do it with what you already have, using splice[0] (and, just for
completeness, I also did a small mock-up of the body of the loop, so the
proggy also uses autodie[1], chomp LIST[2], hash slices[3], and
smart-matching[4]):
use strict;
use warnings;
use 5.010;
use autodie;
my $header = <DATA>;
my [at] list = <DATA>;
open my $OUTFILE, '>', "grouped.txt";
chomp [at] list;
while ( my ($one, $two) = splice [at] list, 0, 2 ) {
my (%first_line, %second_line);
[at] first_line{ qw/ chromosome start end strand / } = split /\s+/, $one;
[at] second_line{ qw/ chromosome start end strand / } = split /\s+/, $two;
if ( [at] first_line{ qw/ chromosome strand / } ~~ [at] second_line{ qw/ chromosome
strand / } ) {
say { $OUTFILE } join "\t", $first_line{chromosome}, $first_line{start},
$second_line{end}, $first_line{strand};
} else {
die "Something weird is going on";
}
}
__DATA__
chr start end strand
x 12 24 1
x 24 48 1
1 100 124 -1
1 124 148 -1
(Without gmail's screwy indentation: http://ideone.com/SnRKp)
But reading an entire file into memory isn't advisable. So you either need
to drop that [at] list array and change the condition int he while loop to
something like this:
while ( my ($one, $two) = ( scalar <DATA>, scalar <DATA> ) ) { # Or put that
inside a function
...
}
Or you keep the [at] list array, but do some magic with Tie::File[5]:
use Tie::File;
tie [at] list, 'Tie::File', $file or die ...;
my $index = 1; #To skip the header
while ( $index < $#list ) { # Or use a traditional for loop.
my $one = $list[$index++];
my $two = $list[$index++];
...
}
Brian.
[0] http://perldoc.perl.org/functions/splice.html
[1] http://perldoc.perl.org/autodie.html
[2] http://perldoc.perl.org/functions/chomp.html
[3] http://perldoc.perl.org/perldata.html#Slices
[4] http://perldoc.perl.org/perlsyn.html#Switch-statements
[5] http://perldoc.perl.org/Tie/File.html
--bcaec52d51d7aabe9d04a314fcb1--
Re: help me with a parsing script please
On Thu, May 12, 2011 at 10:23:29AM +0100, Nathalie Conte wrote:
<snip>
> I have this file format
> chr start end strand
> x 12 24 1
<snip>
> I have this script to split and iterate over each line, but I don't
> know how to group 2 lines together, and take the start of the firt line
> and the end on the second line? could you please advise? thanks
my $file = 'chrome.dat';
my %hash;
open my $FH, '<', $file or die "Unable to open $file: $!, stopped ";
while( my $line = <$FH> )
{ my ( $chr, $start, $end, $strand ) = split /\s+/, $line;
# on the assumption there might be a chrom. x, strand 2
my $key = $chr . ':' . $strand;
if( exists $hash{$key} )
{ $hash{$key}{'start'} = $start
if( $start < $hash{$key}{'start'} );
$hash{$key}{'end'} = $end
if( $hash{$key}{'end'} < $end );
}
else
{ $hash{$key}{'start'} = $start;
$hash{$key}{'end'} = $end;
}
}
close $FH or die "Unable to close $file: $!, stopped ";
my $outfile = 'chrome_out.dat';
open my $OFH, '>', $outfile or die "Unable to open $outfile: $!, stopped ";
for my $k (sort keys %hash)
{ my ( $chr, $strand) = split /:/, $k;
printf $OFH "%s\t%s\t%s\t%s\n",
$chr, $hash{$k}{'start'}, $hash{$k}{'end'}, $strand;
}
close $OFH or die "Unable to close $outfile: $!, stopped ";
HTH,
Mike
--
Satisfied user of Linux since 1997.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: help me with a parsing script please
On Thu, May 12, 2011 at 11:23 AM, Nathalie Conte <nac [at] sanger.ac.uk> wrote:
>> HI,
>>
>> I have this file format
>> chr start end strand
>> x 12 24 1
>> x 24 48 1
>> 1 100 124 -1
>> 1 124 148 -1
>>
>> Basically I would like to create a new file by grouping the start of the
>> first line (12) with the end of the second line (48) and so on
>> the output should look like this:
>> x 12 48 1
>> 1 100 148 -1
>>
>> I have this script to split and iterate over each line, but I don't know
>> how to group 2 lines together, and take the start of the firt line and the
>> end on the second line? could you please advise? thanks
>>
use strict;
use warnings;
my %first_start;
while (<>) {
next if /^chr/; # skip header line
chomp;
my ($chr, $start, $end, $strand) = split;
if ( exists $first_start{$chr} ) {
print "$chr $first_start{$chr} $end $strand\n";
}
else {
$first_start{$chr} = $start;
}
}
Put this in a file, "combine-chromosomes" for example
and then:
perl combine-chromosomes infile.txt >grouped.txt
Hope it helps
- Pete
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/
Re: help me with a parsing script please
On 12/05/2011 10:23, Nathalie Conte wrote:
>
> HI,
>
> I have this file format
> chr start end strand
> x 12 24 1
> x 24 48 1
> 1 100 124 -1
> 1 124 148 -1
>
> Basically I would like to create a new file by grouping the start of the
> first line (12) with the end of the second line (48) and so on
> the output should look like this:
> x 12 48 1
> 1 100 148 -1
>
> I have this script to split and iterate over each line, but I don't know
> how to group 2 lines together, and take the start of the firt line and
> the end on the second line? could you please advise? thanks
>
> unless (open(FH, $file)){
> print "Cannot open file \"$file\"\n\n";
> }
>
> my [at] list = <FH>;
> close FH;
>
> open(OUTFILE, ">grouped.txt");
>
>
> foreach my $line( [at] list){
> chomp $line;
> my [at] coordinates = split(/' '/, $region);
> my $chromosome = $coordinates[0];
> my $start = $coordinates[1];
> my $end = $coordinates[2];
> my $strand = $coordinates[3];
> ...???
Hi Nathalie
I have written something that should work for you. It includes basic
checks (that the chromosome and strand fields in the two lines match,
and that the end field of the first line matches the start field of the
second line. You may want to add more, depending how much you trust your
data.
HTH,
Rob
use strict;
use warnings;
while (my $line1 = <DATA>) {
my $line2 = <DATA>;
last unless defined $line2;
my [at] data = (
[ split ' ', $line1 ],
[ split ' ', $line2 ],
);
die unless $data[0][0] eq $data[1][0];
die unless $data[0][3] == $data[1][3];
die unless $data[0][2] == $data[1][1];
$data[0][2] = $data[1][2];
print " [at] {$data[0]}\n";
}
__DATA__
x 12 24 1
x 24 48 1
1 100 124 -1
1 124 148 -1
**OUTPUT**
x 12 48 1
1 100 148 -1
--
To unsubscribe, e-mail: beginners-unsubscribe [at] perl.org
For additional commands, e-mail: beginners-help [at] perl.org
http://learn.perl.org/