How can I use the until function with appropriate way - perl

I have a file that I want to filter which is like that:
##matrix=axtChain 16 91,-114,-31,-123,-114,100,-125,-31,-31,-125,100,-114,-123,-31,-114,91
##gapPenalties=axtChain O=400 E=30
chain 21455232 chr20 14302601 + 37457 14119338 chr22 14786829 + 3573 14759345 1
189 159 123
24 30 22
165 21 20
231 105 0
171 17 19
261 0 2231
222 2 0
253 56 48
chain 164224 chr20 14302601 + 1105938 1125118 chr22 14786829 + 1081744 1100586 8
221 352 334
24 100 112
34 56 56
26 50 47
…………………….
chain 143824 chr20 14302601 + 1105938 1125118 chr22 14786829 + 1081744 1100586 8
So, briefly,there are blocks separated by a blank line.
Each block begins with the line " chain xxxxx " and continues with lines with numbers.
I want to filter out the file and keep just the blocks with chain and the number that follows be greater than 3000.
I wrote the following script to do that:
#!/usr/bin/perl
use strict;
use warnings;
use POSIX;
my $chain = $ARGV[0];
#It filters the chains with chains >= 3000.
open my $chain_file, $chain or die "Could not open $chain: $!";
my #array;
while( my $cline = <$chain_file>) {
#next if /^\s*#/;
chomp $cline;
#my #lines = split (/ /, $cline);
if ($cline =~/^chain/) {
my #lines = split (/\s/, $cline);
if ($lines[1] >= 3000) {
#print $lines[1];
#my #lines = split (/ /, $cline);
#print "$cline\n";
push (#array, $cline);
}
}
until ($cline ne ' ') {
push (#array, $cline);
}
foreach (#array) {
print "$_\n";
}
undef(#array);
}
The problem is that I can print just the headers (chain XXXXX…..) and not the numbers that follows at the next lines of each block.
I'm using the until function till will find the blank line, but it doesn't work.
If someone could help me with that….
Thank you very much in advance,
Vasilis.

The first problem here is that ' ' is a single space, not a blank line ("" or '' should be fine since you've already chomp-ed the line.
The second problem is that
until ( $cline ne "" )
is the same as
while ( $cline eq "" )
which is the opposite of what you need to push lines to #array.
That said, the flip-flop operator is probably a more suitable construct for what you're after:
my #array;
while ( <$chain_file> ) { # Using $_ instead of $cline
chomp;
if ( do { /^chain\s+(\d+)/ && $1 >= 3000 } .. /^$/ ) {
# Accumulate lines in #array
push #array, $_; # False until LHS evaluates to true ...
} # ... then true until RHS evaluates to true
else {
for ( #array ) {
print $_, "\n"; # Print matches
}
#array = (); # Reset/clear out #array
}
}

It's usually best not to use unless instead of while. It negates the boolean expression many times leaving you with a double negative to solve. Here's an example
while ( $foo ne $bar ) {
Even though this is a negative expression, I can pretty easily figure out when to exit my loop. However:
until ( $foo eq $bar ) {
Just takes time to figure out.
Also, ' ' does not make a blank line: Use the regular expression $cline =~ /^\s*$/. However, even beyond that the loop:
until ($cline ne ' ') {
push (#array, $cline);
}
will go on forever if $cline does equal blank. You're never changing the value of $cline.
You can use what I use to call state variables (until Perl actually created a variable type called state and now I have no idea what to call them.) This variable tracks where you are in your file. Are you inside a chain section of the file? Do you want these lines or not? This way, you only have a single loop. You set your state variables and then process your loop.
In this example, I have a state variable called $keep_line which is asking whether or not I want to keep the lines I want to read in. If the line starts with chain and the second field is greater than 3000, I want to keep the entire block (if I understand what you're attempting to do). (By the way, I'm keeping blank lines. Is that okay?)
my $keep_lines = 0; # Aren't in lines I want to keep
my #good_lines; # Array where you want to keep the good lines
while ( my $line = <$chain_file> ) {
chomp $line; # Good habit to always chomp a input as soon as it's read.
if ( $line =~ /^chain/ ) { # This is a chain line. Do I want to keep this group?
my #fields = ( /\s+/, $line );
if ( $field[1] > 3000 ) { # Yes, if the second field is greater than 3000
$keep_lines = 1; # Keep this and all other lines until the next chain line
}
else {
$keep_lines = 0; # Skip until the next chain line
}
}
if ( $keep_lines ) {
push #good_lines, $line;
}
}
I also smell a function here: Instead of the tangle of if clauses, I would probably make this a function that returns the value I set $keep_lines to.
while ( my $line = <$chain_file> ) {
chomp $line; # Good habit to always chomp a input as soon as it's read.
$keep_lines = keep_this_section( $line );
if ( $keep_lines ) {
push #good_lines, $line;
}
}
Simplifies the code quite a bit.
I would also declare some constants to remove those Mysterious Moes. Those are things like 3000 and /^chain/ that have some sort of mysterious, but important meaning in your program. You can use constant as a pragma to define Perl constants. It's part of standard Perl:
use constant {
KEEP_LIMIT => 3000,
SECTION_HEADER => qr/^chain/,
};
Now, I can do things like this:
if ( $line =~ SECTION_HEADER ) {
instead of:
if ( $line =~ /^chain/ ) {
and
if ( $field[1] > KEEP_LIMIT ) {
instead of
if ( $field[1] > 3000 ) {
There are problems with the constant pragma. The biggest is that it just doesn't interpolate in places where Perl will normally interpolate variables. This include double quoted strings and hash keys. If I have $foo{KEEP_LIMIT}, Perl will interpret the key as a string KEEP_LIMIT and not as a constant of KEEP_LIMIT.
Many developers use Readonly which is just so much better in so many ways. Unfortunately, Readonly isn't a standard Perl module, so you have to install it via CPAN, and that's sometimes not possible or desirable to do. So, I tend to use constant.

Related

Extract and filter a range of lines from the input using Perl

I'm quite new to Perl and I have some problems in skipping lines using a foreach loop. I want to copy some lines of a text file to a new one.
When the first words of a line are FIRST ITERATION, skip two more lines and print everything following until the end of the file or an empty line is encountered.
I've tried to find out a similar post but nobody talks about working with text files.
This is the form I thought of
use 5.010;
use strict;
use warnings;
open( INPUT, "xxx.txt" ) or die("Could not open log file.");
open( OUT, ">>yyy.txt" );
foreach my $line (<INPUT>) {
if ( $line =~ m/^FIRST ITERATION/ ) {
# print OUT
}
}
close(OUT);
close(INFO);
I tried using next and $line++ but my program prints only the line that begins with FIRST ITERATION.
I may try to use a for loop but I don't know how many lines my file may have, nor do I know how many lines there are between "First Iteration" and the next empty line.
The simplest way is to process the file a line at a time and keep a state flag which is set to 1 if the current line is begins with FIRST ITERATION and 0 if it is blank, otherwise it is incremented if it is already positive so that it provides a count of the line number within the current block
This solution expects the path to the input file as a parameter on the command line and prints its output to STDOUT, so you will need to redirect the output to the file on the command line as necessary
Note that the regex pattern /\S/ checks whether there is a non-blank character anywhere in the current line, so not /\S/ is true if the line is empty or all blank characters
use strict;
use warnings;
my $lines = 0;
while ( <> ) {
if ( /^FIRST ITERATION/ ) {
$lines = 1;
}
elsif ( not /\S/ ) {
$lines = 0;
}
elsif ( $lines > 0 ) {
++$lines;
}
print if $lines > 3;
}
This can be simplified substantially by using Perl's built-in range operator, which keeps its own internal state and returns the number of times it has been evaluated. So the above may be written
use strict;
use warnings;
while ( <> ) {
my $s = /^FIRST ITERATION/ ... not /\S/;
print if $s and $s > 3;
}
And the last can be rewritten as a one-line command line program like this
$ perl -ne '$s = /^FIRST ITERATION/ ... not /\S/; print if $s and $s > 3' myfile.txt
Use additional counter, that will say on which condition print line. Something like this:
$skipCounter = 3;
And in foreach:
if ($skipCounter == 2) {
// print OUT
}
if ( $line =~ m/^FIRST ITERATION/) {
$skipCounter = 0;
}
$skipCounter++;
Advice: Use STDIN and STDOUT instead of files, this will allowes you to change them without modifying script
Code:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
open(INPUT, "xxx.txt" ) or die "Could not open log file: $!.";
open(OUT, ">yyy.txt") or die "Could not open output file: $!";
while( my $line = <INPUT> )
{
if ( $line =~ m/^FIRST ITERATION/) {
<INPUT>; # skip line
<INPUT>; # skip line
while( $line = <INPUT>) # print till empty line
{
last if $line eq "\n";
print OUT $line;
}
};
};
close (OUT);
close (INPUT);
You're on the right track. What you need to use is the flip-flop operator (which is basically the range operator) ... It will toggle for you between two matches, so you get everything in between. After that, it's a matter of keeping track of the lines you want to skip.
So basically we are checking for FIRST ITERATION and for an empty line, and grab everything in between those. $skip is used to remember how many lines were skipped. It starts at 0 and gets incremented for the first two lines after we start being in the flip-flop if block. In the else case, where we are after the flip-flop, it gets reset to 0 so we can start over with the next block.
Since you know how to open and write files, I'll skip that.
use strict;
use warnings;
my $skip = 0;
while (<DATA>) {
if (/^FIRST ITERATION$/ .. /^$/) {
next if $skip++ <= 2;
print $_;
} else {
$skip = 0;
}
}
__DATA__
FIRST ITERATION
skip1
skip2
foo
bar
baz
don't print this
The output of this is:
foo
bar
baz
To stick with your own code, here's a very verbose solution that uses a foreach and no flip-flop. It does the same thing, just with a lot more words.
my $skip = 0; # skip lines
my $match = 0; # keep track of if we're in between the borders
foreach my $line (<DATA>) {
if ( $line =~ m/^FIRST ITERATION/ ) {
$match = 1; # we are inside the match
next;
}
if ($line =~ m/^$/) {
$match = 0; # we are done matching
next;
}
if ($match) {
$skip++; # count skip-lines
if ($skip <= 2) {
next; # ... and skip the first two
}
print $line; # this is the content we want
}
}
Using paragraph mode (which returns blocks separated by blank lines rather than lines):
local $/ = ""; # Paragraph mode.
while (<>) {
s/\n\K\n+//; # Get rid of trailing empty lines.
print /^FIRST ITERATION\n[^\n]*\n[^\n]*\n(.*)/ms;
}
Using the flip-flop operator:
while (<>) {
if (my $line_num = /^FIRST ITERATION$/ .. /^$/) {
print if $line_num > 3 && $line_num !~ /E0/;
}
}
$line_num !~ /E0/ is true when the flip-flop is flopping (i.e. for the first empty line after FIRST ITERATION). This is checked to avoid printing the blank line.

Passing multiple values to a Perl program in a single command line argument

I have this Perl program which picks data from specific columns starting from a certain row.
#!/usr/bin/perl
# This script is to pick the specific columns from a file, starting from a specific row
# FILE -> Name of the file to be passed at run time.
# rn -> Number of the row from where the data has to be picked.
use strict;
use warnings;
my $file = shift || "FILE";
my $rn = shift;
my $cols = shift;
open(my $fh, "<", $file) or die "Could not open file '$file' : $!\n";
while (<$fh>) {
$. <= $rn and next;
my #fields = split(/\t/);
print "$fields[$cols]\n";
}
My problem is that I am only able to get one column at a time. I want to be able to specify a selection of indices like this
0, 1, 3..6, 21..33
but it's giving me only the first column.
I am running this command to execute the script
perl extract.pl FILE 3 0, 1, 3..6, 21..33
In the absence of any other solutions I am posting some code that I have been messing with. It works with your command line as you have described it by concatenating all of the fields after the first and removing all spaces and tabs.
The column set is converted to a list of integers using eval, after first making sure that it consists of a comma-separated list of either single integers or start-end ranges separated by two or three full stops.
use strict;
use warnings;
use 5.014; # For non-destructive substitution and \h regex item
my $file = shift || "FILE";
my $rn = shift || 0;
my $cols = join('', #ARGV) =~ s/\h+//gr;
my $item_re = qr/ \d+ (?: \.\.\.? \d+)? /ax;
my $set_re = qr/ $item_re (?: , $item_re )* /x;
die qq{Invalid column set "$cols"} unless $cols =~ / \A $set_re \z /x;
my #cols = eval $cols;
open my $fh, '<', $file or die qq{Couldn't open "$file": $!};
while (<$fh>) {
next if $. <= $rn;
my #fields = split /\t/;
print "#fields[#cols]\n";
}
My problem is that I am only able to get one column at a time
You don't understand what perl is passing to your program from the command line:
use strict;
use warnings;
use 5.016;
my $str = "1..3";
my $x = shift #ARGV; # $ perl myprog.pl 1..3
if ($str eq $x) {
say "It's a string";
}
else {
say "It's a range";
}
my #cols = (0, 1, 2, 3, 4);
say for #cols[$str];
--output:--
$perl myprog.pl 1..3
Scalar value #cols[$str] better written as $cols[$str] at 1.pl line 16.
It's a string
Argument "1..3" isn't numeric in array slice at 1.pl line 16.
1
Anything you write on the command line will be passed to your program as a string, and perl won't automatically convert the string "1..3" into the range 1..3 (in fact your string would be the strange looking "1..3,"). After throwing some errors, perl sees a number on the front of the string "1..3", so perl converts the string to the integer 1. So, you need to process the string yourself:
use strict;
use warnings;
use 5.016;
my #fields = (0, 1, 2, 3, 4);
my $str = shift #ARGV; # perl myprog.pl 0,1..3 => $str = "0,1..3"
my #cols = split /,/, $str;
for my $col (#cols) {
if($col =~ /(\d+) [.]{2} (\d+)/xms) {
say #fields[$1..$2]; # $1 and $2 are strings but perl will convert them to integers
}
else {
say $fields[$col];
}
}
--output:--
$ perl myprog.pl 0,1..3
0
123
Perl presents the parameters entered on the command line in an array called #ARGV. Since this is an ordinary array, you could use the length of this array to get additional information. Outside a subroutine, the shift command shifts values from the beginning of the #ARGV array when you don't give it any parameters.
You could do something like this:
my $file = shift; # Adding || "FILE" doesn't work. See below
my $rn = shift;
my #cols = #ARGV;
Instead of cols being a scalar variable, it's now an array that can hold all of the columns you want. In other words, the first parameter is the file name, the second parameter is the row, and the last set of parameters are the columns you want:
while (<$fh>) {
next if $. <= $rn;
my #fields = split(/\t/);
for my $column ( #columns ) {
printf "%-10.10s", $fields[$column];
}
print "\n";
break; # You printed the row. Do you want to stop?
}
Now, this isn't as fancy pants as your way of doing it where you can give ranges, etc, but it's fairly straight forward:
$ perl extract.pl FILE 3 0 1 3 4 5 6 21 22 23 24 25 26 27 28 29 30 31 32 33
Note I used printf instead of print so all of the fields will be the same width (assuming that they're strings and none is longer than 10 characters).
I tried looking for a Perl module that will handle range input like you want. I'm sure one exists, but I couldn't find it. You still need to allow for a range of input in #col like I showed above, and then parse #cols to get the actual columns.
What's wrong with my $file = shift || "FILE";?
In your program, you're assuming three parameters. That means you need a file, a row, and at least one column parameter. You will never have a situation where not giving a file name will work since it means you don't have a row or a set of columns to print out.
So, you need to look at $#ARGV and verify it has at least three values in it. If it doesn't have three values, you need to decide what to do at that point. The easy solution is to just abort the program with a little message telling you the correct usage. You could verify if there are one, two, or three parameters and decide what to do there.
Another idea is to use Getopt::Long which will allow you to use named parameters. You can load the parameters with pre-defined defaults, and then change when you read in the parameters:
...
use Getopt::Long;
my $file = "FILE"; # File has a default;
my $row, #cols; # No default values;
my $help; # Allow user to request help
GetOptions (
"file=s" => \$file,
"rows=i => \$rows,
"cols=i" => \#cols,
"help" => $help,
);
if ( "$help" ) {
print_help();
}
if ( not defined $rows ) {
error_out ( "Need to define which row to fetch" );
}
if ( not #cols ) {
error_out ( "Need to define which rows" );
}
The user could call this via:
$ perl extract.pl -file FILE -row 3 -col 0 -col 1 3 4 5 6 21 22 23 24 25 26 27 28 29 30 31 32 33
Note that if I use -col, by default, GetOptions will assume that all values after the -col are for that option. Also note I could, if I want, repeat -col for each column.
By the way, if you use GetOpt::Long, you might as well use Pod::Usage. POD stands for Plain Ol' Document which is Perl's way of documenting how a program is used. Might as well make this educational. Read up on POD Documentation, the POD Specifications, and the standard POD Style. This is how you document your Perl programming. You can use the perldoc command (Betcha you didn't know it existed), to print out the embedded Perl POD documentation, and use Pod::Usage to print it out for the user.

Efficient way to read columns in a file using Perl

I have an input file like so, separated by newline characters.
AAA
BBB
BBA
What would be the most efficient way to count the columns (vertically), first with first, second with second etc etc.
Sample OUTPUT:
ABB
ABB
ABA
I have been using the following, but am unable to figure out how to remove the scalar context from it. Any hints are appreciated:
while (<#seq_prot>){
chomp;
my #sequence = map substr (#seq_prot, 1, 1), $start .. $end;
#sequence = split;
}
My idea was to use the substring to get the first letter of the input (A in this case), and it would cycle for all the other letters (The second A and B). Then I would increment the cycle number + 1 so as to get the next line, until I reached the end. Of course I can't seem to get the first part going, so any help is greatly appreciated, am stumped on this one.
Basically, you're trying to transpose an array.
This can be done easily using Array::Transpose
use warnings;
use strict;
use Array::Transpose;
die "Usage: $0 filename\n" if #ARGV != 1;
for (transpose([map {chomp; [split //]} <>])) {
print join("", map {$_ // " "} #$_), "\n"
}
For an input file:
ABCDEFGHIJKLMNOPQRS
12345678901234
abcdefghijklmnopq
ZYX
Will output:
A1aZ
B2bY
C3cX
D4d
E5e
F6f
G7g
H8h
I9i
J0j
K1k
L2l
M3m
N4n
O o
P p
Q q
R
S
You'll have to read in the file once for each column, or store the information and go through the data structure later.
I was originally thinking in terms of arrays of arrays, but I don't want to get into References.
I'm going to make the assumption that each line is the same length. Makes it simpler that way. We can use split to split your line into individual letters:
my = $line = "ABC"
my #split_line = split //, $line;
This will give us:
$split_line[0] = "A";
$split_line[1] = "B";
$split_line[2] = "C";
What if we now took each letter, and placed it into a #vertical_array.
my #vertical_array;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
Now let's do this with the next line:
$line = "123";
#split_line = split //, $line;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
This will give us:
$vertical_array[0] = "A1";
$vertical_array[1] = "B2";
$vertical_array[2] = "C3";
As you can see, I'm building the $vertical_array with each interation:
use strict;
use warnings;
use autodie;
use feature qw(say);
my #vertical_array;
while ( my $line = <DATA> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( 0..$#split_line ) {
$vertical_array[$index] .= $split_line[$index];
}
}
#
# Print out your vertical lines
#
for my $line ( #vertical_array ) {
say $line;
}
__DATA__
ABC
123
XYZ
BOY
FOO
BAR
This prints out:
A1XBFB
B2YOOA
C3ZYOR
If I had used references, I could probably have built an array of arrays and then flipped it. That's probably more efficient, but more complex. However, that may be better at handling lines of different lengths.

Could i search between keys of a hash and assign its value to a variable in Perl?

I want to use substr function to recuperate some nucleotides in a sequences. Here i have the FASTA format of those sequences:
>dvex28051
AAAACAAAAACATTCGCTAGAAAGTAATCAGCTGGTCATTTATTTGAAATGTTAATGATATATTTCATGTTGCTAATTTTTTATGAAAAAAATCATTGCTTATTTAATTACTCTTGGTTCTTGACCAACTATAAAAGCATTGTTTAGTATCAAGTGTCCAGGTATCAGCAGTTTTGTTTGAAAACAAACTTTTATTCATGCAGTCAGTGGCGGATCCAGGTAGAGTGCAGAGGCAGCACCCTCCGTCAGAAAACCAAAAAAAGAAGAAATGAAAAATTATAAAAAAAATTTCTAAACGTTGGTGCACTTAAGTGTAGCAAAAAATTCCTGTTTAGATATTCAGTGGGGAGCGACACCTTTTGGGGCCTATAGCTTCAAATCTTACTTGGTGACCTAAAATCGCTTTTTCGTTGGATCTGCGAAAGCTAGAATTTGGTTGCTGCAAATCGAATCGGTGCATCAACTGCATCAATATCAACGATGTGGTGACTGGTGGTATATTTTGGGTTCGTGCAATGCTACATTTATTTCAATCATATTTCAAGGCAGAAAGGGAAAGAAAACATCAGGTCAAGACAGTGGCGTAGCGAGGGAAGGGGGGCATACGTCCCCGGGCGCAACACGATGTCTTTTTTTTTAATCATCTGCGAAATTCAGACATTTTTTAGAGACTAAATGAAACTATGGAAAACCGGGCCCTTATAAAAGTTGAGACCAAGTGAAAAACTGGGGATAAAACATGAAAATCGGGCTCCAAAAGAATGAGAGTCCGCCCTTGGTCTGTACCAGCATGATTTGAGCGCAAATTTCATTAAGCCCCCGGGCGCAAGACACTCACGCTACGCCCCTGGGTAAAGACAAACAGAGTAGTTTTTCTTATAAACACAAGCATGCACAAACAACATAAAAACAAAACACAGTTTTTTTTAAGACGATGTGCTGCGTGCACCCGCTCAATGTTTTTTTTTTTTTTTTATAGAAAAGCAAAACTTTGAAAGGTTAACGTCAACTCATTTTACAACAATTTGTGGCAAATGGTATCAAGGTATCAAGCAATTAACTAAATGTCTTCCACTAGAACGCAGAACACCATTTTGCAATTATTTATTTGATGTAAACCAGTGTGTTAGATCAAAATCACTTCGACGCCGTTTTTTGACTCCGTGAAAATCTTGGTATTCTTCTCGCATTGCATAATGATGGTTTGTTGAAATAAAATTAAACGCTTAACGTTCTTAAAATGAGCGCGATACTACTTTTCTTTGTAGATTTTCTGCATGCGCTCCTTTTAAGTTGATCCCGAGCTACAAACTTCTTTATGAACGTTTTGGATTTCTCCAAAATAAAGCCTGCAAGCAGTTTTCTAAAAACACCGCACCCCCCATTAGGAATTTCTAGATCCGCCCCTGCATACAGTATTTGTTAATTATTAAAACCAACCAGCAGCAATTGTTTATTCAATGACTATTAAACCAACCTGGATAGTGCGTTTGGTCTTGATTGAAGCGATTGCTGCATTGACGTCTTTCGGAACCACATCACC
>dvex294195
GAATCAGTGGAAAAGTCACAACGCAGCTTGCCGAATTACTGCAGATTCTTTACACTTTTTTTTCTACATTATCACTGTTTTGCTTAATTTTCAATTATAGAAATCAAAATTAATAACTGGTATGTAGTTGGTCGGTGCTTCGAGAAAGTAGCCTACTCAATGATTTCTCAGAATGTTACAGTACTTCAAAAAAACAGACTACCCATTTCAAAAAATATAAACCTAGTA
I want to compare each keys of the hash with the Hit column (dvex\d++) of this table:
#Query Hit sense start end star_q end_q lenght_q # this line is informative don't make part of the code.
miRNA1 dvex28051 + 205 232 11 38 51
miRNA1 dvex202016 - 75 106 17 48 51
miRNA1 dvex294195 + 55 85 11 48 51
If this exist, I want to assign its value of the hash to a variable (i.e: $sequence) for apply a substr function:
my $fragment = substr $sequence, $start, $length_sequence;
I make an array with the sequences, and tried to reading it each 2 values and compare it:
while (my $line1 = <$MYINPUTFILE>){ #Entry of the sequences Fasta file
chomp $line1;
push #array_lines, $line1;
}
while (my $line2 = <$IN>){ #Entry of the table
chomp $line2;
push #database_lines, $line2;
}
foreach my $database_line (#database_lines){ #each value of the table
my #entry = split /\s++/,$database_line;
$pattern = $entry[1];
$query = $entry[0];
$start = $entry[3];
$l_pattern = length $pattern;
$end = $entry[4];
$lng_sequence = ($end - $start) + 1;
$sense = $entry[2];
$l_query = $entry[7];
my $count = 2;
for (my $i = 0; $i <= $#array_lines; $i +=$count){
chomp $array_lines[$i-2];
chomp $array_lines[$i-1];
$seq = $array_lines[$i-1];
$header = $array_lines[$i-2];
if($new_header =~ /$pattern/ && $l_header == $l_pattern){
if(($end+$right_diff+$increment) > $l_query){
$clean_seq = substr $seq, $start, $l_query;
} else {;}
}
The problem with my code is that Perl recognizes $seq as the last one Sequence. And always apply substr function on this $seq. I need to search the $pattern and search in those sequences, if exist, assign $seq to its sequence, next apply substr function.
Some suggestions?
I see two significant problems with your code. First, in the loop:
for (my $i = 0; $i <= $#array_lines; $i +=$count){
chomp $array_lines[$i-2];
chomp $array_lines[$i-1];
$seq = $array_lines[$i-1];
$i is set to zero the first time through, but you access array elements $i-1 and $i-2. Element -1 will be the last element of the array, and -2 will be the second to the last element. So it looks like $seq and $header will have incorrect values the first time through your loop. Maybe you need to start $i at $count instead of zero?
Secondly, in this line:
if(($end+$right_diff+$increment) > $l_query){
$increment appears only here in your code. It is never set to anything. Did you mean to use $i here?
A few other suggestions:
Make sure you use warnings; use strict; This will catch errors such as the $increment variable above.
Here is a simpler way to read a file into an array:
my #array_lines = <$MYINPUTFILE>;
chomp #array_lines;
Within regexes, ++ is a special quantifier that disables backtracking. If you want to split on one or more whitespace characters, it is more typical to use split /\s+/, or the equivalent split ' '
With this line, you appear to be simply checking that two strings are equal:
if($new_header =~ /$pattern/ && $l_header == $l_pattern)
You could just do this instead:
if($new_header eq $pattern)
When you have multiple conditions, it is clearer to put them all in one if statement instead of using nested statements. If you have many conditions, you can put them on multiple lines for clarity.
It isn't necessary to use else {;} If you don't need to do anything there, just omit the else clause altogether.

Perl: Removing duplicates from a large set of data

I'm using Perl to generate a list of unique exons (which are the units of genes).
I've generated a file in this format (with hundreds of thousands of lines):
chr1 1000 2000 gene1
chr1 3000 4000 gene2
chr1 5000 6000 gene3
chr1 1000 2000 gene4
Position 1 is the chromosome, position 2 is the starting coordinate of the exon, position 3 is the ending coordinate of the exon, and position 4 is the gene name.
Because genes are often constructed of different arrangements of exons, you have the same exon in multiple genes (see the first and fourth sets). I want to remove these "duplicate" - ie, delete gene1 or gene4 (not important which one gets removed).
I've bashed my head against the wall for hours trying to do what (I think) is a simple task. Could anyone point me in the right direction(s)? I know people often use hashes to remove duplicate elements, but these aren't exactly duplicates (since the gene names are different). It's important that I don't lose the gene name, also. Otherwise this would be simpler.
Here's a totally non-functional loop I've tried. The "exons" array has each line stored as a scalar, hence the subroutine. Don't laugh. I know it doesn't work but at least you can see (I hope) what I'm trying to do:
for (my $i = 0; $i < scalar #exons; $i++) {
my #temp_line = line_splitter($exons[$i]); # runs subroutine turning scalar into array
for (my $j = 0; $j < scalar #exons_dup; $j++) {
my #inner_temp_line = line_splitter($exons_dup[$j]); # runs subroutine turning scalar into array
unless (($temp_line[1] == $inner_temp_line[1]) && # this loop ensures that the the loop
($temp_line[3] eq $inner_temp_line[3])) { # below skips the identical lines
if (($temp_line[1] == $inner_temp_line[1]) && # if the coordinates are the same
($temp_line[2] == $inner_temp_line[2])) { # between the comparisons
splice(#exons, $i, 1); # delete the first one
}
}
}
}
my #exons = (
'chr1 1000 2000 gene1',
'chr1 3000 4000 gene2',
'chr1 5000 6000 gene3',
'chr1 1000 2000 gene4'
);
my %unique_exons = map {
my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_));
"$chro $scoor $ecoor" => $gene
} #exons;
print "$_ $unique_exons{$_} \n" for keys %unique_exons;
This will give you uniqueness, and the last gene name will be included. This results in:
chr1 1000 2000 gene4
chr1 5000 6000 gene3
chr1 3000 4000 gene2
You can use a hash to dedup en passant, but you need a way to join the parts you want to use to detect duplicates into a single string.
sub extract_dup_check_string {
my $exon = shift;
my #parts = line_splitter($exon);
# modify to suit:
my $dup_check_string = join( ';', #parts[0..2] );
return $dup_check_string;
}
my %seen;
#deduped_exons = grep !$seen{ extract_dup_check_string($_) }++, #exons;
You can use a hash to keep track of duplicates you've already seen and then skip them. This example assumes the fields in your input file are space-delimited:
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
while (my $line = <>) {
my($chromosome, $exon_start, $exon_end, $gene) = split /\s+/, $line;
my $key = join ':', $chromosome, $exon_start, $exon_end;
if ($seen{$key}) {
next;
}
else {
$seen{$key}++;
print $line;
}
}
As simple as it comes. I tried to use as little magic as possible.
my %exoms = ();
my $input;
open( $input, '<', "lines.in" ) or die $!;
while( <$input> )
{
if( $_ =~ /^(\w+\s+){3}(\w+)$/ ) #ignore lines that are not in expected format
{
my #splits = split( /\s+/, $_ ); #split line in $_ on multiple spaces
my $key = $splits[1] . '_' . $splits[2];
if( !exists( $exoms{$key} ) )
{
#could output or write to a new file here, probably output to a file
#for large sets.
$exoms{$key} = \#splits;
}
}
}
#demo to show what was parsed from demo input
while( my ($key, $value) = each(%exoms) )
{
my #splits = #{$value};
foreach my $position (#splits)
{
print( "$position " );
}
print( "\n" );
}