The system() of Perl "paused". Caused by $ARGV[]?

The system() of Perl "paused". Caused by $ARGV[]? - perl

I was stuck when combining the BLAST command into perl script. The problem is that the command line paused when the PART II begin.
PART I is used to crop the fasta sequence.
PART II is used to do BLAST with the file generated by PART I.
Both the two parts can run well individually, but met the "pause" problem when combining together.
I guess it was because the $ARGV[1] and $ARGV[3] generated by part I cannot be used in part II. I dont know how to fix, though I tried a lot.
Thanks!
#! /usr/bin/perl -w
use strict;
#### PART I
die "usage:4files fasta1 out1 fasta2 out2\n" unless #ARGV==4;
open (S, "$ARGV[0]") || die "cannot open FASTA file to read: $!";
open OUT,">$ARGV[1]" || die "no out\n";
open (S2, "$ARGV[2]") || die "cannot open FASTA file to read: $!";
open OUT2,">$ARGV[3]" || die "no out2\n";
my %s;# a hash of arrays, to hold each line of sequence
my %seq; #a hash to hold the AA sequences.
my $key;
print "how long is the N-terminal(give number,e.g. 30. whole length input \"0\") \n";
chomp(my $nl=<STDIN>);
##delete "\n" for seq.
local $/ = ">";
<S>;
while (<S>){ #Read the FASTA file.
chomp;
my #line=split/\n/;
print OUT ">",$line[0],"\n";
splice #line,0,1;
#print OUT join ("",#line),"\n";
##line = join("",#line);
#print #line,"\n";
if ($nl == 0){ #whole length
my $seq=join("",#line);
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
#push #{$s{$key}},#amac;
print OUT #amac,"\n";
}
else { # extract inital aa by number ##Guanhua
my $seq=join("",#line);
#print $seq,"\n";
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
splice #amac,$nl; ##delete from the N to end
#print #amac,"\n";
#push (#{$s{$key}}, #amac);
print OUT #amac,"\n";
}
}
<S2>;
while (<S2>){ #Read the FASTA file.
chomp;
my #line=split/\n/;
print OUT2 ">",$line[0],"\n";
splice #line,0,1;
#print OUT join ("",#line),"\n";
##line = join("",#line);
#print #line,"\n";
if ($nl == 0){ #whole length
my $seq=join("",#line);
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
#push #{$s{$key}},#amac;
print OUT2 #amac,"\n";
}
else { # extract inital aa by number ##Guanhua
my $seq=join("",#line);
#print $seq,"\n";
my #amac = split(//,$seq);
splice #amac,0,1; # delete first "MM"
splice #amac,$nl; ##delete from the N to end
#print #amac,"\n";
#push (#{$s{$key}}, #amac);
print OUT2 #amac,"\n";
}
}
##### PART II
print "nucl or prot?\n";
chomp(my $tp = <STDIN>);
system ("makeblastdb -in $ARGV[1] -dbtype prot");
system ("makeblastdb -in $ARGV[3] -dbtype $tp");
print "blast type? (blastp,blastn)\n";
chomp(my $cmd = <STDIN>);
system ("blastp -query $ARGV[1] -db $ARGV[3] -outfmt 6 -evalue 1e-3 -out 12.out ");
system ("$cmd -db $ARGV[1] -query $ARGV[3] -outfmt 6 -evalue 1e-3 -out 21.out ");

You changed the way perl reads from 'STDIN' when you set '$/' in this line:
local $/ = ">";
The easiest way to fix this is to add a left bracket right before that line and a right bracket just before the '##### PART II' comment:
{
local $/ = ">";
...
...
}
##### PART II
(I think theoretically, you could put a ">" at the end of the text you input, but that seems strange, so I wouldn't do it)
That will fix your problem. But something that should be addressed is some of the style choices you made. The two big chunks of code in the middle are both identical as far as I can tell and should probably be put into a subroutine and then called twice. This will eliminate duplication and is less error prone.
You should also use the three argument open call to open files.

Related

print hashes with values from different files

I want to create output file that has values from file 1 and file 2.
The line from file 1:
chr1 Cufflinks exon 708356 708487 1000 - .
gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "5"; FPKM
"3.1300591420"; frac "1.000000"; conf_lo "2.502470"; conf_hi
"3.757648"; cov "7.589085"; chr1Cufflinks exon 708356
708487 . - . gene_id "XLOC_001284"; transcript_id
"TCONS_00007667"; exon_number "7"; gene_name "LOC100288069"; oId
"CUFF.15.2"; nearest_ref "NR_033908"; class_code "j"; tss_id
"TSS2981";
The line from file 2:
CUFF.48557
chr4:160253850-160259462:160259621-160260265:160260507-160262715
The second column from this file is unique id (uniq_id).
I want to get output file in the following format:
transcript_id(CUFF_id) uniq_id gene_id(XLOC_ID) FPKM
My script takes XLOC_ID and FPKM values from first file and print them together with two columns from the second file.
#!/usr/bin/perl -w
use strict;
my $v_merge_gtf = shift #ARGV or die $!;
my $unique_gtf = shift #ARGV or die $!;
my %fpkm_hash;
my %xloc_hash;
open (FILE, "$v_merge_gtf") or die $!;
while (<FILE>) {
my $line = $_;
chomp $line;
if ($line =~ /[a-z]/) {
my #array = split("\t", $line);
if ($array[2] eq 'exon') {
my $id = $array[8];
if ($id =~ /transcript_id \"(CUFF\S+)/) {
$id = $1;
$id =~ s/\"//g;
$id =~ s/;//;
}
my $fpkm = $array[8];
if ($fpkm =~ /FPKM \"(\S+)/) {
$fpkm = $1;
$fpkm =~ s/\"//g;
$fpkm =~ s/;//;
}
my $xloc = $array[17];
if ($xloc =~ /gene_id \"(XLOC\S+)/) {
$xloc = $1;
$xloc =~ s/\"//g;
$xloc =~ s/;//;
}
$fpkm_hash{$id} = $fpkm;
$xloc_hash{$id} = $xloc;
}
}
}
close FILE;
open (FILE, "$unique_gtf") or die $!;
while (<FILE>) {
my $line = $_;
chomp $line;
if ($line =~ /[a-z]/) {
my #array = split("\t", $line);
my $id = $array[0];
my $uniq = $array[1];
print $id . "\t" . $uniq . "\t" . $xloc_hash{$id} . "\t" . $fpkm_hash{$id} . "\n";
}
}
close FILE;
I initialized hashes outside of the files, but I get the following error for each CUFF values:
CUFF.24093
chr17:3533641-3539345:3527526-3533498:3526786-3527341:3524707-3526632
Use of uninitialized value in concatenation (.) or string at ex_1.pl
line 55, line 9343.
Use of uninitialized value in concatenation (.) or string at ex_1.pl
line 55, line 9343.
How can I fix this issue?
Thank you!

I think the warning message is because the $id key, (CUFF.24093), you get on line 9343 of the second file isn't contained in the hashes you created in the first file.
Is it possible that an ID in the second file isn't contained in the first file? That seems to be the case here.
If so, and you just want to skip over this unknown ID, you could add a line to your program like:
my $id = $array[0];
my $uniq = $array[1];
next unless exists $fpkm_hash{$id}; # add this line
print $id . "\t" . $uniq . "\t" . $xloc_hash{$id} . "\t" . $fpkm_hash{$id} . "\n";
This will bypass the following print statement and go back to the top of the while loop and read in the next line and continue processing.
It depends on what action you want to take if you encounter an unknown ID.
Update: I thought I might make some observations/improvements to your code.
my $v_merge_gtf = shift #ARGV or die $!;
my $unique_gtf = shift #ARGV or die $!;
The error variable $! serves no purpose here (this is a fact I only recently discovered even after 14 years using Perl). $! is only set for system calls, (where you are involving the operating system).The most common are open and close for files, and opendir and closedir for directories. If an error occurs in opening/closing a file or a directory, $! will contain the error message. (See in my included code how I handled this - I created a message, $usage to be printed if the shift didn't succeed.
Instead of using 2 hashes to store the information, I used 1 hash,%data. The advantage is that it will use less memory, (because its only storing 1 set of keys instead of 2), Though, you could use the 2 if you like.
I used the recommended 3 argument (filehandle, mode, filename) form for opening the files. The 2 argument approach you used is outdated and less safe (for reasons I won't go into detail here). Also, the lexical filehandles I used, my $mrg and my $unique are the newer ways to create filehandles (instead of usingFILEfor your 2 opens).
You can directly assign to $linein your while loop like while (my $line = <FILE>) instead of the way you did it. In my sample program, I didn't assign to $line, but instead relied on the default variable $_. (It simplifies the 2 following statements, next unless /\S/; my #array = split /\t/;). I didn't chomp for the first file because you're only parsing inside the string and aren't using anything from the end of the string.chomp is necessary for the second while loop because the second variable my $uniq = ... would have a newline at its end if it wasn't removed by chomp.
I didn't know what you meant by this statement, if ($line =~ /[a-z]/). I am assuming you wanted to check for empty lines and only process lines with non-space data. That's why I wrote next unless /\S/;instead. (says to skip the following statements and got to the top of the while loop and read the next record).
Your first while loop worked because you had no errors in your input file. If there had errors, the way you wrote the code could have been a problem.
The statementmy $id = $array[8]; gives $id a value that would have been wrongly used if the following if statement had been false. (The same thing for the 2 other variables you want to capture,$fpkm and $xloc). You can see in my code example how I handled this.
In my code, I died if the match didn't succeed, You might not want todie but say match or next to try the next line of data. It depends on how you would want to handle a failed match.
And in this line$array[8] =~ /gene_id "(CUFF\S+)";/, Note that I put the ";following the captured data, so there is no need to remove it from the captured data (as you did in your substitutions)
Well, I know this is a long comment on your code, but I hope you get some good ideas about why I recommended the changes given.
or die "Could not find ID in $v_merge_gtf (line# $.)";
$. is the line number of the file being read.
#!/usr/bin/perl
use warnings;
use strict;
my $usage = "USAGE: perl $0 merge_gtf_file unique_gtf_file\n";
my $v_merge_gtf = shift #ARGV or die $usage;
my $unique_gtf = shift #ARGV or die $usage;
my %data;
open my $mrg, '<', $v_merge_gtf or die $!;
while (<$mrg>) {
next unless /\S/;
my #array = split /\t/;
if ($array[2] eq 'exon') {
$array[8] =~ /gene_id "(CUFF\S+)";/
or die "Could not find ID in $v_merge_gtf (line# $.)";
my $id = $1;
$array[8] =~ /FPKM "(\S+)";/
or die "Could not find FPKM in $v_merge_gtf (line# $.)";
my $fpkm = $1;
$array[17] =~ /gene_id "(XLOC\S+)";/
or die "Could not find XLOC in $v_merge_gtf (line# $.)";
my $xloc = $1;
$data{$id}{fpkm} = $fpkm;
$data{$id}{xloc} = $xloc;
}
}
close $mrg or die $!;
open my $unique, '<', $unique_gtf or die $!;
while (<$unique>) {
next unless /\S/;
chomp;
my ($id, $uniq) = split /\t/;
print join("\t", $id, $uniq, $data{$id}{fpkm}, $data{$id}{xloc}), "\n";
}
close $unique or die $!;

Push, big file. Correct and improvement

dummy.pepmasses
YCL049C 1 511.2465 0 0 MFSK
YCL049C 2 4422.3098 0 0 YLVTASSLFVA
YCL049C 3 1131.5600 0 0 DFYQVSFVK
YCL049C 4 1911.0213 0 0 SIAPAIVNSSVIFHDVSR
YCL049C 5 774.4059 0 0 GVAMGNVK
YCL049C 6 261.1437 0 0 SR
my $dummyfile = "dummy.pepmasses"; #filename defined here
my #mzco = ();
open (IFILE, $dummyfile) or die "unable to open file $dummyfile\n ";
while (my $line = $dummyfile){
#read each line in file
chomp $line;
my $mz_value = (split/\s+/,$line)[3]; #pick column 3rd at every line
$mz_value = join "\n"; # add "\n" for data
push (#mzco,$mz_value); #add them all in one array #mzco
}
print "#mzco";
close IFILE;
There should be better way to express this one. How can it be ?
I want to pick up the third column and push it into an array. Are there better methods?

I'll just go through your code and comment
open (IFILE, $dummyfile) or die "unable to open file $dummyfile\n ";
You should use 3-argument open with explicit mode, and a lexical file handle. Also, you should not include newline in the die message unless you want to suppress line number. You should also include the error, $!.
open my $fh, "<", $dummyfile or die "Unable to open $dummyfile: $!";
while (my $line = $dummyfile){
#read each line in file
No, this just copies the file name. To read from the file handle, do this:
while (my $line = <IFILE>) {
Or <$fh> if you use a lexical file handle.
chomp $line;
my $mz_value = (split/\s+/,$line)[3]; #pick column 3rd at every line
This is actually the 4th column, since indexes start at zero 0.
$mz_value = join "\n"; # add "\n" for data
join does not work that way. It is join EXPR, LIST to join a list of values into a string. You want the concatenation operator .:
$mz_value = $mz_value . "\n";
Or more appropriately:
$mz_value .= "\n";
But why do it that way? It is simpler to just add the newline when you print.
print "#mzco";
You can do this:
print "$_\n" for #mzco;
Or if you are feeling daring:
use feature 'say';
say for #mzco;
And just to show you the power of Perl, this program can be reduced to a one-liner, using a lot of built-in features:
perl -lane ' print $F[3] ' dummy.pepmasses
-l chomp lines, add newline (by default) to print
-n put while (<>) loop around code: read input file or stdin
-a autosplit each line into #F.
The program as a file would look like this:
$\ = $/; # set output record separator to input record separator
while (<>) {
chomp;
my #F = split;
print $F[3];
}

Append a new column to file in perl

I've got the follow function inside a perl script:
sub fileSize {
my $file = shift;
my $opt = shift;
open (FILE, $file) or die "Could not open file $file: $!";
$/ = ">";
my $junk = <FILE>;
my $g_size = 0;
while ( my $rec = <FILE> ) {
chomp $rec;
my ($name, #seqLines) = split /\n/, $rec;
my $sec = join('',#seqLines);
$g_size+=length($sec);
if ( $opt == 1 ) {
open TMP, ">>", "tmp" or die "Could not open chr_sizes.log: $!\n";
print TMP "$name\t", length($sec), "\n";
}
}
if ( $opt == 0 ) {
PrintLog( "file_size: $g_size", 0 );
}
else {
print TMP "file_size: $g_size\n";
close TMP;
}
$/ = "\n";
close FILE;
}
Input file format:
>one
AAAAA
>two
BBB
>three
C
I have several input files with that format. The line beginning with ">" is the same but the other lines can be of different length. The output of the function with only one file is:
one 5
two 3
three 1
I want to execute the function in a loop with this for each file:
foreach my $file ( #refs ) {
fileSize( $file, 1 );
}
When running the next iteration, let's say with this file:
>one
AAAAABB
>two
BBBVFVF
>three
CS
I'd like to obtain this output:
one 5 7
two 3 7
three 1 2
How can I modify the function or modify the script to get this? As can be seen, my function append the text to the file
Thanks!

I've left out your options and the file IO operations and have concentrated on showing a way to do this with an array of arrays from the command line. I hope it helps. I'll leave wiring it up to your own script and subroutines mostly up to to you :-)
Running this one liner against your first data file:
perl -lne ' $name = s/>//r if /^>/ ;
push #strings , [$name, length $_] if !/^>/ ;
END { print "#{$_ } " for #strings }' datafile1.txt
gives this output:
one 5
two 3
three 1
Substituting the second version or instance of the data file (i.e where record one contains AAAAABB) gives the expected results as well.
one 7
two 7
three 2
In your script above, you save to an output file in this format. So, to append columns to each row in your output file, we can just munge each of your data files in the same way (with any luck this might mean things can be converted into a function that will work in a foreach loop). If we save the transformed data to be output into an array of arrays (AoA), then we can just push the length values we get for each data file string onto the corresponding anonymous array element and then print out the array. Voilà! Now let's hope it works ;-)
You might want to install Data::Printer which can be used from the command line as -MDDP to visualize data structures.
First - run the above script and redirect the output to a file with > /tmp/output.txt
Next - try this longish one-liner that uses DDP and p to show the structure of the array we create:
perl -MDDP -lne 'BEGIN{ local #ARGV=shift;
#tmp = map { [split] } <>; p #tmp }
$name = s/>//r if /^>/ ;
push #out , [ $name, length $_ ] if !/^>/ ;
END{ p #out ; }' /tmp/output.txt datafile2.txt `
In the BEGIN block we local-ize #ARGV ; shift off the first file (our version of your TMP file) - {local #ARGV=shift} is almost a perl idiom for handling multiple input files; we then split it inside an anonymous array constructor ([]) and map { } that into the #tmp array which we display with DDP's p() function. Once we are out of the BEGIN block, the implicit while (<>){ ... } that we get with perl's -n command line switch takes over and reads in the remaining file from #ARGV ; we process lines starting with > - stripping the leading character and assigning the string that follows to the $name variable; the while continues and we push $name and the length of any line that does not start with > (if !/^>/) wrapped as elements of an anonymous array [] into the #out array which we display with p() as well (in the END{} block so it doesn't print inside our implicit while() loop). Phew!!
See the AoA that results as a gist #Github.
Finally - building on that, and now we have munged things nicely - we can change a few things in our END{...} block (add a nested for loop to push things around) and put this all together to produce the output we want.
This one liner:
perl -MDDP -lne 'BEGIN{ local #ARGV=shift; #tmp = map {[split]} <>; }
$name = s/>//r if /^>/ ; push #out, [ $name, length $_ ] if !/^>/ ;
END{ foreach $row (0..$#tmp) { push $tmp[$row] , $out[$row][-1]} ;
print "#$_" for #tmp }' output.txt datafile2.txt
produces:
one 5 7
two 3 7
three 1 2
We'll have to convert that into a script :-)
The script consists of three rather wordy subroutines that reads the log file; parses the datafile ; merges them. We run them in order. The first one checks to see if there is an existing log and creates one and then does an exit to skip any further parsing/merging steps.
You should be able to wrap them in a loop of some kind that feeds files to the subroutines from an array instead of fetching them from STDIN. One caution - I'm using IO::All because it's fun and easy!
use 5.14.0 ;
use IO::All;
my #file = io(shift)->slurp ;
my $log = "output.txt" ;
&readlog;
&parsedatafile;
&mergetolog;
####### subs #######
sub readlog {
if (! -R $log) {
print "creating first log entry\n";
my #newlog = &parsedatafile ;
open(my $fh, '>', $log) or die "I CAN HAZ WHA????" ;
print $fh "#$_ \n" for #newlog ;
exit;
}
else {
map { [split] } io($log)->slurp ;
}
}
sub parsedatafile {
my (#out, $name) ;
while (<#file>) {
chomp ;
$name = s/>//r if /^>/;
push #out, [$name, length $_] if !/^>/ ;
}
#out;
}
sub mergetolog {
my #tmp = readlog ;
my #data = parsedatafile ;
foreach my $row (0 .. $#tmp) {
push $tmp[$row], $data[$row][-1]
}
open(my $fh, '>', $log) or die "Foobar!!!" ;
print $fh "#$_ \n" for #tmp ;
}
The subroutines do all the work here - you can likely find ways to shorten; combine; improve them. Is this a useful approach for you?
I hope this explanation is clear and useful to someone - corrections and comments welcome. Probably the same thing could be done with place editing (i.e with perl -pie '...') which is left as an exercise to those that follow ...

You need to open the output file itself. First in read mode, then in write mode.
I have written a script that does what you are asking. What really matters is the part that appends new data to old data. Adapt that to your fileSize function.
So you have the output file, output.txt
Of the form,
one 5
two 3
three 1
And an array of input files, input1.txt, input2.txt, etc, saved in the #inputfiles variable.
Of the form,
>one
AAAAA
>two
BBB
>three
C
>four
DAS
and
>one
AAAAABB
>two
BBBVFVF
>three
CS
Respectively.
After running the following perl script,
# First read previous output file.
open OUT, '<', "output.txt" or die $!;
my #outlines;
while (my $line = <OUT> ) {
chomp $line;
push #outlines, $line;
}
close OUT;
my $outsize = scalar #outlines;
# Suppose you have your array of input file names already prepared
my #inputfiles = ("input1.txt", "input2.txt");
foreach my $file (#inputfiles) {
open IN, '<', $file or die $!;
my $counter = 1; # Used to compare against output size
while (my $line = <IN>) {
chomp $line;
$line =~ m/^>(.*)$/;
my $name = $1;
my $sequence = <IN>;
chomp $sequence;
my $seqsize = length($sequence);
# Here is where I append a column to output data.
if($counter <= $outsize) {
$outlines[$counter - 1] .= " $seqsize";
} else {
$outlines[$counter - 1] = "$name $seqsize";
}
$counter++;
}
close IN;
}
# Now rewrite the results to output.txt
open OUT, '>', "output.txt" or die $!;
foreach (#outlines) {
print OUT "$_\n";
}
close OUT;
You generate the output,
one 5 5 7
two 3 3 7
three 1 1 2
four 3

how to extract substrings by knowing the coordinates

I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...
I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract.
The first file is like:
>scaffold30 24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................
(the string must be everything in the file except the first line), and the coordinates file is like:
44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
.....................
my script is this:
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;
while(<INFILE1>) {
if ($_ !~ m/^>/) {
local $/ = undef;
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my #cline = <INFILE>;
foreach my $cline (#cline) {
print "$cline\n";
my#data = split('\t', $cline);
my $start = $data[0];
my $end = $data[1];
my $offset = $end - $start;
$count++;
my $sub = substr ($var, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE;
}
}
when I run it, it looks like it does only one iteration and it prints me the start of the first file.
It seems like the foreach loop doesn't work.
also substr seems that doesn't work.
when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.
I am sorry if I become annoying, but I must finish it and I am a little bit desperate...
Thank you again.

This line
local $/ = undef;
changes $/ for the entire enclosing block, which includes the section where you read in your second file. $/ is the input record separator, which essentially defines what a "line" is (it is a newline by default, see perldoc perlvar for details). When you read from a filehandle using <>, $/ is used to determine where to stop reading. For example, the following program relies on the default line-splitting behavior, and so only reads until the first newline:
my $foo = <DATA>;
say $foo;
# Output:
# 1
__DATA__
1
2
3
Whereas this program reads all the way to EOF:
local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3
__DATA__
1
2
3
This means your #cline array gets only one element, which is a string containing the text of your entire coordinates file. You can see this using Data::Dumper:
use Data::Dumper;
print Dumper(\#cline);
Which in your case will output something like:
$VAR1 = [
'44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
'
];
Notice how your array (technically an arrayref in this case), delineated by [ and ], contains only a single element, which is a string (delineated by single quotes) that contains newlines.
Let's walk through the relevant sections of your code:
while(<INFILE1>) {
if ($_ !~ m/^>/) {
# Enable localized slurp mode. Stays in effect until we leave the 'if'
local $/ = undef;
# Read the rest of INFILE1 into $var (from current line to EOF)
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
# In list context, return each block until the $/ character as a
# separate list element. Since $/ is still undef, this will read
# everything until EOF into our first list element, resulting in
# a one-element array
my #cline = <INFILE>;
# Since #cline only has one element, the loop only has one iteration
foreach my $cline (#cline) {
As a side note, your code could be cleaned up a bit. The names you chose for your filehandles leave something to be desired, and you should probably use lexical filehandles anyway (and the three-argument form of open):
open my $chromosome_fh, "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;
Also, you do not need to nest your loops in this case, it just makes your code more convoluted. First read the relevant parts of your chromosome file into a variable (named something more meaningful than var):
# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
next if /^>/;
$chromosome .= $_;
}
Then read in your coordinates file:
my #cline = <$coordinates_fh>;
Or if you only need to use the contents of the coordinates file once, process each line as you go using a while loop:
while (<$coordinates_fh>) {
# Do something for each line here
}

As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.
#!/usr/bin/perl
use strict;
use warnings;
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;
<INFILE1>; # get rid of the first line - '>scaffold30 24194'
while(<INFILE1>) {
chomp;
$fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;
while(<INFILE>) {
my ($start, $end) = split;
# Or, should this be: my $offset = $end - ($start - 1);
# That would include the start fasta
my $offset = $end - $start;
$count++;
my $sub = substr ($fasta, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";

How do I use variables to do substitution in Perl?

I have several text files, that were once tables in a database, which is now disassembled. I'm trying to reassemble them, which will be easy, once I get them into a usable form. The first file, "keys.text" is just a list of labels, inconsistently formatted. Like:
Sa 1 #
Sa 2
U 328 #*
It's always letter(s), [space], number(s), [space], and sometime symbol(s). The text files that match these keys are the same, then followed by a line of text, also separated, or delimited, by a SPACE.
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
What I'm trying to do in the code below, is match the key from "keys.text", with the same key in the .txt files, and put a tab between the key, and the text. I'm sure I'm overlooking something very basic, but the result I'm getting, looks identical to the source .txt file.
Thanks in advance for any leads or assistance!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open(IN1, "keys.text");
my $key;
# Read each line one at a time
while ($key = <IN1>) {
# For each txt file in the current directory
foreach my $file (<*.txt>) {
open(IN, $file) or die("Cannot open TXT file for reading: $!");
open(OUT, ">temp.txt") or die("Cannot open output file: $!");
# Add temp modified file into directory
my $newFilename = "modified\/keyed_" . $file;
my $line;
# Read each line one at a time
while ($line = <IN>) {
$line =~ s/"\$key"/"\$key" . "\/t"/;
print(OUT "$line");
}
rename("temp.txt", "$newFilename");
}
}
EDIT: Just to clarify, the results should retain the symbols from the keys as well, if there are any. So they'd look like:
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...

The regex seems quoted rather oddly to me. Wouldn't
$line =~ s/$key/$key\t/;
work better?
Also, IIRC, <IN1> will leave the newline on the end of your $key. chomp $key to get rid of that.
And don't put parentheses around your print args, esp when you're writing to a file handle. It looks wrong, whether it is or not, and distracts people from the real problems.

if Perl is not a must, you can use this awk one liner
$ cat keys.txt
Sa 1 #
Sa 2
U 328 #*
$ cat mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
$ awk 'FNR==NR{ k[$1 SEP $2];next }($1 SEP $2 in k) {$2=$2"\t"}1 ' keys.txt mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...

Using split rather than s/// makes the problem straightforward. In the code below, read_keys extracts the keys from keys.text and records them in a hash.
Then for all files named on the command line, available in the special Perl array #ARGV, we inspect each line to see whether it begins with a key. If not, we leave it alone, but otherwise insert a TAB between the key and the text.
Note that we edit the files in-place thanks to Perl's handy -i option:
-i[extension]
specifies that files processed by the <> construct are to be edited in-place. It does this by renaming the input file, opening the output file by the original name, and selecting that output file as the default for print statements. The extension, if supplied, is used to modify the name of the old file to make a backup copy …
The line split " ", $_, 3 separates the current line into exactly three fields. This is necessary to protect whitespace that's likely to be present in the text portion of the line.
#! /usr/bin/perl -i.bak
use warnings;
use strict;
sub usage { "Usage: $0 text-file\n" }
sub read_keys {
my $path = "keys.text";
open my $fh, "<", $path
or die "$0: open $path: $!";
my %key;
while (<$fh>) {
my($text,$num) = split;
++$key{$text}{$num} if defined $text && defined $num;
}
wantarray ? %key : \%key;
}
die usage unless #ARGV;
my %key = read_keys;
while (<>) {
my($text,$num,$line) = split " ", $_, 3;
$_ = "$text $num\t$line" if defined $text &&
defined $num &&
$key{$text}{$num};
print;
}
Sample run:
$ ./add-tab input
$ diff -u input.bak input
--- input.bak 2010-07-20 20:47:38.688916978 -0500
+++ input 2010-07-20 21:00:21.119531937 -0500
## -1,3 +1,3 ##
-Sa 1 # Random line of text follows.
-Sa 2 This text is just as random.
-U 328 #* Continuing text...
+Sa 1 # Random line of text follows.
+Sa 2 This text is just as random.
+U 328 #* Continuing text...

Fun answers:
$line =~ s/(?<=$key)/\t/;
Where (?<=XXXX) is a zero-width positive lookbehind for XXXX. That means it matches just after XXXX without being part of the match that gets substituted.
And:
$line =~ s/$key/$key . "\t"/e;
Where the /e flag at the end means to do one eval of what's in the second half of the s/// before filling it in.
Important note: I'm not recommending either of these, they obfuscate the program. But they're interesting. :-)

How about doing two separate slurps of each file. For the first file you open the keys and create a preliminary hash. For the second file then all you need to do is add the text to the hash.
use strict;
use warnings;
my $keys_file = "path to keys.txt";
my $content_file = "path to content.txt";
my $output_file = "path to output.txt";
my %hash = ();
my $keys_regex = '^([a-zA-Z]+)\s*\(d+)\s*([^\da-zA-Z\s]+)';
open my $fh, '<', $keys_file or die "could not open $key_file";
while(<$fh>){
my $line = $_;
if ($line =~ /$keys_regex/){
my $key = $1;
my $number = $2;
my $symbol = $3;
$hash{$key}{'number'} = $number;
$hash{$key}{'symbol'} = $symbol;
}
}
close $fh;
open my $fh, '<', $content_file or die "could not open $content_file";
while(<$fh>){
my $line = $_;
if ($line =~ /^([a-zA-Z]+)/){
my $key = $1;
// strip content_file line from keys/number/symbols to leave text
line =~ s/^$key//;
line =~ s/\s*$hash{$key}{'number'}//;
line =~ s/\s*$hash{$key}{'symbol'}//;
$line =~ s/^\s+//g;
$hash{$key}{'text'} = $line;
}
}
close $fh;
open my $fh, '>', $output_file or die "could not open $output_file";
for my $key (keys %hash){
print $fh $key . " " . $hash{$key}{'number'} . " " . $hash{$key}{'symbol'} . "\t" . $hash{$key}{'text'} . "\n";
}
close $fh;
I haven't had a chance to test it yet and the solution seems a little hacky with all the regex but might give you an idea of something else you can try.

This looks like the perfect place for the map function in Perl! Read in the entire text file into an array, then apply the map function across the entire array. The only other thing you might want to do is use the quotemeta function to escape out any possible regular expressions in your keys.
Using map is very efficient. I also read the keys into an array in order to not have to keep opening and closing the keys file in my loop. It's an O^2 algorithm, but if your keys aren't that big, it shouldn't be too bad.
#! /usr/bin/env perl
use strict;
use vars;
use warnings;
open (KEYS, "keys.text")
or die "Cannot open 'keys.text' for reading\n";
my #keys = <KEYS>;
close (KEYS);
foreach my $file (glob("*.txt")) {
open (TEXT, "$file")
or die "Cannot open '$file' for reading\n";
my #textArray = <TEXT>;
close (TEXT);
foreach my $line (#keys) {
chomp $line;
map($_ =~ s/^$line/$line\t/, #textArray);
}
open (NEW_TEXT, ">$file.new") or
die qq(Can't open file "$file" for writing\n);
print TEXT join("\n", #textArray) . "\n";
close (TEXT);
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

The system() of Perl "paused". Caused by $ARGV[]? - perl

Related

print hashes with values from different files

Push, big file. Correct and improvement

Append a new column to file in perl

how to extract substrings by knowing the coordinates

How do I use variables to do substitution in Perl?

Categories

Resources