Why is no output files written in prinseqlite perl loop? - perl

I am completely new to this type of coding/command lines, so I am sorry if I am asking this question in a wrong way.
I want to loop over all files in a directory (I am quality trimming DNA sequencing files (.fastq format))
I have written this loop:
for i in *.fastq; do
perl /apps/prinseqlite/0.20.4/prinseq-lite.pl -fastq $i -min_len 220 -max_len 240 -min_qual_mean 30 -ns_max_n 5 -trim_tail_right 15 -trim_tail_left 15 -out_good /proj/forhot/qfiltered/looptest/$i_filtered.fastq -out_bad null; done
The code itself seems to work, I can see in my terminal that it is taking the right files and it is doing the trimming (it is writing a summary log in the terminal as it goes), but no output files are generated - i.e these ones:
-out_good /proj/forhot/qfiltered/looptest/$i_filtered.fastq
If I run the code in a non-loop way, just on one file it works (= the output is generated). link this example:
prinseq-lite.pl -fastq 60782_merged_rRNA.fastq -min_len 220 -max_len 240 -min_qual_mean 30 -ns_max_n 5 -trim_tail_right 15 -trim_tail_left 15 -out_good 60782_merged_rRNA_filt_codeTEST.fastq -out_bad null
Is there a simple reason/answer to this?

This problem has nothing to do with Perl at all.
/proj/forhot/qfiltered/looptest/$i_filtered.fastq is read by the shell as interpolating the contents of i_filtered. There is no such shell variable, so this argument turns into /proj/forhot/qfiltered/looptest/.fastq ($i_filtered turns into nothing).
Therefore all of your prinseq-lite.pl executions place their output in the same file, which (because its name starts with a .) is "hidden": You need to use ls -a to see it, not just ls.
Fix
... -out_good /proj/forhot/qfiltered/looptest/${i}_filtered.fastq
Note that this would give you e.g. 60782_merged_rRNA.fastq_filtered.fastq for an input file of 60782_merged_rRNA.fastq. If you want to get rid of the duplicate .fastq part, you need something like:
... -out_good /proj/forhot/qfiltered/looptest/"${i%.fastq}"_filtered.fastq

Related

Issues with fopen understanding the filepath I am giving it

Basically, I build a list of files using ls, then want to loop through that list, read in a file, and do some stuff. But when I try to read the file in it fails.
Here is an example
r=ls(['Event_2006_334_21_20_11' '/*.r'])
Event_2006_334_21_20_11/IU.OTAV_1.0.i.r
which is a 1x80 char
fopen(r(1,:))
-1
but
fopen('Event_2006_334_21_20_11/IU.OTAV_1.0.i.r')
12 (or whatever its on)
works. I've tried string(r) and char(r) and sprintf('%s',r). If I just build the string like r = ['Event_2006_334_21_20_11' '/IU.OTAV_1.0.i.r'] it works. So it seems something about combining the different variable types that messes it up but I can't seem to find a workaround. Probably something obvious I'm missing.
Any suggestions?
ls returns a matrix of characters, which means each row contains the same number of characters. To indicate the problem, try:
['-' r(1,:) '-']
You will probably notice some whitespaces in front of the -. Unless you want to print the output to the command line, ls is not really useful. As mentioned by Alex, use dir instead.
A further tip regarding your last comment, concatenate file path using fullfile. It makes sure you get one file separator whenever concatenating:
>> fullfile('myfolder','mysubfolder','myfile.m')
ans = myfolder/mysubfolder/myfile.m
>> fullfile('myfolder/','mysubfolder','myfile.m')
ans = myfolder/mysubfolder/myfile.m
>> fullfile('myfolder/','/mysubfolder','myfile.m')
ans = myfolder/mysubfolder/myfile.m

Variable not being recognized after "read"

-- Edit : Resolved. See answer.
Background:
I'm writing a shell that will perform some extra actions required on our system when someone resizes a database.
The shell is written in ksh (requirement), the OS is Solaris 5.10 .
The problem is with one of the checks, which verifies there's enough free space on the underlying OS.
Problem:
The check reads the df -k line for root, which is what I check in this step, and prints it to a file. I then "read" the contents into variables which I use in calculations.
Unfortunately, when I try to run an arithmetic operation on one of the variables, I get an error indicating it is null. And a debug output line I've placed after that line verifies that it is null... It lost it's value...
I've tried every method of doing this I could find online, they work when I run it manually, but not inside the shell file.
(* The file does have #!/usr/bin/ksh)
Code:
df -k | grep "rpool/ROOT" > dftest.out
RPOOL_NAME=""; declare -i TOTAL_SIZE=0; USED_SPACE=0; AVAILABLE_SPACE=0; AVAILABLE_PERCENT=0; RSIGN=""
read RPOOL_NAME TOTAL_SIZE USED_SPACE AVAILABLE_SPACE AVAILABLE_PERCENT RSIGN < dftest.out
\rm dftest.out
echo $RPOOL_NAME $TOTAL_SIZE $USED_SPACE $AVAILABLE_SPACE $AVAILABLE_PERCENT $RSIGN
((TOTAL_SIZE=$TOTAL_SIZE/1024))
This is the result:
DBResize.sh[11]: TOTAL_SIZE=/1024: syntax error
I'm pulling hairs at this point, any help would be appreciated.
The code you posted cannot produce the output you posted. Most obviously, the error is signalled at line 11 but you posted fewer than 11 lines of code. The previous lines may matter. Always post complete code when you ask for help.
More concretely, the declare command doesn't exist in ksh, it's a bash thing. You can achieve the same result with typeset (declare is a bash equivalent to typeset, but not all options are the same). Either you're executing this script with bash, or there's another error message about declare, or you've defined some additional commands including declare which may change the behavior of this code.
None of this should have an impact on the particular problem that you're posting about, however. The variables created by read remain assigned until the end of the subshell, i.e. until the code hits a ), the end of a pipe (left-hand side of the pipe only in ksh), etc.
About the use of declare or typeset, note that you're only declaring TOTAL_SIZE as an integer. For the other variables, you're just assigning a value which happens to consist exclusively of digits. It doesn't matter for the code you posted, but it's probably not what you meant.
One thing that may be happening is that grep matches nothing, and therefore read reads an empty line. You should check for errors. Use set -e in scripts to exit at the first error. (There are cases where set -e doesn't catch errors, but it's a good start.)
Another thing that may be happening is that df is splitting its output onto multiple lines because the first column containing the filesystem name is too large. To prevent this splitting, pass the option -P.
Using a temporary file is fragile: the code may be executed in a read-only directory, another process may want to access the same file at the same time... Here a temporary file is useless. Just pipe directly into read. In ksh (unlike most other sh variants including bash), the right-hand side of a pipe runs in the main shell, so assignments to variables in the right-hand side of a pipe remain available in the following commands.
It doesn't matter in this particular script, but you can use a variable without $ in an arithmetic expression. Using $ substitutes a string which can have confusing results, e.g. a='1+2'; $((a*3)) expands to 7. Not using $ uses the numerical value (in ksh, a='1+2'; $((a*3)) expands to 9; in some sh implementations you get an error because a's value is not numeric).
#!/usr/bin/ksh
set -e
typeset -i TOTAL_SIZE=0 USED_SPACE=0 AVAILABLE_SPACE=0 AVAILABLE_PERCENT=0
df -Pk | grep "rpool/ROOT" | read RPOOL_NAME TOTAL_SIZE USED_SPACE AVAILABLE_SPACE AVAILABLE_PERCENT RSIGN
echo $RPOOL_NAME $TOTAL_SIZE $USED_SPACE $AVAILABLE_SPACE $AVAILABLE_PERCENT $RSIGN
((TOTAL_SIZE=TOTAL_SIZE/1024))
Strange...when I get rid of your "declare" line, your original code seems to work perfectly well (at least with ksh on Linux)
The code :
#!/bin/ksh
df -k | grep "/home" > dftest.out
read RPOOL_NAME TOTAL_SIZE USED_SPACE AVAILABLE_SPACE AVAILABLE_PERCENT RSIGN < dftest.out
\rm dftest.out
echo $RPOOL_NAME $TOTAL_SIZE $USED_SPACE $AVAILABLE_SPACE $AVAILABLE_PERCENT $RSIGN
((TOTAL_SIZE=$TOTAL_SIZE/1024))
print $TOTAL_SIZE
The result :
32962416 5732492 25552588 19% /home
5598
Which are the value a simple df -k is returning. The variables seem to last.
For those interested, I have figured out that it is not possible to use "read" the way I was using it.
The variable values assigned by "read" simply "do not last".
To remedy this, I have applied the less than ideal solution of using the standard "while read" format, and inside the loop, echo selected variables into a variable file.
Once said file was created, I just "loaded" it.
(pseudo code:)
LOOP START
echo "VAR_A="$VAR_A"; VAR_B="$VAR_B";" > somefile.out
LOOP END
. somefile.out

Perl Code : Output not displayed properly

I have a perl code where I access multiple txt files and produce output for them.
While I run the code, the output lines on the console are overwritten.
2015-04-21:12-04-54|getFilesInInputDir| ********** name : PEPORT **********
PEPORT4-21:12-04-54|readNFormOutputFile| name :
PEPORT" is : -04-54|readNFormOutputFile| Frequency for name "
Please note, that the second and third line it should have been like
2015-04-21:12-04-54|readNFormOutputFile| name : PEPORT
2015-04-21:12-04-54|readNFormOutputFile| Frequency for name "PEPORT"
Also, after this the code stops processing my files. The code seems fine. May I know what may be the possible cause for this.
Thanks.
Seems like CR/LF versus LF issue. Convert your input from MSWin to Linux by running dos2unix or fromdos, or remove the "\r" characters from within the Perl code.
As choroba says, I guess you are reading a file on Linux that has been generated on Windows. The easiest fix is to replace chomp with s/\s+\z//or s/\p{cntrl}+\z//
Or, if trailing spaces are significant, you can use s/[\r\n]+\z// or, if you are running version 10 or later of Perl 5, s/\R\z//

creating a hash with regex matches in perl

Lets say i have a file like below:
And i want to store all the decimal numbers in a hash.
hello world 10 20
world 10 10 10 10 hello 20
hello 30 20 10 world 10
i was looking at this
and this worked fine:
> perl -lne 'push #a,/\d+/g;END{print "#a"}' temp
10 20 10 10 10 10 20 30 20 10 10
Then what i need was to count number of occurrences of each regex.
for this i think it would be better to store all the matches in a hash and assign an incrementing value for each and every key.
so i tried :
perl -lne '$a{$1}++ for ($_=~/(\d+)/g);END{foreach(keys %a){print "$_.$a{$_}"}}' temp
which gives me an output of:
> perl -lne '$a{$1}++ for ($_=~/(\d+)/g);END{foreach(keys %a){print "$_.$a{$_}"}}' temp
10.4
20.7
Can anybody correct me whereever i was wrong?
the output i expect is:
10.7
20.3
30.1
although i can do this in awk,i would like to do it only in perl
Also order of the output is not a concern for me.
$a{$1}++ for ($_=~/(\d+)/g);
This should be
$a{$_}++ for ($_=~/(\d+)/g);
and can be simplified to
$a{$_}++ for /\d+/g;
The reason for this is that /\d+/g creates a list of matches, which is then iterated over by for. The current element is in $_. I imagine $1 would contain whatever was left in there by the last match, but it's definitely not what you want to use in this case.
Another option would be this:
$a{$1}++ while ($_=~/(\d+)/g);
This does what I think you expected your code to do: loop over each successful match as the matches happen. Thus the $1 will be what you think it is.
Just to be clear about the difference:
The single argument for loop in Perl means "do something for each element of a list":
for (#array)
{
#do something to each array element
}
So in your code, a list of matches was built first, and only after the whole list of matches was found did you have the opportunity to do something with the results. $1 got reset on each match as the list was being built, but by the time your code was run, it was set to the last match on that line. That is why your results didn't make sense.
On the other hand, a while loop means "check if this condition is true each time, and keep going until the condition is false". Therefore, the code in a while loop will be executed on each match of a regex, and $1 has the value for that match.
Another time this difference is important in Perl is file processing. for (<FILE>) { ... } reads the entire file into memory first, which is wasteful. It is recommended to use while (<FILE>) instead, because then you go through the file line by line and keep only the information you want.

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}