Merge Fasta and Qual files with different headers order in FASTQ - perl

I am trying to merge a fasta file and a qual file in a new fastq file having in mind the case that the two files might be provided with different order in their sequence IDs. To do that, I tried the first step of my script to be the sorting of the sequences which works perfectly when I test it as a separate script. The same with the rest, when I run separately the part where it combines the files in a fastq, it runs perfectly. But now that I am trying to combine the two methods in one script it doesn't work and I don't know what else to do! I would appreciate it if you can help me.
Here is my script as far. It creates the new fastq file but the content is messed up and not what I want. I run it from terminal like this:
$ perl script.pl reads.fasta reads.qual > reads.fq
Script :
#!/usr/bin/env perl
use strict;
use warnings;
die ("Usage: script.pl reads.fasta reads.qual > reads.fq") unless (scalar #ARGV) == 2;
open FASTA, $ARGV[0] or die "cannot open fasta: $!\n";
open QUAL, $ARGV[1] or die "cannot open qual: $!\n";
my $offset = 33;
my $count = 0;
local($/) = "\n>";
my %id2seq = ();
my $id = '';
my %idq2seq = ();
my $idq = '';
my (#sort_q, #sort_f);
while(<FASTA>){
chomp;
if($_ =~ /^>(.+)/){
$id = $1;
}else{
$id2seq{$id} .= $_;
}
}
for $id (sort keys %id2seq)
{
#sort_f = "$id\n$id2seq{$id}\n\n";
print #sort_f;
}
while(<QUAL>){
chomp;
if($_ =~ /^>(.+)/){
$idq = $1;
}else{
$idq2seq{$idq} .= $_;
}
}
for $idq (sort keys %idq2seq)
{
#sort_q = "$idq\n$idq2seq{$idq}\n\n";
print "#sort_q";
}
while (my #sort_f) {
chomp #sort_f;
my ($fid, #seq) = split "\n", #sort_f;
my $seq = join "", #seq; $seq =~ s/\s//g;
my $sortq = #sort_q;
chomp my #sortq;
my ($qid, #qual) = split "\n", #sortq;
#qual = split /\s+/, (join( " ", #qual));
# convert score to character code:
my #qual2 = map {chr($_+$offset)} #qual;
my $quals = join "", #qual2; `enter code here`
die "missmatch of fasta and qual: '$fid' ne '$qid'" if $fid ne $qid;
$fid =~ s/^\>//;
print STDOUT (join( "\n", "#".$fid, $seq, "+$fid", $quals), "\n");
$count++;
}
close FASTA;
close QUAL;
print STDERR "wrote $count entries\n";
Thank you in advance

It's been a while since I have used perl, but I would approach this using a hash of key/value pairs for both the fasta and quality input. Then write out all the pairs by looping over the fasta hash and pulling out the corresponding quality string.
I have written something in python that will do what you need, you can see it in action here:
It assumes that your input looks like this:
reads.fasta
>fa_0
GCAGCCTGGGACCCCTGTTGT
>fa_1
CCCACAAATCGCAGACACTGGTCGG
reads.qual
>fa_0
59 37 38 51 56 55 60 44 43 42 56 65 60 68 52 67 43 72 59 65 69
>fa_1
36 37 47 72 34 53 67 41 70 67 66 51 47 41 73 58 75 36 61 48 70 55 46 42 42
output
#fa_0
GCAGCCTGGGACCCCTGTTGT
+
;%&387<,+*8A<D4C+H;AE
#fa_1
CCCACAAATCGCAGACACTGGTCGG
+
$%/H"5C)FCB3/)I:K$=0F7.**
#fa_2
TCGTACAGCAGCCATTTTCATAACCGAACATGACTC
+
C?&93A#:?#F,2:'KF*20CC:I7F9J.,:E8&?F
import sys
# Check there are enough arguments
if len(sys.argv) < 3:
print('Usage: {s} reads.fasta reads.qual > reads.fq'.format(s=sys.argv[0]), file=sys.stderr)
sys.exit(1)
# Initalise dictionaries for reads and qualities
read_dict = dict()
qual_dict = dict()
fa_input = sys.argv[1]
qual_input = sys.argv[2]
# Read in fasta input
with open(fa_input, 'r') as fa:
for line in fa:
line = line.strip()
if line[0] == '>':
read_dict[line[1:]] = next(fa).strip()
else:
next(fa)
# Read in quality input
with open(qual_input, 'r') as qual:
for line in qual:
line = line.strip()
if line[0] == '>':
qual_dict[line[1:]] = next(qual).strip()
else:
next(qual)
count = 0
# Iterate over all fasta reads
for key, seq in read_dict.items():
# Check if read header is in the qualities data
if key in qual_dict.keys():
# There's both sequence and quality data so write stdout
read_str = '#{header}\n{seq}\n+\n{qual}'.format(
header=key,
seq=seq,
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')]))
print(read_str, file=sys.stdout)
count += 1
else: # not found
# Write error to stderr
print('Error: {k} not found in qual file'.format(k=key), file=sys.stderr)
# Print count to stderr
print('{c} reads written'.format(c=count), file=sys.stderr)
If you need to use an offset for the quality score edit
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')])) to
qual=''.join([chr(int(x) + offset) for x in qual_dict[key].split(' ')])) and define an offset variable before this.

Related

File's absolute values after subtraction in perl

I have the below table (txt file with /t separator) and I want to write a script to subtract each time the values of the next line to the previous line and then to obtain the absolute value of each value.
43 402 51 360 66
61 63 67 66 65
63 60 69 63 58
65 53 89 55 57
103 138 135 135 85
For example:
abs(61-43) abs(63-402) abs(67-51) abs(66-360) abs(65-66)
abs(63-61) abs(60-63) abs(69-67) abs(63-66) abs(58-65) etc..
This is what I wrote.
#!/usr/bin/perl -w
use strict;
use warnings;
my $filename = '/home/kgee/Desktop/gene_gangs/table_gangs/table_ony_numbers';
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
$newtable=0;
$i=0;
$j=1;
while (my $row = <$fh>) {
chomp $row;
print "$row\n";
my #numbertable = split//,$filename;
for (#numbertable){
$newtable[$j]= ($i+2) -($i+1);
$temp[$i]= abs($newtable[$j]);
$newtable=$tepm[$i];
my #newtable= split//,$newtable;
print("#newtable","\n");
$i=$i+1;
}
}
I got many errors all of them are "global symbol XXX requires explicit package name (did you forget to declare "my XXX"?) at line XXX?"
I read online that to step over this issue you have remove the use warnings;(from the begging) which is not recommanded or to declare the variable outside the block (and not inside!). I tried both but still I have some warnings.
OK first off: Do not "turn off warnings" ever. Would you turn off the warnings on your car, and expect that to end well? Warnings are there because there's something you need to fix, not ignore.
my is used to declare a variable the first time you use it.
This is as simple as
my $newtable = 0;
You've also added some routes of confusion into your code, and I'd suggest you sort:
Indent it properly
Don't use $newtable and #newtable - they're different variables, and it's way too easy to mix up $newtable; and $newtable[0];
You've got $temp and $tepm - this is exactly the sort of thing that use warnings helps you identify.
Splitting $filename to get #numbertable - I'm pretty sure that doesn't do what you you want, because it's splitting the string '/home/kgee/Desktop/gene_gangs/table_gangs/table_ony_numbers' into characters. Did you perhaps mean split /\t/, $row;?
Likewise my #newtable= split//,$newtable; ... I don't think that does what you think it does either, because $newtable is 'just' zero as instantiated earlier in your program, and you never modify it.
for (#numbertable) iterates each element in that table (the split row?) but you don't ever use the iterator. $_ is set to the current element each iteration. But it doesn't have anything to do with $i and $j, and you don't actually seem to modify $j at all - so it stays zero.
perl -w and use warnings; is redundant. You should probably stick with one or other. (I favour use warnings; along with use strict; personally).
Actually, the more I look at the code, I'm afraid it becomes clear it doesn't actually work, and your problems run a bit deeper than your initial warnings.
How about:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #previous_row;
my #new_table;
##iterate line by line - use your file handle here, <DATA> is a special case.
while ( <DATA> ) {
#strip trailing linefeed.
chomp;
#split this row on any whitespace (which includes tabs)
my #row = split;
#Handle first iteration - can't subtract 'previous row' if there isn't one.
if ( #previous_row ) {
my #new_row;
#iterate the current row
foreach my $element ( #row ) {
#grab the elements off the previous row - note "shift" modifies it, and this will
#break if you've got uneven length rows. (But you don't, and I'll leave it to you to handle that if you do.
my $previous_row_element = shift #previous_row;
#calculate new value
my $value = abs ( $element - $previous_row_element );
#stash new value into new row.
push #new_row, $value;
}
#new row is complete, so push it into the new table.
push #new_table, \#new_row;
}
#Update 'previous row' with the contents of the current row.
#previous_row = #row;
}
#lazy mode output. Iterating the array and printing values in the format you want
#is up to you.
print Dumper \#new_table;
__DATA__
43 402 51 360 66
61 63 67 66 65
63 60 69 63 58
65 53 89 55 57
103 138 135 135 85
See #Sobrique's answer for excellent explanation of some problems with your script. I just present an alternative approach here:
use feature qw(say);
use strict;
use warnings;
my $filename = 'table_only_numbers';
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename': $!";
my #row1;
while (my $row = <$fh>) {
chomp $row;
my #row2 = split " ", $row;
if (#row1) {
die "Bad format!" if #row1 != #row2;
my #new = map { abs ($row2[$_] - $row1[$_]) } 0..$#row2;
say join " ", #new;
}
#row1 = #row2;
}
close $fh;

How to extract line after line with matching pattern in perl

I am new in perl. I am trying this code below to extract line after the line that has string "length' in my text file merge.txt. I want to get the result as shown below. can someone please help me with what I need to add to my code. Thanks
my #array;
While (#array = ‘/Users/Desktop/merged.txt’) {
foreach my $line (#array) {
if ($line =~ m/length/) {
my $a = $array[$i+1];
push (#array, $a);
}
my $wanted_type = "$1";
print "$wanted_type\n";
}
}
}
merge.txt file has
owner#owner-HP-Z840-Workstation ~/SLM/Desktop/Documents
% ./count-smRNAs.pl /media/owner/c92ed9e9-3d94-497c-bb2e-514a4806bbcd/merged.fastq 32 32
length number A C G T
32 14824945 1992856 1576607 2413263 8756583
owner#owner-HP-Z840-Workstation ~/SLM/Desktop/Documents
% ./count-smRNAs.pl /media/owner/c92ed9e9-3d94-497c-bb2e-514a4806bbcd/merged.fastq 33 33
length number A C G T
33 58619575 1415093 3274505 5499169 48070172
owner#owner-HP-Z840-Workstation ~/SLM/Desktop/Documents
% ./count-smRNAs.pl /media/owner/c92ed9e9-3d94-497c-bb2e-514a4806bbcd/test.fastq 34 34
length number A C G T
34 13018196 1047476 903554 1695778 9296236
result I want is shown below. I also want to grab the filename from line
that has /c92ed9e9-3d94-497c-bb2e-514a4806bbcd/ as shown below.
merged.fastq
32 14824945 1992856 1576607 2413263 8756583
33 58619575 1415093 3274505 5499169 48070172
test.fastq
34 13018196 1047476 903554 1695778 9296236
my $print;
while (<>) {
if ( my $qfn = /^\s*%\s*\S+\s+(\S+)/ ) {
( my $fn = $qfn ) =~ s{^.*/}{}s;
print("$fn\n");
}
print if $print;
$print = /length/;
}

Awk - Get rows from a file that contain values within ranges described in another file

I have 2 tab-delimited files formatted similar to this:
file 1
A 100 90 PASS
B 89 80 PASS
C 79 70 PASS
D 69 60 FAIL
F 59 0 FAIL
file 2
Randy 80
Denis 44
Earl 97
I want to take the values from column 2 in file 2 and compare them with the ranges given between columns 2 and 3 of file 1. Then I want to create a new file that combines this data, printing columns 1 and 2 from file 2 and columns 1 and 4 from file 1:
file 3
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS
I want to implement this using awk or perl.
You can use this awk:
awk 'BEGIN{FS=OFS="\t"}
FNR==NR {
a[$0] = $2
next
}
{
for (i in a)
if ($2>=a[i] && $3<=a[i])
print i, $1, $4
}' file2 file1
Earl 97 A PASS
Randy 80 B PASS
Denis 44 F FAIL
In perl, I'd probably do something like this:
#!/usr/bin/env perl
use strict;
use warnings 'all';
use Data::Dumper;
open ( my $grades_in, '<', "file1.txt" ) or die $!;
my #grade_lookup = map { [split] } <$grades_in>;
print Dumper \#grade_lookup;
close ( $grades_in );
open ( my $people, '<', "file2.txt" ) or die $!;
while (<$people>) {
chomp;
my ( $person, $score ) = split;
my ( $grade ) = grep { $_ -> [1] >= $score
and $_ -> [2] <= $score } #grade_lookup;
print join " ", $person, $score, $grade -> [0], $grade -> [3], "\n";
}
close ( $people );
output:
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS
In Perl
use strict;
use warnings 'all';
use autodie;
use List::Util 'first';
my #grades = do {
open my $fh, '<', 'file1.txt';
map [ split ], <$fh>;
};
open my $fh, '<', 'file2.txt';
while ( <$fh>) {
my ($name, $score) = split;
my $grade = first { $_->[2] <= $score } #grades;
print "$name $score #$grade[0,3]\n";
}
output
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS

perl text::csv - filtering specific columns in a csv document and discarding others

I would like to filter out particular columns with a regex and discard others. For example, if I had the following column names:
date
mem_total
cpu.usagemhz.average_0
cpu.usagemhz.average_1
cpu.usagemhz.average_2
I would like to capture only columns that begin with "cpu.usage.mhz.average"
Is their a particular function of text::csv that will help me do a quick check of the column names?
Thanks!
JD
* Update **
I tried jimtut answer and it is extremely close to what I am looking for. Thanks Again Everyone!
Here is the code from jimtut with one small edit on the print statement at the bottom. I added the print $colCount just to see what was going on with the data;
use Text::CSV;
my $file = "foo.csv";
my $pattern = ".*In";
open(F, $file) or warn "Warning! Unable to open $file\n";
my $lineCount = 0;
my %desiredColumns;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
print "$colCount\t$field\n";
}
}
}
}
close(F);
Here is the results
colCount | $field
12 565
13 73
14 36
15 32
16 127
17 40
18 32
19 42
20 171
12 464
13 62
14 32
15 24
16 109
17 21
18 19
19 39
20 150
12 515
13 76
14 28
15 30
16 119
17 15
18 25
19 46
20 169
12 500
13 71
14 30
15 28
16 111
17 20
18 18
19 40
20 167
I would like to add this data to individual arrays or hashes. what do you think? something like...
foreach column {
check to see if a hash already exists with that column number. If not then create hash.
}
Then go through each field and add the field data to the appropriate hash.
Do you think this is the right way to go about solving this?
No, not a specific function in Text::CSV. I would do something like this:
use Text::CSV;
my $file = "foo.csv";
my $pattern = "cpu.usage.mhz.average.*";
open(F, $file) or die "Unable to open $file: $!\n";
my $lineCount = 0;
my %desiredColumns;
my %columnContents;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
push(#{$columnContents{$colCount}}, $field);
}
}
}
}
close(F);
foreach my $key (sort keys %columnContents) {
print "Column $key: " . join(",", #{$columnContents{$key}}) . "\n\n";
}
Hope that helps! I'm sure someone can write that in a Perl one-liner, but that's easier (for me) to read...
Since your fields of interest are at index 2-4, we'll just pluck those out of the field array returned by getline(). This sample code prints them but you can do whatever you like to them.
use Text::CSV; # load the module
my $csv = Text::CSV->new (); # instantiate
open $fh, "<somefile"; # open the input
while ( my $fields = $csv->getline($fh) ) { # read a line, and parse it into fields
print "I got #{$fields}[2..4]\n"; # print the fields of interest
}
close ($fh) # close when done
WHY are you trying to do this? Is it to minimize storage? Eliminate processing costs for parsing many un-needed columns?
If the latter, you can't avoid that processing cost. Any solution you come up with would STILL read and parse 100% of the file.
If the former, there are many methods, some are more efficient than the others.
Also, what exactly do you mean "help me do a quick check of the column names?"? If you want to get the column names, there's column_names() method provided you previously set the column names using column_names(getline($fh)).
If you want to only return specific column names in a hash to avid wasting memory on un-needed columns, there's no clear-cut API for that. You can roll your own, or abuse a "bug/feature" of getline_hr() method:
For the former (roll your own), you can do something like:
my $headers = $csv->getline( $fh ); # First line is headers.
my #headers_keep = map { /^cpu.usage.mhz.average/ ? 1 : 0 } #$headers;
while ( my $row = $csv->getline( $fh ) ) {
my $i = 0;
my #row_new = grep { $headers_keep[$i++] } $#row;
push #rows, \#row_new;
}
BUT you can either roll your own OR .
You can also use a "feature" of "getline_hr()" which doesn't assign values into a hash if the column name is a duplicate (only the LAST version gets assigned) \
In your case, for column names: date,mem_total,cpu.usagemhz.average_0,cpu.usagemhz.average_1,cpu.usagemhz.average_2, merely set the column_names array to contain "cpu.usagemhz.average_0" value in the first 2 eements of the array - they will NOT be then saved by getline_hr().
You can go over the list of columns, find the consecutive range of "not needed" columns, and replace their names with the name of the first needed column follwing that range. The only stiking point is if the "un-needed" range is at the very end of the columns - replace with "JUNK" or something.

How to print/extract information listed under a column from two dimensional array in Perl?

I have a output file which is a two dimensional array (this file was output generated after running script written to produce 2D array) and I have to read information under a particular column, say column 1. In other words, how do I read and print out information listed, corresponding to all the rows, under column 1.
Any suggestions?
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
From the above data I want to extract information column wise, say if I want information from column 1, I should be able to list only the following output.
want to list Then I want
OUTPUT:
1
A
93
Final version after all corrections:
#!/usr/bin/perl
use strict;
use warnings;
my $column_to_show = 0;
while ( <DATA> ) {
last unless /\S/;
print +(split)[$column_to_show], "\n";
}
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
Output:
C:\Temp> u
1
A
93
Explanation of print +(split)[$column_to_show], "\n";:
perldoc -f split:
Splits the string EXPR into a list of strings and returns that list.
...
If EXPR is omitted, splits the $_ string. If PATTERN is also omitted,
splits on whitespace (after skipping any leading whitespace).
So: (split)[3] selects the fourth element of the list returned by split. The + in front of (split) is necessary to help perl parse the expression correctly. See perldoc -f print:
Also be careful not to follow the
print keyword with a left parenthesis
unless you want the corresponding
right parenthesis to terminate the
arguments to the print — interpose a +
or put parentheses around all the
arguments.
I thoroughly recommend every Perl programmer to occasionally skim through all of the documentation perldoc perltoc. It is on your computer.
my $line ;
foreach $line (#DATA)
{
my #DATA1 = split( / +/, "$line" );
print "first element of array is $DATA1[0]";
}
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
OUTPUT:-
1
A
93
Try playing with this code. Basically I load the data into an array of arrays
Each line is a reference to a row.
#!/usr/bin/perl
use strict;
use warnings;
my $TwoDimArray;
while (my $line=<DATA>) {
push #$TwoDimArray, [split(/,/,$line)];
};
for my $column (0..2) {
print "[$column,0] : " . $TwoDimArray->[0]->[$column] ."\n";
print "[$column,1] : " . $TwoDimArray->[1]->[$column] ."\n";
print "\n";
}
__DATA__
1,2,3,04,05,06
7,8,9,10,11,12
The map function is your friend:
open FILE, "data.txt";
while ($line = <FILE>) {
chomp($line);
push #data, [split /[, ]+/, $line];
}
close FILE;
#column1 = map {$$_[0]} #data;
print "#column1\n";
And in data.txt something like:
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12
13, 14, 15, 16
perl -lne '#F = split /\s+/ and print $F[1]'
This might be what you want:
use English qw<$OS_ERROR>; # Or just use $!
use IO::Handle;
my #columns;
open my $fh, '<', 'columns.dat' or die "I'm dead. $OS_ERROR";
while ( my $line = <$fh> ) {
my #cols = split /\s+/, $line;
$columns[$_][$fh->input_line_number()-1] = $cols[$_] foreach 0..$#cols;
}
$fh->close();
You can access them directly by element.
$arrays[0][0] = 1;
$arrays[0][1] = 2;
$arrays[1][0] = 3;
$arrays[1][1] = 4;
for (my $i = 0; $i <= $#{$arrays[1]}; $i++) {
print "row for $i\n";
print "\tfrom first array: " . $arrays[0][$i] . "\n";
print "\tfrom second array: " . $arrays[1][$i] . "\n";
}
prints
row for 0
from first array: 1
from second array: 3
row for 1
from first array: 2
from second array: 4