Related
I am trying to merge a fasta file and a qual file in a new fastq file having in mind the case that the two files might be provided with different order in their sequence IDs. To do that, I tried the first step of my script to be the sorting of the sequences which works perfectly when I test it as a separate script. The same with the rest, when I run separately the part where it combines the files in a fastq, it runs perfectly. But now that I am trying to combine the two methods in one script it doesn't work and I don't know what else to do! I would appreciate it if you can help me.
Here is my script as far. It creates the new fastq file but the content is messed up and not what I want. I run it from terminal like this:
$ perl script.pl reads.fasta reads.qual > reads.fq
Script :
#!/usr/bin/env perl
use strict;
use warnings;
die ("Usage: script.pl reads.fasta reads.qual > reads.fq") unless (scalar #ARGV) == 2;
open FASTA, $ARGV[0] or die "cannot open fasta: $!\n";
open QUAL, $ARGV[1] or die "cannot open qual: $!\n";
my $offset = 33;
my $count = 0;
local($/) = "\n>";
my %id2seq = ();
my $id = '';
my %idq2seq = ();
my $idq = '';
my (#sort_q, #sort_f);
while(<FASTA>){
chomp;
if($_ =~ /^>(.+)/){
$id = $1;
}else{
$id2seq{$id} .= $_;
}
}
for $id (sort keys %id2seq)
{
#sort_f = "$id\n$id2seq{$id}\n\n";
print #sort_f;
}
while(<QUAL>){
chomp;
if($_ =~ /^>(.+)/){
$idq = $1;
}else{
$idq2seq{$idq} .= $_;
}
}
for $idq (sort keys %idq2seq)
{
#sort_q = "$idq\n$idq2seq{$idq}\n\n";
print "#sort_q";
}
while (my #sort_f) {
chomp #sort_f;
my ($fid, #seq) = split "\n", #sort_f;
my $seq = join "", #seq; $seq =~ s/\s//g;
my $sortq = #sort_q;
chomp my #sortq;
my ($qid, #qual) = split "\n", #sortq;
#qual = split /\s+/, (join( " ", #qual));
# convert score to character code:
my #qual2 = map {chr($_+$offset)} #qual;
my $quals = join "", #qual2; `enter code here`
die "missmatch of fasta and qual: '$fid' ne '$qid'" if $fid ne $qid;
$fid =~ s/^\>//;
print STDOUT (join( "\n", "#".$fid, $seq, "+$fid", $quals), "\n");
$count++;
}
close FASTA;
close QUAL;
print STDERR "wrote $count entries\n";
Thank you in advance
It's been a while since I have used perl, but I would approach this using a hash of key/value pairs for both the fasta and quality input. Then write out all the pairs by looping over the fasta hash and pulling out the corresponding quality string.
I have written something in python that will do what you need, you can see it in action here:
It assumes that your input looks like this:
reads.fasta
>fa_0
GCAGCCTGGGACCCCTGTTGT
>fa_1
CCCACAAATCGCAGACACTGGTCGG
reads.qual
>fa_0
59 37 38 51 56 55 60 44 43 42 56 65 60 68 52 67 43 72 59 65 69
>fa_1
36 37 47 72 34 53 67 41 70 67 66 51 47 41 73 58 75 36 61 48 70 55 46 42 42
output
#fa_0
GCAGCCTGGGACCCCTGTTGT
+
;%&387<,+*8A<D4C+H;AE
#fa_1
CCCACAAATCGCAGACACTGGTCGG
+
$%/H"5C)FCB3/)I:K$=0F7.**
#fa_2
TCGTACAGCAGCCATTTTCATAACCGAACATGACTC
+
C?&93A#:?#F,2:'KF*20CC:I7F9J.,:E8&?F
import sys
# Check there are enough arguments
if len(sys.argv) < 3:
print('Usage: {s} reads.fasta reads.qual > reads.fq'.format(s=sys.argv[0]), file=sys.stderr)
sys.exit(1)
# Initalise dictionaries for reads and qualities
read_dict = dict()
qual_dict = dict()
fa_input = sys.argv[1]
qual_input = sys.argv[2]
# Read in fasta input
with open(fa_input, 'r') as fa:
for line in fa:
line = line.strip()
if line[0] == '>':
read_dict[line[1:]] = next(fa).strip()
else:
next(fa)
# Read in quality input
with open(qual_input, 'r') as qual:
for line in qual:
line = line.strip()
if line[0] == '>':
qual_dict[line[1:]] = next(qual).strip()
else:
next(qual)
count = 0
# Iterate over all fasta reads
for key, seq in read_dict.items():
# Check if read header is in the qualities data
if key in qual_dict.keys():
# There's both sequence and quality data so write stdout
read_str = '#{header}\n{seq}\n+\n{qual}'.format(
header=key,
seq=seq,
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')]))
print(read_str, file=sys.stdout)
count += 1
else: # not found
# Write error to stderr
print('Error: {k} not found in qual file'.format(k=key), file=sys.stderr)
# Print count to stderr
print('{c} reads written'.format(c=count), file=sys.stderr)
If you need to use an offset for the quality score edit
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')])) to
qual=''.join([chr(int(x) + offset) for x in qual_dict[key].split(' ')])) and define an offset variable before this.
My file looks like this:
1 15
2 16
3 18
4 19
5 25
6 30
7 55
8 45
9 34
10 52
If the matched pattern is 30 in line 6, I would like to grab N lines before and M lines after the line 6, for example if N=3 and M=4 so the result is expected to be like this:
3 18
4 19
5 25
6 30
7 55
8 45
9 34
10 52
I am a very new beginner in Perl and any advice would be appreciated.
﹟UPDATE
Many thanks for these helpful advice below and I really appreciate them.
Here is my updated code for this and any suggestions are welcome!
my $num;
while(<>)
{
if ( /pattern/)
{$num = $. ;}
}
open (,"") || die ("Can't open the file");
while(<>)
{
if ( $. >= $num-N and $. <=$num+M)
{
print OUT "$_ \r";
}
}
Maintain an array (I'll call it #preceding) of the last N lines read. When the pattern is matched, stop updating this array and start inserting lines into another array (#following). Do this until #following has M lines in it.
It should look something like this (fixed now thanks to ikegami):
my $matched = 0;
my #preceding;
my #following;
while(<>){
if ($matched){
push ( #following, $_);
last if #following == M;
next;
}
else {
push ( #preceding, $_);
shift(#preceding) if #preceding > N;
}
$matched = 1 if /pattern/;
}
my #lines = <>;
foreach $idx (grep { $lines[$_] =~ /pattern/ } 0..$#lines) {
print join (map {$lines[$_]} grep { $_ >= $idx - $B && $_ <= $idx +$A } 0..$#lines)."\n";
}
You can also use the GNU grep command, with -A,-B flags for that exact purpose.
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing -- between contiguous groups of
matches.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing -- between contiguous groups of
matches.
I have a PDB file. Now it has two parts separated by TER. Before TER I call it part 1. I want to take x,y,z of ATOM 1 of first part i.e before TER and find distance to all x,y,z co ordinates after TER and then second ATOM of part one to all ATOMS of part second. This has to be repeated for all ATOMS of first part= to all ATOMS of second part. I have to automate it for 20 files. names of my files begin like 1_0.pdb,2_0.pdb....20_0.pdb.
This is a distance calculation. I have tried something in PERL but its very rough. Can someone help a bit.
The File looks like:
----long file (I truncated it)----
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
The code is: In the end it finds the maximum distance and its co ordinates
my #points = ();
open(IN, #ARGV[0]) or die "$!";
while (my $line = <IN>) {
chomp($line);
my #array = (split (/\s+/, $line))[5, 6, 7];
print "#array\n";
push #points, [ #array ];
}
close(IN);
$max=0;
for my $i1 ( 0 .. $#points )
{
my ( $x1, $y1, $z1 ) = #{ $points[$i1] };
my $dist = sqrt( ($x1+1.925)**2 + ($y1+11.270)**2 + ($z1-1.404)**2 );
print "distance from (-1.925 -11.270 1.404) to ( $x1, $y1, $z1 ) is $dist\n";
if ( $dist > $max )
{ $max = $dist;
$x=$x1;
$y=$y1;
$z=$z1;
}}
print "maximum value is : $max\n";
print "co ordinates are : $x $y $z\n";
Not sure I clearly understand what you want, but how about:
#!/usr/local/bin/perl
use strict;
use warnings;
my (#refer, #points);
my $part = 0;
while (my $line = <DATA>) {
chomp($line);
if ($line =~ /^TER/) {
$part++;
next;
}
my #array = (split (/\s+/, $line))[5, 6, 7];
if ($part == 0) {
push #refer, [ #array ];
} else {
push #points, [ #array ];
}
}
my %max = (val=>0, x=>0, y=>0, z=>0);
foreach my $ref(#refer) {
my ($x1, $y1, $z1) = #{$ref};
foreach my $atom(#points) {
my ($x, $y, $z) = #{$atom};
my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
if ($dist > $max{val}) {
$max{val} = $dist;
$max{x} = $x;
$max{y} = $y;
$max{z} = $z;
}
}
}
print "max is $max{val}; coord: x=$max{x}, y=$max{y}, z=$max{z}\n";
__DATA__
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
output:
max is 35.9813670807545; coord: x=30.175, y=4.389, z=5.768
The main issue here is reading the data. First, note that one cannot use split with PDB text files since the fields are defined by position and not by separators. See Coordinate File Description (PDB Format).
To separate the ATOM record of different polymer chains you can start with a simplified version like
my $iblock = 0;
my #atoms = ();
while (my $line = <IN>) {
chomp($line);
# Switch blocks at TER lines
if ($line =~ /^TER/) {
$iblock++;
# Read ATOM lines
} elsif ($line =~ m/^ATOM/) {
my #xyz = (substr($line,7-1,9),substr($line,16-1,9),substr($line,25-1,9));
printf "Block %d: atom at (%s)\n",$iblock,join (",",#xyz);
push #{$atoms[$iblock]},\#xyz;
# Parse additional line types (if needed)
} else {
...
}
}
Followed by a loop over all pairs of coordinates from different blocks, structured as follows:
# 1st block
for my $iblock1 (0..$#atoms) {
# 2nd block
for my $iblock2 ($iblock1+1..$#atoms) {
# Compare all pairs of atoms
...
my $xyz1 (#{$atoms[$iblock1]}) {
for my $xyz2 (#{$atoms[$iblock2]}) {
# Calculate distance and compare with $max_dist
...
}
}
# Print the maximal distance between these two blocks
...
}
}
Of course, the code could be more general if a more elaborate data structure is used or by applying one of the available PDB parsers, such as Bioperl's.
With proper encapsulation, this is pretty simple, and requires minor modifications of your code.
ETA: Added fixed width solution I had on hand. It would probably be best to read all the fields instead of discarding the first 31 chars, and then return them all in a hash reference. That way, you could process all the lines with the same subroutine, and simply switch between parts when the first field turns out to be TER. It should be easy for you to extrapolate this from the given code.
You'll note that the reference values are read in with a loop, because we need to break the loop at the break point. The rest of the values are slurped up with a map statement. Then we simply feed the data to the subroutine we made from your initial code (with some improvements). I used the same names for the lexical variables to make it easier to read the code.
use strict;
use warnings;
my #points;
while (<DATA>) {
last if /^TER$/;
push #points, getpoints($_);
}
my #ref = map getpoints($_), <DATA>;
for my $p (#points) {
getcoords($p, \#ref);
}
sub getpoints {
my $line = shift;
my #data = unpack "A31 A8 A8 A8", $line;
shift #data;
return \#data;
}
sub getcoords {
my ($p, $ref) = #_;
my ($p1,$p2,$p3) = #$p;
my $max=0;
my ($x,$y,$z);
for my $aref ( #$ref ) {
my ( $x1, $y1, $z1 ) = #$aref;
my $dist = sqrt(
($x1-$p1)**2 +
($y1-$p2)**2 +
($z1-$p3)**2
);
print "distance from ($p1 $p2 $p3) to ( $x1, $y1, $z1 ) is $dist\n";
if ( $dist > $max ) {
$max = $dist;
$x=$x1;
$y=$y1;
$z=$z1;
}
}
print "maximum value is : $max\n";
print "co ordinates are : $x $y $z\n";
}
__DATA__
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
I would like to filter out particular columns with a regex and discard others. For example, if I had the following column names:
date
mem_total
cpu.usagemhz.average_0
cpu.usagemhz.average_1
cpu.usagemhz.average_2
I would like to capture only columns that begin with "cpu.usage.mhz.average"
Is their a particular function of text::csv that will help me do a quick check of the column names?
Thanks!
JD
* Update **
I tried jimtut answer and it is extremely close to what I am looking for. Thanks Again Everyone!
Here is the code from jimtut with one small edit on the print statement at the bottom. I added the print $colCount just to see what was going on with the data;
use Text::CSV;
my $file = "foo.csv";
my $pattern = ".*In";
open(F, $file) or warn "Warning! Unable to open $file\n";
my $lineCount = 0;
my %desiredColumns;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
print "$colCount\t$field\n";
}
}
}
}
close(F);
Here is the results
colCount | $field
12 565
13 73
14 36
15 32
16 127
17 40
18 32
19 42
20 171
12 464
13 62
14 32
15 24
16 109
17 21
18 19
19 39
20 150
12 515
13 76
14 28
15 30
16 119
17 15
18 25
19 46
20 169
12 500
13 71
14 30
15 28
16 111
17 20
18 18
19 40
20 167
I would like to add this data to individual arrays or hashes. what do you think? something like...
foreach column {
check to see if a hash already exists with that column number. If not then create hash.
}
Then go through each field and add the field data to the appropriate hash.
Do you think this is the right way to go about solving this?
No, not a specific function in Text::CSV. I would do something like this:
use Text::CSV;
my $file = "foo.csv";
my $pattern = "cpu.usage.mhz.average.*";
open(F, $file) or die "Unable to open $file: $!\n";
my $lineCount = 0;
my %desiredColumns;
my %columnContents;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
push(#{$columnContents{$colCount}}, $field);
}
}
}
}
close(F);
foreach my $key (sort keys %columnContents) {
print "Column $key: " . join(",", #{$columnContents{$key}}) . "\n\n";
}
Hope that helps! I'm sure someone can write that in a Perl one-liner, but that's easier (for me) to read...
Since your fields of interest are at index 2-4, we'll just pluck those out of the field array returned by getline(). This sample code prints them but you can do whatever you like to them.
use Text::CSV; # load the module
my $csv = Text::CSV->new (); # instantiate
open $fh, "<somefile"; # open the input
while ( my $fields = $csv->getline($fh) ) { # read a line, and parse it into fields
print "I got #{$fields}[2..4]\n"; # print the fields of interest
}
close ($fh) # close when done
WHY are you trying to do this? Is it to minimize storage? Eliminate processing costs for parsing many un-needed columns?
If the latter, you can't avoid that processing cost. Any solution you come up with would STILL read and parse 100% of the file.
If the former, there are many methods, some are more efficient than the others.
Also, what exactly do you mean "help me do a quick check of the column names?"? If you want to get the column names, there's column_names() method provided you previously set the column names using column_names(getline($fh)).
If you want to only return specific column names in a hash to avid wasting memory on un-needed columns, there's no clear-cut API for that. You can roll your own, or abuse a "bug/feature" of getline_hr() method:
For the former (roll your own), you can do something like:
my $headers = $csv->getline( $fh ); # First line is headers.
my #headers_keep = map { /^cpu.usage.mhz.average/ ? 1 : 0 } #$headers;
while ( my $row = $csv->getline( $fh ) ) {
my $i = 0;
my #row_new = grep { $headers_keep[$i++] } $#row;
push #rows, \#row_new;
}
BUT you can either roll your own OR .
You can also use a "feature" of "getline_hr()" which doesn't assign values into a hash if the column name is a duplicate (only the LAST version gets assigned) \
In your case, for column names: date,mem_total,cpu.usagemhz.average_0,cpu.usagemhz.average_1,cpu.usagemhz.average_2, merely set the column_names array to contain "cpu.usagemhz.average_0" value in the first 2 eements of the array - they will NOT be then saved by getline_hr().
You can go over the list of columns, find the consecutive range of "not needed" columns, and replace their names with the name of the first needed column follwing that range. The only stiking point is if the "un-needed" range is at the very end of the columns - replace with "JUNK" or something.
I have a output file which is a two dimensional array (this file was output generated after running script written to produce 2D array) and I have to read information under a particular column, say column 1. In other words, how do I read and print out information listed, corresponding to all the rows, under column 1.
Any suggestions?
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
From the above data I want to extract information column wise, say if I want information from column 1, I should be able to list only the following output.
want to list Then I want
OUTPUT:
1
A
93
Final version after all corrections:
#!/usr/bin/perl
use strict;
use warnings;
my $column_to_show = 0;
while ( <DATA> ) {
last unless /\S/;
print +(split)[$column_to_show], "\n";
}
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
Output:
C:\Temp> u
1
A
93
Explanation of print +(split)[$column_to_show], "\n";:
perldoc -f split:
Splits the string EXPR into a list of strings and returns that list.
...
If EXPR is omitted, splits the $_ string. If PATTERN is also omitted,
splits on whitespace (after skipping any leading whitespace).
So: (split)[3] selects the fourth element of the list returned by split. The + in front of (split) is necessary to help perl parse the expression correctly. See perldoc -f print:
Also be careful not to follow the
print keyword with a left parenthesis
unless you want the corresponding
right parenthesis to terminate the
arguments to the print — interpose a +
or put parentheses around all the
arguments.
I thoroughly recommend every Perl programmer to occasionally skim through all of the documentation perldoc perltoc. It is on your computer.
my $line ;
foreach $line (#DATA)
{
my #DATA1 = split( / +/, "$line" );
print "first element of array is $DATA1[0]";
}
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
OUTPUT:-
1
A
93
Try playing with this code. Basically I load the data into an array of arrays
Each line is a reference to a row.
#!/usr/bin/perl
use strict;
use warnings;
my $TwoDimArray;
while (my $line=<DATA>) {
push #$TwoDimArray, [split(/,/,$line)];
};
for my $column (0..2) {
print "[$column,0] : " . $TwoDimArray->[0]->[$column] ."\n";
print "[$column,1] : " . $TwoDimArray->[1]->[$column] ."\n";
print "\n";
}
__DATA__
1,2,3,04,05,06
7,8,9,10,11,12
The map function is your friend:
open FILE, "data.txt";
while ($line = <FILE>) {
chomp($line);
push #data, [split /[, ]+/, $line];
}
close FILE;
#column1 = map {$$_[0]} #data;
print "#column1\n";
And in data.txt something like:
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12
13, 14, 15, 16
perl -lne '#F = split /\s+/ and print $F[1]'
This might be what you want:
use English qw<$OS_ERROR>; # Or just use $!
use IO::Handle;
my #columns;
open my $fh, '<', 'columns.dat' or die "I'm dead. $OS_ERROR";
while ( my $line = <$fh> ) {
my #cols = split /\s+/, $line;
$columns[$_][$fh->input_line_number()-1] = $cols[$_] foreach 0..$#cols;
}
$fh->close();
You can access them directly by element.
$arrays[0][0] = 1;
$arrays[0][1] = 2;
$arrays[1][0] = 3;
$arrays[1][1] = 4;
for (my $i = 0; $i <= $#{$arrays[1]}; $i++) {
print "row for $i\n";
print "\tfrom first array: " . $arrays[0][$i] . "\n";
print "\tfrom second array: " . $arrays[1][$i] . "\n";
}
prints
row for 0
from first array: 1
from second array: 3
row for 1
from first array: 2
from second array: 4