Mapping SNP coordinates to gene coordinates is too slow - perl
I have two tab-delimited files like these
File 1 (these are Single Nucleotide Polymorphism (SNP) positions)
Chr1 26690
Chr1 33667
Chr1 75049
.
.
Chr2 12342
Chr2 32642
Chr2 424421
.
.
File 2 (these are gene start and end coordinates)
Chr1 2903 10817 LOC_Os01g01010
Chr1 2984 10562 LOC_Os01g01010
Chr1 11218 12435 LOC_Os01g01019
Chr1 12648 15915 LOC_Os01g01030
Chr1 16292 18304 LOC_Os01g01040
Chr1 16292 20323 LOC_Os01g01040
Chr1 16321 20323 LOC_Os01g01040
Chr1 16321 20323 LOC_Os01g01040
Chr1 22841 26971 LOC_Os01g01050
Chr1 22841 26971 LOC_Os01g01050
.
.
What I want is to match SNPs in file 1 to genes in file 2. The script should match the string in the first column of the files, and if they match it should then find which gene in the file 2 contains the corresponding SNP and return the locus ID from the fourth column of File 2.
Here's the script I have written
use strict;
my $i1 = $ARGV[0]; # SNP
my $i2 = $ARGV[1]; # gene coordinate
open(I1, $i1);
open(I2, $i2);
my #snp = ();
my #coor = ();
while( <I1> ) {
push(#snp, $_);
}
while ( <I2> ) {
push(#coor, $_);
}
for ( my $i = 0; $i <= $#snp; $i++ ) {
my #snp_line = split "\t", $snp[$i];
for ( my $j = 0; $j <= $#coor; $j++ ) {
my #coor_line = split "\t", $coor[$i];
if ( $snp_line[0] eq $coor_line[0] ) {
if ( $snp_line[1] >= $coor_line[1] && $snp_line[1] <= $coor_line[2] ) {
print "$snp_line[0]\t$snp_line[1]\t$coor_line[3]\n";
goto a;
}
}
}
a:
}
The problem is that obviously this is not the best way to do it as it iterates over all the ~60,000 lines in file 2 for each SNP in line 1. Also, it ran overnight and did not go past Chr1; we have upto Chr12.
You could work with these files when reformatted as UCSC BED format, using a toolkit like BEDOPS that does efficient set operations on sorted BED files.
Convert your first file of SNPs to a sorted BED file:
$ awk -v OFS="\t" '{ print $1, $2, ($2+1); }' snps.txt | sort-bed - > snps.bed
Sort the genes ("file 2"):
$ sort-bed genes.unsorted.txt > genes.bed
Map SNPs to genes:
$ bedmap --echo --echo-map-id-uniq --delim '\t' snps.bed genes.bed > answer.bed
If you need to, you can strip the end position of the SNP from the answer:
$ cut -f1,2,4 answer.bed > answer.txt
These tools will run very fast, usually within a few moments.
I would not use Perl or Python to do these kinds of set operations, unless I was doing some kind of academic exercise.
Here is a working script, the one posted above had bugs
use strict;
my $i1=$ARGV[0]; # SNP
my $i2=$ARGV[1]; # gene coordinate
open(I1,$i1);
open(I2,$i2);
my #snp=();
my #coor=();
while(<I1>)
{
push(#snp,$_);
}
while(<I2>)
{
push(#coor,$_);
}
for(my $i=0;$i<=$#snp;$i++)
{
my #snp_line = split "\t",$snp[$i];
for(my $j=0;$j<=$#coor;$j++)
{
my #coor_line = split "\t",$coor[$j];
if ($snp_line[0] eq $coor_line[0])
{
if ($snp_line[1] >= $coor_line[1] && $snp_line[1] <= $coor_line[2])
{
print "$snp_line[0]\t$snp_line[1]\t$coor_line[3]\n";
}
}
}
}
This one does the job.
Related
insert null values in missing rows of file
I have a text file which consists some data of 24 hours time stamp segregated in 10 minutes interval. 2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2 2016-02-06,01:00:00,ujjawal,36072-2,MT,55,0,2 2016-02-06,01:10:00,ujjawal,36072-2,MT,41,0,2 2016-02-06,01:20:00,ujjawal,36072-2,MT,46,0,2 2016-02-06,01:30:00,ujjawal,36072-2,MT,56,0,3 2016-02-06,01:40:00,ujjawal,36072-2,MT,38,0,2 2016-02-06,01:50:00,ujjawal,36072-2,MT,49,0,2 2016-02-06,02:00:00,ujjawal,36072-2,MT,58,0,4 2016-02-06,02:10:00,ujjawal,36072-2,MT,43,0,2 2016-02-06,02:20:00,ujjawal,36072-2,MT,46,0,2 2016-02-06,02:30:00,ujjawal,36072-2,MT,61,0,2 2016-02-06,02:40:00,ujjawal,36072-2,MT,57,0,3 2016-02-06,02:50:00,ujjawal,36072-2,MT,45,0,2 2016-02-06,03:00:00,ujjawal,36072-2,MT,45,0,3 2016-02-06,03:10:00,ujjawal,36072-2,MT,51,0,2 2016-02-06,03:20:00,ujjawal,36072-2,MT,68,0,3 2016-02-06,03:30:00,ujjawal,36072-2,MT,51,0,2 2016-02-06,03:40:00,ujjawal,36072-2,MT,68,0,3 2016-02-06,03:50:00,ujjawal,36072-2,MT,67,0,3 2016-02-06,04:00:00,ujjawal,36072-2,MT,82,0,8 2016-02-06,04:10:00,ujjawal,36072-2,MT,82,0,5 2016-02-06,04:20:00,ujjawal,36072-2,MT,122,0,4 2016-02-06,04:30:00,ujjawal,36072-2,MT,133,0,3 2016-02-06,04:40:00,ujjawal,36072-2,MT,142,0,3 2016-02-06,04:50:00,ujjawal,36072-2,MT,202,0,1 2016-02-06,05:00:00,ujjawal,36072-2,MT,731,1,3 2016-02-06,05:10:00,ujjawal,36072-2,MT,372,0,7 2016-02-06,05:20:00,ujjawal,36072-2,MT,303,0,2 2016-02-06,05:30:00,ujjawal,36072-2,MT,389,0,3 2016-02-06,05:40:00,ujjawal,36072-2,MT,454,0,1 2016-02-06,05:50:00,ujjawal,36072-2,MT,406,0,6 2016-02-06,06:00:00,ujjawal,36072-2,MT,377,0,1 2016-02-06,06:10:00,ujjawal,36072-2,MT,343,0,5 2016-02-06,06:20:00,ujjawal,36072-2,MT,370,0,2 2016-02-06,06:30:00,ujjawal,36072-2,MT,343,0,9 2016-02-06,06:40:00,ujjawal,36072-2,MT,315,0,8 2016-02-06,06:50:00,ujjawal,36072-2,MT,458,0,3 2016-02-06,07:00:00,ujjawal,36072-2,MT,756,1,3 2016-02-06,07:10:00,ujjawal,36072-2,MT,913,1,3 2016-02-06,07:20:00,ujjawal,36072-2,MT,522,0,3 2016-02-06,07:30:00,ujjawal,36072-2,MT,350,0,7 2016-02-06,07:40:00,ujjawal,36072-2,MT,328,0,6 2016-02-06,07:50:00,ujjawal,36072-2,MT,775,1,3 2016-02-06,08:00:00,ujjawal,36072-2,MT,310,0,9 2016-02-06,08:10:00,ujjawal,36072-2,MT,308,0,6 2016-02-06,08:20:00,ujjawal,36072-2,MT,738,1,3 2016-02-06,08:30:00,ujjawal,36072-2,MT,294,0,6 2016-02-06,08:40:00,ujjawal,36072-2,MT,345,0,1 2016-02-06,08:50:00,ujjawal,36072-2,MT,367,0,6 2016-02-06,09:00:00,ujjawal,36072-2,MT,480,0,3 2016-02-06,09:10:00,ujjawal,36072-2,MT,390,0,3 2016-02-06,09:20:00,ujjawal,36072-2,MT,436,0,3 2016-02-06,09:30:00,ujjawal,36072-2,MT,1404,2,3 2016-02-06,09:40:00,ujjawal,36072-2,MT,346,0,3 2016-02-06,09:50:00,ujjawal,36072-2,MT,388,0,3 2016-02-06,10:00:00,ujjawal,36072-2,MT,456,0,2 2016-02-06,10:10:00,ujjawal,36072-2,MT,273,0,7 2016-02-06,10:20:00,ujjawal,36072-2,MT,310,0,3 2016-02-06,10:30:00,ujjawal,36072-2,MT,256,0,7 2016-02-06,10:40:00,ujjawal,36072-2,MT,283,0,3 2016-02-06,10:50:00,ujjawal,36072-2,MT,276,0,3 2016-02-06,11:00:00,ujjawal,36072-2,MT,305,0,1 2016-02-06,11:10:00,ujjawal,36072-2,MT,310,0,9 2016-02-06,11:20:00,ujjawal,36072-2,MT,286,0,3 2016-02-06,11:30:00,ujjawal,36072-2,MT,286,0,3 2016-02-06,11:40:00,ujjawal,36072-2,MT,247,0,7 2016-02-06,11:50:00,ujjawal,36072-2,MT,366,0,2 2016-02-06,12:00:00,ujjawal,36072-2,MT,294,0,2 2016-02-06,12:10:00,ujjawal,36072-2,MT,216,0,5 2016-02-06,12:20:00,ujjawal,36072-2,MT,233,0,1 2016-02-06,12:30:00,ujjawal,36072-2,MT,785,1,2 2016-02-06,12:40:00,ujjawal,36072-2,MT,466,0,1 2016-02-06,12:50:00,ujjawal,36072-2,MT,219,0,9 2016-02-06,13:00:00,ujjawal,36072-2,MT,248,0,6 2016-02-06,13:10:00,ujjawal,36072-2,MT,223,0,7 2016-02-06,13:20:00,ujjawal,36072-2,MT,276,0,8 2016-02-06,13:30:00,ujjawal,36072-2,MT,219,0,6 2016-02-06,13:40:00,ujjawal,36072-2,MT,699,1,2 2016-02-06,13:50:00,ujjawal,36072-2,MT,439,0,2 2016-02-06,14:00:00,ujjawal,36072-2,MT,1752,2,3 2016-02-06,14:10:00,ujjawal,36072-2,MT,203,0,5 2016-02-06,14:20:00,ujjawal,36072-2,MT,230,0,7 2016-02-06,14:30:00,ujjawal,36072-2,MT,226,0,1 2016-02-06,14:40:00,ujjawal,36072-2,MT,195,0,6 2016-02-06,14:50:00,ujjawal,36072-2,MT,314,0,1 2016-02-06,15:00:00,ujjawal,36072-2,MT,357,0,2 2016-02-06,15:10:00,ujjawal,36072-2,MT,387,0,9 2016-02-06,15:20:00,ujjawal,36072-2,MT,1084,1,3 2016-02-06,15:30:00,ujjawal,36072-2,MT,1295,2,3 2016-02-06,15:40:00,ujjawal,36072-2,MT,223,0,8 2016-02-06,15:50:00,ujjawal,36072-2,MT,254,0,1 2016-02-06,16:00:00,ujjawal,36072-2,MT,252,0,7 2016-02-06,16:10:00,ujjawal,36072-2,MT,268,0,1 2016-02-06,16:20:00,ujjawal,36072-2,MT,242,0,1 2016-02-06,16:30:00,ujjawal,36072-2,MT,254,0,9 2016-02-06,16:40:00,ujjawal,36072-2,MT,271,0,3 2016-02-06,16:50:00,ujjawal,36072-2,MT,244,0,7 2016-02-06,17:00:00,ujjawal,36072-2,MT,281,0,1 2016-02-06,17:10:00,ujjawal,36072-2,MT,190,0,8 2016-02-06,17:20:00,ujjawal,36072-2,MT,187,0,1 2016-02-06,17:30:00,ujjawal,36072-2,MT,173,0,9 2016-02-06,17:40:00,ujjawal,36072-2,MT,140,0,5 2016-02-06,17:50:00,ujjawal,36072-2,MT,147,0,6 2016-02-06,18:00:00,ujjawal,36072-2,MT,109,0,4 2016-02-06,18:10:00,ujjawal,36072-2,MT,99,0,1 2016-02-06,18:20:00,ujjawal,36072-2,MT,66,0,6 2016-02-06,18:30:00,ujjawal,36072-2,MT,67,0,4 2016-02-06,18:40:00,ujjawal,36072-2,MT,40,0,2 2016-02-06,18:50:00,ujjawal,36072-2,MT,52,0,3 2016-02-06,19:00:00,ujjawal,36072-2,MT,40,0,3 2016-02-06,19:10:00,ujjawal,36072-2,MT,30,0,2 2016-02-06,19:20:00,ujjawal,36072-2,MT,25,0,3 2016-02-06,19:30:00,ujjawal,36072-2,MT,35,0,4 2016-02-06,19:40:00,ujjawal,36072-2,MT,14,0,1 2016-02-06,19:50:00,ujjawal,36072-2,MT,97,0,7 2016-02-06,20:00:00,ujjawal,36072-2,MT,14,0,1 2016-02-06,20:10:00,ujjawal,36072-2,MT,12,0,4 2016-02-06,20:20:00,ujjawal,36072-2,MT,11,0,2 2016-02-06,20:30:00,ujjawal,36072-2,MT,12,0,1 2016-02-06,20:40:00,ujjawal,36072-2,MT,6,0,1 2016-02-06,20:50:00,ujjawal,36072-2,MT,13,0,2 2016-02-06,21:00:00,ujjawal,36072-2,MT,5,0,1 2016-02-06,21:10:00,ujjawal,36072-2,MT,12,0,2 2016-02-06,21:20:00,ujjawal,36072-2,MT,1,0,1 2016-02-06,21:30:00,ujjawal,36072-2,MT,21,0,2 2016-02-06,21:50:00,ujjawal,36072-2,MT,9,0,3 2016-02-06,22:00:00,ujjawal,36072-2,MT,2,0,1 2016-02-06,22:10:00,ujjawal,36072-2,MT,12,0,5 2016-02-06,22:20:00,ujjawal,36072-2,MT,1,0,1 2016-02-06,22:30:00,ujjawal,36072-2,MT,9,0,1 2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1 2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2 2016-02-06,23:10:00,ujjawal,36072-2,MT,10,0,3 2016-02-06,23:20:00,ujjawal,36072-2,MT,10,0,1 2016-02-06,23:30:00,ujjawal,36072-2,MT,6,0,1 2016-02-06,23:40:00,ujjawal,36072-2,MT,12,0,1 if you see above sample as per 10 minutes interval there should be total 143 rows in 24 hours in this file but after second last line which has time 2016-02-06,23:40:00 data for date, time 2016-02-06,23:50:00 is missing. similarly after 2016-02-06,22:40:00 data for date, time 2016-02-06,22:50:00 is missing. can we insert missing date,time followed by 6 null separated by commas e.g. 2016-02-06,22:50:00,null,null,null,null,null,null where ever any data missing in rows of this file based on count no 143 rows and time stamp comparison in rows 2016-02-06,00:00:00 to 2016-02-06,23:50:00 which is also 143 in count ? here is what i have tried created a file with 143 entries of date and time as 2.csv and used below command join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,2.1,2.1,2.2 <(sort -k2 1.csv) <(sort -k2 2.csv)|grep "2016-02-06,21:30:00"| sort -u|sed "s/\t//g"> 3.txt part of output is repetitive like this : 2016-02-06,21:30:00 2016-02-06,21:30:00 2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,21:30:00 2016-02-06,21:30:00 2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,21:30:00 2016-02-06,21:30:00 2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,21:30:00 2016-02-06,21:30:00 2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,21:30:00 2016-02-06,21:30:00 2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2 2016-02-06,21:30:00 any suggestions ?
I'd actually not cross reference a new csv file, and instead do it like this: #!/usr/bin/env perl use strict; use warnings; use Time::Piece; my $last_timestamp; my $interval = 600; #read stdin line by line while ( <> ) { #extract date and time from this line. my ( $date, $time, #fields ) = split /,/; #parse the timestamp my $timestamp = Time::Piece -> strptime ( $date . $time, "%Y-%m-%d%H:%M:%S" ); #set last if undefined. $last_timestamp //= $timestamp; #if there's a gap... : if ( $last_timestamp + $interval < $timestamp ) { #print "GAP detected at $timestamp: ",$timestamp - $last_timestamp,"\n"; #print lines to fill in the gap for ( ($timestamp - $last_timestamp) % 600 ) { $last_timestamp += 600; print join ( ",", $last_timestamp -> strftime("%Y-%m-%d,%H:%M:%S"), ("null")x6),"\n"; } } $last_timestamp = $timestamp; print; } Which for your sample gives me lines (snipped for brevity): 2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1 2016-02-06,22:50:00,null,null,null,null,null,null 2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2 Note - this is assuming the timestamps are exactly 600s apart. You can adjust the logic a little if that isn't a valid assumption, but it depends exactly what you're trying to get at that point.
Here's another Perl solution It initialises $date to the date contained in the first line of the file, and a time of 00:00:00 It then fills the %values hash with records using the value of $date as a key, incrementing the value by ten minutes until the day of month changes. These form the "default" values Then the contents of the file are used to overwrite all elements of %values for which we have an actual value. Any gaps will remain set to their default from the previous step Then the hash is simply printed in sorted order, resulting in a full set of data with defaults inserted as necessary use strict; use warnings 'all'; use Time::Piece; use Time::Seconds 'ONE_MINUTE'; use Fcntl ':seek'; my $delta = 10 * ONE_MINUTE; my $date = Time::Piece->strptime(<ARGV> =~ /^([\d-]+)/, '%Y-%m-%d'); my %values; for ( my $day = $date->mday; $date->mday == $day; $date += $delta ) { my $ds = $date->strftime('%Y-%m-%d,%H:%M:%S'); $values{$ds} = $ds. ',null' x 6 . "\n"; } seek ARGV, 0, SEEK_SET; while ( <ARGV> ) { my ($ds) = /^([\d-]+,[\d:]+)/; $values{$ds} = $_; } print $values{$_} for sort keys %values;
here is the answer.. cat 1.csv 2.csv|sort -u -t, -k2,2
...or a shell script: #! /bin/bash set -e file=$1 today=$(head -1 $file | cut -d, -f1) line=0 for (( h = 0 ; h < 24 ; h++ )) do for (( m = 0 ; m < 60 ; m += 10 )) do stamp=$(printf "%02d:%02d:00" $h $m) if [ $line -eq 0 ]; then IFS=',' read date time data; fi if [ "$time" = "$stamp" ]; then echo $date,$time,$data line=0 else echo $today,$stamp,null,null,null,null,null,null line=1 fi done done <$file
I would write it like this in Perl This program expects the name of the input file as a parameter on the command line, and prints its output to STDOUT, which may be redirected as normal use strict; use warnings 'all'; use feature 'say'; use Time::Piece; use Time::Seconds 'ONE_MINUTE'; my $format = '%Y-%m-%d,%H:%M:%S'; my $delta = 10 * ONE_MINUTE; my $next; our #ARGV = 'mydates.txt'; while ( <> ) { my $new = Time::Piece->strptime(/^([\d-]+,[\d:]+)/, $format); while ( $next and $next < $new ) { say $next->strftime($format) . ',null' x 6; $next += $delta; } print; $next = $new + $delta; } while ( $next and $next->hms('') > 0 ) { say $next->strftime($format) . ',null' x 6; $next += $delta; } output 2016-02-06,00:00:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,00:10:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,00:20:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,00:30:00,null,null,null,null,null,null 2016-02-06,00:40:00,ujjawal,36072-2,MT,37,0,1 2016-02-06,00:50:00,ujjawal,36072-2,MT,42,0,2 2016-02-06,01:00:00,ujjawal,36072-2,MT,55,0,2 2016-02-06,01:10:00,ujjawal,36072-2,MT,41,0,2 2016-02-06,01:20:00,ujjawal,36072-2,MT,46,0,2 2016-02-06,01:30:00,ujjawal,36072-2,MT,56,0,3 2016-02-06,01:40:00,ujjawal,36072-2,MT,38,0,2 2016-02-06,01:50:00,ujjawal,36072-2,MT,49,0,2 2016-02-06,02:00:00,ujjawal,36072-2,MT,58,0,4 2016-02-06,02:10:00,ujjawal,36072-2,MT,43,0,2 2016-02-06,02:20:00,ujjawal,36072-2,MT,46,0,2 2016-02-06,02:30:00,ujjawal,36072-2,MT,61,0,2 2016-02-06,02:40:00,ujjawal,36072-2,MT,57,0,3 2016-02-06,02:50:00,ujjawal,36072-2,MT,45,0,2 2016-02-06,03:00:00,ujjawal,36072-2,MT,45,0,3 2016-02-06,03:10:00,ujjawal,36072-2,MT,51,0,2 2016-02-06,03:20:00,ujjawal,36072-2,MT,68,0,3 2016-02-06,03:30:00,ujjawal,36072-2,MT,51,0,2 2016-02-06,03:40:00,ujjawal,36072-2,MT,68,0,3 2016-02-06,03:50:00,ujjawal,36072-2,MT,67,0,3 2016-02-06,04:00:00,ujjawal,36072-2,MT,82,0,8 2016-02-06,04:10:00,ujjawal,36072-2,MT,82,0,5 2016-02-06,04:20:00,ujjawal,36072-2,MT,122,0,4 2016-02-06,04:30:00,ujjawal,36072-2,MT,133,0,3 2016-02-06,04:40:00,ujjawal,36072-2,MT,142,0,3 2016-02-06,04:50:00,ujjawal,36072-2,MT,202,0,1 2016-02-06,05:00:00,ujjawal,36072-2,MT,731,1,3 2016-02-06,05:10:00,ujjawal,36072-2,MT,372,0,7 2016-02-06,05:20:00,ujjawal,36072-2,MT,303,0,2 2016-02-06,05:30:00,ujjawal,36072-2,MT,389,0,3 2016-02-06,05:40:00,ujjawal,36072-2,MT,454,0,1 2016-02-06,05:50:00,ujjawal,36072-2,MT,406,0,6 2016-02-06,06:00:00,ujjawal,36072-2,MT,377,0,1 2016-02-06,06:10:00,ujjawal,36072-2,MT,343,0,5 2016-02-06,06:20:00,ujjawal,36072-2,MT,370,0,2 2016-02-06,06:30:00,ujjawal,36072-2,MT,343,0,9 2016-02-06,06:40:00,ujjawal,36072-2,MT,315,0,8 2016-02-06,06:50:00,ujjawal,36072-2,MT,458,0,3 2016-02-06,07:00:00,ujjawal,36072-2,MT,756,1,3 2016-02-06,07:10:00,ujjawal,36072-2,MT,913,1,3 2016-02-06,07:20:00,ujjawal,36072-2,MT,522,0,3 2016-02-06,07:30:00,ujjawal,36072-2,MT,350,0,7 2016-02-06,07:40:00,ujjawal,36072-2,MT,328,0,6 2016-02-06,07:50:00,ujjawal,36072-2,MT,775,1,3 2016-02-06,08:00:00,ujjawal,36072-2,MT,310,0,9 2016-02-06,08:10:00,ujjawal,36072-2,MT,308,0,6 2016-02-06,08:20:00,ujjawal,36072-2,MT,738,1,3 2016-02-06,08:30:00,ujjawal,36072-2,MT,294,0,6 2016-02-06,08:40:00,ujjawal,36072-2,MT,345,0,1 2016-02-06,08:50:00,ujjawal,36072-2,MT,367,0,6 2016-02-06,09:00:00,ujjawal,36072-2,MT,480,0,3 2016-02-06,09:10:00,ujjawal,36072-2,MT,390,0,3 2016-02-06,09:20:00,ujjawal,36072-2,MT,436,0,3 2016-02-06,09:30:00,ujjawal,36072-2,MT,1404,2,3 2016-02-06,09:40:00,ujjawal,36072-2,MT,346,0,3 2016-02-06,09:50:00,ujjawal,36072-2,MT,388,0,3 2016-02-06,10:00:00,ujjawal,36072-2,MT,456,0,2 2016-02-06,10:10:00,ujjawal,36072-2,MT,273,0,7 2016-02-06,10:20:00,ujjawal,36072-2,MT,310,0,3 2016-02-06,10:30:00,ujjawal,36072-2,MT,256,0,7 2016-02-06,10:40:00,ujjawal,36072-2,MT,283,0,3 2016-02-06,10:50:00,ujjawal,36072-2,MT,276,0,3 2016-02-06,11:00:00,ujjawal,36072-2,MT,305,0,1 2016-02-06,11:10:00,ujjawal,36072-2,MT,310,0,9 2016-02-06,11:20:00,ujjawal,36072-2,MT,286,0,3 2016-02-06,11:30:00,ujjawal,36072-2,MT,286,0,3 2016-02-06,11:40:00,ujjawal,36072-2,MT,247,0,7 2016-02-06,11:50:00,ujjawal,36072-2,MT,366,0,2 2016-02-06,12:00:00,ujjawal,36072-2,MT,294,0,2 2016-02-06,12:10:00,ujjawal,36072-2,MT,216,0,5 2016-02-06,12:20:00,ujjawal,36072-2,MT,233,0,1 2016-02-06,12:30:00,ujjawal,36072-2,MT,785,1,2 2016-02-06,12:40:00,ujjawal,36072-2,MT,466,0,1 2016-02-06,12:50:00,ujjawal,36072-2,MT,219,0,9 2016-02-06,13:00:00,ujjawal,36072-2,MT,248,0,6 2016-02-06,13:10:00,ujjawal,36072-2,MT,223,0,7 2016-02-06,13:20:00,ujjawal,36072-2,MT,276,0,8 2016-02-06,13:30:00,ujjawal,36072-2,MT,219,0,6 2016-02-06,13:40:00,ujjawal,36072-2,MT,699,1,2 2016-02-06,13:50:00,ujjawal,36072-2,MT,439,0,2 2016-02-06,14:00:00,ujjawal,36072-2,MT,1752,2,3 2016-02-06,14:10:00,ujjawal,36072-2,MT,203,0,5 2016-02-06,14:20:00,ujjawal,36072-2,MT,230,0,7 2016-02-06,14:30:00,ujjawal,36072-2,MT,226,0,1 2016-02-06,14:40:00,ujjawal,36072-2,MT,195,0,6 2016-02-06,14:50:00,ujjawal,36072-2,MT,314,0,1 2016-02-06,15:00:00,ujjawal,36072-2,MT,357,0,2 2016-02-06,15:10:00,ujjawal,36072-2,MT,387,0,9 2016-02-06,15:20:00,ujjawal,36072-2,MT,1084,1,3 2016-02-06,15:30:00,ujjawal,36072-2,MT,1295,2,3 2016-02-06,15:40:00,ujjawal,36072-2,MT,223,0,8 2016-02-06,15:50:00,ujjawal,36072-2,MT,254,0,1 2016-02-06,16:00:00,ujjawal,36072-2,MT,252,0,7 2016-02-06,16:10:00,ujjawal,36072-2,MT,268,0,1 2016-02-06,16:20:00,ujjawal,36072-2,MT,242,0,1 2016-02-06,16:30:00,ujjawal,36072-2,MT,254,0,9 2016-02-06,16:40:00,ujjawal,36072-2,MT,271,0,3 2016-02-06,16:50:00,ujjawal,36072-2,MT,244,0,7 2016-02-06,17:00:00,ujjawal,36072-2,MT,281,0,1 2016-02-06,17:10:00,ujjawal,36072-2,MT,190,0,8 2016-02-06,17:20:00,ujjawal,36072-2,MT,187,0,1 2016-02-06,17:30:00,ujjawal,36072-2,MT,173,0,9 2016-02-06,17:40:00,ujjawal,36072-2,MT,140,0,5 2016-02-06,17:50:00,ujjawal,36072-2,MT,147,0,6 2016-02-06,18:00:00,ujjawal,36072-2,MT,109,0,4 2016-02-06,18:10:00,ujjawal,36072-2,MT,99,0,1 2016-02-06,18:20:00,ujjawal,36072-2,MT,66,0,6 2016-02-06,18:30:00,ujjawal,36072-2,MT,67,0,4 2016-02-06,18:40:00,ujjawal,36072-2,MT,40,0,2 2016-02-06,18:50:00,ujjawal,36072-2,MT,52,0,3 2016-02-06,19:00:00,ujjawal,36072-2,MT,40,0,3 2016-02-06,19:10:00,ujjawal,36072-2,MT,30,0,2 2016-02-06,19:20:00,ujjawal,36072-2,MT,25,0,3 2016-02-06,19:30:00,ujjawal,36072-2,MT,35,0,4 2016-02-06,19:40:00,ujjawal,36072-2,MT,14,0,1 2016-02-06,19:50:00,ujjawal,36072-2,MT,97,0,7 2016-02-06,20:00:00,ujjawal,36072-2,MT,14,0,1 2016-02-06,20:10:00,ujjawal,36072-2,MT,12,0,4 2016-02-06,20:20:00,ujjawal,36072-2,MT,11,0,2 2016-02-06,20:30:00,ujjawal,36072-2,MT,12,0,1 2016-02-06,20:40:00,ujjawal,36072-2,MT,6,0,1 2016-02-06,20:50:00,ujjawal,36072-2,MT,13,0,2 2016-02-06,21:00:00,ujjawal,36072-2,MT,5,0,1 2016-02-06,21:10:00,ujjawal,36072-2,MT,12,0,2 2016-02-06,21:20:00,ujjawal,36072-2,MT,1,0,1 2016-02-06,21:30:00,ujjawal,36072-2,MT,21,0,2 2016-02-06,21:40:00,null,null,null,null,null,null 2016-02-06,21:50:00,ujjawal,36072-2,MT,9,0,3 2016-02-06,22:00:00,ujjawal,36072-2,MT,2,0,1 2016-02-06,22:10:00,ujjawal,36072-2,MT,12,0,5 2016-02-06,22:20:00,ujjawal,36072-2,MT,1,0,1 2016-02-06,22:30:00,ujjawal,36072-2,MT,9,0,1 2016-02-06,22:40:00,ujjawal,36072-2,MT,13,0,1 2016-02-06,22:50:00,null,null,null,null,null,null 2016-02-06,23:00:00,ujjawal,36072-2,MT,20,0,2 2016-02-06,23:10:00,ujjawal,36072-2,MT,10,0,3 2016-02-06,23:20:00,ujjawal,36072-2,MT,10,0,1 2016-02-06,23:30:00,ujjawal,36072-2,MT,6,0,1 2016-02-06,23:40:00,ujjawal,36072-2,MT,12,0,1 2016-02-06,23:50:00,null,null,null,null,null,null
How to calculate inverse log2 ratio of a UCSC wiggle file using perl?
I have 2 separate files namely A & B containing same header lines but 2 and 1 column respectively. I want to take inverse log2 of the 2nd column or 1st column in separate files but keep the other description intact. I am having some thing like this.. values in file A $1 and $2 are separated by delimiter tab file A track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar variableStep chrom=chr1 12 0.781985 16 0.810993 20 0.769601 24 0.733831 file B track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar variableStep chrom=chr1 0.721985 0.610993 0.760123 0.573831 I expect an output like this. file A track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar variableStep chrom=chr1 12 1.7194950944 16 1.754418585 20 1.7047982296 24 1.6630493726 track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar variableStep chrom=chr2 for file B (in this file values are just copy paste of file A) track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar variableStep chrom=chr1 1.7194950944 1.754418585 1.7047982296 1.6630493726 track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig rep1.bar.wig graphType=bar variableStep chrom=chr2
This awk script does the calculation that you want: awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' file This matches lines that contain only digits, periods and any space characters, substituting the value of the last field $NF for 2 raised to the power of $NF. The format specifier %.12f can be modified to give you the required number of decimal places. The 1 at the end is shorthand for {print}. Testing it out on your new files: $ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' A track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar variableStep chrom=chr1 12 1.719495094445 16 1.754418584953 20 1.704798229573 24 1.663049372620 $ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' B track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar variableStep chrom=chr1 1.649449947457 1.527310087388 1.693635012985 1.488470882686
So here's the Perl version: use strict; open IN, $ARGV[0]; while (<IN>) { chomp; if (/^(.*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm" my $power = 2 ** $2; print("$1\t" . $power . "\n"); } elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm" my $power = 2 ** $1; print($power . "\n"); } else { # echo all other stuff print; print ("\n"); } } close IN; If you run <file>.pl <datafile> (replace with appropriate names) it will convert one file so the lines have 2**<2nd value>). It simply echoes the lines that do not match the number pattern.
This is the modified little script of #ThomasKilian Thanks to him for providing the framework. use strict; open IN, $ARGV[0]; while (<IN>) { chomp; if (/^(\d*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm" my $power = 2 ** $2; $power= sprintf("%.12f", $power); print("$1\t" . $power . "\n"); } elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm" my $power = 2 ** $1; $power= sprintf("%.12f", $power); print($power . "\n"); } else { # echo all other stuff print; print ("\n"); } } close IN;
how to convert PHYLIP format to FASTA
I just start working with perl and I have a question. I have PHYLIP file and I need convert it into FASTA. I start writing a script. Firstly, i removed scpaces in lines, now i need to align all lines that in every line should be 60 aminoacids and sequances identificator should be printed in new line. Maybe someone could give me some advice?
BioPerl Bio::AlignIO module might help. It support the PHYLIP sequence format : phylip2fasta.pl use strict; use warnings; use Bio::AlignIO; # http://doc.bioperl.org/bioperl-live/Bio/AlignIO.html # http://doc.bioperl.org/bioperl-live/Bio/AlignIO/phylip.html # http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format my ($inputfilename) = #ARGV; die "must provide phylip file as 1st parameter...\n" unless $inputfilename; my $in = Bio::AlignIO->new(-file => $inputfilename , -format => 'phylip', -interleaved => 1); my $out = Bio::AlignIO->new(-fh => \*STDOUT , -format => 'fasta'); while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); } $ perl phylip2fasta.pl test.phylip >Turkey/1-42 AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT >Salmo_gair/1-42 AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT >H._Sapiens/1-42 ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA >Chimp/1-42 AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT >Gorilla/1-42 AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA test.phylip http://evolution.genetics.washington.edu/phylip/doc/sequence.html 5 42 Turkey AAGCTNGGGC ATTTCAGGGT Salmo gairAAGCCTTGGC AGTGCAGGGT H. SapiensACCGGTTGGC CGTTCAGGGT Chimp AAACCCTTGC CGTTACGCTT Gorilla AAACCCTTGC CGGTACGCTT GAGCCCGGGC AATACAGGGT AT GAGCCGTGGC CGGGCACGGT AT ACAGGTTGGC CGTTCAGGGT AA AAACCGAGGC CGGGACACTC AT AAACCATTGC CGGTACGCTT AA
If you have access to BioPerl, I suggest using that (see other answer). If not, here is a quick script I used in an old HW assignment a few years ago. It may work for you. One thing to note: It prints the whole fasta sequences on one line, so you have you edit the print statement at the end to print 70 AA per line. #!/usr/bin/perl use warnings; use strict; <DATA> =~ /(\d+)/; # first number is number of species my $num_species = $1; my $i = 0; my #species; my #acids; # first $num_species rows have the species name for ($i = 0; $i < $num_species; $i++) { my #line = split /\s+/, <DATA>; chomp #line; push #species, shift (#line); push #acids, join ("", #line); } # Get the rest of the AAs $i = 0; while (<DATA>) { chomp; $_ =~ s/\r//g; #remove \r next if !$_; $_ =~ s/\s+//g; #remove spaces $acids[$i] .= $_; $i = ++$i % $num_species; } # Print them for ($i = 0; $i < $num_species; $i++) { print "> ", $species[$i], "\n"; # uncomment next line if you want to remove the gaps ("-") $acids[$i] =~ s/-//g; print $acids[$i], "\n\n"; } # Simple PHYLIP Amino Acid file __DATA__ 10 234 Cow MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL Carp MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL Chicken MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL Human MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL Loach MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL Mouse MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL Rat MAYPFQLGLQ DATSPIMEEL TNFHDHTLMI VFLISSLVLY IISLMLTTKL Seal MAYPLQMGLQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL Whale MAYPFQLGFQ DAASPIMEEL LHFHDHTLMI VFLISSLVLY IITLMLTTKL Frog MAHPSQLGFQ DAASPIMEEL LHFHDHTLMA VFLISTLVLY IITIMMTTKL THTSTMDAQE VETIWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM TNKYILDSQE IEIVWTILPA VILVLIALPS LRILYLMDEI NDPHLTIKAM S-SNTVDAQE VELIWTILPA IVLVLLALPS LQILYMMDEI DEPDLTLKAI TNTNISDAQE METVWTILPA IILVLIALPS LRILYMTDEV NDPSLTIKSI TNMYILDSQE IEIVWTVLPA LILILIALPS LRILYLMDEI NDPHLTIKAM THTSTMDAQE VETIWTILPA VILIMIALPS LRILYMMDEI NNPVLTVKTM THTSTMDAQE VETIWTILPA VILILIALPS LRILYMMDEI NNPVLTVKTM THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEV NNPSLTVKTM TNTNLMDAQE IEMVWTIMPA ISLIMIALPS LRILYLMDEV NDPHLTIKAI GHQWYWSYEY TDYEDLSFDS YMIPTSELKP GELRLLEVDN RVVLPMEMTI GHQWYWSYEY TDYENLGFDS YMVPTQDLAP GQFRLLETDH RMVVPMESPV GHQWYWTYEY TDFKDLSFDS YMTPTTDLPL GHFRLLEVDH RIVIPMESPI GHQWYWTYEY TDYGGLIFNS YMLPPLFLEP GDLRLLDVDN RVVLPIEAPI GHQWYWSYEY TDYENLSFDS YMIPTQDLTP GQFRLLETDH RMVVPMESPI GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI GHQWYWSYEY TDYEDLNFDS YMIPTQELKP GELRLLEVDN RVVLPMEMTI GHQWYWSYEY TDYEDLSFDS YMIPTSDLKP GELRLLEVDN RVVLPMEMTI GHQWYWSYEY TNYEDLSFDS YMIPTNDLTP GQFRLLEVDN RMVVPMESPT RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSSRPG LYYGQCSEIC RVLVSAEDVL HSWAVPSLGV KMDAVPGRLN QAAFIASRPG VFYGQCSEIC RVIITADDVL HSWAVPALGV KTDAIPGRLN QTSFITTRPG VFYGQCSEIC RMMITSQDVL HSWAVPTLGL KTDAIPGRLN QTTFTATRPG VYYGQCSEIC RILVSAEDVL HSWALPAMGV KMDAVPGRLN QTAFIASRPG VFYGQCSEIC RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC RMLISSEDVL HSWAIPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMTMRPG LYYGQCSEIC RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSTRPG LFYGQCSEIC RLLVTAEDVL HSWAVPSLGV KTDAIPGRLH QTSFIATRPG VFYGQCSEIC GSNHSFMPIV LELVPLKYFE KWSASML--- ---- GANHSFMPIV VEAVPLEHFE NWSSLMLEDA SLGS GANHSYMPIV VESTPLKHFE AWSSL----- -LSS GANHSFMPIV LELIPLKIFE M-------GP VFTL GANHSFMPIV VEAVPLSHFE NWSTLMLKDA SLGS GSNHSFMPIV LEMVPLKYFE NWSASMI--- ---- GSNHSFMPIV LEMVPLKYFE NWSASMI--- ---- GSNHSFMPIV LELVPLSHFE KWSTSML--- ---- GSNHSFMPIV LELVPLEVFE KWSVSML--- ---- GANHSFMPIV VEAVPLTDFE NWSSSML-EA SL-- Output: > Cow MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML > Carp MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQEIEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDSYMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLNQAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS > Chicken MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLSSNTVDAQEVELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDSYMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLNQTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSLLSS > Human MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEMGPVFTL > Loach MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQEIEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDSYMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLNQTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS > Mouse MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI > Rat MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI > Seal MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDSYMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML > Whale MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML > Frog MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQEIEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDSYMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLHQTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSMLEASL
Adding columns to a file based on existing columns
I am trying to modify a file which is set up like this: chr start ref alt chr1 18884 C CAAAA chr1 135419 TATACA T chr1 332045 T TTG chr1 453838 T TAC chr1 567652 T TG chr1 602541 TTTA T chr1 614937 C CTCTCTG chr1 654889 C CA chr1 736800 AC A I want to modify it such that: if column "ref" is a string >1 (i.e line 2) then I generate 2 new columns where: first new column = start coordinate-1 second new column = start coordinate+(length of string in ref)+1 therefore, for line 2 output would look like: chr1 135419 TATACA T 135418 135426 or: if length of string in "ref" = 1 and column "alt"=string of length>1 (i.e. line 1) then first new column = start coordinate second new column = start coordinate+2 so, output for line 1 would be: chr1 18884 C CAAAA 18884 18886 I have tried to this in awk but without success My perl is non-existent but would this be the best way? Or maybe in R?
Perl solution. Note that your specification does not mention what to do if both strings are length 1. #!/usr/bin/perl use warnings; use strict; use feature qw(say); #use Data::Dumper; <DATA>; # Skip the header; while (<DATA>) { my ($chr, $start, $ref, $alt) = split; my #cols; if (1 < length $ref) { #cols = ( $start - 1, $start + 1 + length $ref); } elsif (1 < length $alt) { #cols = ($start, $start + 2); } else { warn "Don't know what to do at $.\n"; } say join "\t", $chr, $start, $ref, $alt, #cols; } __DATA__ chr start ref alt chr1 18884 C CAAAA chr1 135419 TATACA T chr1 332045 T TTG chr1 453838 T TAC chr1 567652 T TG chr1 602541 TTTA T chr1 614937 C CTCTCTG chr1 654889 C CA chr1 736800 AC A
Here's one way using awk. Run like: awk -f script.awk file | column -t Contents of script.awk: NR==1 { next } length($3)>1 && length($4)==1 { print $0, $2-1, $2+length($3)+1 next } length($3)==1 && length($4)>1 { print $0, $2, $2+2 next }1 Results: chr1 18884 C CAAAA 18884 18886 chr1 135419 TATACA T 135418 135426 chr1 332045 T TTG 332045 332047 chr1 453838 T TAC 453838 453840 chr1 567652 T TG 567652 567654 chr1 602541 TTTA T 602540 602546 chr1 614937 C CTCTCTG 614937 614939 chr1 654889 C CA 654889 654891 chr1 736800 AC A 736799 736803 Alternatively, here's the one-liner: awk 'NR==1 { next } length($3)>1 && length($4)==1 { print $0, $2-1, $2+length($3)+1; next } length($3)==1 && length($4)>1 { print $0, $2, $2+2; next }1' filem | column -t The code should be pretty self-explanatory. The 1 on the end of the script simply enables default printing (i.e. '1' returns true) of each line. HTH.
Doing it in perl is trivial (but so is in awk): #!/usr/bin/perl while (<>) { chmop; my ($chr,$start,$ref,$alt)=split(/\s+/,$_); if (len($ref) > 1) { print STDOUT "$chr\t$start\t$ref\t$alt\t", $start+len($ref)+1,"\n"; } elsif (len($ref)==1) { print STDOUT "$chr\t$start\t$ref\t$alt\t", $start+2,"\n"; } else { print STDERR "ERROR: ???\n"; #actually impossible } } Stick it in a file morecols.pl , chmod +x morecols.pl, run more morecols.pl . (Beware, lots of assumptions in this code/instructions). I have a feeling your actual problem is more with programming/text processing then tools or languages. If so, this code is just a stopgap solution.... Cheers.
How to extract part of a line in a text file and print it to an output file in Perl (Code half-written)
I have a large .txt file, a part of which is shown below - ID SNP FT SNP 102 FT /note="refAllele: C SNPstrains: 4395_8_10=A 4395_8_7=A 4395_8_9=A " FT /colour=1 FT SNP 1299 FT /note="refAllele: A SNPstrains: 6437_8_6=T (non-synonymous) (AA Gin->His) " FT /colour=2 FT SNP 2134 FT /note="refAllele: C SNPstrains: 4395_8_12=T " FT /colour=1 FT SNP 3205 FT /note="refAllele: C SNPstrains: 6437_8_12=T (synonymous) " I have this script as well (which I did not write) - $cod{1} = "Int"; $cod{2} = "non"; $cod{3} = "syn"; $cod{4} = "stop"; $file = "Whole.pl"; open IN, "$file"; open OUT, ">whole2"; print OUT "Coordinate Type Strains\n"; while (<IN>) { if (m/^FT\s+SNP\s+(\d+)/) { $SNP = $1; } elsif (m/^FT\s+\/note="(.*)"/) { $line = $1; $count = ($line =~ tr/=/=/); } elsif (m/^FT\s+\/colour=(\d+)/) { if ($cod{$1}) { print OUT "$SNP $cod{$1} $count\n"; } elsif (!$cod{$1}) { print OUT "$SNP colour $1 $count\n"; } } } It creates a new file. For the above data it would create this - Coordinate Type Strains 102 Int 3 1299 non 1 2134 Int 1 3205 syn 1 I am very new to perl and programming in general, and think I just about understand what this script is doing. However, for strains that show a non-synonymous mutation (such as the second one in the .txt file), I would like to have a fourth column in the output file which details the amino acid change (e.g. (AA Gin->His), end of the sixth line in the .txt file). Also, I would ideally like to just have non-synonymous mutations shown in the output, and leave the "syn" and "int" out altogether. I have tried numerous ways to do this but none have worked. Please can you show me a way to do this? Many thanks in advance. Max
Assumptions: Your /note may contain an amino acid change as last element. It must be enclosed with parens and start with the letters AA, followed by a sequence of one or more letters, followed by ->, followed by another sequence of one or more letters. You are only interested in the non type. In your first elsif, we have to match the $line against a possible amino change: }elsif(m/^FT\s+\/note="(.*)"/){ $line=$1; $line =~ m/\((AA \w+->\w+)\)\s*$/; $change = $1 || ""; ...; In your second elsif, we print only if $cod{$1} is equal to non: }elsif(m/^FT\s+\/colour=(\d+)/){ print OUT "$SNP $count $change\n" if $cod{$1} eq "non"; # inner if/else not needed any longer. } Also, your table headers at the top have to change: print OUT "Coordinate Strains Change\n"; You will have to re-align the columns manually. This would print something like Coordinate Strains Change 1299 1 AA Gin->His on the example input.