Adding columns to a file based on existing columns - perl

I am trying to modify a file which is set up like this:
chr start ref alt
chr1 18884 C CAAAA
chr1 135419 TATACA T
chr1 332045 T TTG
chr1 453838 T TAC
chr1 567652 T TG
chr1 602541 TTTA T
chr1 614937 C CTCTCTG
chr1 654889 C CA
chr1 736800 AC A
I want to modify it such that:
if column "ref" is a string >1 (i.e line 2) then I generate 2 new columns where:
first new column = start coordinate-1
second new column = start coordinate+(length of string in ref)+1
therefore, for line 2 output would look like:
chr1 135419 TATACA T 135418 135426
or:
if length of string in "ref" = 1 and column "alt"=string of length>1 (i.e. line 1) then
first new column = start coordinate
second new column = start coordinate+2
so, output for line 1 would be:
chr1 18884 C CAAAA 18884 18886
I have tried to this in awk but without success
My perl is non-existent but would this be the best way? Or maybe in R?

Perl solution. Note that your specification does not mention what to do if both strings are length 1.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw(say);
#use Data::Dumper;
<DATA>; # Skip the header;
while (<DATA>) {
my ($chr, $start, $ref, $alt) = split;
my #cols;
if (1 < length $ref) {
#cols = ( $start - 1, $start + 1 + length $ref);
} elsif (1 < length $alt) {
#cols = ($start, $start + 2);
} else {
warn "Don't know what to do at $.\n";
}
say join "\t", $chr, $start, $ref, $alt, #cols;
}
__DATA__
chr start ref alt
chr1 18884 C CAAAA
chr1 135419 TATACA T
chr1 332045 T TTG
chr1 453838 T TAC
chr1 567652 T TG
chr1 602541 TTTA T
chr1 614937 C CTCTCTG
chr1 654889 C CA
chr1 736800 AC A

Here's one way using awk. Run like:
awk -f script.awk file | column -t
Contents of script.awk:
NR==1 {
next
}
length($3)>1 && length($4)==1 {
print $0, $2-1, $2+length($3)+1
next
}
length($3)==1 && length($4)>1 {
print $0, $2, $2+2
next
}1
Results:
chr1 18884 C CAAAA 18884 18886
chr1 135419 TATACA T 135418 135426
chr1 332045 T TTG 332045 332047
chr1 453838 T TAC 453838 453840
chr1 567652 T TG 567652 567654
chr1 602541 TTTA T 602540 602546
chr1 614937 C CTCTCTG 614937 614939
chr1 654889 C CA 654889 654891
chr1 736800 AC A 736799 736803
Alternatively, here's the one-liner:
awk 'NR==1 { next } length($3)>1 && length($4)==1 { print $0, $2-1, $2+length($3)+1; next } length($3)==1 && length($4)>1 { print $0, $2, $2+2; next }1' filem | column -t
The code should be pretty self-explanatory. The 1 on the end of the script simply enables default printing (i.e. '1' returns true) of each line. HTH.

Doing it in perl is trivial (but so is in awk):
#!/usr/bin/perl
while (<>) {
chmop;
my ($chr,$start,$ref,$alt)=split(/\s+/,$_);
if (len($ref) > 1) {
print STDOUT
"$chr\t$start\t$ref\t$alt\t",
$start+len($ref)+1,"\n";
} elsif (len($ref)==1) {
print STDOUT
"$chr\t$start\t$ref\t$alt\t",
$start+2,"\n";
} else {
print STDERR "ERROR: ???\n"; #actually impossible
}
}
Stick it in a file morecols.pl , chmod +x morecols.pl, run more morecols.pl . (Beware, lots of assumptions in this code/instructions). I have a feeling your actual problem is more with programming/text processing then tools or languages. If so, this code is just a stopgap solution....
Cheers.

Related

Perl command not giving expected output

Command:
perl -lpe '1 while (s/(^|\s)(0\d*)(\s|$)/$1"$2"$3/)' test5
Input:
1234 012345 0
0.000 01234 0
01/02/03 5467 0abc
01234 0123
0000 000054
0asdf 0we23-1
Current Output:
perl -lpe '1 while (s/(^|\s)(0\d*)(\s|$)/$1"$2"$3/)' test5
1234 "012345" "0"
0.000 "01234" "0"
01/02/03 5467 "0abc"
"01234" "0123"
"0000" "000054"
0asdf 0we23-1
Excepted Output:
1234 "012345" 0
0.000 "01234" 0
01/02/03 5467 "0abc"
"01234" "0123"
"0000" "000054"
"0asdf" "0we23-1"
Conditions to follow in output:
All strings starting with 0 having alphanumeric character except / and . should be double quoted.
if string starting with 0 have only 0 character should not be quoted.
Spacing between strings should be preserved.
This appears to do what you want:
#!/usr/bin/env perl
use strict;
use warnings;
while ( <DATA> ) {
my #fields = split;
s/^(0[^\.\/]+)$/"$1"/ for #fields;
print join " ", #fields, "\n";
}
__DATA__
1234 012345 0
0.000 01234 0
01/02/03 5467 0abc
01234 0123
0000 000054
0asdf 0we23-1
Note - it doesn't strictly preserve whitespace like you asked though - it just removes it and reinserts a single space. That seems to meet your spec, but you could instead:
my #fields = split /(\s+)/;
as this would capture the spaces too.
join "", #fields;
This is reducible to a one liner using -a for autosplitting:
perl -lane 's/^(0[^\.\/]+)$/"$1"/ for #F; print join " ", #F'
If you wanted to do the second bit (preserving whitespace strictly) then you'd need to drop the -a and use split yourself.

Mapping SNP coordinates to gene coordinates is too slow

I have two tab-delimited files like these
File 1 (these are Single Nucleotide Polymorphism (SNP) positions)
Chr1 26690
Chr1 33667
Chr1 75049
.
.
Chr2 12342
Chr2 32642
Chr2 424421
.
.
File 2 (these are gene start and end coordinates)
Chr1 2903 10817 LOC_Os01g01010
Chr1 2984 10562 LOC_Os01g01010
Chr1 11218 12435 LOC_Os01g01019
Chr1 12648 15915 LOC_Os01g01030
Chr1 16292 18304 LOC_Os01g01040
Chr1 16292 20323 LOC_Os01g01040
Chr1 16321 20323 LOC_Os01g01040
Chr1 16321 20323 LOC_Os01g01040
Chr1 22841 26971 LOC_Os01g01050
Chr1 22841 26971 LOC_Os01g01050
.
.
What I want is to match SNPs in file 1 to genes in file 2. The script should match the string in the first column of the files, and if they match it should then find which gene in the file 2 contains the corresponding SNP and return the locus ID from the fourth column of File 2.
Here's the script I have written
use strict;
my $i1 = $ARGV[0]; # SNP
my $i2 = $ARGV[1]; # gene coordinate
open(I1, $i1);
open(I2, $i2);
my #snp = ();
my #coor = ();
while( <I1> ) {
push(#snp, $_);
}
while ( <I2> ) {
push(#coor, $_);
}
for ( my $i = 0; $i <= $#snp; $i++ ) {
my #snp_line = split "\t", $snp[$i];
for ( my $j = 0; $j <= $#coor; $j++ ) {
my #coor_line = split "\t", $coor[$i];
if ( $snp_line[0] eq $coor_line[0] ) {
if ( $snp_line[1] >= $coor_line[1] && $snp_line[1] <= $coor_line[2] ) {
print "$snp_line[0]\t$snp_line[1]\t$coor_line[3]\n";
goto a;
}
}
}
a:
}
The problem is that obviously this is not the best way to do it as it iterates over all the ~60,000 lines in file 2 for each SNP in line 1. Also, it ran overnight and did not go past Chr1; we have upto Chr12.
You could work with these files when reformatted as UCSC BED format, using a toolkit like BEDOPS that does efficient set operations on sorted BED files.
Convert your first file of SNPs to a sorted BED file:
$ awk -v OFS="\t" '{ print $1, $2, ($2+1); }' snps.txt | sort-bed - > snps.bed
Sort the genes ("file 2"):
$ sort-bed genes.unsorted.txt > genes.bed
Map SNPs to genes:
$ bedmap --echo --echo-map-id-uniq --delim '\t' snps.bed genes.bed > answer.bed
If you need to, you can strip the end position of the SNP from the answer:
$ cut -f1,2,4 answer.bed > answer.txt
These tools will run very fast, usually within a few moments.
I would not use Perl or Python to do these kinds of set operations, unless I was doing some kind of academic exercise.
Here is a working script, the one posted above had bugs
use strict;
my $i1=$ARGV[0]; # SNP
my $i2=$ARGV[1]; # gene coordinate
open(I1,$i1);
open(I2,$i2);
my #snp=();
my #coor=();
while(<I1>)
{
push(#snp,$_);
}
while(<I2>)
{
push(#coor,$_);
}
for(my $i=0;$i<=$#snp;$i++)
{
my #snp_line = split "\t",$snp[$i];
for(my $j=0;$j<=$#coor;$j++)
{
my #coor_line = split "\t",$coor[$j];
if ($snp_line[0] eq $coor_line[0])
{
if ($snp_line[1] >= $coor_line[1] && $snp_line[1] <= $coor_line[2])
{
print "$snp_line[0]\t$snp_line[1]\t$coor_line[3]\n";
}
}
}
}
This one does the job.

How to repeat a sequence of numbers to the end of a column?

I have a data file that needs a new column of identifiers from 1 to 5. The final purpose is to split the data into five separate files with no leftover file (split leaves a leftover file).
Data:
aa
bb
cc
dd
ff
nn
ww
tt
pp
with identifier column:
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
Not sure if this can be done with seq? Afterwards it will be split with:
awk '$2 == 1 {print $0}'
awk '$2 == 2 {print $0}'
awk '$2 == 3 {print $0}'
awk '$2 == 4 {print $0}'
awk '$2 == 5 {print $0}'
Perl to the rescue:
perl -pe 's/$/" " . $. % 5/e' < input > output
Uses 0 instead of 5.
$. is the line number.
% is the modulo operator.
the /e modifier tells the substitution to evaluate the replacement part as code
i.e. end of line ($) is replaced with a space concatenated (.) with the line number modulo 5.
$ awk '{print $0, ((NR-1)%5)+1}' file
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
No need for that to create 5 separate files of course. All you need is:
awk '{print > ("file_" ((NR-1)%5)+1)}' file
Looks like you're happy with a perl solution that outputs 1-4 then 0 instead of 1-5 so FYI here's the equivalent in awk:
$ awk '{print $0, NR%5}' file
aa 1
bb 2
cc 3
dd 4
ff 0
nn 1
ww 2
tt 3
pp 4
I am going to offer a Perl solution even though it wasn't tagged because Perl is well suited to solve this problem.
If I understand what you want to do, you have a single file that you want to split into 5 separate files based on the position of a line in the data file:
the first line in the data file goes to file 1
the second line in the data file goes to file 2
the third line in the data file goes to file 3
...
since you already have the lines position in the file, you don't really need the identifier column (though you could pursue that solution if you wanted).
Instead you can open 5 filehandles and simply alternate which handle you write to:
use strict;
use warnings;
my $datafilename = shift #ARGV;
# open filehandles and store them in an array
my #fhs;
foreach my $i ( 0 .. 4 ) {
open my $fh, '>', "${datafilename}_$i"
or die "$!";
$fhs[$i] = $fh;
}
# open the datafile
open my $datafile_fh, '<', $datafilename
or die "$!";
my $row_number = 0;
while ( my $datarow = <$datafile_fh> ) {
print { $fhs[$row_number++ % #fhs] } $datarow;
}
# close resources
foreach my $fh ( #fhs ) {
close $fh;
}

How to extract a particular column of data in Perl?

I have some data from a unix commandline call
1 ab 45 1234
2 abc 5
4 yy 999 2
3 987 11
I'll use the system() function for the call.
How can I extract the second column of data into an array in Perl? Also, the array size has to be dependent on the number of rows that I have (it will not necessarily be 4).
I want the array to have ("ab", "abc", "yy", 987).
use strict;
use warnings;
my $data = "1 ab 45 1234
2 abc 5
2 abc 5
2 abc 5
4 yy 999 2
3 987 11";
my #second_col = map { (split)[1] } split /\n/, $data;
To get unique values, see perlfaq4. Here's part of the answer provided there:
my %seen;
my #unique = grep { ! $seen{ $_ }++ } #second_col;
You can chain a Perl cmd-line call (aka: one-liner) to your unix script:
perl -lane 'print $F[1]' data.dat
instead of data.dat, use a pipe from your command line tool
cat data.dat | perl -lane 'print $F[1]'
Addendum:
The extension for unique-ness of the resulting column is straightforward:
cat data.dat | perl -lane 'print $F[1] unless $seen{$F[1]}++'
or, if you are lazy (employing %_):
cat data.dat | perl -lane 'print unless $_{$_=$F[1]}++'

How do I split up a line and rearrange its elements?

I have some data on a single line like below
abc edf xyz rfg yeg udh
I want to present the data as below
abc
xyz
yeg
edf
rfg
udh
so that alternate fields are printed with newline separated.
Are there any one liners for this?
The following awk script can do it:
> echo 'abc edf xyz rfg yeg udh' | awk '{
for (i = 1;i<=NF;i+=2){print $i}
print "";
for (i = 2;i<=NF;i+=2){print $i}
}'
abc
xyz
yeg
edf
rfg
udh
Python in the same spirit as the above awk (4 lines):
$ echo 'abc edf xyz rfg yeg udh' | python -c 'f=raw_input().split()
> for x in f[::2]: print x
> print
> for x in f[1::2]: print x'
Python 1-liner (omitting the pipe to it which is identical):
$ python -c 'f=raw_input().split(); print "\n".join(f[::2] + [""] + f[1::2])'
Another Perl 5 version:
#!/usr/bin/env perl
use Modern::Perl;
use List::MoreUtils qw(part);
my $line = 'abc edf xyz rfg yeg udh';
my #fields = split /\s+/, $line; # split on whitespace
# Divide into odd and even-indexed elements
my $i = 0;
my ($first, $second) = part { $i++ % 2 } #fields;
# print them out
say for #$first;
say ''; # Newline
say for #$second;
A shame that the previous perl answers are so long. Here are two perl one-liners:
echo 'abc edf xyz rfg yeg udh'|
perl -naE '++$i%2 and say for #F; ++$j%2 and say for "",#F'
On older versions of perl (without "say"), you may use this:
echo 'abc edf xyz rfg yeg udh'|
perl -nae 'push #{$a[++$i%2]},"$_\n" for "",#F; print map{#$_}#a;'
Just for comparison, here's a few Perl scripts to do it (TMTOWTDI, after all). A rather functional style:
#!/usr/bin/perl -p
use strict;
use warnings;
my #a = split;
my #i = map { $_ * 2 } 0 .. $#a / 2;
print join("\n", #a[#i]), "\n\n",
join("\n", #a[map { $_ + 1 } #i]), "\n";
We could also do it closer to the AWK script:
#!/usr/bin/perl -p
use strict;
use warnings;
my #a = split;
my #i = map { $_ * 2 } 0 .. $#a / 2;
print "$a[$_]\n" for #i;
print "\n";
print "$a[$_+1]\n" for #i;
I've run out of ways to do it, so if any other clever Perlers come up with another method, feel free to add it.
Another Perl solution:
use strict;
use warnings;
while (<>)
{
my #a = split;
my #b = map { $a[2 * ($_%(#a/2)) + int($_ / (#a /2))] . "\n" } (0 .. #a-1);
print join("\n", #a[0..((#b/2)-1)], '', #a[(#b/2)..#b-1], '');
}
You could even condense it into a real one-liner:
perl -nwle'my #a = split;my #b = map { $a[2 * ($_%(#a/2)) + int($_ / (#a /2))] . "\n" } (0 .. #a-1);print join("\n", #a[0..((#b/2)-1)], "", #a[(#b/2)..#b-1], "");'
Here's the too-literal, non-scalable, ultra-short awk version:
awk '{printf "%s\n%s\n%s\n\n%s\n%s\n%s\n",$1,$3,$5,$2,$4,$6}'
Slightly longer (two more characters), using nested loops (prints an extra newline at the end):
awk '{for(i=1;i<=2;i++){for(j=i;j<=NF;j+=2)print $j;print ""}}'
Doesn't print an extra newline:
awk '{for(i=1;i<=2;i++){for(j=i;j<=NF;j+=2)print $j;if(i==1)print ""}}'
For comparison, paxdiablo's version with all unnecessary characters removed (1, 9 or 11 more characters):
awk '{for(i=1;i<=NF;i+=2)print $i;print "";for(i=2;i<=NF;i+=2)print $i}'
Here's an all-Bash version:
d=(abc edf xyz rfg yeg udh)
i="0 2 4 1 3 5"
for w in $i
do
echo ${d[$w]}
[[ $w == 4 ]]&&echo
done
My attempt in haskell:
Prelude> (\(x,y) -> putStr $ unlines $ map snd (x ++ [(True, "")] ++ y)) $ List.partition fst $ zip (cycle [True, False]) (words "abc edf xyz rfg yeg udh")
abc
xyz
yeg
edf
rfg
udh
Prelude>
you could also just use tr:
echo "abc edf xyz rfg yeg udh" | tr ' ' '\n'
Ruby versions for comparison:
ARGF.each do |line|
groups = line.split
0.step(groups.length-1, 2) { |x| puts groups[x] }
puts
1.step(groups.length-1, 2) { |x| puts groups[x] }
end
ARGF.each do |line|
groups = line.split
puts groups.select { |x| groups.index(x) % 2 == 0 }
puts
puts groups.select { |x| groups.index(x) % 2 != 0 }
end
$ echo 'abc edf xyz rfg yeg udh' |awk -vRS=" " 'NR%2;NR%2==0{_[++d]=$0}END{for(i=1;i<=d;i++)print _[i]}'
abc
xyz
yeg
edf
rfg
udh
For newlines, i leave it to you to do yourself.
Here is yet another way, using Bash, to manually rearrange words in a line - with previous conversion to an array:
echo 'abc edf xyz rfg yeg udh' | while read tline; do twrds=($(echo $tline)); echo -e "${twrd[0]} \n${twrd[2]} \n${twrd[4]} \n\n ${twrd[1]} \n${twrd[3]} \n${twrd[5]} \n" ; done
Cheers!