Read Tab delimited file and count the occurrences and delete row - perl

I am fairly new to programming and trying to resolve this problem. I have the file like this.
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 77 T C T T T T T
tg93 79 C - C C C - -
tg93 79 C G C C C C G C
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 105 A G A A A A A G A
tg93 108 A G A A A A G A A
tg93 114 T C T T T T T C T
tg93 131 A C A A A A A A A
tg93 136 G C C G C C G G G
tg93 150 CTCTC - CTCTC - CTCTC CTCTC
In this file, in the heading
CHROM - name
POS - position
REF - reference
ALT - alternate
10 - 16_sample.bam - samplesd
I
Now i wanted to see how many times the letter in REF and ALT column occured. If either of them is repeated less than two times, i need to delete that row.
For example
In the first row, i have 'T' in REF and 'C' in ALT . I see in 7 samples, there are 5 T's and 2 blanks and no C. So i need to delete this row.
In Second row, REF is 'C' and Alt is '-'. Now in seven samples we have 3 C's, 2 '-'s and 2 blanks. So we keep this row as C and - have repeated more than 2 times.
Always we ignore the blanks while counting
The final file after filtering is
#CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
I am able to read the columns in to arrays and display them in the code but i am not sure how to start the loops to read the base and count their occurrences and remain the column. Can anyone tell me how i should be proceeding with this? Or it will be helpful if you have any example code i can modify up on.

#!/usr/bin/env perl
use strict;
use warnings;
print scalar(<>); # Read and output the header.
while (<>) { # Read a line.
chomp; # Remove the newline from the line.
my ($chrom, $pos, $ref, $alt, #samples) =
split /\t/; # Parse the remainder of the line.
my %counts; # Count the occurrences of sample values.
++$counts{$_} for #samples; # e.g. Might end up with $counts{"G"} = 3.
print "$_\n" # Print line if we want to keep it.
if ($counts{$ref} || 0) >= 2 # ("|| 0" avoids a spurious warning.)
&& ($counts{$alt} || 0) >= 2;
}
Output:
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
You included 108 in your desired output, but it only has one instance of ALT in the seven samples.
Usage:
perl script.pl file.in >file.out
Or in-place:
perl -i script.pl file

Here's an approach that does not assume tab separation between fields
use IO::All;
my $chrom = "tg93";
my #lines = io('file.txt')->slurp;
foreach(#lines) {
%letters = ();
# use regex with backreferences to extract data - this method does not depend on tab separated fields
if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {
# initialize hash counts
$letters{$1} = 0;
$letters{$2} = 0;
# loop through the samples and increment the counter when matches are found
foreach($3, $4, $5, $6, $7, $8, $9) {
if ($_ eq $1) {
++$letters{$1};
}
if ($_ eq $2) {
++$letters{$2};
}
}
# if the counts for both POS and REF are greater than or equal to 2, print the line
if($letters{$1} >= 2 && $letters{$2} >= 2) {
print $_;
}
}
}

Related

How to vectorize this for-loop in Matlab?

This is perhaps a simple question. I have a vector and a matrix and want to make a new matrix based on some manipulation. I constructed the new matrix using for loop and I would like to know how can I write it with Vector operator that are likely faster.
d=[n x 1];
t= [n x n];
I want the new Delta matrix which is [n x n] as follows:
for i=1:39
for j=1:39
Delta(i,j)=d(i)-d(j)-t(i,j);
end
end
The result
[d (1) - d (1) - t( 1 ,1),d (1) - d (2) - t( 1 ,2), ... d(1) - d (39) - t( 1 ,39)
d (2) - d (1) - t( 2 ,1),d (2) - d (2) - t( 2 ,2), .... ,d (2) - d (39) - t( 2 ,39)
.
.
.
d (38) - d (1) - t( 38 ,1),d (38) - d (2) - t( 38 ,2), ... , d(38) -d (39)-t(38,39)
d (39) - d (1) - t( 39 ,1),d (39) - d (2) - t( 39 ,2), ..., d(39)- d (39)- t(39 ,39)]
You can use the efficient bsxfun -
Delta = bsxfun(#minus,d,d.') - t

Consolidation of intervals

I'm working with biological data (copy number variations) which is shown as intervals (tab separated file):
File 1
Columns: Chromosome, Start, End, Annotation
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
I intersected them in order to consolidate the overlapping events (50% of overlap is my condition), the result is this:
I used intersectBed from Bedtools (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html):
$ intersectBed -a File1 -b File1 -loj -f 0.50 -r > File 2
File 2
Columns: Chromosome, Start, End, Annotation , Chromosome, Start, End, Annotation
1 1 10 A 1 1 10 A
1 1 10 A 1 3 12 B
1 3 12 B 1 1 10 A
1 3 12 B 1 3 12 B
1 3 12 B 1 7 15 C
1 7 15 C 1 3 12 B
1 7 15 C 1 7 15 C
1 20 30 D 1 20 30 D
1 35 45 E 1 35 45 E
1 35 45 E 1 37 45 F
1 37 45 F 1 35 45 E
1 37 45 F 1 37 45 F
1 50 60 G 1 50 60 G
1 50 60 G 1 50 65 H
1 50 65 H 1 50 60 G
1 50 65 H 1 50 65 H
Event A and the event C overlaps with the event B, event E and F overlaps with each other like G and H, finally the event D has no overlapping partners. Knowing this, the list of consolidated CNV should be:
File 3
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
I was trying to use the merge option of the HDCNV java software (http://daleylab.org/lab/?page_id=125) but the output is not what I needed. I was trying to write a perl code but I'm a beginner so this problem is, at the moment, out of my limits.
I would appreciate if you can help me with a nice perl or awk code which take File 2 as input and outputs File 3.
Thanks in advance
I'm assuming that the columns have the following meanings:
col 1: chromosome number
col 2: start position of genomic region
col 3: end position of genomic region
col 4: text identifier
This script looks for the areas of overlap between the named regions. It assumes that the input text is sorted by col 1 then col 2. I have put the input text in a string, but you will probably be reading it in from a file (and outputting your data to a file, too). I will leave you to work out how to do that--it is pretty easy, and there is lots of documentation on the perl website.
#!/usr/bin/perl
use strict;
use warnings;
use feature ":5.10";
use Data::Dumper;
my $text = '1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
';
# we have tab-delimited data.
# split on line breaks, remove line ending, split on tabs
my #lines = map { chomp; [ split(/\t/, $_) ]; } split("\n", $text);
my $col_0 = 1;
my $min = 0;
my $max = 0;
my #range;
foreach (#lines) {
# the chromosome number has changed or
# minimum is greater than current maximum:
# start a new interval
if ($col_0 != $_->[0] || $_->[1] > $max) {
if (#range) {
# print out the range, and restart the stack
say join("\t",
$col_0,
( $min || $_->[1] ),
( $max || $_->[2] ),
join(", ", #range)
);
}
#range = ( $_->[3] );
# set the min and max
$col_0 = $_->[0];
$min = $_->[1];
$max = $_->[2];
}
else {
# the minimum is lower than our current maximum.
# check whether the max is greater than our current
# max and increase it if so. Add the letter to the
# current range.
if ($_->[2] > $max) {
$max = $_->[2];
}
push #range, $_->[3];
}
}
# print out the last line
say join("\t", $col_0, $min, $max, join(", ", #range) );
Output:
1 1 15 A, B, C
1 20 30 D
1 35 45 E, F
1 50 65 G, H
2 1 15 I, J, K
2 20 30 L
2 35 45 M, N
2 50 65 O, P
I have just calculated simple overlap - this doesn't do 50% overlap. Using this script as a start, you can figure out how to do that. We're not doing your PhD for you! ;)
awk '
$2 > end && NR>1 {
print "1", start, end, pair;
start=end=pair=0
}
{
if (!start) { start = $2 };
end = $3;
pair = (pair ? pair "," $4 : $4)
}
END {
print "1", start, end, pair
}' file
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
Assuming ordered data, the following stub should handle merging the records.
Would just have to modify it to load and output to a file.
use strict;
use warnings;
use List::Util qw(min max);
my $last;
while (<DATA>) {
my #fields = split;
if ( !$last ) {
$last = \#fields;
} elsif ( $last->[0] == $fields[0] && $last->[2] > $fields[1] ) {
$last->[1] = min( $last->[1], $fields[1] );
$last->[2] = max( $last->[2], $fields[2] );
$last->[3] .= ",$fields[3]";
} else {
print join( "\t", #$last ), "\n";
$last = \#fields;
}
}
print join( "\t", #$last ), "\n";
__DATA__
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
Outputs:
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
2 1 15 I,J,K
2 20 30 L
2 35 45 M,N
2 50 65 O,P
My take:
awk -F "\t" -v OFS="\t" '
function emit() {print chrom, start, end, annot}
$1 == chrom && ((start<=$2 && $2<=end) || (start<=$3 && $3<=end)) {
annot = annot "," $4
if ($2 < start) start = $2
if ($3 > end) end = $3
next
}
chrom {emit()}
{chrom=$1; start=$2; end=$3; annot=$4}
END {emit()}
' file1
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H

How to print the array element values in array order?

How to print the array's values in order of array element?
#ab= <DATA>;
print "#ab\n";
#a = qw(a b c d);
foreach $s(#ab){
foreach $m (#a){
$z =~m/$m/g;
print "$z";
}
}
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
In this program gives outputs but i expect the outputs is
a 5
a 12
a 1
a 32
b 24
b 54
c 11
d 43
d 4
d 1
First prints the first array element's matching value then second and so on.. How can i do this?
Your outer loop should loop over #a and inner over #ab
my #ab= <DATA>;
my #a = qw(a b c d);
foreach my $m (#a) {
foreach my $s (#ab) {
print $s if $s =~ /^$m/;
}
}
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
output
a 5
a 12
a 1
a 32
b 24
b 54
c 11
d 43
d 4
d 1
It looks like you're trying to sort the array alphabetically but only based on the first letter. I think this does what you want:
use strict;
use warnings;
print sort { (substr $a, 0, 1) cmp (substr $b, 0, 1) } <DATA>;
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
Output:
a 5
a 12
a 1
a 32
b 24
b 54
c 11
d 43
d 4
d 1
Try:
#ab= <DATA>;
#a = qw(a b c d);
print map { $tmp = $_; grep { $tmp eq (split(" ", $_))[0] } #ab } #a;
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
Explanation
map { ... } #a: loop through each element in array #a, orderly.
$tmp = $_: save the current value of $_ of map function to variable $tmp.
grep { $tmp eq (split(" ", $_))[0] } #ab: with each element in array #ab, we split it to get only the first character, compare with current $tmp value. If equal, grep return that element.
Group the sections, and then print
use strict;
use warnings;
my %group;
while (<DATA>) {
my ($key) = split ' ';
push #{$group{$key}}, $_;
}
for my $key (sort keys %group) {
print #{$group{$key}};
}
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
Outputs:
a 5
a 12
a 1
a 32
b 24
b 54
c 11
d 43
d 4
d 1
This scales better than Сухой27's answer: O(S) instead of O(ID*S).
my #ids = qw( a b c d );
my %s_by_id;
while (my $s = <DATA>) {
my ($id) = $s =~ /^(\S+)/
or next;
push #{ $s_by_id{$id} }, $s;
}
for my $id (#ids) {
print #{ $s_by_id{$id} } if $s_by_id{$id};
}

How to double the columns in a data frame in perl

I have a big data frame that looks like this:
name1 A A G
name2 C C T
name3 A G G
name4 H G G
name5 C - T
name6 C C C
name7 A G G
name8 G G A
I expect the data frame changed to:
name1 A A A A G G
name2 C C C C T T
name3 A A G G G G
name4 H H G G G G
name5 C C - - T T
name6 C C C C C C
name7 A A G G G G
name8 G G G G A A
I tried to work with R to do this but the memory limit not allow me to do it. Please help me with a perl solution. I don't know how to write a perl script. Thanks.
perl -lane'
BEGIN { $, ="\t" }
print shift(#F), map{ ($_)x2 } #F
' file
output
name1 A A A A G G
name2 C C C C T T
name3 A A G G G G
name4 H H G G G G
name5 C C - - T T
name6 C C C C C C
name7 A A G G G G
name8 G G G G A A
Using a perl one-liner
perl -lane 'print join "\t", shift(#F), map {($_) x 2} #F' data.txt

Taking only values which form continous range

I have a file with 3 columns ->
A1 0 9
A1 4 14
A1 16 24
A1 25 54
A1 64 84
A1 74 84
A2 15 20
A2 19 50
I want to check if each line (value in col2 and 3) is present already or is in between the range of previous line, if col1 value is equal.
The desired output is ->
A1 0 14
A1 16 54
A1 64 84
A2 15 50
I have tried ->
#ARGV or die "No input file specified";
open $first, '<',$ARGV[0] or die "Unable to open input file: $!";
#open $second,'<', $ARGV[1] or die "Unable to open input file: $!";
$k=0;
while (<$first>)
{
if($k==0)
{
#cols = split /\s+/;
$p0=$cols[0];
$p1=$cols[1];
$p2=$cols[2];
$p3=$cols[2]+1;
}
else{
#new = split /\s+/;
if ($new[0] eq $p0){
if ($new[1]>$p3)
{
print join("\t", #new),"\n";
$p0=$new[0];
$p1=$new[1];
$p2=$new[2];
$p3=$new[2]+1;
}
elsif ($new[2]>=$p2)
{
print $p0,"\t",$p1,"\t",$new[2],"\n";
$p2=$new[2];
$p3=$new[2]+1;
}
else
{
$p5=1;
}
}
else
{
print join("\t", #new),"\n";
$p0=$new[0];
$p1=$new[1];
$p2=$new[2];
$p3=$new[2]+1;
}}
$k=1;
}
and output I am getting is ->
A1 0 14
A1 16 24
A1 16 54
A1 64 84
A1 64 84
A2 15 20
A2 22 50
I am not able to understand why I am getting this wrong output. Also if there is any way that I can erase(or overwrite) the last printed line, then it will be very easy.
First of all, it would be much more simple to help you if you
used strict and warnings, and declared all your variabled close to first use with my
indented your code properly to show the structure
The reason your code fails is that you are printing data under too many conditions. For example you output A1 16 24 when you find it cannot be joined with the previous range A1 4 14 without waiting for it to be extended by the subsequent A1 25 54 (when you correctly extend the range and print it again). A1 64 84 is output twice for the same reason: first because it cannot be merged with A1 25 54, and again because it has been "extended" with A1 74 84. Finally A2 15 20 is output straight away because it has a new first column, even though it is merged with the next line and output again.
You need to output a range only when you have found that it cannot be extended again. That happens when
a new record is found that doesn't overlap the existing data
the end of the file is reached
This code prints output only in those cases an appears to do what you need.
use strict;
use warnings;
my #data;
while (<DATA>) {
if (not #data) {
#data = split;
next;
}
my #new = split;
if ($new[0] eq $data[0] and $new[1] <= $data[2] + 1) {
$data[2] = $new[2];
}
else {
print join("\t", #data), "\n";
#data = #new;
}
print join("\t", #data), "\n" if eof DATA;
}
__DATA__
A1 0 9
A1 4 14
A1 16 24
A1 25 54
A1 52 57
A1 59 62
A1 64 84
A1 74 84
A2 15 20
A2 19 50
OUTPUT
A1 0 14
A1 16 57
A1 59 62
A1 64 84
A2 15 50
You need to have some variables describing currently-accumulated contiguous region. For each line of input, flush the previously-accumulated region if the new input is a new column1 label, or is same label but non-contiguous, or is end-of-file. If it's same label and contiguous yo update the min and max values.
This assumes that columns 1 and 2 are sorted.
The rest is left as an exercise for the reader.