Consolidation of intervals - perl

I'm working with biological data (copy number variations) which is shown as intervals (tab separated file):
File 1
Columns: Chromosome, Start, End, Annotation
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
I intersected them in order to consolidate the overlapping events (50% of overlap is my condition), the result is this:
I used intersectBed from Bedtools (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html):
$ intersectBed -a File1 -b File1 -loj -f 0.50 -r > File 2
File 2
Columns: Chromosome, Start, End, Annotation , Chromosome, Start, End, Annotation
1 1 10 A 1 1 10 A
1 1 10 A 1 3 12 B
1 3 12 B 1 1 10 A
1 3 12 B 1 3 12 B
1 3 12 B 1 7 15 C
1 7 15 C 1 3 12 B
1 7 15 C 1 7 15 C
1 20 30 D 1 20 30 D
1 35 45 E 1 35 45 E
1 35 45 E 1 37 45 F
1 37 45 F 1 35 45 E
1 37 45 F 1 37 45 F
1 50 60 G 1 50 60 G
1 50 60 G 1 50 65 H
1 50 65 H 1 50 60 G
1 50 65 H 1 50 65 H
Event A and the event C overlaps with the event B, event E and F overlaps with each other like G and H, finally the event D has no overlapping partners. Knowing this, the list of consolidated CNV should be:
File 3
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
I was trying to use the merge option of the HDCNV java software (http://daleylab.org/lab/?page_id=125) but the output is not what I needed. I was trying to write a perl code but I'm a beginner so this problem is, at the moment, out of my limits.
I would appreciate if you can help me with a nice perl or awk code which take File 2 as input and outputs File 3.
Thanks in advance

I'm assuming that the columns have the following meanings:
col 1: chromosome number
col 2: start position of genomic region
col 3: end position of genomic region
col 4: text identifier
This script looks for the areas of overlap between the named regions. It assumes that the input text is sorted by col 1 then col 2. I have put the input text in a string, but you will probably be reading it in from a file (and outputting your data to a file, too). I will leave you to work out how to do that--it is pretty easy, and there is lots of documentation on the perl website.
#!/usr/bin/perl
use strict;
use warnings;
use feature ":5.10";
use Data::Dumper;
my $text = '1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
';
# we have tab-delimited data.
# split on line breaks, remove line ending, split on tabs
my #lines = map { chomp; [ split(/\t/, $_) ]; } split("\n", $text);
my $col_0 = 1;
my $min = 0;
my $max = 0;
my #range;
foreach (#lines) {
# the chromosome number has changed or
# minimum is greater than current maximum:
# start a new interval
if ($col_0 != $_->[0] || $_->[1] > $max) {
if (#range) {
# print out the range, and restart the stack
say join("\t",
$col_0,
( $min || $_->[1] ),
( $max || $_->[2] ),
join(", ", #range)
);
}
#range = ( $_->[3] );
# set the min and max
$col_0 = $_->[0];
$min = $_->[1];
$max = $_->[2];
}
else {
# the minimum is lower than our current maximum.
# check whether the max is greater than our current
# max and increase it if so. Add the letter to the
# current range.
if ($_->[2] > $max) {
$max = $_->[2];
}
push #range, $_->[3];
}
}
# print out the last line
say join("\t", $col_0, $min, $max, join(", ", #range) );
Output:
1 1 15 A, B, C
1 20 30 D
1 35 45 E, F
1 50 65 G, H
2 1 15 I, J, K
2 20 30 L
2 35 45 M, N
2 50 65 O, P
I have just calculated simple overlap - this doesn't do 50% overlap. Using this script as a start, you can figure out how to do that. We're not doing your PhD for you! ;)

awk '
$2 > end && NR>1 {
print "1", start, end, pair;
start=end=pair=0
}
{
if (!start) { start = $2 };
end = $3;
pair = (pair ? pair "," $4 : $4)
}
END {
print "1", start, end, pair
}' file
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H

Assuming ordered data, the following stub should handle merging the records.
Would just have to modify it to load and output to a file.
use strict;
use warnings;
use List::Util qw(min max);
my $last;
while (<DATA>) {
my #fields = split;
if ( !$last ) {
$last = \#fields;
} elsif ( $last->[0] == $fields[0] && $last->[2] > $fields[1] ) {
$last->[1] = min( $last->[1], $fields[1] );
$last->[2] = max( $last->[2], $fields[2] );
$last->[3] .= ",$fields[3]";
} else {
print join( "\t", #$last ), "\n";
$last = \#fields;
}
}
print join( "\t", #$last ), "\n";
__DATA__
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
Outputs:
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
2 1 15 I,J,K
2 20 30 L
2 35 45 M,N
2 50 65 O,P

My take:
awk -F "\t" -v OFS="\t" '
function emit() {print chrom, start, end, annot}
$1 == chrom && ((start<=$2 && $2<=end) || (start<=$3 && $3<=end)) {
annot = annot "," $4
if ($2 < start) start = $2
if ($3 > end) end = $3
next
}
chrom {emit()}
{chrom=$1; start=$2; end=$3; annot=$4}
END {emit()}
' file1
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H

Related

Dividing a matrix into two parts

I am trying to classify my dataset. To do this, I will use the 4th column of my dataset. If the 4th column of the dataset is equal to 1, that row will added in new matrix called Q1. If the 4th column of the dataset is equal to 2, that row will be added to matrix Q2.
My code:
i = input('Enter a start row: ');
j = input('Enter a end row: ');
search = importfiledataset('search-queries-features.csv',i,j);
[n, p] = size(search);
if j>n
disp('Please enter a smaller number!');
end
for s = i:j
class_id = search(s,4);
if class_id == 1
Q1 = search(s,1:4)
elseif class_id ==2
Q2 = search(s,1:4)
end
end
This calculates the Q1 and Q2 matrices, but they all are 1x4 and when it gives new Q1 the old one is deleted. I need to add new row and make it 2x4 if conditions are true. I need to expand my Q1 matrix.
Briefly I am trying to divide my dataset into two parts using for loops and if statements.
Dataset:
I need outcome like:
Q1 = [30 64 1 1
30 62 3 1
30 65 0 1
31 59 2 1
31 65 4 1
33 58 10 1
33 60 0 1
34 58 30 1
34 60 1 1
34 61 10 1]
Q2 = [34 59 0 2
34 66 9 2]
How can I prevent my code from deleting previous rows of Q1 and Q2 and obtain the entire matrices?
The main problem in your calculation is that you overwrite Q1 and Q2 each loop iteration. Best solution: get rid of the loops and use logical indexing.
You can use logical indexing to quickly determine where a column is equal to 1 or 2:
search = [
30 64 1 1
30 62 3 1
30 65 0 1
31 59 2 1
31 65 4 1
33 58 10 1
33 60 0 1
34 59 0 2
34 66 9 2
34 58 30 1
34 60 1 1
34 61 10 1
];
Q1 = search(search(:,4)==1,:) % == compares each entry in the fourth column to 1
Q2 = search(search(:,4)==2,:)
Q1 =
30 64 1 1
30 62 3 1
30 65 0 1
31 59 2 1
31 65 4 1
33 58 10 1
33 60 0 1
34 58 30 1
34 60 1 1
34 61 10 1
Q2 =
34 59 0 2
34 66 9 2
Warning: Slow solution!
If you are hell bent on using loops, make sure to not overwrite your variables. Either extend them each iteration (which is very, very slow):
Q1=[];
Q2=[];
for ii = 1:size(search,1) % loop over all rows
if search(ii,4)==1
Q1 = [Q1;search(ii,:)];
end
if search(ii,4)==2
Q2 = [Q2;search(ii,:)];
end
end
MATLAB will put orange wiggles beneath Q1 and Q2, because it's a bad idea to grow arrays in-place. Alternatively, you can preallocate them as large as search and strip off the excess:
Q1 = zeros(size(search)); % Initialise to be as large as search
Q2 = zeros(size(search));
Q1kk = 1; % Intialiase counters
Q2kk = 1;
for ii = 1:size(search,1) % loop over all rows
if search(ii,4)==1
Q1(Q1kk,:) = search(ii,:); % store
Q1kk = Q1kk + 1; % Increase row counter
end
if search(ii,4)==2
Q2(Q2kk,:) = search(ii,:);
Q2kk = Q2kk + 1;
end
end
Q1 = Q1(1:Q1kk-1,:); % strip off excess rows
Q2 = Q2(1:Q2kk-1,:);
Another option using accumarray, if Q is your original matrix:
Q = accumarray(Q(:,4),1:size(Q,1),[],#(x){Q(x,:)});
You can access the result with Q{1} (for class_id = 1), Q{2} (for class_id = 2) and so on...

Matching 1st and 2nd columns of first file with 1st column of second file

Want to compare the first and second column of first file with the 1st column of second file.
I do the following:
awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $2,$1,$3,$4}' file 1 file 2
Result of my Script is
3 b 33 44
2 c 44 11
5 d 12 21
I got the result for first column only and it does not print the repeated value. I want my output with repeated value also.
File1
a c 0.1
a b 0.4
b d 0.1
b a 0.8
b c 0.8
c b 0.7
c a 0.6
c d 0.9
d c 0.3
File2
b 3 33 44
c 2 44 11
d 5 12 21
Expected output:
3 b 33 44 5 d 12 21 0.1
3 b 33 44 2 c 44 11 0.8
2 c 44 11 3 b 33 44 0.7
2 c 44 11 5 d 12 21 0.9
5 d 12 21 2 c 44 11 0.3
Reading file2 before file1 is more practical.
$ awk 'NR==FNR {
a[$1]=($2 OFS $1 OFS $3 OFS $4)
next
}
($1 in a)&&($2 in a) {
print a[$1],a[$2],$3
}' file2 file1
3 b 33 44 5 d 12 21 0.1
3 b 33 44 2 c 44 11 0.8
2 c 44 11 3 b 33 44 0.7
2 c 44 11 5 d 12 21 0.9
5 d 12 21 2 c 44 11 0.3

Perl: find sum and average of specific columns

I want to calculate the average over all itemsX (where X is a digit) for each row in Perl on windows.
I have file in format:
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Want to find sum of item blocks and their average.
Any help on this?
Here's what I've tried so far:
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($item1, $item2, ) = split;
$total += $numbers;
$count += 1;
}
For the first line of input (the column names), we store the indices of the columns that start with item. For each subsequent line, we sum the columns referenced by the array slice derived from #indices.
use strict;
use warnings;
use List::Util qw(sum);
my #indices;
while (<DATA>) {
my #fields = split;
if ($. == 1) {
#indices = grep { $fields[$_] =~ /^item/ } 0 .. $#fields;
next;
}
my $sum = sum(#fields[#indices]);
my $avg = $sum / scalar(#indices);
printf("Row %d stats: sum=%d, avg=%.2f\n", $., $sum, $avg);
}
__DATA__
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Output:
Row 2 stats: sum=77, avg=25.67
Row 3 stats: sum=165, avg=55.00

Modifying Script to include the Count of a each time a name appears from a table

I have a script below that takes my FILE1 and parses out FILE2 only if the first column of FILE1 matches column number 10 of FILE2. So it will print out the rows I need. This part works great. The part I am having a tad bit of difficulty is inserting a sort of count for the output. The goal of the script is take column 10 at the end and produce an output. In my list there are 12 names and I want to get the count of each name. For the example below, I have used four names.
FILE1:
name1 15
name2 15
name2 30
name5 15
name4 10
name2 5
name2 5
FILE2:
23 15 5.4 1.3 5 55 128 21799 + 32 name2 1 77 0 1
23 20 5.4 1.3 5 55 128 7998 + 18 name4 1 77 0 1
23 20 5.4 1.3 6 55 128 9984 + 13 name4 1 77 1 1
23 20 5.4 1.3 7 55 128 7998 + 14 name5 1 77 2 1
23 20 5.4 1.3 6 55 128 994 + 14 name1 1 77 3
23 20 5.4 1.3 9 55 128 984 + 5 name7 1 77 4 1
23 20 5.4 1.3 5 55 128 99 + 5 name8 1 77 5 1
Expected Output
$VAR1 = {
'name1' => 1,
'name2' => 4,
'name4' => 1,
'name5' => 1,
};
5 55 128 21799 32 name2 77 0 1
5 55 128 7998 18 name4 77 0 1
6 55 128 9984 13 name4 77 1 1
7 55 128 7998 14 name5 77 2 1
6 55 128 994 14 name1 77 3 1
name1 1
name2 1
name4 2
name5 1
You can test the script it works. The part I am having difficulty with is inserting the count of each name based on the output. The print \%x is a way of checking if my original list was truly used as I am working with a much larger set of data. If someone could point me the right direction on how to modify my script without changing it drastically that would be great. I feel like this script fulfills the majority of my needs even if it is not the most efficient way of doing it.
use strict;
use Data::Dumper;
my %x;
open(FILE1, $ARGV[0]) or die "Cannot open the file: $!";
while (my $line = <FILE1>) {
my #array = split(" ", $line);
$x{$array[0]}++;
}
close FILE1;
print Dumper( \%x );
my %count;
open(FILE2, $ARGV[1]) or die "Cannot open the file: $!";
while (my $line = <FILE2>) {
my #name = split(" ", $line);
my $y = $name[9];
if ( $x{ $y } ) {
print join(" ", #name[4,5,6,7,9,11,12,13]), "\n";
$count{#name[9]}++;
}
}
print Dumper (\%count);
close FILE2;
exit;
Script now counts. Just need to debug.
the "minimal" change would be to set the elements of %x to 0 in the FILE1 loop, then check for exists $x{$y} in the FILE2 loop and do ++$x{$y} inside the condition body. Now at the end %x has the counts of all the occurrences.
The usual way (as mentioned in the comments of the question) would be to declare an additional %count and perform the same ++$count{$y} inside the if block as in the above method.
The first has the advantage and disadvantage (depending on your needs) of reporting the count even when the name has zero found occurrences.

concat columns in perl

Each iteration in my perl code generates a vector of 5.
Output of first iteration is
out1
1
2
3
4
5
The second iterations generates same length of vector.
out2
10
20
30
40
50
and then it runs for nth time
out n
100
200
300
400
500
I want to merge these columns and have the final output in a tabular format or matrix format if you like:
out1 out2 ... outn
1 10 100
2 20 200
3 30 300
4 40 400
5 50 500
I tried splitting and then using the push but it prints "(101" and only do it once and not for all 20. I also have no idea where the "(101" comes from.
Any suggestions?
First, put all those output lists to a list. Second, iterate on that list: output every first element of each element-list in the first iteration, output every second element of each element-list in the second iteration, and so on.
For example
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #lists;
for my $i (1..10) {
my #list;
push #list, $_ * $i for (1..5);
push #lists, \#list;
}
$Data::Dumper::Indent = 0;
print Dumper(\#lists), "\n\n";
while (#{$lists[0]}) {
for my $list (#lists) {
print shift #$list, "\t";
}
print "\n";
}
Output:
$ perl t.pl
$VAR1 = [
[1,2,3,4,5],
[2,4,6,8,10],
[3,6,9,12,15],
[4,8,12,16,20],
[5,10,15,20,25],
[6,12,18,24,30],
[7,14,21,28,35],
[8,16,24,32,40],
[9,18,27,36,45],
[10,20,30,40,50]
];
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
Note: The output of Data::Dumper has been edited to make it more compact.
Save your vector information to an array of array as you do your processing. Then you can output the rows using a simple join:
use strict;
use warnings;
my #rows;
for my $i (1..10) {
my #vector = map {$i * $_} (1..5);
push #{$rows[$_]}, $vector[$_] for (0..$#vector);
}
for my $row (#rows) {
print join(" ", map {sprintf "%-3s", $_} #$row), "\n";
}
Outputs:
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
Note: It'd be a lot easier to advise if you provided code and actual data.