I have a tab-seperated table (table1) with 4 columns, looking like this
A + 1 1
A + 2 2
A + 3 3
B - 2 4
B - 3 5
B - 4 6
B - 5 7
C + 1 8
C + 2 9
C + 3 10
D - 1 11
D - 2 12
D - 3 13
The letters in col1 define groups. Now, if there is a "+" in col2, I just want to keep the line as it is. If there i a "-" in col2, I want to reverse the values in col3, for all lines belonging to the group defined by col1. For this example the output should look like this
A + 1 1
A + 2 2
A + 3 3
B - 5 4
B - 4 5
B - 3 6
B - 2 7
C + 1 8
C + 2 9
C + 3 10
D - 3 11
D - 2 12
D - 1 13
So for group A, everything stays the same. But for group B the original values in col3 are: 2, 3, 4, 5. They should become 5, 4, 3, 2. The rest of the columns stays the same.
How should I work this out in Perl?
Keep the lines of a - group in a buffer, output it when the group changes.
#!/usr/bin/perl
use warnings;
use strict;
sub output {
my $buffer = shift;
my #rev = map $_->[2], #$buffer;
$_->[2] = pop #rev for #$buffer; # Reverse the 3rd column.
print join("\t", #$_) for #$buffer;
#$buffer = ();
}
my #buffer;
my $group;
print scalar <>; # header
while (<>) {
my #cols = split /\t/;
if (#buffer and $cols[0] ne $group) {
output(\#buffer);
}
if ('+' eq $cols[1]) {
print;
} else {
$group = $cols[0];
push #buffer, \#cols;
}
}
output(\#buffer) if #buffer; # Don't forget to output the last buffer.
Here's an alternative solution that makes use of the advantage of random access if you read the data into memory
use strict;
use warnings;
my #data = map { chomp; [ split /\t/ ]; } <DATA>;
my %ranges;
for ( grep $_->[1] eq '-', #data ) {
push #{ $ranges{$_->[0]} }, $_;
}
for my $range ( values %ranges ) {
for ( my $i = 0; $i*2 < $#$range; ++$i ) {
my ($from, $to) = #{$range}[$i, $#$range-$i];
( $from->[2], $to->[2] ) = ( $to->[2], $from->[2] );
}
}
print join("\t", #$_), "\n" for #data;
__DATA__
col1 col2 col3 col4
A + 1 10
A + 2 20
A + 3 35
B - 2 5
B - 3 21
B - 4 23
B - 5 36
output
col1 col2 col3 col4
A + 1 10
A + 2 20
A + 3 35
B - 5 5
B - 4 21
B - 3 23
B - 2 36
Related
I want to calculate the average over all itemsX (where X is a digit) for each row in Perl on windows.
I have file in format:
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Want to find sum of item blocks and their average.
Any help on this?
Here's what I've tried so far:
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($item1, $item2, ) = split;
$total += $numbers;
$count += 1;
}
For the first line of input (the column names), we store the indices of the columns that start with item. For each subsequent line, we sum the columns referenced by the array slice derived from #indices.
use strict;
use warnings;
use List::Util qw(sum);
my #indices;
while (<DATA>) {
my #fields = split;
if ($. == 1) {
#indices = grep { $fields[$_] =~ /^item/ } 0 .. $#fields;
next;
}
my $sum = sum(#fields[#indices]);
my $avg = $sum / scalar(#indices);
printf("Row %d stats: sum=%d, avg=%.2f\n", $., $sum, $avg);
}
__DATA__
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Output:
Row 2 stats: sum=77, avg=25.67
Row 3 stats: sum=165, avg=55.00
I have the following code. I am trying to print all common values from #arr2 and #arr4 without repetition. The expected output should be 5,6,7,8,9,13,14,15,16,17,18. I am not getting how to put a condition in a loop to avoid repetition and why $i is not printing in this code.
#!/usr/bin/perl
my #arr2 = ( 1 .. 10, 5 .. 15, 10 .. 20 );
my #arr4 = ( 5 .. 9, 13 .. 18 );
foreach my $line1 (#arr2) {
my ( $from1, $to1 ) = split( /\.\./, $line1 );
#print "$to1\n";
foreach my $line2 (#arr4) {
my ( $from2, $to2 ) = split( /\.\./, $line2 );
for ( my $i = $from1; $i <= $to1; $i++ ) {
for ( my $j = $from2; $j <= $to2; $j++ ) {
if ( $i == $j ) {
print "$i \n";
}
}
}
}
}
As Jonathan has pointed out, you appear to misunderstanding the nature of your data because you don't recognize the Range Operator .. used to construct lists.
my #array = (1 .. 10);
print "#array\n";
Outputs
1 2 3 4 5 6 7 8 9 10
Once you recognize that, then you just need to be pointed to the following:
perlfaq4 - How can I remove duplicate elements from a list or array?
perlfaq4 - How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
Combined to form:
#!/usr/bin/perl
use strict;
use warnings;
my #arr2 = ( 1 .. 10, 5 .. 15, 10 .. 20 );
my #arr4 = ( 5 .. 9, 13 .. 18 );
my %seen;
$seen{$_}++ for uniq(#arr2), uniq(#arr4);
my #intersection = sort { $a <=> $b } grep { $seen{$_} == 2 } keys %seen;
print "#intersection\n";
sub uniq {
my %seen;
$seen{$_}++ for #_;
return keys %seen;
}
Outputs:
5 6 7 8 9 13 14 15 16 17 18
The first step to understanding your problem is to understand your data — the arrays do not hold what you think they hold.
#!/usr/bin/perl
my #arr2=(1..10,5..15,10..20);
my #arr4=(5..9,13..18);
print "arr2: #arr2\n";
print "arr4: #arr4\n";
The output from this is:
arr2: 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15 16 17 18 19 20
arr4: 5 6 7 8 9 13 14 15 16 17 18
This shows that your code trying to split a string on .. is going to fail horribly.
One of the most basic debugging techniques is printing out the data you've actually got to ensure it matches what you think you should have. Here, that basic printing would have shown that the input data is not in the format you expected.
I'm working with biological data (copy number variations) which is shown as intervals (tab separated file):
File 1
Columns: Chromosome, Start, End, Annotation
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
I intersected them in order to consolidate the overlapping events (50% of overlap is my condition), the result is this:
I used intersectBed from Bedtools (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html):
$ intersectBed -a File1 -b File1 -loj -f 0.50 -r > File 2
File 2
Columns: Chromosome, Start, End, Annotation , Chromosome, Start, End, Annotation
1 1 10 A 1 1 10 A
1 1 10 A 1 3 12 B
1 3 12 B 1 1 10 A
1 3 12 B 1 3 12 B
1 3 12 B 1 7 15 C
1 7 15 C 1 3 12 B
1 7 15 C 1 7 15 C
1 20 30 D 1 20 30 D
1 35 45 E 1 35 45 E
1 35 45 E 1 37 45 F
1 37 45 F 1 35 45 E
1 37 45 F 1 37 45 F
1 50 60 G 1 50 60 G
1 50 60 G 1 50 65 H
1 50 65 H 1 50 60 G
1 50 65 H 1 50 65 H
Event A and the event C overlaps with the event B, event E and F overlaps with each other like G and H, finally the event D has no overlapping partners. Knowing this, the list of consolidated CNV should be:
File 3
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
I was trying to use the merge option of the HDCNV java software (http://daleylab.org/lab/?page_id=125) but the output is not what I needed. I was trying to write a perl code but I'm a beginner so this problem is, at the moment, out of my limits.
I would appreciate if you can help me with a nice perl or awk code which take File 2 as input and outputs File 3.
Thanks in advance
I'm assuming that the columns have the following meanings:
col 1: chromosome number
col 2: start position of genomic region
col 3: end position of genomic region
col 4: text identifier
This script looks for the areas of overlap between the named regions. It assumes that the input text is sorted by col 1 then col 2. I have put the input text in a string, but you will probably be reading it in from a file (and outputting your data to a file, too). I will leave you to work out how to do that--it is pretty easy, and there is lots of documentation on the perl website.
#!/usr/bin/perl
use strict;
use warnings;
use feature ":5.10";
use Data::Dumper;
my $text = '1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
';
# we have tab-delimited data.
# split on line breaks, remove line ending, split on tabs
my #lines = map { chomp; [ split(/\t/, $_) ]; } split("\n", $text);
my $col_0 = 1;
my $min = 0;
my $max = 0;
my #range;
foreach (#lines) {
# the chromosome number has changed or
# minimum is greater than current maximum:
# start a new interval
if ($col_0 != $_->[0] || $_->[1] > $max) {
if (#range) {
# print out the range, and restart the stack
say join("\t",
$col_0,
( $min || $_->[1] ),
( $max || $_->[2] ),
join(", ", #range)
);
}
#range = ( $_->[3] );
# set the min and max
$col_0 = $_->[0];
$min = $_->[1];
$max = $_->[2];
}
else {
# the minimum is lower than our current maximum.
# check whether the max is greater than our current
# max and increase it if so. Add the letter to the
# current range.
if ($_->[2] > $max) {
$max = $_->[2];
}
push #range, $_->[3];
}
}
# print out the last line
say join("\t", $col_0, $min, $max, join(", ", #range) );
Output:
1 1 15 A, B, C
1 20 30 D
1 35 45 E, F
1 50 65 G, H
2 1 15 I, J, K
2 20 30 L
2 35 45 M, N
2 50 65 O, P
I have just calculated simple overlap - this doesn't do 50% overlap. Using this script as a start, you can figure out how to do that. We're not doing your PhD for you! ;)
awk '
$2 > end && NR>1 {
print "1", start, end, pair;
start=end=pair=0
}
{
if (!start) { start = $2 };
end = $3;
pair = (pair ? pair "," $4 : $4)
}
END {
print "1", start, end, pair
}' file
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
Assuming ordered data, the following stub should handle merging the records.
Would just have to modify it to load and output to a file.
use strict;
use warnings;
use List::Util qw(min max);
my $last;
while (<DATA>) {
my #fields = split;
if ( !$last ) {
$last = \#fields;
} elsif ( $last->[0] == $fields[0] && $last->[2] > $fields[1] ) {
$last->[1] = min( $last->[1], $fields[1] );
$last->[2] = max( $last->[2], $fields[2] );
$last->[3] .= ",$fields[3]";
} else {
print join( "\t", #$last ), "\n";
$last = \#fields;
}
}
print join( "\t", #$last ), "\n";
__DATA__
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
Outputs:
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
2 1 15 I,J,K
2 20 30 L
2 35 45 M,N
2 50 65 O,P
My take:
awk -F "\t" -v OFS="\t" '
function emit() {print chrom, start, end, annot}
$1 == chrom && ((start<=$2 && $2<=end) || (start<=$3 && $3<=end)) {
annot = annot "," $4
if ($2 < start) start = $2
if ($3 > end) end = $3
next
}
chrom {emit()}
{chrom=$1; start=$2; end=$3; annot=$4}
END {emit()}
' file1
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
How to print the array's values in order of array element?
#ab= <DATA>;
print "#ab\n";
#a = qw(a b c d);
foreach $s(#ab){
foreach $m (#a){
$z =~m/$m/g;
print "$z";
}
}
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
In this program gives outputs but i expect the outputs is
a 5
a 12
a 1
a 32
b 24
b 54
c 11
d 43
d 4
d 1
First prints the first array element's matching value then second and so on.. How can i do this?
Your outer loop should loop over #a and inner over #ab
my #ab= <DATA>;
my #a = qw(a b c d);
foreach my $m (#a) {
foreach my $s (#ab) {
print $s if $s =~ /^$m/;
}
}
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
output
a 5
a 12
a 1
a 32
b 24
b 54
c 11
d 43
d 4
d 1
It looks like you're trying to sort the array alphabetically but only based on the first letter. I think this does what you want:
use strict;
use warnings;
print sort { (substr $a, 0, 1) cmp (substr $b, 0, 1) } <DATA>;
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
Output:
a 5
a 12
a 1
a 32
b 24
b 54
c 11
d 43
d 4
d 1
Try:
#ab= <DATA>;
#a = qw(a b c d);
print map { $tmp = $_; grep { $tmp eq (split(" ", $_))[0] } #ab } #a;
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
Explanation
map { ... } #a: loop through each element in array #a, orderly.
$tmp = $_: save the current value of $_ of map function to variable $tmp.
grep { $tmp eq (split(" ", $_))[0] } #ab: with each element in array #ab, we split it to get only the first character, compare with current $tmp value. If equal, grep return that element.
Group the sections, and then print
use strict;
use warnings;
my %group;
while (<DATA>) {
my ($key) = split ' ';
push #{$group{$key}}, $_;
}
for my $key (sort keys %group) {
print #{$group{$key}};
}
__DATA__
d 43
a 5
b 24
d 4
a 12
b 54
c 11
a 1
d 1
a 32
Outputs:
a 5
a 12
a 1
a 32
b 24
b 54
c 11
d 43
d 4
d 1
This scales better than Сухой27's answer: O(S) instead of O(ID*S).
my #ids = qw( a b c d );
my %s_by_id;
while (my $s = <DATA>) {
my ($id) = $s =~ /^(\S+)/
or next;
push #{ $s_by_id{$id} }, $s;
}
for my $id (#ids) {
print #{ $s_by_id{$id} } if $s_by_id{$id};
}
I am trying to study an algorithm in which, given a list of numbers I must calculate a coefficient given by the ratio between the number of triangles founded in the data list and the minimum number of neighbors that a number has; for example, given the first two rows of the file:
1 2 3 4 5 6 9
2 1 3
...
if the first element of a row appears in the other rows and if the first element of the subsequent rows appear in the row taken in exam then I found a link;
if the "link" exists, then I want to count how many times the other elements in the row taken in exam appear in the row where the link is present and print "I have found z triangles".
For example in this case when the program compare the first row and the second row and find that "the link 1 2" exists and found that there is 1 triangle made by the vertex 1,2,3.
In the algorithm I have to divide the number of triangles + 1 by the minimum number of element in each row - 2 ( in this case the minimum number come from the second line and the value is 3-2=1); the coefficient that I am looking for is then (1+1)/1 = 2;
The output file will be written as:
1 2 1
in which in the first two columns I find the element that makes a link and in the 3rd column the value of the coefficient;
Here is the code I've written so far:
use strict;
use warnings;
use 5.010;
use List::Util;
my $filename = "data";
open my $fh, '<', $filename or die "Cannot open $filename: $!";
my $output_file = "output_example";
open my $fi, ">", $output_file or die "Error during $output_file opening: $!";
my %vector;
while (<$fh>) {
my #fields = split;
my $root = shift #fields;
$vector{$root} = { map { $_ => 1} #fields };
}
my #roots = sort { $a <=> $b } keys %vector;
for my $i (0 .. $#roots) {
my $aa = $roots[$i];
my $n_element_a = scalar (keys %{$vector{$aa}})-1;
for my $j ($i+1 .. $#roots) {
my $minimum;
my $bb = $roots[$j];
my $n_element_b = scalar (keys %{$vector{$bb}})-1;
next unless $vector{$aa}{$bb} and $vector{$bb}{$aa};
if ($n_element_a < $n_element_b){
$minimum = $n_element_a;
}else {
$minimum = $n_element_b;
}
my $triangles = 0;
for my $cc ( keys %{$vector{$aa}} ) {
next if $cc == $aa or $cc == $bb;
if ($vector{$bb}{$cc}) {
$triangles++;
}
}
my $coeff;
my #minimum_list;
if ($minimum == 0){
$coeff = ($triangles +1)/($minimum+0.00000000001);
} else {
$coeff = ($triangles +1)/($minimum);
}
say $fi "$aa $bb $coeff";
}
}
__DATA__
1 2 3 4 5 6 9
2 1 3
3 1 2
4 1 5
5 1 4
6 1 7 8
8 6 7
9 1 10 11
10 9 11 12 14
11 9 10 12 13
12 10 13 14
13 11 12
14 10 12 15
15 14
I put the entire dataset at the end. The output file gives:
__OUTPUT__
1 2 2
1 3 2
1 4 2
1 5 2
1 6 0.5
1 9 0.5
2 3 2
4 5 2
6 8 2
9 10 1
9 11 1
10 11 1
10 12 1
10 14 1
11 13 2
12 13 1
12 14 1
14 15 100000000000
Now I would like to find the minimum value of the coefficient, identify the link(s) that present this lower value, erase this elements in the original dataset and repeat the same program on the "new" dataset.
For example in this case the links that present the minimum values are the 1 6 and the 1 9 with a coefficient of 0.5. So now the program should delete in the file data the element "6" in the row that start with "1" and vice-versa and the same with the 9. So now the "new" dataset would be:
1 2 3 4 5
2 1 3
3 1 2
4 1 5
5 1 4
6 7 8
8 6 7
9 10 11
10 9 11 12 14
11 9 10 12 13
12 10 13 14
13 11 12
14 10 12 15
15 14
What I am looking for is:
How can I erase the elements that present the minimum coefficient's value from the dataset contained in the data file?
How can I iterate the processes N times?
To find the minimum from the output file I thought to add at the end of the program these lines:
my $file1 = "output_example";
open my $fg, "<", $file1 or die "Error during $file1 opening: $!";
my #minimum_vector;
while (<$fg>) {
push #minimum_vector, [ split ];
}
my $minima=$minimum_vector[0][2];
for my $i (0 .. $#minimum_vector){
if($minima >= $minimum_vector[$i][2] ){
$minima=$minimum_vector[$i][2];
}
}
say $minima;
close $file1;
But it gives me an error with the $minima because I think I can't read from the same file that I have just created (in this case the output_example file). It runs if I compile in a different program.
The best way to iterate would probably be to break your code up into subroutines. This will also help clarify the code and track down exactly where problems might be occuring.
use strict;
use warnings;
use 5.010;
use List::Utils qw/min/;
sub load_initial_data {
# open and read file, load it into an arrayref and return it.
}
sub find_coefficients {
my $data = shift;
my #results;
foreach my $row (#$data) {
# do stuff to calculate $aa, $bb, $coeff
push #results, [$aa, $bb, $coeff];
}
return \#results;
}
sub filter_data {
my $data = shift;
my $coefficients = shift;
my $min = min map { $_->[2] } #$coefficients;
my #deletions = grep { $min == $_->[2] } #$coefficients;
foreach my $del (#deletions) {
delete( $data->{$del->[0]}{$del->[1]} );
}
}
# doing the actual work:
my $data = load_initial_data();
my $coeffs;
foreach my $pass (0 .. $N) {
$coeffs = find_coefficients( $data );
$data = filter_data( $data, $coeffs );
# You could write $data and/or $coeffs out to a file here
# if you need to keep the intermediate stages
}