split one column by ',' and use the values in calculations - perl

I'm writing a script in which I'm using a text file, where in one column there can be two letters (A,B,C or D) seperated by a ",". This column can also just contain one of those letters. I have to use both letters for further calculations in the rest of the script. This is a simplified example of my input file (here $variants):
C1 C2 C3 C4 C5 C6 ... C9
text 2 A D values and text in the other columns
text 4 B C values and text in the other columns
text 5 A B,D values and text in the other columns
So in line 3 of C4 there is a B and D. After C4 there are still a lot of columns, which cannot be changed since I need them in other parts of my script.
I have a second input file from which, based on the letters present in C3 and C4, some values are extracted. This is how this second input file looks like (here $frequency)
C1 C2 A a B b C c D d
text 1 0 1 0 0 0 0 0 0
text 2 1 0 5 4 0 0 0 0
text 3 0 0 0 0 10 11 3 6
text 4 1 0 9 4 0 2 0 0
text 5 5 3 0 0 6 7 4 0
This is how my output should look like:
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
text 2 A D 1 0 0 0 empty
text 4 B C 9 4 0 2 empty
text 5 A B,D 5 3 0 0 4 0
So for line 1, there is A in C3, then the script extracts the values for A and a from $frequency and puts them in C5 and C6. The values from C4 are then put in C7 and C8 from the output file. Now in the 3rd line there is B,D in C4. So what the script needs to do now is putting the corresponding values from B and b in C7 and C8 and the values for D and d in C9 and C10.
The only thing where I have still problems in my script is in splitting up this C4 when there is a ','. The rest is working.
This is how the problematic part of my script looks like
while(<$variants>){
next if /^\s*#/;
next if /^\s*"/;
chomp;
my ($chr, $pos, $refall, #altall) = split /\t/; # How should I specify here the C4, as an array? So that I don't know
my #ref_data = #{$frequency_data[$pos]}{$refall, lc($refall)};
my #alt_data = #{$frequency_data[$pos]}{$altall, lc($altall)}; # this works for C3 ($refall), but not for C4 when there are two letters
$pos = $#genes if $circular and $pos > $#genes; # adding annotation # this can be ignored here, since this line isn't part of my question
print join("\t","$_ ", $genes[$pos] // q(), #ref_data, #alt_data), "\n"; # printing annotation
}
So could someone help me with splitting of this C4 by ',' and still use the information for extracting values from $variants

I think the easiest would be treating columns 3 and 4 as lists from the get-go:
while(<$variants>){
next if /^\s*#/;
next if /^\s*"/;
chomp;
my ($chr, $pos, $refall_string, $altall_string, #other) = split /\t/;
my #refall = split(",", $refall_string);
my #altall = split(",", $altall_string);
my #ref_data_all = (); # Treat C3 as array just in case...
foreach my $refall (#refall) {
push #ref_data_all, #{$frequency_data[$pos]}{ $refall, lc($refall) };
}
my #alt_data_all = ();
foreach my $altall (#altall) {
push #alt_data_all, #{$frequency_data[$pos]}{ $altall, lc($altall) };
}
$pos = $#genes if $circular and $pos > $#genes;
print join("\t","$_ ", $genes[$pos] // q(),
#ref_data_all, #alt_data_all), "\n";
}
I didn't test this but the approach should be clear even if there's some minor bugs.

All you need is a couple of map calls.
If you write
map { $_, lc } split /,/, $refall
then you have split the field at any commas and duplicated each letter as upper case and lower case.
This is the complete loop (tested).
while (<$variants>) {
next if /^\s*#/;
next if /^\s*"/;
chomp;
my ($chr, $pos, $refall, $altall) = split /\t/;
my $entry = $frequency_data[$pos];
my #ref_data = map { $entry->{$_} } map { $_, lc } split /,/, $refall;
my #alt_data = map { $entry->{$_} } map { $_, lc } split /,/, $altall;
$pos = $#genes if $circular and $pos > $#genes;
print join("\t","$_ ", $genes[$pos] // q(), #ref_data, #alt_data), "\n";
}

Related

horizontal absolute values of every line

I am trying to calculate the absolute values of line 2 - values of line 1
and then the horizontal absolute values of every line in my input file. Here's a part of that input.
43 402 51 360
63 60 69 63
65 53 89 55
103 138 135 135
109 36 123 38
To be more precise about what im trying to do I made the following example
initial data
0 2 0 0
0 1 1 1
next stage (absolute value after subscription the second line minus the first line)
2 2 0
1 0 0
final stage (horizontal application of abs values until one column remained)
0
1
The below code was a failed attempt to obtain the final stage of the single column. My problem here is that I don't know how to obtain the final (desired) stage by using subroutine, as I believe that it is a better way to solving my problem. Of course, every idea or better approach is welcome.
#!/usr/bin/perl
use feature qw(say);
use strict;
use warnings;
use Data::Dumper;
my #rows = 'table_only_numbers';
open(my $fh, '<:encoding(UTF-8)', $rows)
sub ori {
for ($num_cols=#{ $rows[$r-1]}; $num_cols=1; $num_cols-- ){
my #diff_diffs = map { abs($diffs[$_-1] - $diffs[$_]) } 1..$num_cols-1;
#final=#diff_diffs;
say join ' ',#final;
return (final) }
my $num_cols = #{ $rows[0] };
for my $r (1..$#rows) {
die "Bad format!" if #{ $rows[$r] } != $num_cols;
my #diffs = map { abs($rows[$r-1][$_] - $rows[$r][$_]) } 0..$num_cols-1;
while ($num_cols>1)
{
$final_output = ori(#{ $rows[0] })
say "final_output";
}
}
close $fh;
Finally, I figure it by myself without subroutines!!! Im posting it in case someone face the same issue in the future.I know that it is an easiest way to do it but as I am newbie in Perl it is the easiest way for me.
So I used:
for the first abs of the line 2 minus the line 1
my #data = map { abs($current[$_]-$previous[$_]) } 0..$#current;
push #final, \#data;
To obtain the absolute value of row 2 minus the row 1
And after I used 3 times as I had 3 columns left (in my case) the following coding line and each time I Substituted the #xxx with a new variable. and I have desired output of I column.
foreach my $row (#XXX) {
my #data = map { abs(#{$row}[$_]-#{$row}[$_+1]) } 0..$#{$row}-1;
say join ' ', #data;
push #XXX, \#data;}

Comparing two files, where one piece of information can be flexible

Comparing two files. So easy, but comparing two files where one piece of information can be flexible is proving to be very challenging for me.
fileA
4 "dup" 37036335 37044984
3 "dup" 100146708 100147504
7 "del" 100 203
2 "dup" 34 89
fileB
4 "dup" 37036335 37036735
3 "dup" 100146708 100147504
4 "dup" 68 109
Anticipated output:
output_file1 (matching hits)
fileA: 4 "dup" 37036335 37044984
fileB: 4 "dup" 37036335 37036735
fileA: 3 "dup" 100146708 100147504
fileB: 3 "dup" 100146708 100147504
output_file2 (found in fileA, but not in FileB including non-overlap)
7 "del" 100 203
2 "dup" 34 89
output_file3 (found in fileB, but not in FileA including non-overlap)
4 "dup" 68 109
The credentials are...
I need field 1 and field 2 in the first file to exactly match the second file and the coordinates in field 3 to be exact or overlap.
This would mean these are the same.
fileA :4 "dup" 37036335 37044984
fileB :4 "dup" 37036335 37036735
I also need to find differences between the two files. (no-overlap, 1 row isn't present in one file, but not in the other, etc)
Here's the gist of what I've tried. I've written this code probably 4 different ways, alas, still no success. I've put both files into arrays (I've tried a hash too...idk)
## if no hits in original, but hits in calculated
if((! #ori) && (#calc)){}
## if CNV calls in original, but none in calculated
if((#ori) && (! #calc)){}
## if CNV calls in both
if((#ori) && (#calc)){
## compare calls with double 'for' loop
foreach my $l (#ori){
my #l = split(/\s/,$l);
my $Ochromosome = $l[0];
my $Ostart = $l[2];
my $Oend = $l[3];
my $Otype = $l[1];
foreach my $l (#calc){
my #l = split(/\s/,$l);
my $Cchromosome = $l[0];
my $Cstart = $l[2];
my $Cend = $l[3];
my $Ctype = $l[1];
## check chromosome and type here
if(($Ochromosome eq $Cchromosome) && ($Otype eq $Ctype)){ ## what if there are two duplications on the same chromosome?
## check coordinates
if(($Ostart <= $Cend) && ($Cstart <= $Oend)){
## overlap
}else{
## noOverlap
}
}else{
## what if there is something found in one, but not in the other and they both have calls?
## ahhhh
}
}
}
Here is a simple solution which is also fairly efficient.
Iterate over lines of one file, checking each against all lines of the other (until a match is found). This is the very least we must do complexity wise, given all information that needs to be gathered.
If a line from A is not found in B, it is added to #not_in_B. To determine which lines in B are not in A, we prepare a hash where each element of B is a key with a value 0. Once/if an element of B is found, the value of its key in the hash is set to 1. Those that are not 1 at the end have never been found by elements of A, and so are the extra ones. They go in #not_in_A.
Both files are first read into arrays for simplicity (but this is needed for the inner loop).
use warnings;
use strict;
use feature 'say';
my $f1 = 'f1.txt';
my $f2 = 'f2.txt';
open my $fh, '<', $f1;
my #a1 = <$fh>; chomp(#a1);
open $fh, '<', $f2;
my #a2 = <$fh>; chomp(#a2);
close $fh;
my (#not_in_A, #not_in_B);
my %Bs_in_A = map { $_ => 0 } #a2;
foreach my $e1 (#a1)
{
my $match = 0;
foreach my $e2 (#a2)
{
if ( lines_match($e1, $e2) ) {
$match = 1;
say "Match:\n\tf1: $e1\n\tf2: $e2";
$Bs_in_A{$e2} = 1;
last;
}
}
push #not_in_B, $e1 if not $match;
}
#not_in_A = grep { $Bs_in_A{$_} == 0 } keys %Bs_in_A;
say '---';
say "Elements of A that are not in B:";
say "\t$_" for #not_in_B;
say "Elements of B that are not in A:";
say "\t$_" for #not_in_A;
sub lines_match
{
my ($l1, $l2) = #_;
my #t1 = split ' ', $l1;
my #t2 = split ' ', $l2;
# First two fields must be the same
return if $t1[0] ne $t2[0] or $t1[1] ne $t2[1];
# Third-to-fourth-field ranges must overlap
return
if ($t1[2] < $t2[2] and $t1[3] < $t2[2])
or ($t1[2] > $t2[3] and $t1[3] > $t2[3]);
return 1; # match
}
Output
Match:
f1: 4 "dup" 37036335 37044984
f2: 4 "dup" 37036335 37036735
Match:
f1: 3 "dup" 100146708 100147504
f2: 3 "dup" 100146708 100147504
---
Elements of A that are not in B:
7 "del" 100 203
2 "dup" 34 89
Elements of B that are not in A:
4 "dup" 68 109
Note that I've used 1 in place of A and 2 in place of B.

Perl - How do I add up (calculate) numbers if they have same string in array

I want to add up the following numbers if they have same string in array.
Example input:
AA 1
AA 3
AA 2
BB 4
BB 6
Desired output:
AA 6
BB 10
Hashes are useful for grouping.
my %sums;
while (<>) {
my #F = split;
$sums{$F[0]} += $F[1];
}
print("$_ $sums{$_}\n") for sort keys %sums;

perl add contents of a column of a file

Column A | Column B | Column C | Column D
35627799100 8 8 2
35627788000 60 34 45
35627799200 10 21 21
35627780000 60 5 8
Basically I have a file as shown above and would like to add the contents of Column B i.e 8+60+10+60. To be frank I'm not sure if need to remove the first line being text and if I can use the split function and put it in a hash something along the lines:
my %hash = map {split/\s+/,$_,4} <$file>;
Thanks in advance for the help.
If you just want to sum up the second column, a hash is overkill. You can do something like this and calculate the sum directly in the map.
my $sum;
$sum += (split /\s+/, $_)[1] while <$file>;
Edit: If you have header rows or other rows with non-numeric values in column 2, then as the comments below indicate, you will run into problems. You can avoid this by trading split for a regular expression, like so:
my $sum = 0;
while (<STDIN>)
{
$sum += $1 if $_ =~ /^\S+\s+(\d+)/;
}
If it's possible that column 1 has no text (ie. the line starts with a single blank and the first non-blank represents the second column), then change the first part of the pattern from ^\S+ to ^\S*.
This is an example based on your data:
use strict;
use warnings;
my $sum_column_b = 0;
<DATA>; #drop header
while( my $line = <DATA>) {
$line =~ m/\s+(\d+)/; #regexpr to catch second column values
$sum_column_b += $1;
}
print $sum_column_b, "\n"; #<-- prints: 138
__DATA__
Column A | Column B | Column C | Column D
35627799100 8 8 2
35627788000 60 34 45
35627799200 10 21 21
35627780000 60 5 8

Delete the first character of array elements in Perl

I would like to remove the first character from a the elements of an array in a Perl script.
I have this line of script:
#dash = split /\s+/, $dash;
The variable "dash" is read from a particular row of my file: Example
21 A10 A11 A12 A13 ..
Then I have tried to push these values to my hash called "flowers"
for $i (1..$#dash) {
push(#flowers, $line[$i]);
}
This seems to work for what I need in my subsequent lines of script but I have found out that $dash contains unwanted character in front of each values:
A10 A11 A12 A13 ..
instead of
10 11 12 13 .....
but I wanted #flowers to contain:
10 11 12 13 ....
How can I delete the first character Before I pushed it to my hash (#flowers)
chop(#flowers);
could have worked but it only chops out the last character. When I tried to use
substr($dash, 0, 2)
It does produce 10, but all the rest of the values A11 A12 A13 is no longer in my #flowers.
Any help is appreciated.
This will operate on each element of the #dash array :
#dash = split /\s+/, $dash;
shift #dash;
#dash = map { substr($_, 1) } #dash;
Your substr($dash, 0, 2) was operating on the line as one string, not each element of it.
And, unless you need the index for some other operation :
push #flowers, #dash
That will push all elements of #dash onto #flowers. Which looks like what you're doing.
Why not just change the regex in the split?
split /\s+\D?/, $dash;
Adding them to #flowers this way if you want:
push( #flowers, split(/\s+\D?/, $dash) );
You need some kind of loop, since you want to do something to each element of #dash other than the first. map is convenient here.
my #flowers = map substr($dash[$_], 1), 1..$#dash;
which is the short way of writing
my #flowers;
for (1..$#dash) {
push #flowers, substr($dash[$_], 1);
}
I suggest that you just pull out all the digit sequences from $dash, like this:
my $dash = '21 A10 A11 A12 A13 .. ';
my #flowers = $dash =~ /\d+/g;
shift #flowers;
print "#flowers";
output
10 11 12 13
This is a possible solution:
use strict;
use warnings;
my $dash = "21 A10 A11 A12 A13"; #test data
my #dash = split /\s+/, $dash; #split into #dash array
shift #dash; #delete first array value
$_ = substr($_,1) for #dash; #for each item in array, remove the first character
print "#dash\n"; #prints: 10 11 12 13