match columns on different lines and sum - perl

I have a csv with about 160,000 lines, it looks like this:
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
I would like to pair lines where column 1 is the same, column 3 matches column 2 and where column 6 is a '+' while on the other line it is a '-'. If this is true I would like to sum column 4 and column 5.
My desired out put would be
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,-
the best solution I can think of is to duplicate the file and then match columns between the file and it's duplicate with perl:
#!/usr/bin/perl
use strict;
use warnings;
open my $firstfile, '<', $ARGV[0] or die "$!";
open my $secondfile, '<', $ARGV[1] or die "$!";
my ($chr_a, $chr_b,$start,$end,$begin,$finish, $sum_a, $sum_b, $total_a,
$total_b,$sign_a,$sign_b);
while (<$firstfile>) {
my #col = split /,/;
$chr_a = $col[0];
$start = $col[1];
$end = $col[2];
$sum_a = $col[3];
$total_a = $col[4];
$sign_a = $col[5];
seek($secondfile,0,0);
while (<$secondfile>) {
my #seccol = split /,/;
$chr_b = $seccol[0];
$begin = $seccol[1];
$finish = $seccol[2];
$sum_b = $seccol[3];
$total_b = $seccol[4];
$sign_b = $seccol[5];
print join ("\t", $col[0], $col[1], $col[2], $col[3]+=$seccol[3],
$col[4]+=$seccol[4], $col[5]),
"\n" if ($chr_a eq $chr_b and $end==$begin and $sign_a ne $sign_b);
}
}
And that works fine, but ideally I'd like to be able to do this within the file itself without having to duplicate it, because I have many files and so I would like to run a script over all of them that is less time-consuming.
Thanks.

In the absence of a response to my comment, this program will do as you ask with the data you provide.
use strict;
use warnings;
my #last;
while (<DATA>) {
s/\s+\z//;
my #line = split /,/;
if (#last
and $last[0] eq $line[0]
and $last[2] eq $line[1]
and $last[5] eq '+' and $line[5] eq '-') {
$last[3] += $line[3];
$last[4] += $line[4];
print join(',', #last), "\n";
#last = ()
}
else {
print join(',', #last), "\n" if #last;
#last = #line;
}
}
print join(',', #last), "\n" if #last;
__DATA__
chr1,160,161,3,0.333333333333333,+
chr1,161,162,4,0.5,-
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,18,0.5,+
chr2,511,512,6,0.333333333333333,-
output
chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,+

Related

Range overlapping number and percentage calculation

I want to calculate overlap number(#) and percentage (%) from a series of ranges values distributed in four different files initiated with a specific identifier(id) (like NP_111111.4) . The initial list of ids are taken from file1.txt (starting file) and if the id matches with ids of other file, overlaps are calculated. Suppose my files are like this:
file1.txt
NP_111111.4: 1-9 12-20 30-41
YP_222222.2: 3-30 40-80
file2.txt
NP_111111.4: 1-6, 13-22, 31-35, 36-52
NP_414690.4: 360-367, 749-755
YP_222222.2: 19-24, 22-40
file3.txt
NP_418214.2: 1-133, 135-187, 195-272
YP_222222.2: 1-10
file4.txt
NP_418119.2
YP_222222.2 GO:0016878, GO:0051108
NP_111111.4 GO:0005887
From these input file, I want to create a .csv or excel output with separate columns with header as:
id overlap_file1_file2(#) overlap_file1_file2(%) overlap_file1_file3(#) overlap_file1_file3(%) overlap_file1_file2_file3(#) overlap_file1_file2_file3(%) Go_Terms(File4)
I am learning perl and found a perl module "strictures" for this type of range comparison. I am calculating overlapping number and percentage from two ranges as:
#!/usr/bin/perl
use strictures;
use Number::Range;
my $seq1 = Number::Range->new(8..356); #Start and stop for file1.txt
my $seq2 = Number::Range->new(156..267); #Start and stop for file2.txt
my $overlap = 0;
my $sseq1 = $seq1->size;
my $percent = (($seq2->size * 100) / $seq1->size);
foreach my $int ($seq2->range) {
if ( $seq1->inrange($int) ) {
$overlap++;
}
else {
next;
}
}
print "Total size= $sseq1 Number overlapped= $overlap Percentage overlap= $percent \n";
But I could not find a way to match ids of (file1.txt) with other files to extract specific information and to print them in a output csv file.
Please help. Thanks for your consideration.
This is a fragile solution in that it can only check 3 files for overlaps. If more files are involved, the code would need to be restructured. It uses Set::IntSpan to calculate the overlaps (and percent of overlaps.
#!/usr/bin/perl
use strict;
use warnings;
use Set::IntSpan;
use autodie;
my $file1 = 'file1';
my #files = qw/file2 file3/;
my %data;
my %ids;
open my $fh1, '<', $file1;
while (<$fh1>) {
chomp;
my ($id, $list) = split /:\s/;
$ids{$id}++;
$data{$file1}{$id} = Set::IntSpan->new(split ' ', $list);
}
close $fh1;
for my $file (#files) {
open my $fh, '<', $file;
while (<$fh>) {
chomp;
my ($id, $list) = split /:\s/;
next unless exists $ids{$id};
$data{$file}{$id} = Set::IntSpan->new(split /,\s/, $list);
}
close $fh;
}
my %go_terms;
open my $go, '<', 'file4';
while (<$go>) {
chomp;
my ($id, $terms) = split ' ', $_, 2;
$go_terms{$id} = $terms =~ tr/,//dr;
}
close $go;
my %output;
for my $file (#files) {
for my $id (keys %ids) {
my $count = ($data{$file1}{$id} * $data{$file}{$id})->size;
my $percent = sprintf "%.0f", 100 * $count / $data{$file1}{$id}->size;
$output{$id}{$file} = [$count, $percent];
}
}
for my $id (keys %ids) {
my $count = ($data{$file1}{$id} * $data{$files[0]}{$id} * $data{$files[1]}{$id})->size;
my $percent = sprintf "%.0f", 100 * $count / $data{$file1}{$id}->size;
$output{$id}{all_files} = [$count, $percent];
}
# output saved as f2.csv
print join(",", qw/ID f1f2_overlap f1f2_%overlap
f1f3_overlap f1f3_%overlap
f1f2f3_overlap f1f2f3_%overlap Go_terms/), "\n";
for my $id (keys %output) {
print "$id,";
for my $file (#files, 'all_files') {
my $aref = $output{$id}{$file};
print join(",", #$aref), ",";
}
print +($go_terms{$id} // ''), "\n";
}
The Excel sheet looks like this.

Selecting records from a file based on keys from a second file

My first file looks like:
CHR id position
1 rs58108140 10583
1 rs189107123 10611
1 rs180734498 13302
1 rs144762171 13327
1 chr1:13957:D 13957
And my second file looks like:
CHR SNP POS RiskAl OTHER_ALLELE RAF logOR Pval
10 rs1999138 110140096 T C 0.449034245446375 0.0924443 1.09e-06
6 rs7741604 20839503 C A 0.138318264238111 0.127947 1.1e-06
8 rs1486006 82553172 G C 0.833130882716561 0.147456 1.12727730194884e-06
My script reads in the first file and stores it in an array, and then I would like to find rsIDs from column 2 of the first file that are in column 2 in the second file. I think I am having a problem with how I'm matching the expressions. Here is my script:
#! perl -w
use strict;
use warnings;
my $F = shift #ARGV;
my #snps;
open IN, "$F";
while (<IN>) {
next if m/CHR/;
my #L = split;
push #snps, [$L[0], $L[1], $L[2]] if $L[0] !~ m/[XY]/;
}
close IN;
open IN, "DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt";
while (<IN>) {
my #L = split;
next if m/CHR/;
foreach (#snps) {
next if ($L[0] != ${$_}[0]);
# if not on same chromosome
if ($L[0] = ${$_}[0]) {
# if on same chromosome
if ($L[1] =~ ${$_}[1]) {
print "$L[0] $L[1] ${$_}[2]\n";
last;
}
}
}
}
Your code doesn't seem to correspond to your description. You are comparing both the first and second columns of the file rather than just the second.
The main problems are:
You use $L[0] = ${$_}[0] to compare the first columns. This will do an assigmment instead of a comparison. You should use $L[0] == ${$_}[0] instead or, better, $L[0] == $_->[0]
You use $L[1] =~ ${$_}[1] to compare the second columns. This will check whether ${$_}[1] is a substring of $L[1]. You could use anchors like $L[1] =~ /^${$_}[1]$/ but it's much better to just do a string comparison as $L[1] eq $_->[1]
The easiest way is to read the second file first so as to build a list of values that you want included from the first file. I have written it so that it does what your code looks like it's supposed to do, i.e. match the first two columns.
That would look like this
use strict;
use warnings;
use autodie;
my ($f1, $f2) = #_;
my %include;
open my $fh2, '<', $f2;
while (<$fh2>) {
my #fields = split;
my $key = join '|', #fields[0,1];
++$include{$key};
}
close $fh2;
open my $fh1, '<', $f1;
while (<$fh1>) {
my #fields = split;
my $key = join '|', #fields[0,1];
print "#fields[0,1,2]\n" if $include{$key};
}
close $fh1;
output
Unfortunately your choice of sample data doesn't include any records in the first file that have matching keys in the second, so there is no output!
Update
This is a corrected version of your own program. It should work, but it is far more efficient and concise to use hashes, as above
use strict;
use warnings;
use autodie;
my ($filename) = #ARGV;
my #snps;
open my $in_fh, '<', $filename;
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
push #snps, \#fields unless $fields[0] =~ /[XY]/;
}
close $in_fh;
open $in_fh, '<', 'DIAGRAMv3sansWTCCCqc0clumpd_noTCF7L2regOrLeadOrPlt1em6clumps- CHR_SNP_POS_RiskAl_OtherAl_RAF_logOR_Pval.txt';
<$in_fh>; # Discard header line
while (<$in_fh>) {
my #fields = split;
for my $snp (#snps) {
next unless $fields[0] == $snp->[0] and $fields[1] eq $snp->[1];
print "$fields[0] $fields[1] $snp->[2]\n";
last;
}
}
close $in_fh;

Degeneracy of characters when searching for a specific sub-string

I have the following script which searches for specified substrings within an input string (a DNA sequence). I was wondering if anybody could help out with being able to specify degeneracy of specific characters. For example, instead of searching for GATC (or anything consisting solely of G's, T's, A's and C's), I could instead search for GRTNA where R = A or G and where N = A, G, C or T. I would need to be able to specify quite a few of these in a long list within the script. Many thanks for any help or tips!
use warnings;
use strict;
#User Input
my $usage = "Usage (OSX Terminal): perl <$0> <FASTA File> <Results Directory + Filename>\n";
#Reading formatted FASTA/FA files
sub read_fasta {
my ($in) = #_;
my $sequence = "";
while(<$in>) {
my $line = $_;
chomp($line);
if($line =~ /^>/){ next }
else { $sequence .= $line }
}
return(\$sequence);
}
#Scanning for restriction sites and length-output
open(my $in, "<", shift);
open(my $out, ">", shift);
my $DNA = read_fasta($in);
print "DNA is: \n $$DNA \n";
my $len = length($$DNA);
print "\n DNA Length is: $len \n";
my #pats=qw( GTTAAC );
for (#pats) {
my $m = () = $$DNA =~ /$_/gi;
print "\n Total DNA matches to $_ are: $m \n";
}
my $pat=join("|",#pats);
my #cutarr = split(/$pat/, $$DNA);
for (#cutarr) {
my $len = length($_);
print $out "$len \n";
}
close($out);
close($in);
GRTNA would correspond to the pattern G[AG]T[AGCT]A.
It looks like you could do this by writing
for (#pats) {
s/R/[AG]/g;
s/N/[AGCT]/g;
}
before
my $pat = join '|', #pats;
my #cutarr = split /$pat/, $$DNA;
but I'm not sure I can help you with the requirement that "I would need to be able to specify quite a few of these in a long list within the script". I think it would be best to put your sequences in a separate text file rather than embed the list directly into the program.
By the way, wouldn't it be simpler just to
return $sequence
from your read_fasta subroutine? Returning a reference just means you have to dereference it everywhere with $$DNA. I suggest that it should look like this
sub read_fasta {
my ($fh) = #_;
my $sequence;
while (<$fh>) {
unless (/^>/) {
chomp;
$sequence .= $_;
}
}
return $sequence;
}

find partial matching two files in perl

I want to write a Perl program. The first input file is 2 columns of text. The first column is a label and the second column is the search string. The second input file also has 2 columns. The first column is a label and the second column is the text to be searched. For example, according to the second columns, John (in the file1) is more similar to Johni in file2 than John.
file1
John AABBBCCCDEE
Jam WWQQQQQQQERRRTTTTTT
file2
Jami EWWQQQQQQQERRRTTTTTTTTTT
Johni AAAAABBBCCCDEEEEEEHHHHHH
Mark WWWCCVVVVVVFFFFFFFTTTTTT
ROB #############VVVVVVVVVVV
John WWADFRWSSSSSSDDDDDqqqqqq
output
Jami EWWQQQQQQQERRRTTTTTTTTTT Jam WWQQQQQQQERRRTTTTTT
Johni AAAAABBBCCCDEEEEEEHHHHHH John AABBBCCCDEE
I tried the following code but it doesn't work the way I want.
#!/user/bin/perl
use warnings;
use strict;
my ($infile1) = $ARGV[0];
my ($infile2) = $ARGV[1];
open(my $fh1, "<$infile1");
while(my $file1 = <$fh1> ){
my #file1 = split ("\t| ", $file1);
my $name_file1 = $file1[0];
my $ID_file1 = $file1[1];
my #matchline_file2 = `cat $infile2 | grep $name_file1`;
for my $ID_file1 (#file1){
if (grep my $ID_file2 eq $ID_file1, #matchline_file2){
print "found\n";}else{print "not_found\n";}}}
This doesn't print the results in reverse order like your output. I'm not sure if that was intentional. You could store the results in an array and reverse or sort the order if you like. Your example is very limited and this is just a best estimate of what you're trying to do.
#!/usr/bin/perl
use warnings;
use strict;
my ($infile1) = $ARGV[0];
my ($infile2) = $ARGV[1];
my $search_file = "";
open(my $fh2, "<$infile2");
while(my $line = <$fh2>)
{
$search_file .= $line;
}
open(my $fh1, "<$infile1");
while(my $line = <$fh1>)
{
chomp($line);
if($line =~ m/\w+\s+(.*)/)
{
my $search_string = quotemeta("$1");
if($search_file =~ m/(.*$search_string.*)/)
{
print "$1\t$line\n";
}
else
{
print "Could not find: $line\n";
}
}
else
{
print "Invalid line: $line\n";
}
}

Argument is not numeric error while comparing hash values based on keys

#!/usr/bin/perl
use strict;
use Data::Dumper;
use warnings;
my #mdsum;
open (IN1,"$ARGV[0]") || die "counldn't open";
open (MYFILE, '>>md5sum-problem.txt');
open (IN2, "mdsumfile.txt");
my %knomexl=();
my %knomemdsum = ();
my #arrfile ;
my $tempkey ;
my $tempval ;
my #values ;
my $val;
my $i;
my #newarra;
my $testxl ;
my $testmdsum;
while(<IN1>){
next if /barcode/;
#arrfile = split('\t', $_);
$knomexl{$arrfile[0]} = $arrfile[2];
}
while(<IN2>){
chomp $_;
#newarra = split(/ {1,}/, $_);
$tempval = $newarra[0];
$tempkey = $newarra[1];
$tempkey=~ s/\t*$//g;
$tempval=~ s/\s*$//g;
$tempkey=~s/.tar.gz//g;
$knomemdsum{$tempkey} = $tempval;
}
#values = keys %knomexl;
foreach $i(#values){
$testxl = $knomexl{$values[$i]};
print $testxl."\n";
$testmdsum = $knomemdsum{$values[$i]};
print $testmdsum."\n";
if ( $testxl ne $testmdsum ) {
if ($testxl ne ""){
print MYFILE "Files hasving md5sum issue $i\n";
}
}
}
close (MYFILE);
I have two files one both having File name and Mdsum values and I need to check that which all file's md5sum values are not matching so I understand that in some case where Value and corresponding values will not be their and I want those cases only. Any work around on this code ? Please. This code is pretty simple but don't know why it's not working!! :( :(
#values = keys %knomexl;
foreach $i(#values){
#print Dumper $knomexl{$values[$i]};
$testxl = $knomexl{$i};
print $testxl."\n";
$testmdsum = $knomemdsum{$i};
print $testmdsum."\n";
$i is an element of #values because of the foreach, not an index, so you shouldn't use $values[$i].