Compare fields of two VCF files - perl

I would like to ask you for some help for an apparently easy script I am trying to work on.
Basically I would like to compare each fields of two tab delimited files.
if the second field of the files match --> compare all the rest of the fields of the line.
In the case the field of the first file is "NA" print the field of the second file for the same locations.
Now I have wrote this small script but one of the problem I am having is:
1- how to keep the first field of the first 9 fields from the first file
2- how to tell Perl to print out the line with the changed field from the second file.
Here is an example if I was not clear:
File 1:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT NA NA 0/1
File2:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1
Desidered tab delimited output:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1
Thank you in advance for any comment and help!
use strict;
use warnings;
my $frameshift_file = <>;
my $monomorphic_file = <>;
my #split_file1 = split "\t", $frameshift_file; #splits the file on tabs
my #split_file2 = split "\t", $monomorphic_file; #splits line on tab delimeted fields
if ($split_file1[1] eq $split_file2[1] {
for (my $i=0; $i<scalar(#split_file1); $i++) {
if ($split_file1[$i] eq "NA") {
print $split_file2[$i],"\t";
} else { print $split_file1[$i],"\t";
}
}
}

Try something like this.. (replace "\s+" with "\t" to split only on tabs).
use strict;
use warnings;
my (#split_file1, #split_file2, $frameshift_file, $monomorphic_file, $x);
$frameshift_file = "16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT NA NA 0/1";
$monomorphic_file = "16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1";
(#split_file1) = split('\s+', $frameshift_file); #splits the file on tabs
(#split_file2) = split('\s+', $monomorphic_file); #splits line on tab delimeted fields
if ("$split_file1[1]" eq "$split_file2[1]"){ # 2nd field of files match
for($x = 2; $x <= $#split_file1; $x++){
if ($split_file1[$x] eq "NA"){ # If file1 shows "NA", print file2 equivalent array element.
print "split_file1[$x] = \"NA\" .. split_file2[$x] = $split_file2[$x]\n";
}
}
}

Related

awk or perl to update/replace pattern with specific value

Trying to use perl or awk to update a specific pattern in a file to a specific text value. The code does execute but does not produce the desired output, rather it seems to repeat $3 instead of update it. Thank you :).
So when 0/0 or 1/1 is found in $3 it is updated/replaced with hom or when 0/1 is found in $3 it is replaced/updated to het.
file space delimited
chr1 115252242 0/0
chr1 247587408 1/1
chr5 35873605 0/1
perl
perl -plae '
BEGIN{ %h = qw(0/0 hom 0/1 het 1/1 hom 1/2 het 2/2 hom) } # define patterns
/^.*([0-2]\/[0-2])/ # define hash values to use with patterns and
$_ .= join "\t", ("", $1, $2, $h{$3})' file # print updated output
current
chr1 115252242 0/0 0/0
chr1 247587408 1/1 1/1
chr5 35873605 0/1 0/1
desired tab-delimited
chr1 115252242 hom
chr1 247587408 hom
chr5 35873605 het
Your Perl script seems to be trying to use $1, $2 as if they are awk fields:
/^.*([0-2]\/[0-2])/ and
$_ .= join "\t", ("", $1, $2, $h{$3})' file
However, in Perl, these variables refer to capture groups from a regex match. In your regex, there is only a single capture group. This is clearer, if you modify the join as:
/^.*([0-2]\/[0-2])/ and
$_ .= join "#", ("", ":", $1, "::", $2, ":::", $h{$3})' file
Your output will be:
chr1 115252242 0/0#:#0/0#::##:::#
chr1 247587408 1/1#:#1/1#::##:::#
chr5 35873605 0/1#:#0/1#::##:::#
In addition .= appends to the existing value; it does not replace it.
A modification to your script to fix this is to add an extra capture group:
/^(.*)([0-2]\/[0-2])/
Now you can refer to $1 and $2:
$_ = join("\t", $1, $h{$2});
However, as $1 ends with whitespace already, and you seem to want to end up with TSV, you can extract leading/trailing whitespace by:
/^\s*(.*)([0-2]\/[0-2])\s*$/
and then replace all remaining runs of whitespace with a single tab:
s/\s+/\t/g
Finally, instead of m// (match) followed by join, you can simply use s/// (find and replace). We use && so that the second replacement only happens if the first one did anything:
s/^\s*(.*)([0-2]\/[0-2])\s*$/$1$h{$2}/ && s/\s+/\t/g
The final program is:
perl -plae '
BEGIN{ %h = qw( 0/0 hom 0/1 het 1/1 hom 1/2 het 2/2 hom ) }
s/^\s*(.*)([0-2]\/[0-2])\s*$/$1$h{$2}/ && s/\s+/\t/g
' file
An awk equivalent might be:
awk -v OFS="\t" '
$3 ~ /^[0-9]+\/[0-9]+/$/ {
split($3,n,"/")
$3 = n[1]==n[2] ? "hom" : "het"
}
1
' file
This processes lines with whitespace-delimited columns. It checks if the 3rd column has the appropriate form. If so, it is split into two numbers. If the numbers match, it sets the 3rd column to "hom"; if not, to "het". 1 is a shorter way to write {print}.
Could you please try following.
awk '
BEGIN{
OFS="\t"
}
$NF=="0/0" || $NF=="1/1"{
$NF="hom"
}
$NF=="0/1"{
$NF="het"
}
1
' Input_file
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section of this program here.
OFS="\t" ##Setting OFS as \t(tab) here.
} ##Closing BLOCK for BEGIN section of this command here.
$NF=="0/0" || $NF=="1/1"{ ##Checking condition if last field is either equal to 0/0 OR 1/1 then do following.
$NF="hom" ##Set last field value of as string hom string here.
} ##Closing BLOCK for this above condition here.
$NF=="0/1"{ ##Checking condition if last field value is 0/1 then do following.
$NF="het" ##Setting last field value as het string here.
} ##Closing BLOCK for condition here.
1 ##mentioning 1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Following code should give desired result
use strict;
use warnings;
my $file = do { local $/; <DATA> };
$file =~ s#(0/0|1/1)#hom#g;
$file =~ s#0/1#het#g;
$file =~ s# +#\t#g; # replace space separators to tab
print $file;
__DATA__
chr1 115252242 0/0
chr1 247587408 1/1
chr5 35873605 0/1
output
chr1 115252242 hom
chr1 247587408 hom
chr5 35873605 het
$ awk 'BEGIN{map["0/0"]=map["1/1"]="hom"; map["0/1"]="net"} $3 in map{$3=map[$3]} 1' file
chr1 115252242 hom
chr1 247587408 hom
chr5 35873605 net

Perl File with newline to HASH then to CSV

Hello Perl experts,
Im sorry if I m asking too much, but just started learning perl, and want to know more on hash to csv. Tried breaking my head for few days, but didnt get much.
My requirement is I want to convert the below file & print it as a csv without using a custom module from cpan. Since custom modules are not allowed in here.
The logic i tried was take the input file into a hash and printing like a csv.
First problem something wrong with my code.
I have a complicated part in this, I need to search using first keyword before space, if I dont find it, need to add a ''(space). eg. ovpa_type is not present in the second.
Like this there are 5L lines in the file.
I want to learn about adding lines into the hash array, like powershell does. and converting into csv whereever I want.
My input file contains the below data.
begin node
name ccaita23.contoso.com
on_off on
group SYSTEM_PING_UNIX
suppress no
auto_delete yes
read_community public
write_community public
address 2.1.52.36
port 161
ovpa_type router
trace off
snmp_version 1
engineid 0
auth_protocol 1
is_key_ok 0
error_status 0
security_level 0
v3_user 0
priv_protocol 0
end node
begin node
name ccaidi7c.contoso.com
on_off on
group SYSTEM_PING_UNIX
suppress no
auto_delete yes
read_community public
write_community public
address 1.1.210.76
port 161
trace off
snmp_version 1
engineid 0
auth_protocol 1
is_key_ok 0
error_status 0
security_level 0
v3_user 0
priv_protocol 0
end node
Output required
ccaita23.contoso.com,on,SYSTEM_PING_UNIX,no,yes,public,public,2.11.52.36,161,router,off,1,0,1,0,0,0,0,0
ccaidi7c.contoso.com,on,SYSTEM_PING_UNIX,no,yes,public,public,1.1.210.76,161,,off,1,0,1,0,0,0,0,0
open FILE1, "File.node" or die;
my %hash;
while (my $line=<FILE1>) {
chomp($line);
(my $key,my $value) = split / /, $line;
$hash{$key} .= $value;
}
my $name = $hash{'name'};
my $group = $hash{'group'};
my $csv = "$name,$group\n";
print $csv;
my #fields = qw(
name on_off group suppress auto_delete read_community write_community
address port ovpa_type trace snmp_version engineid auth_protocol
is_key_ok error_status security_level v3_user priv_protocol
);
open my $FILE1, "<", "File.node" or die $!;
local $/ = ""; # paragraph reading mode/reads whole node at once
while (my $rec = <$FILE1>) {
# $rec =~ s/(?:begin|end)\s+node//g; # get rid of begin/end
my %hash = split ' ', $rec; # split node on spaces and feed into hash, in key/value fashion
# get hash slice for #fields keys, map every undef to "", so join wont warn under warnings
print join(",", map { $_ // "" } #hash{#fields}), "\n";
}

Split a column then sum and push sum onto array in perl

I have a file that looks like this:
LOCUS POS ALIAS VAR TEST P I DESC
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.43 0/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.295 0/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.005 1/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 0.676617 0.005 1/0
I want to split the last field by "/", then sum those numbers, and push another column on with the sum. For example, I would want the output to look like:
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.43 0/1 1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.295 0/1 1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.005 1/1 2
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 0.676617 0.005 1/0 1
I have this code, but it doesn't work:
#! perl -w
my $file1 = shift#ARGV;
my $NVAR=0;
my #vars;
open (IN, $file1) or die "couldn't read file one";
while(<IN>){
my#L=split;
next if ($L[0] =~ m/LOCUS/);
my#counts=split /\//, $L[7];
foreach (#counts){
$NVAR=${$_}[0] + ${$_}[1];
}
push #vars,[$L[0],$L[1],$L[2],$L[3],$L[4],$L[5],$L[6],$L[7],$NVAR];
}
close IN;
print "LOCUS POS ALIAS NVAR TEST P I DESC SUM\n";
foreach(#vars){
print "#{$_}\n";
}
Any help is appreciated.
Always include use strict; and use warnings; at the top of EVERY script.
Limit your variables to the smallest scope possible, as declaring $NVAR outside of the while loop introduced a bug. Your summation can be fixed by the following:
my $NVAR = 0;
foreach (#counts){
#$NVAR=${$_}[0] + ${$_}[1]; <-- this was bad.
$NVAR += $_;
}
However, this can be solved using a perl oneliner
perl -MList::Util=sum -lane 'push #F, sum split "/", $F[-1]; print "#F"' file.txt
Or if you have a header row:
perl -MList::Util=sum -lane '
push #F, $. == 1 ? "SUM" : sum split "/", $F[-1];
print "#F"
' file.txt
Note, you can also utilize List::Util sum in your script as well.

Insert the highest value among the number of times it occurs

I have two files:
1) Tab file with the following content. Let's call this reference file:
V$HMGIY_01_rc Ncor=0.405
V$CACD_01 Ncor=0.405
V$GKLF_02 Ncor=0.650
V$AML2_Q3 Ncor=0.792
V$WT1_Q6 Ncor=0.607
V$KID3_01 Ncor=0.668
V$CNOT3_01 Ncor=0.491
V$KROX_Q6 Ncor=0.423
V$ETF_Q6_rc Ncor=0.547
V$E2F_Q2_rc Ncor=0.653
V$SP1_Q6_01_rc Ncor=0.650
V$SP4_Q5 Ncor=0.660
2) The second tab file contains the search string X as shown below. Let's call this file as search_string:
A X
NF-E2_SC-22827 NF-E2
NRSF NRSF
NFATC1_SC-17834 NFATC1
NFKB NFKB
TCF3_SC-349 TCF3
MEF2A MEF2A
what I have already done is: Take the first search term (from search_string file; column X), check if it occurs in first column of the reference file. Example: The first search term is NF-E2. I checked if this string occurs in the first column of the reference file. If it occurs, then give a score of 1, else give 0. Also i have counted the number of times it matches the pattern. Now my output is of the format:
Keyword Keyword in file? Number of times keyword occurs in file
NF-E2 1 3
NRSF 0 0
NFATC1 0 0
NFKB 1 7
TCF3 0 0
Now, in addition to this, what I would like to add is the highest Ncor value for each string in each file. Say for example: while I search for NF-E2 in NF-E2.txt, the Ncor values present are: 3.02, 2.87 and 4.59. Then I want the value 4.59 to be printed in the next column. So now my output should look like:
Keyword Keyword in file? Number of times keyword occurs in file Ncor
NF-E2 1 3 4.59
NRSF 0 0
NFATC1 0 0
NFKB 1 7 1.66
TCF3 0 0
Please note: I need to search each string in different files i.e. The first string (Nf-E2) should be searched in file NF-E2.tab; the second string (NRSF) should be searched in file NRSF.tab and so on.
Here is my code:
perl -lanE '$str=$F[1]; $f="/home/$str/list/$str.txt"; $c=`grep -c "$str" "$f"`;chomp($c);$x=0;$x++ if $c;say "$str\t$x\t$c"' file2
PLease help!!!
This should work:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my $keyword = (split /\s+/)[1];
my $file = "/home/$keyword/list/${keyword}.txt";
open my $reference, '<', "$file" or die "Cannot open $file: $!";
my $key_cnt = 0;
my $max_ncor = 0;
while (my $line = <$reference>) {
my ($string, undef, $ncor) = split /\s+|=/, $line;
if ($string =~ $keyword) {
$key_cnt++;
$max_ncor = $ncor if ($max_ncor < $ncor);
}
}
print join("\t", $keyword, $key_cnt ? 1 : 0, $key_cnt, $key_cnt ? $max_ncor : ''), "\n";
}
Run it like this:
perl t.pl search_string.txt

grep variables and give informative ouput

I want to see how many times specific word was mentioned in the file/lines.
My dummy examples looks like this:
cat words
blue
red
green
yellow
cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT
I am doing this:
for i in $(cat words); do grep "$i" text | wc >> output; done
cat output
2 2 51
0 0 0
1 1 26
0 0 0
But what I actually want to get is:
1. Word that was used as a variable;
2. In how many lines (additionally to text hits) word was found.
Preferable output looks like this:
blue 3 2
red 0 0
green 1 1
yellow 0 0
$1 - variable that was grep'ed
$2 - how many times variable was found in the text
$3 - in how many lines variable was found
Hope someone could help me doing this with grep, awk, sed as they are fast enough for the large data set, but Perl one liner would help me too.
Edit
Tried this
for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*
and it kinda looks nice, but some of the words are longer than 300 letters so I can't create file named like the word.
You can use the grep option -o which print only the matched parts of a matching line, with each match on a separate output line.
while IFS= read -r line; do
wordcount=$(grep -o "$line" text | wc -l)
linecount=$(grep -c "$line" text)
echo $line $wordcount $linecount
done < words | column -t
You can put it all in one line to make it a one liner.
If column gives the "column too long" error, you can use printf provided you know the maximum number of characters. Use the below instead of echo and remove the pipe to column:
printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount
Replace the 20 with your max word length and the other numbers as well if you need to.
Here is a similar Perl solution; but rather written as a complete script.
#!/usr/bin/perl
use 5.012;
die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless #ARGV;
my $wordsfile = shift #ARGV;
my #wordlist = do {
open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
map {chomp; length() ? $_ : ()} <$words_fh>;
};
my %words;
while (<>) {
for my $word (#wordlist) {
my $cnt = 0;
$cnt++ for /\Q$word\E/g;
$words{$word}[0] += $cnt;
$words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
}
}
# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
say join "\t", $key, #{ $words{$key} };
}
Example output:
blue 3 2
green 1 1
red 0 0
yellow 0 0
Advantage over bash script: every file is only read once.
This gets pretty ugly as a Perl one-liner (partly because it needs to get data from two files and only one can be sent on stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you go:
perl -E 'undef $|; open $w, "<", "words"; #w=<$w>; chomp #w; $r{$_}=[0,{}] for #w; my $re = join "|", #w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for #w' < text
This requires perl 5.10 or later, but changing it to support 5.8 and earlier is trivial. (Change the -E to -e, change say to print, and add a \n at the end of each line of output.)
Output:
blue 3 2
red 0 0
green 1 1
yellow 0 0
an awk(gawk) oneliner could save you from grep puzzle:
awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
format the code a bit:
awk 'NR==FNR{n[$0];l[$0];next;}
{for(w in n){ s=$0;
t=gsub(w,"#",s);
n[w]+=t;l[w]+=t>0?1:0;}
}END{for(x in n)print x,n[x],l[x]}' words text
test with your example:
kent$ awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow 0 0
red 0 0
green 1 1
blue 3 2
if you want to format your output, you could just pipe the awk output to column -t
so it looks like:
yellow 0 0
red 0 0
green 1 1
blue 3 2
awk '
NR==FNR { words[$0]; next }
{
for (word in words) {
count = gsub(word,word)
if (count) {
counts[word] += count
lines[word]++
}
}
}
END { for (word in words) printf "%s %d %d\n", word, counts[word], lines[word] }
' file