awk or perl to update/replace pattern with specific value - perl

Trying to use perl or awk to update a specific pattern in a file to a specific text value. The code does execute but does not produce the desired output, rather it seems to repeat $3 instead of update it. Thank you :).
So when 0/0 or 1/1 is found in $3 it is updated/replaced with hom or when 0/1 is found in $3 it is replaced/updated to het.
file space delimited
chr1 115252242 0/0
chr1 247587408 1/1
chr5 35873605 0/1
perl
perl -plae '
BEGIN{ %h = qw(0/0 hom 0/1 het 1/1 hom 1/2 het 2/2 hom) } # define patterns
/^.*([0-2]\/[0-2])/ # define hash values to use with patterns and
$_ .= join "\t", ("", $1, $2, $h{$3})' file # print updated output
current
chr1 115252242 0/0 0/0
chr1 247587408 1/1 1/1
chr5 35873605 0/1 0/1
desired tab-delimited
chr1 115252242 hom
chr1 247587408 hom
chr5 35873605 het

Your Perl script seems to be trying to use $1, $2 as if they are awk fields:
/^.*([0-2]\/[0-2])/ and
$_ .= join "\t", ("", $1, $2, $h{$3})' file
However, in Perl, these variables refer to capture groups from a regex match. In your regex, there is only a single capture group. This is clearer, if you modify the join as:
/^.*([0-2]\/[0-2])/ and
$_ .= join "#", ("", ":", $1, "::", $2, ":::", $h{$3})' file
Your output will be:
chr1 115252242 0/0#:#0/0#::##:::#
chr1 247587408 1/1#:#1/1#::##:::#
chr5 35873605 0/1#:#0/1#::##:::#
In addition .= appends to the existing value; it does not replace it.
A modification to your script to fix this is to add an extra capture group:
/^(.*)([0-2]\/[0-2])/
Now you can refer to $1 and $2:
$_ = join("\t", $1, $h{$2});
However, as $1 ends with whitespace already, and you seem to want to end up with TSV, you can extract leading/trailing whitespace by:
/^\s*(.*)([0-2]\/[0-2])\s*$/
and then replace all remaining runs of whitespace with a single tab:
s/\s+/\t/g
Finally, instead of m// (match) followed by join, you can simply use s/// (find and replace). We use && so that the second replacement only happens if the first one did anything:
s/^\s*(.*)([0-2]\/[0-2])\s*$/$1$h{$2}/ && s/\s+/\t/g
The final program is:
perl -plae '
BEGIN{ %h = qw( 0/0 hom 0/1 het 1/1 hom 1/2 het 2/2 hom ) }
s/^\s*(.*)([0-2]\/[0-2])\s*$/$1$h{$2}/ && s/\s+/\t/g
' file
An awk equivalent might be:
awk -v OFS="\t" '
$3 ~ /^[0-9]+\/[0-9]+/$/ {
split($3,n,"/")
$3 = n[1]==n[2] ? "hom" : "het"
}
1
' file
This processes lines with whitespace-delimited columns. It checks if the 3rd column has the appropriate form. If so, it is split into two numbers. If the numbers match, it sets the 3rd column to "hom"; if not, to "het". 1 is a shorter way to write {print}.

Could you please try following.
awk '
BEGIN{
OFS="\t"
}
$NF=="0/0" || $NF=="1/1"{
$NF="hom"
}
$NF=="0/1"{
$NF="het"
}
1
' Input_file
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section of this program here.
OFS="\t" ##Setting OFS as \t(tab) here.
} ##Closing BLOCK for BEGIN section of this command here.
$NF=="0/0" || $NF=="1/1"{ ##Checking condition if last field is either equal to 0/0 OR 1/1 then do following.
$NF="hom" ##Set last field value of as string hom string here.
} ##Closing BLOCK for this above condition here.
$NF=="0/1"{ ##Checking condition if last field value is 0/1 then do following.
$NF="het" ##Setting last field value as het string here.
} ##Closing BLOCK for condition here.
1 ##mentioning 1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.

Following code should give desired result
use strict;
use warnings;
my $file = do { local $/; <DATA> };
$file =~ s#(0/0|1/1)#hom#g;
$file =~ s#0/1#het#g;
$file =~ s# +#\t#g; # replace space separators to tab
print $file;
__DATA__
chr1 115252242 0/0
chr1 247587408 1/1
chr5 35873605 0/1
output
chr1 115252242 hom
chr1 247587408 hom
chr5 35873605 het

$ awk 'BEGIN{map["0/0"]=map["1/1"]="hom"; map["0/1"]="net"} $3 in map{$3=map[$3]} 1' file
chr1 115252242 hom
chr1 247587408 hom
chr5 35873605 net

Related

Perl command not giving expected output

Command:
perl -lpe '1 while (s/(^|\s)(0\d*)(\s|$)/$1"$2"$3/)' test5
Input:
1234 012345 0
0.000 01234 0
01/02/03 5467 0abc
01234 0123
0000 000054
0asdf 0we23-1
Current Output:
perl -lpe '1 while (s/(^|\s)(0\d*)(\s|$)/$1"$2"$3/)' test5
1234 "012345" "0"
0.000 "01234" "0"
01/02/03 5467 "0abc"
"01234" "0123"
"0000" "000054"
0asdf 0we23-1
Excepted Output:
1234 "012345" 0
0.000 "01234" 0
01/02/03 5467 "0abc"
"01234" "0123"
"0000" "000054"
"0asdf" "0we23-1"
Conditions to follow in output:
All strings starting with 0 having alphanumeric character except / and . should be double quoted.
if string starting with 0 have only 0 character should not be quoted.
Spacing between strings should be preserved.
This appears to do what you want:
#!/usr/bin/env perl
use strict;
use warnings;
while ( <DATA> ) {
my #fields = split;
s/^(0[^\.\/]+)$/"$1"/ for #fields;
print join " ", #fields, "\n";
}
__DATA__
1234 012345 0
0.000 01234 0
01/02/03 5467 0abc
01234 0123
0000 000054
0asdf 0we23-1
Note - it doesn't strictly preserve whitespace like you asked though - it just removes it and reinserts a single space. That seems to meet your spec, but you could instead:
my #fields = split /(\s+)/;
as this would capture the spaces too.
join "", #fields;
This is reducible to a one liner using -a for autosplitting:
perl -lane 's/^(0[^\.\/]+)$/"$1"/ for #F; print join " ", #F'
If you wanted to do the second bit (preserving whitespace strictly) then you'd need to drop the -a and use split yourself.

select specific columns from complex lines

I have a file that contains lines with the following format. I would like to keep only the first column and the column containing the string with the following format NC_XXXX.1
484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
.....
....
...
output:
484-2117 NC_002188.1
787-1087 NC_004068.1
Something like this in perl:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my ( $id, $nc ) = m/^([\d\-]+).*(NC_[\d\.]+)/;
print "$id $nc\n";
}
__DATA__
484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
Output:
484-2117 NC_002188.1
787-1087 NC_004068.1
Which reduces to a one liner of:
perl -ne 'm/^([\d\-]+).*(NC_[\d\.]+)/ and print "$1 $2\n"' yourfile
Note - this specifically matches a first column made up of number and dash - you could do this with a wider regex match.
awk to the rescue!
$ awk -F' +|\\|' '{for(i=2;i<=NF;i++) if($i ~ /^NC_[0-9.]+$/) {print $1,$i; next}}' file
484-2117 NC_002188.1
787-1087 NC_004068.1
if the space is a tab char, need to add to the delimiter list
$ awk -F' +|\\||\t' ...
With perl:
perl -anE'say "$F[0] ",(split /\|/, $F[2])[3]' file
or awk:
awk -F'\\|| +' '{print $1,$6}' file
Using gnu-awk below could be solution:
awk '{printf "%s %s\n",$1,gensub(/.*(NC_.*\.1).*/,"\\1",1,$0)}' file
Output
484-2117 NC_002188.1
787-1087 NC_004068.1
A more restrictive version would be
awk '{printf "%s %s\n",$1,gensub(/.*(NC_[[:digit:]]*\.1).*/,"\\1",1,$0)}' file
awk -F'[ |]' '{print $1,$10}' file
484-2117 NC_002188.1
787-1087 NC_004068.1

Compare fields of two VCF files

I would like to ask you for some help for an apparently easy script I am trying to work on.
Basically I would like to compare each fields of two tab delimited files.
if the second field of the files match --> compare all the rest of the fields of the line.
In the case the field of the first file is "NA" print the field of the second file for the same locations.
Now I have wrote this small script but one of the problem I am having is:
1- how to keep the first field of the first 9 fields from the first file
2- how to tell Perl to print out the line with the changed field from the second file.
Here is an example if I was not clear:
File 1:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT NA NA 0/1
File2:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1
Desidered tab delimited output:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1
Thank you in advance for any comment and help!
use strict;
use warnings;
my $frameshift_file = <>;
my $monomorphic_file = <>;
my #split_file1 = split "\t", $frameshift_file; #splits the file on tabs
my #split_file2 = split "\t", $monomorphic_file; #splits line on tab delimeted fields
if ($split_file1[1] eq $split_file2[1] {
for (my $i=0; $i<scalar(#split_file1); $i++) {
if ($split_file1[$i] eq "NA") {
print $split_file2[$i],"\t";
} else { print $split_file1[$i],"\t";
}
}
}
Try something like this.. (replace "\s+" with "\t" to split only on tabs).
use strict;
use warnings;
my (#split_file1, #split_file2, $frameshift_file, $monomorphic_file, $x);
$frameshift_file = "16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT NA NA 0/1";
$monomorphic_file = "16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1";
(#split_file1) = split('\s+', $frameshift_file); #splits the file on tabs
(#split_file2) = split('\s+', $monomorphic_file); #splits line on tab delimeted fields
if ("$split_file1[1]" eq "$split_file2[1]"){ # 2nd field of files match
for($x = 2; $x <= $#split_file1; $x++){
if ($split_file1[$x] eq "NA"){ # If file1 shows "NA", print file2 equivalent array element.
print "split_file1[$x] = \"NA\" .. split_file2[$x] = $split_file2[$x]\n";
}
}
}

Split a column then sum and push sum onto array in perl

I have a file that looks like this:
LOCUS POS ALIAS VAR TEST P I DESC
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.43 0/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.295 0/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.005 1/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 0.676617 0.005 1/0
I want to split the last field by "/", then sum those numbers, and push another column on with the sum. For example, I would want the output to look like:
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.43 0/1 1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.295 0/1 1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.005 1/1 2
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 0.676617 0.005 1/0 1
I have this code, but it doesn't work:
#! perl -w
my $file1 = shift#ARGV;
my $NVAR=0;
my #vars;
open (IN, $file1) or die "couldn't read file one";
while(<IN>){
my#L=split;
next if ($L[0] =~ m/LOCUS/);
my#counts=split /\//, $L[7];
foreach (#counts){
$NVAR=${$_}[0] + ${$_}[1];
}
push #vars,[$L[0],$L[1],$L[2],$L[3],$L[4],$L[5],$L[6],$L[7],$NVAR];
}
close IN;
print "LOCUS POS ALIAS NVAR TEST P I DESC SUM\n";
foreach(#vars){
print "#{$_}\n";
}
Any help is appreciated.
Always include use strict; and use warnings; at the top of EVERY script.
Limit your variables to the smallest scope possible, as declaring $NVAR outside of the while loop introduced a bug. Your summation can be fixed by the following:
my $NVAR = 0;
foreach (#counts){
#$NVAR=${$_}[0] + ${$_}[1]; <-- this was bad.
$NVAR += $_;
}
However, this can be solved using a perl oneliner
perl -MList::Util=sum -lane 'push #F, sum split "/", $F[-1]; print "#F"' file.txt
Or if you have a header row:
perl -MList::Util=sum -lane '
push #F, $. == 1 ? "SUM" : sum split "/", $F[-1];
print "#F"
' file.txt
Note, you can also utilize List::Util sum in your script as well.

Adding columns to a file based on existing columns

I am trying to modify a file which is set up like this:
chr start ref alt
chr1 18884 C CAAAA
chr1 135419 TATACA T
chr1 332045 T TTG
chr1 453838 T TAC
chr1 567652 T TG
chr1 602541 TTTA T
chr1 614937 C CTCTCTG
chr1 654889 C CA
chr1 736800 AC A
I want to modify it such that:
if column "ref" is a string >1 (i.e line 2) then I generate 2 new columns where:
first new column = start coordinate-1
second new column = start coordinate+(length of string in ref)+1
therefore, for line 2 output would look like:
chr1 135419 TATACA T 135418 135426
or:
if length of string in "ref" = 1 and column "alt"=string of length>1 (i.e. line 1) then
first new column = start coordinate
second new column = start coordinate+2
so, output for line 1 would be:
chr1 18884 C CAAAA 18884 18886
I have tried to this in awk but without success
My perl is non-existent but would this be the best way? Or maybe in R?
Perl solution. Note that your specification does not mention what to do if both strings are length 1.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw(say);
#use Data::Dumper;
<DATA>; # Skip the header;
while (<DATA>) {
my ($chr, $start, $ref, $alt) = split;
my #cols;
if (1 < length $ref) {
#cols = ( $start - 1, $start + 1 + length $ref);
} elsif (1 < length $alt) {
#cols = ($start, $start + 2);
} else {
warn "Don't know what to do at $.\n";
}
say join "\t", $chr, $start, $ref, $alt, #cols;
}
__DATA__
chr start ref alt
chr1 18884 C CAAAA
chr1 135419 TATACA T
chr1 332045 T TTG
chr1 453838 T TAC
chr1 567652 T TG
chr1 602541 TTTA T
chr1 614937 C CTCTCTG
chr1 654889 C CA
chr1 736800 AC A
Here's one way using awk. Run like:
awk -f script.awk file | column -t
Contents of script.awk:
NR==1 {
next
}
length($3)>1 && length($4)==1 {
print $0, $2-1, $2+length($3)+1
next
}
length($3)==1 && length($4)>1 {
print $0, $2, $2+2
next
}1
Results:
chr1 18884 C CAAAA 18884 18886
chr1 135419 TATACA T 135418 135426
chr1 332045 T TTG 332045 332047
chr1 453838 T TAC 453838 453840
chr1 567652 T TG 567652 567654
chr1 602541 TTTA T 602540 602546
chr1 614937 C CTCTCTG 614937 614939
chr1 654889 C CA 654889 654891
chr1 736800 AC A 736799 736803
Alternatively, here's the one-liner:
awk 'NR==1 { next } length($3)>1 && length($4)==1 { print $0, $2-1, $2+length($3)+1; next } length($3)==1 && length($4)>1 { print $0, $2, $2+2; next }1' filem | column -t
The code should be pretty self-explanatory. The 1 on the end of the script simply enables default printing (i.e. '1' returns true) of each line. HTH.
Doing it in perl is trivial (but so is in awk):
#!/usr/bin/perl
while (<>) {
chmop;
my ($chr,$start,$ref,$alt)=split(/\s+/,$_);
if (len($ref) > 1) {
print STDOUT
"$chr\t$start\t$ref\t$alt\t",
$start+len($ref)+1,"\n";
} elsif (len($ref)==1) {
print STDOUT
"$chr\t$start\t$ref\t$alt\t",
$start+2,"\n";
} else {
print STDERR "ERROR: ???\n"; #actually impossible
}
}
Stick it in a file morecols.pl , chmod +x morecols.pl, run more morecols.pl . (Beware, lots of assumptions in this code/instructions). I have a feeling your actual problem is more with programming/text processing then tools or languages. If so, this code is just a stopgap solution....
Cheers.