How to implement this in awk or shell? - perl

Input File1:
5 5 NA
NA NA 1
2 NA 2
Input File2:
1 1 1
2 NA 2
3 NA NA
NA 4 4
5 5 5
NA NA 6
Output:
3 NA NA
NA 4 4
NA NA 6
The purpose is: in file1 , set any item of each line that is not NA into a set, then in file2, eliminate the line that whose fields are within this set. Does anyone have ideas about this?

To add any item not 'NA':
awk -f script.awk file1 file2
Contents of script.awk:
FNR==NR {
for (i=1;i<=NF;i++) {
if ($i != "NA") {
a[$i]++
}
}
next
}
{
for (j=1;j<=NF;j++) {
if ($j in a) {
next
}
}
}1
Results:
3 NA NA
NA 4 4
NA NA 6
Alternatively, here's the one-liner:
awk 'FNR==NR { for (i=1;i<=NF;i++) if ($i != "NA") a[$i]++; next } { for (j=1;j<=NF;j++) if ($j in a) next }1' file1 file2

You could do this with grep:
$ egrep -o '[0-9]+' file1 | fgrep -wvf - file2
3 NA NA
NA 4 4
NA NA 6

awk one-liner:
 
awk 'NR==FNR{for(i=1;i<=NF;i++)if($i!="NA"){a[$i];break} next}{for(i=1;i<=NF;i++)if($i in a)next;}1' file1 file2
with your data:
kent$ awk 'NR==FNR{for(i=1;i<=NF;i++)if($i!="NA"){a[$i];break;} next}{for(i=1;i<=NF;i++)if($i in a)next;}1' file1 file2
3 NA NA
NA 4 4
NA NA 6

If the column position of the values matters:
awk '
NR==FNR{
for(i=1; i<=NF; i++) if($i!="NA") A[i,$i]=1
next
}
{
for(i=1; i<=NF; i++) if($i!=NA && A[i,$i]) next
print
}
' file1 file2

Related

What does this code do in perl

The input to perl is like this:
ID NALT NMIN NHET NVAR SING TITV QUAL DP G|DP NRG|DP
PT-1RTW 1 1 1 4573 1 NA 919.41 376548 23.469 58
PT-1RTX 0 0 0 4566 0 NA NA NA 34.5866 NA
PT-1RTY 1 1 1 4592 1 NA 195.49 189549 24.0416 18
PT-1RTZ 0 0 0 4616 0 NA NA NA 44.1474 NA
PT-1RU1 0 0 0 4609 0 NA NA NA 28.2893 NA
PT-1RU2 2 2 2 4568 2 0 575.41 330262 28.2108 49
PT-1RU3 0 0 0 4617 0 NA NA NA 35.9204 NA
PT-1RU4 0 0 0 4615 0 NA NA NA 30.5878 NA
PT-1RU5 0 0 0 4591 0 NA NA NA 26.2729 NA
This is the code:
perl -pe 'if($.==1){#L=split;foreach(#L){$_="SING.$_";}$_="#L\n"}'
I sort of guessed it is processing the first line to add SING to each word. but what does the last part $_="#L\n" do? without this, this code doesn't work.
-p command line switch makes perl process input (or files listed at command line) "line by line" and print processed lines. The line content is stored in $_ variable. $_="#L\n" assign new value to $_ before it is printed.
Shorter version: perl -pe 'if($.==1){s/(^| )/$1SING./g}'
Deparsed (more readable) one-liner above:
perl -MO=Deparse -pe 'if($.==1){#L=split;foreach(#L){$_="SING.$_";}$_="#L\n"}'
LINE: while (defined($_ = readline ARGV)) {
if ($. == 1) {
#L = split(' ', $_, 0);
foreach $_ (#L) {
$_ = "SING.$_";
}
$_ = "#L\n";
}
}
continue {
die "-p destination: $!\n" unless print $_;
}
The last line combines the modified words into a full line and assigns it to $_, which is what will be printed after processing each line when -p is used. (You might have inferred this from the perlrun manual section on -p.)

Merge files based on a mapping in another file

I have written a script in Perl that merges files based on a mapping in a third file; the reason I am not using join is because lines won't always match. The code works, but gives an error that doesn't appear to affect output: Use of uninitialized value in join or string at join.pl line 43, <$fh> line 21. As I am relatively new to Perl I have been unable to understand what is causing this error. Any help resolving this error or advice about my code would be greatly appreciated. I have provided example input and output below.
join.pl
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Tie::File;
use Scalar::Util qw(looks_like_number);
chomp( my $infile = $ARGV[0] );
chomp( my $infile1 = $ARGV[1] );
chomp( my $infile2 = $ARGV[2] );
chomp( my $outfile = $ARGV[3] );
open my $mapfile, '<', $infile or die "Could not open $infile: $!";
open my $file1, '<', $infile1 or die "Could not open $infile1: $!";
open my $file2, '<', $infile2 or die "Could not open $infile2: $!";
tie my #tieFile1, 'Tie::File', $infile1 or die "Could not open $infile1: $!";
tie my #tieFile2, 'Tie::File', $infile2 or die "Could not open $infile2: $!";
open my $output, '>', $outfile or die "Could not open $outfile: $!";
my %map1;
my %map2;
# This loop will read two input files and populate two hashes
# using the coordinates (field 2) and the current line number
while ( my $line1 = <$file1>, my $line2 = <$file2> ) {
my #row1 = split( "\t", $line1 );
my #row2 = split( "\t", $line2 );
# $. holds the line number
$map1{$row1[1]} = $.;
$map2{$row2[1]} = $.;
}
close($file1);
close($file2);
while ( my $line = <$mapfile> ) {
chomp $line;
my #row = split( "\t", $line );
my $species1 = $row[1];
my $reference1 = $map1{$species1};
my $species2 = $row[3];
my $reference2 = $map2{$species2};
my #nomatch = ("NA", "", "NA", "", "", "", "", "NA", "NA");
# test numeric
if ( looks_like_number($reference1) && looks_like_number($reference2) ) {
# do the do using the maps
print $output join("\t", $tieFile1[$reference1], $tieFile2[$reference2]), "\n";
}
elsif ( looks_like_number($reference1) )
{
print $output join("\t", $tieFile1[$reference1], #nomatch), "\n";
}
elsif ( looks_like_number($reference2) )
{
print $output join("\t", #nomatch, $tieFile2[$reference2]), "\n";
}
}
close($output);
untie #tieFile1;
untie #tieFile2;
input_1:
Scf_3L 12798910 T 0 41 0 0 NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA
input_2:
3L 12559896 T 0 31 0 0 NA NA
3L 12559897 C 0 0 33 0 NA NA
3L 12559898 A 34 0 0 0 NA NA
3L 12559899 G 0 0 0 33 NA NA
3L 12559900 T 0 34 0 0 NA NA
3L 12559901 G 0 0 0 33 NA NA
3L 12559902 T 0 33 0 0 NA NA
3L 12559903 A 33 0 0 0 NA NA
3L 12559904 G 0 0 0 33 NA NA
3L 12559905 T 0 34 0 0 NA NA
3L 12559906 T 0 33 0 0 NA NA
map:
3L 12798910 T 12559896 T
3L 12798911 C 12559897 C
3L 12798912 A 12559898 A
3L 12798913 G 12559899 G
3L 12798914 T 12559900 T
3L 12798915 G 12559901 G
3L 12798916 T 12559902 T
3L 12798917 A 12559903 A
3L 12798918 G 12559904 G
3L 12798919 T 12559905 T
3L 12798920 T 12559906 T
output:
Scf_3L 12798910 T 0 41 0 0 NA NA 3L 12559896 T 0 31 0 0 NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA 3L 12559897 C 0 0 33 0 NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA 3L 12559898 A 34 0 0 0 NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA 3L 12559899 G 0 0 0 33 NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA 3L 12559900 T 0 34 0 0 NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA 3L 12559901 G 0 0 0 33 NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA 3L 12559902 T 0 33 0 0 NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA 3L 12559903 A 33 0 0 0 NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA 3L 12559904 G 0 0 0 33 NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA 3L 12559905 T 0 34 0 0 NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA 3L 12559906 T 0 33 0 0 NA NA
The immediate problem is that the indices of the tied arrays start at zero, while the line numbers in $. start at 1. That means you need to subtract one from $. or from the $reference variables before using them as indices. So your resulting data was never correct in the first place, and you may have overlooked that if it weren't for the warning!
I fixed that and also tidied up your code a little. I mostly added use autodie so that there's no need to check the status of IO operations (except for Tie::File), changed to list assignments, moved the code to read the files into a subroutine, and added code blocks so that the lexical file handles would be closed automatically
I also used the tied arrays to build the %map hashes instead of opening the files separately, which means their values are already zero-based as they need to be
Oh, and I removed looks_like_number, because the $reference variables must be either numeric or undef because that's all we put into the hash. The correct way to check that a value isn't undef is with the defined operator
#!/usr/bin/perl
use strict;
use warnings 'all';
use autodie;
use Fcntl 'O_RDONLY';
use Tie::File;
my ( $mapfile, $infile1, $infile2, $outfile ) = #ARGV;
{
tie my #file1, 'Tie::File' => $infile1, mode => O_RDONLY
or die "Could not open $infile1: $!";
tie my #file2, 'Tie::File' =>$infile2, mode => O_RDONLY
or die "Could not open $infile2: $!";
my %map1 = map { (split /\t/, $file1[$_], 3)[1] => $_ } 0 .. $#file1;
my %map2 = map { (split /\t/, $file2[$_], 3)[1] => $_ } 0 .. $#file2;
open my $map_fh, '<', $mapfile;
open my $out_fh, '>', $outfile;
while ( <$map_fh> ) {
chomp;
my #row = split /\t/;
my ( $species1, $species2 ) = #row[1,3];
my $reference1 = $map1{$species1};
my $reference2 = $map2{$species2};
my #nomatch = ( "NA", "", "NA", "", "", "", "", "NA", "NA" );
my #fields = (
( defined $reference1 ? $file1[$reference1] : #nomatch),
( defined $reference2 ? $file2[$reference2] : #nomatch),
);
print $out_fh join( "\t", #fields ), "\n";
}
}
output
Scf_3L 12798910 T 0 41 0 0 NA NA NA NA NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA NA NA NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA NA NA NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA NA NA NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA NA NA NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA NA NA NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA NA NA NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA NA NA NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA NA NA NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA NA NA NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA NA NA NA NA

I need to print columns from two different files with the different numbers of rows into one file

File1.txt
123 321 231
234 432 342
345 543 453
file2.txt
abc bca cba
def efd fed
ghi hig ihg
jkl klj lkj
mno nom onm
pqr qrp rqp
I want output file like
Outfile.txt
123 321 231 abc bca cba
234 432 342 def efd fed
345 543 453 ghi hig ihg
jkl klj lkj
mno nom onm
pqr qrp rqp
Most simply:
sed 's/$/ /' file1 | paste -d '' - file2
This appends spaces to the end of lines in file1 and pastes the output of that together with file2 without a delimiter.
Alternatively, if you know that file2 is longer than file1,
awk 'NR == FNR { line1[NR] = $0 " "; next } { print line1[FNR] $0 }' file1 file2
or if you don't know it,
awk 'NR == FNR { n = NR; line1[n] = $0 " "; next } { print line1[FNR] $0 } END { for(i = FNR + 1; i <= n; ++i) print line1[i]; }' file1 file2
also works.

bash merge files by matching columns

I do have two files:
File1
12 abc
34 cde
42 dfg
11 df
9 e
File2
23 abc
24 gjr
12 dfg
8 df
I want to merge files column by column (if column 2 is the same) for the output like this:
File1 File2
12 23 abc
42 12 dfg
11 8 df
34 NA cde
9 NA e
NA 24 gjr
How can I do this?
I tried it like this:
cat File* >> tmp; sort tmp | uniq -c | awk '{print $2}' > column2; for i in
$(cat column2); do grep -w "$i" File*
But this is where I am stuck...
Don't know how after greping I should combine files column by column & write NA where value is missing.
Hope someone could help me with this.
Since I was testing with bash 3.2 running as sh (which does not have process substitution as sh), I used two temporary files to get the data ready for use with join:
$ sort -k2b File2 > f2.sort
$ sort -k2b File1 > f1.sort
$ cat f1.sort
12 abc
34 cde
11 df
42 dfg
9 e
$ cat f2.sort
23 abc
8 df
12 dfg
24 gjr
$ join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA f1.sort f2.sort
12 23 abc
34 NA cde
11 8 df
42 12 dfg
9 NA e
NA 24 gjr
$
With process substitution, you could write:
join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA <(sort -k2b File1) <(sort -k2b File2)
If you want the data formatted differently, use awk to post-process the output:
$ join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA f1.sort f2.sort |
> awk '{ printf "%-5s %-5s %s\n", $1, $2, $3 }'
12 23 abc
34 NA cde
11 8 df
42 12 dfg
9 NA e
NA 24 gjr
$

Read and parse multiple text files

Can anyone suggest a simple way of achieving this. I have several files which ends with extension .vcf . I will give example with two files
In the below files, we are interested in
File 1:
38 107 C 3 T 6 C/T
38 241 C 4 T 5 C/T
38 247 T 4 C 5 T/C
38 259 T 3 C 6 T/C
38 275 G 3 A 5 G/A
38 304 C 4 T 5 C/T
38 323 T 3 A 5 T/A
File2:
38 107 C 8 T 8 C/T
38 222 - 6 A 7 -/A
38 241 C 7 T 10 C/T
38 247 T 7 C 10 T/C
38 259 T 7 C 10 T/C
38 275 G 6 A 11 G/A
38 304 C 5 T 12 C/T
38 323 T 4 A 12 T/A
38 343 G 13 A 5 G/A
Index file :
107
222
241
247
259
275
304
323
343
The index file is created based on unique positions from file 1 and file 2. I have that ready as index file. Now i need to read all files and parse data according to the positions here and write in columns.
From above files, we are interested in 4th (Ref) and 6th (Alt) columns.
Another challenge is to name the headers accordingly. So the output should be something like this.
Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222 6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
343 13 5
You can do this using the join command:
# add file1
$ join -e' ' -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n index) <(sort -n -k2 file1) > file1.merged
# add file2
$ join -e' ' -1 1 -2 2 -a 1 -o 0,1.2,1.3,2.4,2.6 file1.merged <(sort -n -k2 file2) > file2.merged
# create the header
$ echo "Position File1_Ref File1_Alt File2_Ref File2_Alt" > report
$ cat file2.merged >> report
Output:
$ cat report
Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222 6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
323 4 12 4 12
343 13 5 13 5
Update:
Here is a script which can be used to combine multiple files.
The following assumptions have been made:
The index file is sorted
The vcf files are sorted on their second column
There are no spaces (or any other special characters) in filenames
Save the following script to a file e.g. report.sh and run it without any arguments from the directory containing your files.
#!/bin/bash
INDEX_FILE=index # the name of the file containing the index data
REPORT_FILE=report # the file to write the report to
TMP_FILE=$(mktemp) # a temporary file
header="Position" # the report header
num_processed=0 # the number of files processed so far
# loop over all files beginning with "file".
# this pattern can be changed to something else e.g. *.vcf
for file in file*
do
echo "Processing $file"
if [[ $num_processed -eq 0 ]]
then
# it's the first file so use the INDEX file in the join
join -e' ' -t, -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n "$INDEX_FILE") <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
else
# work out the output fields
for ((outputFields="0",j=2; j < $((2 + $num_processed * 2)); j++))
do
outputFields="$outputFields,1.$j"
done
outputFields="$outputFields,2.4,2.6"
# join this file with the current report
join -e' ' -t, -1 1 -2 2 -a 1 -o "$outputFields" "$REPORT_FILE" <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
fi
((num_processed++))
header="$header,File${num_processed}_Ref,File${num_processed}_Alt"
mv "$TMP_FILE" "$REPORT_FILE"
done
# add the header to the report
echo "$header" | cat - "$REPORT_FILE" > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"
# the report is a csv file. Uncomment the line below to make it space-separated.
# tr ',' ' ' < "$REPORT_FILE" > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"
This Perl solution will handle 1 or more, (50), files.
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp qw/ slurp /;
use Text::Table;
my $path = '.';
my #file = qw/ o33.txt o44.txt /;
my #position = slurp('index.txt') =~ /\d+/g;
my %data;
for my $filename (#file) {
open my $fh, "$path/$filename" or die "Can't open $filename $!";
while (<$fh>) {
my ($pos, $ref, $alt) = (split)[1, 3, 5];
$data{$pos}{$filename} = [$ref, $alt];
}
close $fh or die "Can't close $filename $!";
}
my #head;
for my $file (#file) {
push #head, "${file}_Ref", "${file}_Alt";
}
my $tb = Text::Table->new( map {title => $_}, "Position", #head);
for my $pos (#position) {
$tb->load( [
$pos,
map $data{$pos}{$_} ? #{ $data{$pos}{$_} } : ('', ''), #file
]
);
}
print $tb;