Merge files based on a mapping in another file - perl

I have written a script in Perl that merges files based on a mapping in a third file; the reason I am not using join is because lines won't always match. The code works, but gives an error that doesn't appear to affect output: Use of uninitialized value in join or string at join.pl line 43, <$fh> line 21. As I am relatively new to Perl I have been unable to understand what is causing this error. Any help resolving this error or advice about my code would be greatly appreciated. I have provided example input and output below.
join.pl
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Tie::File;
use Scalar::Util qw(looks_like_number);
chomp( my $infile = $ARGV[0] );
chomp( my $infile1 = $ARGV[1] );
chomp( my $infile2 = $ARGV[2] );
chomp( my $outfile = $ARGV[3] );
open my $mapfile, '<', $infile or die "Could not open $infile: $!";
open my $file1, '<', $infile1 or die "Could not open $infile1: $!";
open my $file2, '<', $infile2 or die "Could not open $infile2: $!";
tie my #tieFile1, 'Tie::File', $infile1 or die "Could not open $infile1: $!";
tie my #tieFile2, 'Tie::File', $infile2 or die "Could not open $infile2: $!";
open my $output, '>', $outfile or die "Could not open $outfile: $!";
my %map1;
my %map2;
# This loop will read two input files and populate two hashes
# using the coordinates (field 2) and the current line number
while ( my $line1 = <$file1>, my $line2 = <$file2> ) {
my #row1 = split( "\t", $line1 );
my #row2 = split( "\t", $line2 );
# $. holds the line number
$map1{$row1[1]} = $.;
$map2{$row2[1]} = $.;
}
close($file1);
close($file2);
while ( my $line = <$mapfile> ) {
chomp $line;
my #row = split( "\t", $line );
my $species1 = $row[1];
my $reference1 = $map1{$species1};
my $species2 = $row[3];
my $reference2 = $map2{$species2};
my #nomatch = ("NA", "", "NA", "", "", "", "", "NA", "NA");
# test numeric
if ( looks_like_number($reference1) && looks_like_number($reference2) ) {
# do the do using the maps
print $output join("\t", $tieFile1[$reference1], $tieFile2[$reference2]), "\n";
}
elsif ( looks_like_number($reference1) )
{
print $output join("\t", $tieFile1[$reference1], #nomatch), "\n";
}
elsif ( looks_like_number($reference2) )
{
print $output join("\t", #nomatch, $tieFile2[$reference2]), "\n";
}
}
close($output);
untie #tieFile1;
untie #tieFile2;
input_1:
Scf_3L 12798910 T 0 41 0 0 NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA
input_2:
3L 12559896 T 0 31 0 0 NA NA
3L 12559897 C 0 0 33 0 NA NA
3L 12559898 A 34 0 0 0 NA NA
3L 12559899 G 0 0 0 33 NA NA
3L 12559900 T 0 34 0 0 NA NA
3L 12559901 G 0 0 0 33 NA NA
3L 12559902 T 0 33 0 0 NA NA
3L 12559903 A 33 0 0 0 NA NA
3L 12559904 G 0 0 0 33 NA NA
3L 12559905 T 0 34 0 0 NA NA
3L 12559906 T 0 33 0 0 NA NA
map:
3L 12798910 T 12559896 T
3L 12798911 C 12559897 C
3L 12798912 A 12559898 A
3L 12798913 G 12559899 G
3L 12798914 T 12559900 T
3L 12798915 G 12559901 G
3L 12798916 T 12559902 T
3L 12798917 A 12559903 A
3L 12798918 G 12559904 G
3L 12798919 T 12559905 T
3L 12798920 T 12559906 T
output:
Scf_3L 12798910 T 0 41 0 0 NA NA 3L 12559896 T 0 31 0 0 NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA 3L 12559897 C 0 0 33 0 NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA 3L 12559898 A 34 0 0 0 NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA 3L 12559899 G 0 0 0 33 NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA 3L 12559900 T 0 34 0 0 NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA 3L 12559901 G 0 0 0 33 NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA 3L 12559902 T 0 33 0 0 NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA 3L 12559903 A 33 0 0 0 NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA 3L 12559904 G 0 0 0 33 NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA 3L 12559905 T 0 34 0 0 NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA 3L 12559906 T 0 33 0 0 NA NA

The immediate problem is that the indices of the tied arrays start at zero, while the line numbers in $. start at 1. That means you need to subtract one from $. or from the $reference variables before using them as indices. So your resulting data was never correct in the first place, and you may have overlooked that if it weren't for the warning!
I fixed that and also tidied up your code a little. I mostly added use autodie so that there's no need to check the status of IO operations (except for Tie::File), changed to list assignments, moved the code to read the files into a subroutine, and added code blocks so that the lexical file handles would be closed automatically
I also used the tied arrays to build the %map hashes instead of opening the files separately, which means their values are already zero-based as they need to be
Oh, and I removed looks_like_number, because the $reference variables must be either numeric or undef because that's all we put into the hash. The correct way to check that a value isn't undef is with the defined operator
#!/usr/bin/perl
use strict;
use warnings 'all';
use autodie;
use Fcntl 'O_RDONLY';
use Tie::File;
my ( $mapfile, $infile1, $infile2, $outfile ) = #ARGV;
{
tie my #file1, 'Tie::File' => $infile1, mode => O_RDONLY
or die "Could not open $infile1: $!";
tie my #file2, 'Tie::File' =>$infile2, mode => O_RDONLY
or die "Could not open $infile2: $!";
my %map1 = map { (split /\t/, $file1[$_], 3)[1] => $_ } 0 .. $#file1;
my %map2 = map { (split /\t/, $file2[$_], 3)[1] => $_ } 0 .. $#file2;
open my $map_fh, '<', $mapfile;
open my $out_fh, '>', $outfile;
while ( <$map_fh> ) {
chomp;
my #row = split /\t/;
my ( $species1, $species2 ) = #row[1,3];
my $reference1 = $map1{$species1};
my $reference2 = $map2{$species2};
my #nomatch = ( "NA", "", "NA", "", "", "", "", "NA", "NA" );
my #fields = (
( defined $reference1 ? $file1[$reference1] : #nomatch),
( defined $reference2 ? $file2[$reference2] : #nomatch),
);
print $out_fh join( "\t", #fields ), "\n";
}
}
output
Scf_3L 12798910 T 0 41 0 0 NA NA NA NA NA NA
Scf_3L 12798911 C 0 0 43 0 NA NA NA NA NA NA
Scf_3L 12798912 A 42 0 0 0 NA NA NA NA NA NA
Scf_3L 12798913 G 0 0 0 44 NA NA NA NA NA NA
Scf_3L 12798914 T 0 42 0 0 NA NA NA NA NA NA
Scf_3L 12798915 G 0 0 0 44 NA NA NA NA NA NA
Scf_3L 12798916 T 0 42 0 0 NA NA NA NA NA NA
Scf_3L 12798917 A 41 0 0 0 NA NA NA NA NA NA
Scf_3L 12798918 G 0 0 0 43 NA NA NA NA NA NA
Scf_3L 12798919 T 0 43 0 0 NA NA NA NA NA NA
Scf_3L 12798920 T 0 41 0 0 NA NA NA NA NA NA

Related

What does this code do in perl

The input to perl is like this:
ID NALT NMIN NHET NVAR SING TITV QUAL DP G|DP NRG|DP
PT-1RTW 1 1 1 4573 1 NA 919.41 376548 23.469 58
PT-1RTX 0 0 0 4566 0 NA NA NA 34.5866 NA
PT-1RTY 1 1 1 4592 1 NA 195.49 189549 24.0416 18
PT-1RTZ 0 0 0 4616 0 NA NA NA 44.1474 NA
PT-1RU1 0 0 0 4609 0 NA NA NA 28.2893 NA
PT-1RU2 2 2 2 4568 2 0 575.41 330262 28.2108 49
PT-1RU3 0 0 0 4617 0 NA NA NA 35.9204 NA
PT-1RU4 0 0 0 4615 0 NA NA NA 30.5878 NA
PT-1RU5 0 0 0 4591 0 NA NA NA 26.2729 NA
This is the code:
perl -pe 'if($.==1){#L=split;foreach(#L){$_="SING.$_";}$_="#L\n"}'
I sort of guessed it is processing the first line to add SING to each word. but what does the last part $_="#L\n" do? without this, this code doesn't work.
-p command line switch makes perl process input (or files listed at command line) "line by line" and print processed lines. The line content is stored in $_ variable. $_="#L\n" assign new value to $_ before it is printed.
Shorter version: perl -pe 'if($.==1){s/(^| )/$1SING./g}'
Deparsed (more readable) one-liner above:
perl -MO=Deparse -pe 'if($.==1){#L=split;foreach(#L){$_="SING.$_";}$_="#L\n"}'
LINE: while (defined($_ = readline ARGV)) {
if ($. == 1) {
#L = split(' ', $_, 0);
foreach $_ (#L) {
$_ = "SING.$_";
}
$_ = "#L\n";
}
}
continue {
die "-p destination: $!\n" unless print $_;
}
The last line combines the modified words into a full line and assigns it to $_, which is what will be printed after processing each line when -p is used. (You might have inferred this from the perlrun manual section on -p.)

I want to have a output with 85 characters in each line, could you please say how I have to use print in this field?

I used following command to get a specific format that the output of it is in one line:
MASH P 0 3.64 NAMD P 0 3.79 AGHA P 0 4.50 SARG P 0 4.71 BENG P 0 5.47 BANR P 0 6.75 ABZA P 0 6.25 KALI P 0 6.91
I want to have a output with 85 characters in each line, could someone explain how I have to use print in this field?
You can use a regular expression with a quantifier:
$_ = 'MASH P 0 3.64 NAMD P 0 3.79 AGHA P 0 4.50 SARG P 0 4.71 BENG P 0 5.47 BANR P 0 6.75 ABZA P 0 6.25 KALI P 0 6.91';
print $&, "\n" while /.{1,85}/g;
or, if it's a part of a larger program and you don't want to suffer the performance penalty, use ${^MATCH} instead of $&:
use Syntax::Construct qw{ /p };
print ${^MATCH}, "\n" while /.{1,85}/gp;
You can also use the four argument substr:
print substr($_, 0, 85, q()), "\n" while $_;

delete specific columns when other column has specific value (perl or awk)

I have a file with 16 different columns (tab-separated values):
22 51169729 G 39 A 0 0 C 0 0 G 38 0.974359 T 1 0.025641
22 51169730 A 36 A 36 1 C 0 0 G 0 0 T 0 0
22 51169731 C 39 A 0 0 C 39 1 G 0 0 T 0 0
22 51169732 G 37 A 0 0 C 0 0 G 37 1 T 0 0
22 51169733 G 33 A 0 0 C 0 0 G 33 1 T 0 0
22 51169734 C 35 A 0 0 C 35 1 G 0 0 T 0 0
22 51169735 A 32 A 32 1 C 0 0 G 0 0 T 0 0
22 51169736 G 32 A 0 0 C 0 0 G 32 1 T 0 0
22 51169737 C 30 A 0 0 C 30 1 G 0 0 T 0 0
22 51169738 T 27 A 0 0 C 0 0 G 0 0 T 27 1
22 51169739 G 26 A 0 0 C 0 0 G 26 1 T 0 0
22 51169740 A 25 A 25 1 C 0 0 G 0 0 T 0 0
22 51169741 C 22 A 0 0 C 22 1 G 0 0 T 0 0
22 51169742 G 23 A 0 0 C 0 0 G 23 1 T 0 0
22 51169743 C 21 A 0 0 C 21 1 G 0 0 T 0 0
22 51169744 C 22 A 0 0 C 22 1 G 0 0 T 0 0
22 51169745 C 19 A 0 0 C 19 1 G 0 0 T 0 0
22 51169746 C 19 A 0 0 C 19 1 G 0 0 T 0 0
22 51169747 A 15 A 14 0.933333 C 1 0.0666667 G 0 0 T 0 0
22 51169748 C 20 A 0 0 C 20 1 G 0 0 T 0 0
The third column can be A, G, C or T.
I would like to:
remove columns 5, 6 and 7 when column 3 is an 'A' OR when $7=='0'.
Similarly, remove columns 8, 9, 10 when $3== 'C' OR when $10=='0'.
remove columns 11, 12, 13 when $3=='G' OR when $13=='0'.
and remove columns 14, 15, 16 when $3=='T' OR when $16=='0'.
When this is done for the entire file, there would only be 4 columns left in some cases and 7 columns in other cases, like in the following example:
22 51169729 G 39 T 1 0.025641
22 51169730 A 36
22 51169731 C 39
22 51169732 G 37
22 51169733 G 33
22 51169734 C 35
22 51169735 A 32
22 51169736 G 32
22 51169737 C 30
22 51169738 T 27
22 51169739 G 26
22 51169740 A 25
22 51169741 C 22
22 51169742 G 23
22 51169743 C 21
22 51169744 C 22
22 51169745 C 19
22 51169746 C 19
22 51169747 A 15 C 2 0.133333
22 51169748 C 20
Any suggestions?
Perl solution for the first part:
#!/usr/bin/perl
use warnings;
use strict;
my %remove = ( A => 4, # Where to start removing the columns
C => 7, # for a given character in column #3.
G => 10,
T => 13,
);
$\ = "\n"; # Add newline to prints.
$, = "\t"; # Separate values by tabs.
while (<>) { # Read input line by line;
chomp; # Remove newline.
my #F = split /\t/; # Split on tabs, populate an array.
splice #F, $remove{ $F[2] }, 3; # Remove the columns.
print #F; # Output.
}
Once you clarify the second requirement, I can try to add more code. What values do you want to remove? Can you show more examples?
Here's one way to do the first part, assuming no empty fields:
$ cat tst.awk
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
{ gsub(/[[:space:]]+/,"\t"); print }
$ awk -f tst.awk file
1 957584 C 157 A 1 0.006 G 0 0 T 0 0
I don't really understand what you're trying to do in the 2nd part but it sounds like this might be what you want if the test on $7/10/13 is the modified field numbers after the first phase:
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
{ $0=$0 }
$7 ~ /0/ { c++ }
$10 ~ /0/ { c++ }
$13 ~ /0/ { c++ }
c > 1 { $8=$9=$10="" }
{ c=0; gsub(/[[:space:]]+/,"\t"); print }
or this if the test on $7/10/13 is the original field numbers:
$7 ~ /0/ { c++ }
$10 ~ /0/ { c++ }
$13 ~ /0/ { c++ }
$3 == "A" { $5=$6=$7="" }
$3 == "C" { $8=$9=$10="" }
$3 == "G" { $11=$12=$13="" }
$3 == "T" { $14=$15=$16="" }
c > 1 { $8=$9=$10="" }
{ c=0; gsub(/[[:space:]]+/,"\t"); print }
If not, edit your question to clarify with a better example.

How to implement this in awk or shell?

Input File1:
5 5 NA
NA NA 1
2 NA 2
Input File2:
1 1 1
2 NA 2
3 NA NA
NA 4 4
5 5 5
NA NA 6
Output:
3 NA NA
NA 4 4
NA NA 6
The purpose is: in file1 , set any item of each line that is not NA into a set, then in file2, eliminate the line that whose fields are within this set. Does anyone have ideas about this?
To add any item not 'NA':
awk -f script.awk file1 file2
Contents of script.awk:
FNR==NR {
for (i=1;i<=NF;i++) {
if ($i != "NA") {
a[$i]++
}
}
next
}
{
for (j=1;j<=NF;j++) {
if ($j in a) {
next
}
}
}1
Results:
3 NA NA
NA 4 4
NA NA 6
Alternatively, here's the one-liner:
awk 'FNR==NR { for (i=1;i<=NF;i++) if ($i != "NA") a[$i]++; next } { for (j=1;j<=NF;j++) if ($j in a) next }1' file1 file2
You could do this with grep:
$ egrep -o '[0-9]+' file1 | fgrep -wvf - file2
3 NA NA
NA 4 4
NA NA 6
awk one-liner:
 
awk 'NR==FNR{for(i=1;i<=NF;i++)if($i!="NA"){a[$i];break} next}{for(i=1;i<=NF;i++)if($i in a)next;}1' file1 file2
with your data:
kent$ awk 'NR==FNR{for(i=1;i<=NF;i++)if($i!="NA"){a[$i];break;} next}{for(i=1;i<=NF;i++)if($i in a)next;}1' file1 file2
3 NA NA
NA 4 4
NA NA 6
If the column position of the values matters:
awk '
NR==FNR{
for(i=1; i<=NF; i++) if($i!="NA") A[i,$i]=1
next
}
{
for(i=1; i<=NF; i++) if($i!=NA && A[i,$i]) next
print
}
' file1 file2

count no.of occurrences per unique id

I am new to command line. I have long text file (samp.txt) with following columns with space delimited. Awk/sed/perl help appreciated.
Id Pos Re Va Cn SF:R1 SR He Ho NC
c|371443199 22 G A R Pass:8 0 1 0 0
c|371443199 25 C A M Pass:13 0 0 1 0
c|371443199 22 G A R Pass:8 0 1 0 0
c|367079424 17 C G S Pass:19 0 0 1 0
c|371443198 17 G A R Pass:18 0 1 0 0
c|367079424 17 G A R Pass:18 0 0 1 0
I want count for each unique id (count unique id how many occurrences), count 6th column (6th column =pass), count how many He (from 8th column) and how many Ho (9 th column). I would like to get result like this
Id CountId Countpass CountHe CountHO
cm|371443199 3 3 2 1
cm|367079424 2 2 0 2
awk '{ids[$1]++; pass[$1] = "?"; he[$1] += $8; ho[$1] += $9} END {OFS = "\t"; print "Id", "CountId", "Countpass", "CountHe", "CountHO"; for (id in ids) {print id, ids[id], pass[id], he[id], ho[id]}' inputfile
Broken out onto multiple lines:
awk '{
ids[$1]++;
pass[$1] = "?"; # I'm not sure what you want here
he[$1] += $8;
ho[$1] += $9
}
END {
OFS = "\t";
print "Id", "CountId", "Countpass", "CountHe", "CountHO";
for (id in ids) {
print id, ids[id], pass[id], he[id], ho[id]
}' inputfile
You seem to have a typo in your input, where you put ...98 instead of ...99. Assuming this is the case, your other information and expected output makes sense.
Using an array to store the ids to preserve the original order of the ids.
use strict;
use warnings;
use feature 'say'; # to enable say()
my $hdr = <DATA>; # remove header
my %hash;
my #keys;
while (<DATA>) {
my ($id,$pos,$re,$va,$cn,$sf,$sr,$he,$ho,$nc) = split;
$id =~ s/^c\K/m/;
$hash{$id}{he} += $he;
$hash{$id}{ho} += $ho;
$hash{$id}{pass}{$sf}++;
$hash{$id}{count}++;
push #keys, $id if $hash{$id}{count} == 1;
}
say join "\t", qw(Id CountId Countpass CountHe CountHO);
for my $id (#keys) {
say join "\t", $id,
$hash{$id}{count}, # occurences of id
scalar keys $hash{$id}{pass}, # the number of unique passes
#{$hash{$id}}{qw(he ho)};
}
__DATA__
Id Pos Re Va Cn SF:R1 SR He Ho NC
c|371443199 22 G A R Pass:8 0 1 0 0
c|371443199 25 C A M Pass:13 0 0 1 0
c|371443199 22 G A R Pass:8 0 1 0 0
c|367079424 17 C G S Pass:19 0 0 1 0
c|371443198 17 G A R Pass:18 0 1 0 0
c|367079424 17 G A R Pass:18 0 0 1 0
Output:
Id CountId Countpass CountHe CountHO
cm|371443199 3 2 2 1
cm|367079424 2 2 0 2
cm|371443198 1 1 1 0
Note: I made the output tab-delimited for easier post-processing. If you want it pretty instead, use printf to get some fixed width fields.