Unix join on more than two files - perl

I have three files, each with an ID and a value.
sdt5z#fir-s:~/test$ ls
a.txt b.txt c.txt
sdt5z#fir-s:~/test$ cat a.txt
id1 1
id2 2
id3 3
sdt5z#fir-s:~/test$ cat b.txt
id1 4
id2 5
id3 6
sdt5z#fir-s:~/test$ cat c.txt
id1 7
id2 8
id3 9
I want to create a file that looks like this...
id1 1 4 7
id2 2 5 8
id3 3 6 9
...preferably using a single command.
I'm aware of the join and paste commands. Paste will duplicate the id column each time:
sdt5z#fir-s:~/test$ paste a.txt b.txt c.txt
id1 1 id1 4 id1 7
id2 2 id2 5 id2 8
id3 3 id3 6 id3 9
Join works well, but for only two files at a time:
sdt5z#fir-s:~/test$ join a.txt b.txt
id1 1 4
id2 2 5
id3 3 6
sdt5z#fir-s:~/test$ join a.txt b.txt c.txt
join: extra operand `c.txt'
Try `join --help' for more information.
I'm also aware that paste can take STDIN as one of the arguments by using "-". E.g., I can replicate the join command using:
sdt5z#fir-s:~/test$ cut -f2 b.txt | paste a.txt -
id1 1 4
id2 2 5
id3 3 6
But I'm still not sure how to modify this to accomodate three files.
Since I'm doing this inside a perl script, I know I can do something like putting this inside a foreach loop, something like join file1 file2 > tmp1, join tmp1 file3 > tmp2, etc. But this gets messy, and I would like to do this with a one-liner.

join a.txt b.txt|join - c.txt
should be sufficient

Since you're doing it inside a Perl script, is there any specific reason you're NOT doing the work in Perl as opposed to spawning in shell?
Something like (NOT TESTED! caveat emptor):
use File::Slurp; # Slurp the files in if they aren't too big
my #files = qw(a.txt b.txt c.txt);
my %file_data = map ($_ => [ read_file($_) ] ) #files;
my #id_orders;
my %data = ();
my $first_file = 1;
foreach my $file (#files) {
foreach my $line (#{ $file_data{$file} }) {
my ($id, $value) = split(/\s+/, $line);
push #id_orders, $id if $first_file;
$data{$id} ||= [];
push #{ $data{$id} }, $value;
}
$first_file = 0;
}
foreach my $id (#id_orders) {
print "$d " . join(" ", #{ $data{$id} }) . "\n";
}

perl -lanE'$h{$F[0]} .= " $F[1]" END{say $_.$h{$_} foreach keys %h}' *.txt
Should work, can't test it as I'm answering from my mobile. You also could sort the output if you put a sort between foreach and keys.

pr -m -t -s\ file1.txt file2.txt|gawk '{print $1"\t"$2"\t"$3"\t"$4}'> finalfile.txt
Considering file1 and file2 have 2 columns and 1 and 2 represents columns from file1 and 3 and 4 represents columns from file2.
You can also print any column from each file in this way and it will take any number of files as input. If your file1 has 5 columns for example, then $6 will be the first column of the file2.

Related

Script merging two files

I'm fairly inexperienced with coding, but I often use Perl to merge files and match ID's and information between two files. I have just tried matching two files using a program I have used many times previously, but this time it's not working and I don't understand why.
Here is the code:
use strict;
use warnings;
use vars qw($damID $damF $damAHC $prog $hash1 %hash1 $info1 $ID $sire $dam $F $FB $AHC $FA $hash2 %hash2 $info2);
open (FILE1, "<damF.txt") || die "$!\n Couldn't open damF.txt\n";
my $N = 1;
while (<FILE1>){
chomp (my $line=$_);
next if 1..$N==$.;
my ($damID, $damF, $damAHC, $prog) = split (/\t/, $line);
if ($prog){
$hash1 -> {$prog} -> {info1} = "$damID\t$damF\t$damAHC";
}
open (FILE2, "<whole pedigree_F.txt") || die "$!\n whole pedigree_F.txt \n";
open (Output, ">Output.txt")||die "Can't Open Output file";
while (<FILE2>){
chomp (my $line=$_);
next if 1..$N==$.;
my ($ID, $sire, $dam, $F, $FB, $AHC, $FA) = split (/\t/, $line);
if ($ID){
$hash2 -> {$ID} -> {info2} = "$F\t$AHC";
}
if ($ID && ($hash1->{$prog})){
$info1 = $hash1 -> {$prog} -> {info1};
$info2 = $hash2 -> {$ID} -> {info2};
print "$ID\t$info2\t$info1\n";
}
}
}
close(FILE1);
close FILE2;
close Output;
print "Done!\n";
and these snippets from the two input file formats:
File 1:
501093 0 0 3162
2958 0 0 3163
1895 0 0 3164
1382 0 0 3165
2869 0 0 3166
2361 0 0 3167
754 0 0 3168
3163 0 0 3169
File 2:
49327 20543 49325 0.077 0.4899 0.808 0.0484
49328 15247 49326 0.0755 0.5232 0.8972 0.0499
49329 27823 49327 0.0834 0.5138 0.8738 0.0541
I want to match the values from column 4 in file 1, with column 1 in file 2.
Then I also want to print the matching values from columns 2 and 3 in file 1 and columns 3 and 5 in file 2.
Also, it is probably worth mentioning there are about 500000 entries on each file.
This is the output I am getting:
11476 0.0362 0.3237 501093 0 0
11477 0.0673 0.4768 501093 0 0
11478 0.0443 0.2619 501093 0 0
Note that it isn’t looping through the first hash that I created.
Create two tables in SQLite. Load the TSVs into them. Do a SQL join. It will be simpler and faster.
Refer to this answer about how to load data into SQLite. In your case you want .mode tabs.
sqlite> create table file1 ( col1 int, col2 int, col3 int, col4 int );
sqlite> create table file2 ( col1 int, col2 int, col3 int, col4 numeric, col5 numeric, col6 numeric, col7 numeric );
sqlite> .mode tabs
sqlite> .import /path/to/file1 file1
sqlite> .import /path/to/file2 file2
There's any number of ways to improve those tables, but I don't know what your data is. Use better names in your own. You'll also want to declare things like primary and foreign keys as well as indexes to speed things up.
Now you have your data in an easy to manipulate format using a well known query language, not a bunch of custom code.
I want to match the values from column 4 in file 1, with column 1 in file 2.
Then I also want to print the matching values from columns 2 and 3 in file 1 and columns 3 and 5 in file 2.
You can do this with a SQL join between the two tables.
select file1.col2, file1.col3, file2.col3, file2.col5
from file1
join file2 on file1.col4 = file2.col1

sed - substitute between pattern on different lines

I have a csv file exported from spreadsheet which has, in the last column, sometimes a list of names. The file comes out like this:
ag,bd,cj,dy,"ss"
aa,bs,cs,fg,"name1
name2
name3
"
ff,ce,sd,de,
ag,bd,jj,ds,"ds"
fs,ee,sd,ee,"name4
name5
"
and so on.
I would like to remove the line feed in the last column between quotes so that the output is:
ag,bd,cj,dy,ss
aa,bs,cs,fg,"name1 name2 name3"
ff,ce,sd,de,
ag,bd,jj,ds,"ds"
fs,ee,sd,ee,"name4 name5"
Thanks
This awk may be one solution for you:
awk '/\"/ {s=!s} {printf "%s"(s?FS:RS),$0}'
ag,bd,cj,dy,ss
aa,bs,cs,fg,"name1 name2 name3 "
ff,ce,sd,de,df
New solution
awk -F\" 'NF==3; NF==2 {s++} s==1 {printf "%s ",$0} s==2 {print;s=0}' | awk '{sub(/ "/,"\"")}1' file
ag,bd,cj,dy,"ss"
aa,bs,cs,fg,"name1 name2 name3"
ag,bd,jj,ds,"ds"
fs,ee,sd,ee,"name4 name5"

grep and replace

I wanted to grep a string at the first occurrence ONLY from a file (file.dat) and replace it by reading from another file (output). I have a file called "output" as an example contains "AAA T 0001"
#!/bin/bash
procdir=`pwd`
cat output | while read lin1 lin2 lin3
do
srt2=$(echo $lin1 $lin2 $lin3 | awk '{print $1,$2,$3}')
grep -m 1 $lin1 $procdir/file.dat | xargs -r0 perl -pi -e 's/$lin1/$srt2/g'
done
Basically what I wanted is: When ever a string "AAA" is grep'ed from the file "file.dat" at the first instance, I want to replace the second and third column next to "AAA" by "T 0001" but still keep the first column "AAA" as it is. Th above script basically does not work. Basically "$lin1" and $srt2 variables are not understood inside 's/$lin1/$srt2/g'
Example:
in my file.dat I have a row
AAA D ---- CITY COUNTRY
What I want is :
AAA T 0001 CITY COUNTRY
Any comments are very appreciated.
If you have output file like this:
$ cat output
AAA T 0001
Your file.dat file contains information like:
$ cat file.dat
AAA D ---- CITY COUNTRY
BBB C ---- CITY COUNTRY
AAA D ---- CITY COUNTRY
You can try something like this with awk:
$ awk '
NR==FNR {
a[$1]=$0
next
}
$1 in a {
printf "%s ", a[$1]
delete a[$1]
for (i=4;i<=NF;i++) {
printf "%s ", $i
}
print ""
next
}1' output file.dat
AAA T 0001 CITY COUNTRY
BBB C ---- CITY COUNTRY
AAA D ---- CITY COUNTRY
Say you place the string for which to search in $s and the string with which to replace in $r, wouldn't the following do?
perl -i -pe'
BEGIN { ($s,$r)=splice(#ARGV,0,2) }
$done ||= s/\Q$s/$r/;
' "$s" "$r" file.dat
(Replaces the first instance if present)
This will only change the first match in the file:
#!/bin/bash
procdir=`pwd`
while read line; do
set $line
sed '0,/'"$1"'/s/\([^ ]* \)\([^ ]* [^ ]*\)/\1'"$2 $3"'/' $procdir/file.dat
done < output
To change all matching lines:
sed '/'"$1"'/s/\([^ ]* \)\([^ ]* [^ ]*\)/\1'"$2 $3"'/' $procdir/file.dat

Joining two files based on two fields

I posted a question before a week and the answer was simply (use join):
join <(sort file1) <(sort file2) >output
to join files that have something common which is usually the first field.
I have the following two files:
genes.txt
ENSG001 ENSG002
ENSG002 ENSG001
ENSG003 ENSG004
features.txt
ENSG001 400
ENSG002 350
ENSG003 210
ENSG004 100
I need to join these two files to be like this:
output.txt
ENSG001 400 ENSG002 350
ENSG002 350 ENSG001 400
ENSG003 210 ENSG004 100
I know the answer is in join command but I can't figure out how to join based on two fields. I tried
join -j 1 <(sort genes.txt) <(sort features.txt) >attempt1.txt
but the result will looks like this:
attempt1.txt
ENSG001 ENSG002 400
ENSG002 ENSG001 350
ENSG003 ENSG004 210
I then tried
join -j 2 <(sort -k 2 genes.txt) <(sort -k 2 features.txt) >attempt2.txt
attempt2.txt is empty
Does (join) have the ability to join two files based on two fields ? If no then how can I do it ?
%features;
open $fd, '<', 'features.txt' or die $!;
while (<$fd>) {
($k, $v) = split;
$features{$k} = $v;
}
close $fd or die $!;
open $fd, '<', 'genes.txt' or die $!;
while (<$fd>) {
s/(\w+)/$1 $features{$1}/g;
print;
}
close $fd or die $!;
Thank you all guys I have managed to answer it by tricking the problem.
First I joined the files normally, I then changed the position of first and second field, I next joined the modified output file another time with features, and finally I switched the positions of fields again.
join <(sort genes.txt) <(sort features.txt) >tmp
cat tmp | awk '{ print $2, $1, $3 }' >tmp2
join <(sort tmp2) <(sort features.txt) >tmp3
cat tmp3 | awk '{ print $2, $3, $1, $4 }' >output.txt
To the best of my knowledge, join does NOT support this. See join manpage.
However, you can accomplish this in 2 ways:
Turn the first space/tab in the file into a caret (or other character you will never see in the file), then use join as before which will treat the first 2 fields as 1 field:
perl -pi -e 's/^(\S+)\s+/$1#/' file1
perl -pi -e 's/^(\S+)\s+/$1#/' file2
join <(sort file1) <(sort file2) >output
tr "#" " " output > output.final
Do it in Perl. You can do
the blunt approach (perreal's answer: slurp in 2 files at once); this takes a lot of memory if both files are large
The more memory conserving approach (cdtits's answer: slurp in a smaller file, store in a hash, then apply the lookups to line-by-line read of second file)
For really gynormous files, do a linear approach:
sort both files, read 1 line of each file; if they match, print the match; if not; skip 1 line in the file whose ID was smaller.
In case that "ENST" in features.txt is "ENSG", here is an awk solution that works well on given example:
awk 'BEGIN {while(getline <"features.txt") f[$1]=$2} {print $1,f[$1],$2,f[$2]}' < genes.txt
I can explain in detail if you need to.
Using perl:
use strict;
use warnings;
open GIN, "<genes.txt" or die("genes");
open FIN, "<features.txt" or die("features");
my %relations;
my %values;
while (<GIN>) {
my ($r1, $r2) = split;
$relations{$r1} = $r2;
}
while (<FIN>) {
my ($k, $v) = split;
$values{$k} = $v;
}
for my $r1 (sort keys %relations) {
my $r2 = $relations{$r1};
print "$r1 $values{$r1} $r2 $values{$r2}\n";
}
close FIN; close GIN;
Your approach is generally right. It should be achievable by something like
join -o '1.1 2.2 1.2 1.3' <(
join -o '1.1 1.2 2.2' -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
sort
) <(sort features.txt)
If I place ENSG004 instead of ENST004 into features.txt I will get exactly what you are looking for:
$ join -o '1.1 2.2 1.2 1.3' <(
join -o '1.1 1.2 2.2' -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
sort
) <(sort features.txt)
ENSG001 400 ENSG002 350
ENSG002 350 ENSG001 400
ENSG003 210 ENSG004 100
There is less verbose version but there is harder to keep track of fields:
join -o '1.2 2.2 1.1 1.3' -1 2 <(
join -1 2 <(sort -k 2 genes.txt) <(sort features.txt) |
sort -k 2
) <(sort features.txt)
If you are going process really big data it will should work pretty effective to tens of GB (and also should be better then most of RDBMS's if features.txt and genes.txt are comparable in size):
TMP=`mktemp`
sort features.txt > "$TMP"
sort -k 2 genes.txt | join -o '1.1 1.2 2.2' -1 2 - "$TMP" | sort |
join -o '1.1 2.2 1.2 1.3' - "$TMP"
rm "$TMP"

Compare semicolon separated data in 2 files using shell script

I have some data (separated by semicolon) with close to 240 rows in a text file temp1.
temp2.txt stores 204 rows of data (separated by semicolon).
I want to:
Sort the data in both files by field1, i.e. the first data field in every row.
Compare the data in both files and redirect the rows that are not equal in separate files.
Sample data:
temp1.txt
1000xyz400100xyzA00680xyz0;19722.83;19565.7;157.13;11;2.74;11.00
1000xyz400100xyzA00682xyz0;7210.68;4111.53;3099.15;216.95;1.21;216.94
1000xyz430200xyzA00651xyz0;146.70;0.00;0.00;0.00;0.00;0.00
temp2.txt
1000xyz400100xyzA00680xyz0;19722.83;19565.7;157.13;11;2.74;11.00
1000xyz400100xyzA00682xyz0;7210.68;4111.53;3099.15;216.95;1.21;216.94
The sort command I'm using:
sort -k1,1 temp1 -o temp1.tmp
sort -k1,1 temp2 -o temp2.tmp
I'd appreciate if someone could show me how to redirect only the missing/mis-matching rows into two separate files for analysis.
Try
cat temp1 temp2 | sort -k1,1 -o tmp
# mis-matching/missing rows:
uniq -u tmp
# matching rows:
uniq -d tmp
You want the difference as described at http://www.pixelbeat.org/cmdline.html#sets
sort -t';' -k1,1 temp1 temp1 temp2 | uniq -u > only_in_temp2
sort -t';' -k1,1 temp1 temp2 temp2 | uniq -u > only_in_temp1
Notes:
Use join rather than uniq, as shown at the link above if you want to compare only particular fields
If the first field is fixed width then you don't need the -t';' -k1,1 params above
Look at the comm command.
using gawk, and outputting lines in file1 that is not in file2
awk -F";" 'FNR==NR{ a[$1]=$0;next }
( ! ( $1 in a) ) { print $0 > "afile.txt" }' file2 file1
interchange the order of file2 and file to output line in file2 that is not in file1