comparing columns of multiple files using shell - sh

I have to compare two files based on first column, if they match then print second column of file1 and file2 in the same line
file 1
1cu2 pf00959
3nnr pf00440
3nnr pf13972
2v89 pf13341
4aqb pf00431
4aqb pf00431
4aqb pf07645
4aqb pf00084
2liv pf13458
2liv pf01094
file 2
1cu2 d.2.1.3
2v89 g.50.1.2
2v89 g.50.1.2
2liv c.93.1.1
2liv c.93.1.1
1q2w b.47.1.4
1q2w b.47.1.4
1rgh d.1.1.2
1rgh d.1.1.2
1zxl c.2.1.2
output
1cu2 pf00959 d.2.1.3
2v89 pf13341 g.50.1.2
2liv pf13458 c.93.1.1

Assuming you're using more than Bourne (/bin/sh), you can do this in a one liner:
$ join <(sort -u file1) <(sort -u file2)
1cu2 pf00959 d.2.1.3
2liv pf01094 c.93.1.1
2liv pf13458 c.93.1.1
2v89 pf13341 g.50.1.2
If you're actually writing a shell script for /bin/sh, you'll need temporary files, e.g.
$ sort file1 > file1-sorted
$ sort file2 > file2-sorted
$ join file1-sorted file2-sorted
Update: Your example output has one hit per key, even though 2liv has two values in file1. To accomplish this, you need to run through a post-processor to note the duplicates:
$ join <(sort -u file1) <(sort -u file2) |awk '!done[$1,$3] { print; done[$1,$3] = 1 }'
1cu2 pf00959 d.2.1.3
2liv pf01094 c.93.1.1
2v89 pf13341 g.50.1.2
This uses a simple hash in awk. The sort -u items already got rid of the duplicates from file1 (the second column of the final output), so we're merely looking for the first unique pairing of the keys with the values from file2 (the first and third columns). If we find a new pair, the line is printed and the pair is saved so it won't print on its next hit.
Note that this is not sorted the way your sample output was. That would be nontrivial (you'd need a third job just to determine the original order and then map things to it.)

Related

bash while loop failing while using sed

I am facing an issue with sed in a while-loop.using sed. I want to read the 2nd column of file1, compare it with the content of file2, and if the string is matched, i want to replace the matched string of file1 with file2 string.
I tried with the following code, but it is not returning any output.
cat file1 | while read a b; do
sed -i "s/$b/$(grep $b file2)/g" file1 > file3;
done
Example input:
file_1 content:
1 1234
2 8765
file2 content:
12345
34567
87654
Expected output:
1 12345
2 87654
Your script is very inefficient. Using the while-loop you read each line of file1. This is N operations. Per line you process with the while loop, you reproscess the full file1, making it an N*N process. However, in the sed, you grep file2 constantly. If file2 has M lines, this becomes an N*N*M process. This is very inefficient.
On top of that there are some issues:
You updated file1 inplace because you use the -i flag. An inplace update does not provide any output, so file3 will be empty.
You are reading file1 with the while-loop and at the same time you update file1 with sed. I don't know how this will react, but I don't believe it is healthy.
If $b is not in file2 you would, according to your logic, have a line with only a single column. This is not what you expect.
A fix of your script, would be this:
while read -r a b; do
c=$(grep "$b" file2)
[[ "$c" == "" ]] || echo "$a $c"
done < file1 > file3
which is still not efficient, but it is already M*N. The best way is using awk
note: as a novice, always parse your script with http://www.shellcheck.net
note: as a professional, always parse your script with http://www.shellcheck.net
Could you please try following.
awk 'FNR==NR{a[$2]=$1;next} {for(i in a){if(match($0,"^"i)){print a[i],$0;continue}}}' file1 file2
Adding a non-one liner form of solution:
awk '
FNR==NR{
a[$2]=$1
next
}
{
for(i in a){
if(match($0,"^"i)){
print a[i],$0
continue
}
}
}
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
a[$2]=$1 ##Creating array a whose index is $2 and value is $1.
next ##next will skip all further statements from here.
}
{ ##Statements from here will run for 2nd Input_file only.
for(i in a){ ##Traversing through array a all elements here.
if(match($0,"^"i)){ ##Checking condition if current line matches index of current item from array a then do following.
print a[i],$0 ##Printing array a whose index is i and current line here.
continue ##Again take cursor to for loop.
}
}
}
' Input_file1 Input_file2 ##Mentioning all Input_file names here.

search for occurrence of a string in another file in a particular column

I have two files:
1) Tab file with the following content. Let's call this reference file:
V$HMGIY_01_rc Ncor=0.405
V$CACD_01 Ncor=0.405
V$GKLF_02 Ncor=0.650
V$AML2_Q3 Ncor=0.792
V$WT1_Q6 Ncor=0.607
V$KID3_01 Ncor=0.668
V$CNOT3_01 Ncor=0.491
V$KROX_Q6 Ncor=0.423
V$ETF_Q6_rc Ncor=0.547
V$E2F_Q2_rc Ncor=0.653
V$SP1_Q6_01_rc Ncor=0.650
V$SP4_Q5 Ncor=0.660
2) The second tab file contains the search string X as shown below. Let's call this file as search_string:
A X
NF-E2_SC-22827 NF-E2
NRSF NRSF
NFATC1_SC-17834 NFATC1
NFKB NFKB
TCF3_SC-349 TCF3
MEF2A MEF2A
what I would like to do is: Take the first search term (from search_string file; column X), check if it occurs in first column of the reference file.
Example: The first search term is NF-E2. I would need to check if this string occurs in the first column of the reference file. If it occurs, then give a score of 1, else give 0. Also i would like to count the number of times it matches the pattern.
I want the output to be created as follows:
X X in file? number of times it occurs
NF-E2 1 3
NRSF 0 0
NFATC1 0 0
NFKB 1 7
TCF3 0 0
Please note: I need to search each string in different files i.e. The first string (Nf-E2) should be searched in file NF-E2.tab; the second string (NRSF) should be searched in file NRSF.tab and so on. Also I would like to program it using either R or Perl scripts only.
Please help!!
Here is a one-liner that you can play with and alter to suit:
perl -lanE '$str=$F[1]; $f="/home/$str/list/$str.txt"; $c=`grep -c "$str" "$f"`;chomp($c);$x=0;$x++ if $c;say "$str\t$x\t$c"' file2
It assumes your second file is called file2. Here is some sample output from input files I made up on my machine:
NF-E2 0 0
NRSF 1 1
NFATC1 1 2
TCF3 1 3
MEF2A 0 0
It just uses grep -c to count the occurrences and stores that in variable $c. chomp() removes the linefeed from the output of grep. $x is set to zero and incremented if the count ($c) is greater than zero. Then the result is printed using say.
I'll get you started with the search string and the name of the file to search in...
$perl -lanE '$str=$F[1];$f=$str.".txt";print "$str $f"' file2
NF-E2 NF-E2.txt
NRSF NRSF.txt
NFATC1 NFATC1.txt
NFKB NFKB.txt
TCF3 TCF3.txt
MEF2A MEF2A.txt
Explanation
Perl command-line switches used:
-l Perl takes care of line endings for us saving us the trouble - thanks Perl !
-a split the fields of the input file into an array called $F[]
-n put an implict loop around our code to process each line of the input file (file2)
-E execute the code that follows inside single quotes and enable the say feature
Then the actual code inside the single quotes ('')... assign the value of the second field, i.e. $F[1] because fields start at 0, to the variable $str. Assign the value of $str appended with ".txt" to the variable $f - which is the search string. Then print the search string $str and the filename $f.
EDITED
If you find Bash easier to understand, here is a Bash version.
#!/bin/bash
# Set tabs to align output columns
tabs -12
# Output headers
echo -e "X\tPresent?\tCount"
# Extract second column of file2
awk '{print $2}' file2 | while read item
do
# Work out name of file to search in
FILE="/home/${item}/list/${item}.txt"
# Count occurrences of $item in $FILE
COUNT=$(grep -cw "$item" "$FILE")
# If COUNT>0 the value is present
PRESENT=0
[ $COUNT -gt 0 ] && PRESENT=1
echo -e "$item\t$PRESENT\t$COUNT"
done
Save the file as go, then run like this:
chmod +x go # Only necessary for the first run
./go

Assigining data from one file to other

I have a job to adding a row data in one file to another file having same id. Which program I can use do complete this job.
Input file1
481063384 PBPb
481063384 PBPb
481063384 LT_GEWL
481063384 lysozyme_like
481063384 SLT
481063384 emtA
481063406 Hsp33
481063406 Hsp33
481063406 COG1281
481063406 HSP33
Input file2
481063384 putative soluble lytic transglycosylase
481063406 chaperonin HslO
Desired Output
481063384 putative soluble lytic transglycosylase PBPb
481063406 chaperonin Hsp33
Condition first I need to extract first line of the repeating number and I need to assign or add.
Please Help me.
I am thinking awk will be useful but I am not good at awk programming.
You can try this,
awk 'NR==FNR{ if( $1 in a) next;a[$1]=$2;next}{$0=$0" "a[$1]}1' file1 file2
You can use join and if you want first line of each, use awk :
join file2 file1 | awk '{if(a!=$1) print}{a=$1}'
gives :
481063384 putative soluble lytic transglycosylase PBPb
481063406 chaperonin HslO Hsp33
Another way with awk:
awk 'NR==FNR{!seen[$1]++&&line[$1]=$2;next}$0=$0 FS line[$1]' file{1,2}

Awk: Match data in 2 files with duplicate keys

I have 2 files
file1
a^b=-123
a^3=-124
c^b=-129
a^b=-130
and file2
a^b=-523
a^3=-524
a^b=-530
I want to lookup the key using '=' as delimiter and get the following output
a^b^-123^-523
a^b^-130^-530
a^3^-124^-524
When there were no duplicate keys, it was easy to do it in awk mapping the first file and looping over the second, however, with the duplicates, its slightly difficult. I tried something like this:
awk -F"=" '
FNR == NR {
arr[$1 "^" $2] = $2;
next;
}
FNR < NR {
for (i in arr) {
match(i, , /^(.*\^.*)\^([-0-9]*)$/, , ar);
if ($1 == ar[1]) {
if ($2 in load == 0) {
if (ar[2] in l2 == 0) {
l2[ar[2]] = ar[2];
load[$2] = $2;
print i "^" $2
}
}
}
}
}
' file1 file2
This works just fine, however, not surprisingly it's extremely slow. On a file with about 600K records, it ran for 4 hours.
Is there a better and more efficient way to do this in one line awk or perl. If possible, a one liner would be great help.
thanks.
You might want to look at the join command which does something very much like you're doing here, but generates a full database-style join. For example, assuming file1 and file2 contain the data you show above, then the commands
$ sort -o file1.out -t = -k 1,1 file1
$ sort -o file2.out -t = -k 1,1 file2
$ join -t = file1.out file2.out
produces the output
a^3=-124=-524
a^b=-123=-523
a^b=-123=-530
a^b=-130=-523
a^b=-130=-530
The sorts are necessary because, to be efficient, join requires the input file to be sorted on the keys being compared. Note though that this generates the full cross-product join, which appears not to be what you want.
(Note: The following is a very shell-heavy solution, but you could cast it fairly easily into any programming language with dynamic arrays and a built-in sort primitive. Unfortunately, awk isn't one of those but perl and python are, as are I'm sure just about every newer scripting language.)
It seems that you really want each instance of a key to be consumed the first time it's emitted in any output. You can get this as follows, again starting with the original contents of file1 and file2.
$ nl -s = -n rz file1 | sort -t = -k 2,2 > file1.out
$ nl -s = -n rz file2 | sort -t = -k 2,2 > file2.out
This decorates each line with the original line number so that we can recover the original order later, and then sorts them on the key for join. The remainder of the work is a short pipeline, which I've broken up into multiple blocks so it can be explained as we go.
join -t = -1 2 -2 2 file1.out file2.out |
This command joins on the key names, now in field two, and emits records like those shown from the earlier output of join, except that each line now includes the line number where the key was found in file1 and file2. Next, we want to re-establish the search order your original algorithm would have used, so we continue the pipeline with
sort -t = -k 2,2 -k 4,4 |
which sorts first on the file1 line number and then on the file2 line numbers. Finally, we need to efficiently emulate the assumption that a particular key, once consumed, cannot be re-used, in order to eliminate the unwanted matches in the original join output.
awk '
BEGIN { OFS="="; FS="=" }
$2 in seen2 || $4 in seen4 { next }
{ seen2[$2]++; seen4[$4]++; print $1,$3,$5 }
'
This ignores every line that references a previously scanned key in either file, and otherwise prints the following
a^b=-123=-523
a^3=-124=-524
a^b=-130=-530
This should be uniformly efficient even for quite large inputs, because the sorts are O(n log n), and everything else is O(n).
try this awk codes, see if it would be faster than yours: (it could be an one-liner, if you join all lines, but I think with formatting, it is easier to read)
awk -F'=' -v OFS="^" 'NR==FNR{sub(/=/,"^");a[NR]=$0;t=NR;next}
{ s=$1
sub(/\^/,"\\^",s)
for(i=1;i<=t;i++){
if(a[i]~s){
print a[i],$2
delete a[i]
break
}
}
}' file1 file2
with your example, it outputs expected result:
a^b^-123^-523
a^3^-124^-524
a^b^-130^-530
But I think the key is performance here. so give it a try.

Sort a file with unordered columns of integers

I have an input file with two columns of integer values. I would like to chop the input file in this way
input file:
...
...
12312 565456
565456 12312
...
...
#
output file:
...
...
12312 565456
...
...
namely if two numbers are present in couple more then one time, writing a unique line in the output file where the first number if the smaller of the two.
How can be done with sort or a perl script?
You can try:
perl -nale ' #F=reverse #F if($F[0]>$F[1]);
$x=$F[0]." ".$F[1]; if(!$h{$x}){print $x;$h{$x}=1;}'
See it
You could combine perl and sort:
perl -lne 'BEGIN { $, = " " } print sort split' infile | sort -u
awk -vOFS="\t" '$2<$1 {print $2,$1} $1<=$2 {print}'|sort -u
would also work