Assigining data from one file to other - sed

I have a job to adding a row data in one file to another file having same id. Which program I can use do complete this job.
Input file1
481063384 PBPb
481063384 PBPb
481063384 LT_GEWL
481063384 lysozyme_like
481063384 SLT
481063384 emtA
481063406 Hsp33
481063406 Hsp33
481063406 COG1281
481063406 HSP33
Input file2
481063384 putative soluble lytic transglycosylase
481063406 chaperonin HslO
Desired Output
481063384 putative soluble lytic transglycosylase PBPb
481063406 chaperonin Hsp33
Condition first I need to extract first line of the repeating number and I need to assign or add.
Please Help me.
I am thinking awk will be useful but I am not good at awk programming.

You can try this,
awk 'NR==FNR{ if( $1 in a) next;a[$1]=$2;next}{$0=$0" "a[$1]}1' file1 file2

You can use join and if you want first line of each, use awk :
join file2 file1 | awk '{if(a!=$1) print}{a=$1}'
gives :
481063384 putative soluble lytic transglycosylase PBPb
481063406 chaperonin HslO Hsp33

Another way with awk:
awk 'NR==FNR{!seen[$1]++&&line[$1]=$2;next}$0=$0 FS line[$1]' file{1,2}

Related

bash while loop failing while using sed

I am facing an issue with sed in a while-loop.using sed. I want to read the 2nd column of file1, compare it with the content of file2, and if the string is matched, i want to replace the matched string of file1 with file2 string.
I tried with the following code, but it is not returning any output.
cat file1 | while read a b; do
sed -i "s/$b/$(grep $b file2)/g" file1 > file3;
done
Example input:
file_1 content:
1 1234
2 8765
file2 content:
12345
34567
87654
Expected output:
1 12345
2 87654
Your script is very inefficient. Using the while-loop you read each line of file1. This is N operations. Per line you process with the while loop, you reproscess the full file1, making it an N*N process. However, in the sed, you grep file2 constantly. If file2 has M lines, this becomes an N*N*M process. This is very inefficient.
On top of that there are some issues:
You updated file1 inplace because you use the -i flag. An inplace update does not provide any output, so file3 will be empty.
You are reading file1 with the while-loop and at the same time you update file1 with sed. I don't know how this will react, but I don't believe it is healthy.
If $b is not in file2 you would, according to your logic, have a line with only a single column. This is not what you expect.
A fix of your script, would be this:
while read -r a b; do
c=$(grep "$b" file2)
[[ "$c" == "" ]] || echo "$a $c"
done < file1 > file3
which is still not efficient, but it is already M*N. The best way is using awk
note: as a novice, always parse your script with http://www.shellcheck.net
note: as a professional, always parse your script with http://www.shellcheck.net
Could you please try following.
awk 'FNR==NR{a[$2]=$1;next} {for(i in a){if(match($0,"^"i)){print a[i],$0;continue}}}' file1 file2
Adding a non-one liner form of solution:
awk '
FNR==NR{
a[$2]=$1
next
}
{
for(i in a){
if(match($0,"^"i)){
print a[i],$0
continue
}
}
}
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
a[$2]=$1 ##Creating array a whose index is $2 and value is $1.
next ##next will skip all further statements from here.
}
{ ##Statements from here will run for 2nd Input_file only.
for(i in a){ ##Traversing through array a all elements here.
if(match($0,"^"i)){ ##Checking condition if current line matches index of current item from array a then do following.
print a[i],$0 ##Printing array a whose index is i and current line here.
continue ##Again take cursor to for loop.
}
}
}
' Input_file1 Input_file2 ##Mentioning all Input_file names here.

Extracting fasta ids after string match

I have a list of fasta sequences as following:
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
The original fasta sequence is much longer than the subset posted here. I wanted to extract the 10 characters after the pattern "TCAT" into a separate file and did this
grep -oP "(?<=TCAT).{10}"
I do get the needed result as:
CTCACCTACT
TGATAAGGGG
I would like their corresponding fasta ids as one column and the extracted pattern as second column like:
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
Try this one-liner
perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' file
with your given inputs
$ cat fasta.txt
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
$ perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' fasta.txt
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
$
Another way will be ussing awk command like this :
cat <your_file>| awk -F"_" '/Product/{printf "%s", $0; next} 1'|awk -F"TCAT" '{ print substr($1,1,35) "\t" substr($2,1,10)}'
the output :
Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
hope it help you.

comparing columns of multiple files using shell

I have to compare two files based on first column, if they match then print second column of file1 and file2 in the same line
file 1
1cu2 pf00959
3nnr pf00440
3nnr pf13972
2v89 pf13341
4aqb pf00431
4aqb pf00431
4aqb pf07645
4aqb pf00084
2liv pf13458
2liv pf01094
file 2
1cu2 d.2.1.3
2v89 g.50.1.2
2v89 g.50.1.2
2liv c.93.1.1
2liv c.93.1.1
1q2w b.47.1.4
1q2w b.47.1.4
1rgh d.1.1.2
1rgh d.1.1.2
1zxl c.2.1.2
output
1cu2 pf00959 d.2.1.3
2v89 pf13341 g.50.1.2
2liv pf13458 c.93.1.1
Assuming you're using more than Bourne (/bin/sh), you can do this in a one liner:
$ join <(sort -u file1) <(sort -u file2)
1cu2 pf00959 d.2.1.3
2liv pf01094 c.93.1.1
2liv pf13458 c.93.1.1
2v89 pf13341 g.50.1.2
If you're actually writing a shell script for /bin/sh, you'll need temporary files, e.g.
$ sort file1 > file1-sorted
$ sort file2 > file2-sorted
$ join file1-sorted file2-sorted
Update: Your example output has one hit per key, even though 2liv has two values in file1. To accomplish this, you need to run through a post-processor to note the duplicates:
$ join <(sort -u file1) <(sort -u file2) |awk '!done[$1,$3] { print; done[$1,$3] = 1 }'
1cu2 pf00959 d.2.1.3
2liv pf01094 c.93.1.1
2v89 pf13341 g.50.1.2
This uses a simple hash in awk. The sort -u items already got rid of the duplicates from file1 (the second column of the final output), so we're merely looking for the first unique pairing of the keys with the values from file2 (the first and third columns). If we find a new pair, the line is printed and the pair is saved so it won't print on its next hit.
Note that this is not sorted the way your sample output was. That would be nontrivial (you'd need a third job just to determine the original order and then map things to it.)

Awk: Match data in 2 files with duplicate keys

I have 2 files
file1
a^b=-123
a^3=-124
c^b=-129
a^b=-130
and file2
a^b=-523
a^3=-524
a^b=-530
I want to lookup the key using '=' as delimiter and get the following output
a^b^-123^-523
a^b^-130^-530
a^3^-124^-524
When there were no duplicate keys, it was easy to do it in awk mapping the first file and looping over the second, however, with the duplicates, its slightly difficult. I tried something like this:
awk -F"=" '
FNR == NR {
arr[$1 "^" $2] = $2;
next;
}
FNR < NR {
for (i in arr) {
match(i, , /^(.*\^.*)\^([-0-9]*)$/, , ar);
if ($1 == ar[1]) {
if ($2 in load == 0) {
if (ar[2] in l2 == 0) {
l2[ar[2]] = ar[2];
load[$2] = $2;
print i "^" $2
}
}
}
}
}
' file1 file2
This works just fine, however, not surprisingly it's extremely slow. On a file with about 600K records, it ran for 4 hours.
Is there a better and more efficient way to do this in one line awk or perl. If possible, a one liner would be great help.
thanks.
You might want to look at the join command which does something very much like you're doing here, but generates a full database-style join. For example, assuming file1 and file2 contain the data you show above, then the commands
$ sort -o file1.out -t = -k 1,1 file1
$ sort -o file2.out -t = -k 1,1 file2
$ join -t = file1.out file2.out
produces the output
a^3=-124=-524
a^b=-123=-523
a^b=-123=-530
a^b=-130=-523
a^b=-130=-530
The sorts are necessary because, to be efficient, join requires the input file to be sorted on the keys being compared. Note though that this generates the full cross-product join, which appears not to be what you want.
(Note: The following is a very shell-heavy solution, but you could cast it fairly easily into any programming language with dynamic arrays and a built-in sort primitive. Unfortunately, awk isn't one of those but perl and python are, as are I'm sure just about every newer scripting language.)
It seems that you really want each instance of a key to be consumed the first time it's emitted in any output. You can get this as follows, again starting with the original contents of file1 and file2.
$ nl -s = -n rz file1 | sort -t = -k 2,2 > file1.out
$ nl -s = -n rz file2 | sort -t = -k 2,2 > file2.out
This decorates each line with the original line number so that we can recover the original order later, and then sorts them on the key for join. The remainder of the work is a short pipeline, which I've broken up into multiple blocks so it can be explained as we go.
join -t = -1 2 -2 2 file1.out file2.out |
This command joins on the key names, now in field two, and emits records like those shown from the earlier output of join, except that each line now includes the line number where the key was found in file1 and file2. Next, we want to re-establish the search order your original algorithm would have used, so we continue the pipeline with
sort -t = -k 2,2 -k 4,4 |
which sorts first on the file1 line number and then on the file2 line numbers. Finally, we need to efficiently emulate the assumption that a particular key, once consumed, cannot be re-used, in order to eliminate the unwanted matches in the original join output.
awk '
BEGIN { OFS="="; FS="=" }
$2 in seen2 || $4 in seen4 { next }
{ seen2[$2]++; seen4[$4]++; print $1,$3,$5 }
'
This ignores every line that references a previously scanned key in either file, and otherwise prints the following
a^b=-123=-523
a^3=-124=-524
a^b=-130=-530
This should be uniformly efficient even for quite large inputs, because the sorts are O(n log n), and everything else is O(n).
try this awk codes, see if it would be faster than yours: (it could be an one-liner, if you join all lines, but I think with formatting, it is easier to read)
awk -F'=' -v OFS="^" 'NR==FNR{sub(/=/,"^");a[NR]=$0;t=NR;next}
{ s=$1
sub(/\^/,"\\^",s)
for(i=1;i<=t;i++){
if(a[i]~s){
print a[i],$2
delete a[i]
break
}
}
}' file1 file2
with your example, it outputs expected result:
a^b^-123^-523
a^3^-124^-524
a^b^-130^-530
But I think the key is performance here. so give it a try.

Brocade alishow merge two consecutive lines awk sed

How would like to join two lines usung awk or sed?
For example, I have data like below:
abcd
12:12:12:12:12:12:12:12
efgh001_01
45:45:45:45:45:45:45:45
ijkl7464746
78:78:78:78:78:78:78:78
and I need output like below:
abcd 12:12:12:12:12:12:12:12
efgh001_01 45:45:45:45:45:45:45:45
ijkl7464746 78:78:78:78:78:78:78:78
Running this almost works, but I need the space or tab:
awk '!(NR%2){print$0p}{p=$0}'
You're almost there:
awk '(NR % 2 == 0) {print p, $0} {p = $0}'
With sed you can do that as follows:
sed -n 'N;s/\n/ /p' file
where:
N reads next line
s replaces the new line character with a space to join both lines properly
p prints the result
This might work for you:
sed '$!N;s/\n/ /' file
or this:
paste -sd' \n' file