Awk: Match data in 2 files with duplicate keys - perl

I have 2 files
file1
a^b=-123
a^3=-124
c^b=-129
a^b=-130
and file2
a^b=-523
a^3=-524
a^b=-530
I want to lookup the key using '=' as delimiter and get the following output
a^b^-123^-523
a^b^-130^-530
a^3^-124^-524
When there were no duplicate keys, it was easy to do it in awk mapping the first file and looping over the second, however, with the duplicates, its slightly difficult. I tried something like this:
awk -F"=" '
FNR == NR {
arr[$1 "^" $2] = $2;
next;
}
FNR < NR {
for (i in arr) {
match(i, , /^(.*\^.*)\^([-0-9]*)$/, , ar);
if ($1 == ar[1]) {
if ($2 in load == 0) {
if (ar[2] in l2 == 0) {
l2[ar[2]] = ar[2];
load[$2] = $2;
print i "^" $2
}
}
}
}
}
' file1 file2
This works just fine, however, not surprisingly it's extremely slow. On a file with about 600K records, it ran for 4 hours.
Is there a better and more efficient way to do this in one line awk or perl. If possible, a one liner would be great help.
thanks.

You might want to look at the join command which does something very much like you're doing here, but generates a full database-style join. For example, assuming file1 and file2 contain the data you show above, then the commands
$ sort -o file1.out -t = -k 1,1 file1
$ sort -o file2.out -t = -k 1,1 file2
$ join -t = file1.out file2.out
produces the output
a^3=-124=-524
a^b=-123=-523
a^b=-123=-530
a^b=-130=-523
a^b=-130=-530
The sorts are necessary because, to be efficient, join requires the input file to be sorted on the keys being compared. Note though that this generates the full cross-product join, which appears not to be what you want.
(Note: The following is a very shell-heavy solution, but you could cast it fairly easily into any programming language with dynamic arrays and a built-in sort primitive. Unfortunately, awk isn't one of those but perl and python are, as are I'm sure just about every newer scripting language.)
It seems that you really want each instance of a key to be consumed the first time it's emitted in any output. You can get this as follows, again starting with the original contents of file1 and file2.
$ nl -s = -n rz file1 | sort -t = -k 2,2 > file1.out
$ nl -s = -n rz file2 | sort -t = -k 2,2 > file2.out
This decorates each line with the original line number so that we can recover the original order later, and then sorts them on the key for join. The remainder of the work is a short pipeline, which I've broken up into multiple blocks so it can be explained as we go.
join -t = -1 2 -2 2 file1.out file2.out |
This command joins on the key names, now in field two, and emits records like those shown from the earlier output of join, except that each line now includes the line number where the key was found in file1 and file2. Next, we want to re-establish the search order your original algorithm would have used, so we continue the pipeline with
sort -t = -k 2,2 -k 4,4 |
which sorts first on the file1 line number and then on the file2 line numbers. Finally, we need to efficiently emulate the assumption that a particular key, once consumed, cannot be re-used, in order to eliminate the unwanted matches in the original join output.
awk '
BEGIN { OFS="="; FS="=" }
$2 in seen2 || $4 in seen4 { next }
{ seen2[$2]++; seen4[$4]++; print $1,$3,$5 }
'
This ignores every line that references a previously scanned key in either file, and otherwise prints the following
a^b=-123=-523
a^3=-124=-524
a^b=-130=-530
This should be uniformly efficient even for quite large inputs, because the sorts are O(n log n), and everything else is O(n).

try this awk codes, see if it would be faster than yours: (it could be an one-liner, if you join all lines, but I think with formatting, it is easier to read)
awk -F'=' -v OFS="^" 'NR==FNR{sub(/=/,"^");a[NR]=$0;t=NR;next}
{ s=$1
sub(/\^/,"\\^",s)
for(i=1;i<=t;i++){
if(a[i]~s){
print a[i],$2
delete a[i]
break
}
}
}' file1 file2
with your example, it outputs expected result:
a^b^-123^-523
a^3^-124^-524
a^b^-130^-530
But I think the key is performance here. so give it a try.

Related

bash while loop failing while using sed

I am facing an issue with sed in a while-loop.using sed. I want to read the 2nd column of file1, compare it with the content of file2, and if the string is matched, i want to replace the matched string of file1 with file2 string.
I tried with the following code, but it is not returning any output.
cat file1 | while read a b; do
sed -i "s/$b/$(grep $b file2)/g" file1 > file3;
done
Example input:
file_1 content:
1 1234
2 8765
file2 content:
12345
34567
87654
Expected output:
1 12345
2 87654
Your script is very inefficient. Using the while-loop you read each line of file1. This is N operations. Per line you process with the while loop, you reproscess the full file1, making it an N*N process. However, in the sed, you grep file2 constantly. If file2 has M lines, this becomes an N*N*M process. This is very inefficient.
On top of that there are some issues:
You updated file1 inplace because you use the -i flag. An inplace update does not provide any output, so file3 will be empty.
You are reading file1 with the while-loop and at the same time you update file1 with sed. I don't know how this will react, but I don't believe it is healthy.
If $b is not in file2 you would, according to your logic, have a line with only a single column. This is not what you expect.
A fix of your script, would be this:
while read -r a b; do
c=$(grep "$b" file2)
[[ "$c" == "" ]] || echo "$a $c"
done < file1 > file3
which is still not efficient, but it is already M*N. The best way is using awk
note: as a novice, always parse your script with http://www.shellcheck.net
note: as a professional, always parse your script with http://www.shellcheck.net
Could you please try following.
awk 'FNR==NR{a[$2]=$1;next} {for(i in a){if(match($0,"^"i)){print a[i],$0;continue}}}' file1 file2
Adding a non-one liner form of solution:
awk '
FNR==NR{
a[$2]=$1
next
}
{
for(i in a){
if(match($0,"^"i)){
print a[i],$0
continue
}
}
}
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
a[$2]=$1 ##Creating array a whose index is $2 and value is $1.
next ##next will skip all further statements from here.
}
{ ##Statements from here will run for 2nd Input_file only.
for(i in a){ ##Traversing through array a all elements here.
if(match($0,"^"i)){ ##Checking condition if current line matches index of current item from array a then do following.
print a[i],$0 ##Printing array a whose index is i and current line here.
continue ##Again take cursor to for loop.
}
}
}
' Input_file1 Input_file2 ##Mentioning all Input_file names here.

Required suggestions to optimize a piece of unix ksh code

I'm new to shell scripting and expecting some guidance on how to optimize the following piece of code to avoid unnecessary loops.
The file "DD.$BUS_DT.dat" is a pipe delimited file and contains 4 columns. Sample data in DD.2015-05-19.dat will be as follows
cust portal|10|10|0
sys-b|10|10|0
Code
i=0;
sed 's///g;s/[0-9]//g' ./DD.$BUS_DT.dat > ./temp-processed.dat
set -A sourceList
while read line
do
#echo $line
case $line in
'cust portal') sourceList[$i]=custportal;;
*) sourceList[$i]=${line};;
esac
(( i += 1));
done < ./temp-processed.dat;
echo ${sourceList[#]};
i=0;
while [[ i -lt ${#sourceList[#]} ]]; do
print ${sourceList[i]} >> ./processed-$BUS_DT.dat
(( i += 1))
done
My goal is to read the data from the first column of the file without spaces so that the output should be like ...
custportal
sys-b
Your help will be appreciated.
I haven't gone through all your script, but if you just want to get the first column on |-separated columns, stripping the spaces that they may have, you can use awk like this:
$ awk -F"|" '{sub(" ","",$1); print $1}' file
custportal
sys-b
This uses | as field separator and replaces all the spaces with an empty string. Then, it prints it.

comparing columns of multiple files using shell

I have to compare two files based on first column, if they match then print second column of file1 and file2 in the same line
file 1
1cu2 pf00959
3nnr pf00440
3nnr pf13972
2v89 pf13341
4aqb pf00431
4aqb pf00431
4aqb pf07645
4aqb pf00084
2liv pf13458
2liv pf01094
file 2
1cu2 d.2.1.3
2v89 g.50.1.2
2v89 g.50.1.2
2liv c.93.1.1
2liv c.93.1.1
1q2w b.47.1.4
1q2w b.47.1.4
1rgh d.1.1.2
1rgh d.1.1.2
1zxl c.2.1.2
output
1cu2 pf00959 d.2.1.3
2v89 pf13341 g.50.1.2
2liv pf13458 c.93.1.1
Assuming you're using more than Bourne (/bin/sh), you can do this in a one liner:
$ join <(sort -u file1) <(sort -u file2)
1cu2 pf00959 d.2.1.3
2liv pf01094 c.93.1.1
2liv pf13458 c.93.1.1
2v89 pf13341 g.50.1.2
If you're actually writing a shell script for /bin/sh, you'll need temporary files, e.g.
$ sort file1 > file1-sorted
$ sort file2 > file2-sorted
$ join file1-sorted file2-sorted
Update: Your example output has one hit per key, even though 2liv has two values in file1. To accomplish this, you need to run through a post-processor to note the duplicates:
$ join <(sort -u file1) <(sort -u file2) |awk '!done[$1,$3] { print; done[$1,$3] = 1 }'
1cu2 pf00959 d.2.1.3
2liv pf01094 c.93.1.1
2v89 pf13341 g.50.1.2
This uses a simple hash in awk. The sort -u items already got rid of the duplicates from file1 (the second column of the final output), so we're merely looking for the first unique pairing of the keys with the values from file2 (the first and third columns). If we find a new pair, the line is printed and the pair is saved so it won't print on its next hit.
Note that this is not sorted the way your sample output was. That would be nontrivial (you'd need a third job just to determine the original order and then map things to it.)

Using the bash sort command within variable-length filenames

I am trying to numerically sort a series of files output by the ls command which match the pattern either ABCDE1234A1789.RST.txt or ABCDE12345A1789.RST.txt by the '789' field.
In the example patterns above, ABCDE is the same for all files, 1234 or 12345 are digits that vary but are always either 4 or 5 digits in length. A1 is the same length for all files, but value can vary so unfortunately it can't be used as a delimiter. Everything after the first . is the same for all files. Something like:
ls -l *.RST.txt | sort -k +9.13 | awk '{print $9} ' > file-list.txt
will match the shorter filenames but not the longer ones because of the variable length of characters before the field I want to sort by.
Is there a way to accomplish sorting all files without first padding the shorter-length files to make them all the same length?
Perl to the rescue!
perl -e 'print "$_\n" for sort { substr($a, -11, 3) cmp substr($b, -11, 3) } glob "*.RST.txt"'
If your perl is more recent (5.10 or newer), you can shorten it to
perl -E 'say for sort { substr($a, -11, 3) cmp substr($b, -11, 3) } glob "*.RST.txt"'
Because of the parts of the filename which you've identified as unchanging, you can actually build a key which sort will use:
$ echo ABCDE{99999,8765,9876,345,654,23,21,2,3}A1789.RST.txt \
| fmt -w1 \
| sort -tE -k2,2n --debug
ABCDE2A1789.RST.txt
_
___________________
ABCDE3A1789.RST.txt
_
___________________
ABCDE21A1789.RST.txt
__
etc.
What this does is tell sort to separate the fields on character E, then use the 2nd field numerically. --debug arrived in coreutils 8.6, and can be very helpful in seeing exactly what sort is doing.
The conventional way to do this in bash is to extract your sort field. Except for the sort command, the following is implemented in pure bash alone:
sort_names_by_first_num() {
shopt -s extglob
for f; do
first_num="${f##+([^0-9])}";
first_num=${first_num%[^0-9]*};
[[ $first_num ]] && printf '%s\t%s\n' "$first_num" "$f"
done | sort -n | while IFS='' read -r name; do name=${name#*$'\t'}; printf '%s\n' "$name"; done
}
sort_names_by_first_num *.RST.txt
That said, newline-delimiting filenames (as this question seems to call for) is a bad practice: Filenames on UNIX filesystems are allowed to contain newlines within their names, so separating them by newlines within a list means your list is unable to contain a substantial subset of the range of valid names. It's much better practice to NUL-delimit your lists. Doing that would look like so:
sort_names_by_first_num() {
shopt -s extglob
for f; do
first_num="${f##+([^0-9])}";
first_num=${first_num%[^0-9]*};
[[ $first_num ]] && printf '%s\t%s\0' "$first_num" "$f"
done | sort -n -z | while IFS='' read -r -d '' name; do name=${name#*$'\t'}; printf '%s\0' "$name"; done
}
sort_names_by_first_num *.RST.txt

Swap two columns - awk, sed, python, perl

I've got data in a large file (280 columns wide, 7 million lines long!) and I need to swap the first two columns. I think I could do this with some kind of awk for loop, to print $2, $1, then a range to the end of the file - but I don't know how to do the range part, and I can't print $2, $1, $3...$280! Most of the column swap answers I've seen here are specific to small files with a manageable number of columns, so I need something that doesn't depend on specifying every column number.
The file is tab delimited:
Affy-id chr 0 pos NA06984 NA06985 NA06986 NA06989
You can do this by swapping values of the first two fields:
awk ' { t = $1; $1 = $2; $2 = t; print; } ' input_file
I tried the answer of perreal with cygwin on a windows system with a tab separated file. It didn't work, because the standard separator is space.
If you encounter the same problem, try this instead:
awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file
Incoming separator is defined by -F $'\t' and the seperator for output by OFS=$'\t'.
awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file > output_file
Try this more relevant to your question :
awk '{printf("%s\t%s\n", $2, $1)}' inputfile
This might work for you (GNU sed):
sed -i 's/^\([^\t]*\t\)\([^\t]*\t\)/\2\1/' file
Have you tried using the cut command? E.g.
cat myhugefile | cut -c10-20,c1-9,c21- > myrearrangedhugefile
This is also easy in perl:
perl -pe 's/^(\S+)\t(\S+)/$2\t$1/;' file > outputfile
You could do this in Perl:
perl -F\\t -nlae 'print join("\t", #F[1,0,2..$#F])' inputfile
The -F specifies the delimiter. In most shells you need to precede a backslash with another to escape it. On some platforms -F automatically implies -n and -a so they can be dropped.
For your problem you wouldn't need to use -l because the last columns appears last in the output. But if in a different situation, if the last column needs to appear between other columns, the newline character must be removed. The -l switch takes care of this.
The "\t" in join can be changed to anything else to produce a different delimiter in the output.
2..$#F specifies a range from 2 until the last column. As you might have guessed, inside the square brackets, you can put any single column or range of columns in the desired order.
No need to call anything else but your shell:
bash> while read col1 col2 rest; do
echo $col2 $col1 $rest
done <input_file
Test:
bash> echo "first second a c d e f g" |
while read col1 col2 rest; do
echo $col2 $col1 $rest
done
second first a b c d e f g
Maybe even with "inlined" Python - as in a Python script within a shell script - but only if you want to do some more scripting with Bash beforehand or afterwards... Otherwise it is unnecessarily complex.
Content of script file process.sh:
#!/bin/bash
# inline Python script
read -r -d '' PYSCR << EOSCR
from __future__ import print_function
import codecs
import sys
encoding = "utf-8"
fn_in = sys.argv[1]
fn_out = sys.argv[2]
# print("Input:", fn_in)
# print("Output:", fn_out)
with codecs.open(fn_in, "r", encoding) as fp_in, \
codecs.open(fn_out, "w", encoding) as fp_out:
for line in fp_in:
# split into two columns and rest
col1, col2, rest = line.split("\t", 2)
# swap columns in output
fp_out.write("{}\t{}\t{}".format(col2, col1, rest))
EOSCR
# ---------------------
# do setup work?
# e. g. list files for processing
# call python script with params
python3 -c "$PYSCR" "$inputfile" "$outputfile"
# do some more processing
# e. g. rename outputfile to inputfile, ...
If you only need to swap the columns for a single file, then you can also just create a single Python script and statically define the filenames. Or just use an answer above.
awk swapping sans temp-variable :
echo '777777744444444464449: 317 647 14423 262927714037 : 0x2A29D5A1BAA7A95541' |
mawk '1; ($1 = $2 substr(_, ($2 = $1)^_))^_' FS=':' OFS=':'
777777744444444464449: 317 647 14423 262927714037 : 0x2A29D5A1BAA7A95541
317 647 14423 262927714037 :777777744444444464449: 0x2A29D5A1BAA7A95541