hash using sha1sum using awk - hash

I have a "pipe-separated" file that has about 20 columns. I want to just hash the first column which is a number like account number using sha1sum and return the rest of the columns as is.
Whats the best way I can do this using awk or sed?
Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...
Above is an example of the text file showing just 3 columns. Only the first column has the hashfunction implemented on it. Result should like:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...

What the Best Way™ is is up for debate. One way to do it with awk is
awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\ -f 1"); command | getline hash; close(command); $1 = hash; print }' filename
That is
BEGIN {
OFS = FS # set output field separator to field separator; we will use
# it because we meddle with the fields.
}
NR == 1 { # first line: just print headers.
print
}
NR != 1 { # from there on do the hash/replace
# this constructs a shell command (and runs it) that echoes the field
# (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
# and gets it back into awk with getline (into the variable hash)
# the gsub bit is to prevent the shell from barfing if there's an apostrophe
# in one of the fields.
gsub(/'/, "'\\''", $1);
command = ("echo '" $1 "' | sha1sum -b | cut -d\\ -f 1")
command | getline hash
close(command)
# then replace the field and print the result.
$1 = hash
print
}
You will notice the differences between the shell command at the top and the awk code at the bottom; that is all due to shell expansion. Because I put the awk code in single quotes in the shell commands (double quotes are not up for debate in that context, what with $1 and all), and because the code contains single quotes, making it work inline leads to a nightmare of backslashes. Because of this, my advice is to put the awk code into a file, say foo.awk, and run
awk -F'|' -f foo.awk filename
instead.

Here's an awk executable script that does what you want:
#!/usr/bin/awk -f
BEGIN { FS=OFS="|" }
FNR != 1 { $1 = encodeData( $1 ) }
47
function encodeData( fld ) {
cmd = sprintf( "echo %s | sha1sum", fld )
cmd | getline output
close( cmd )
split( output, arr, " " )
return arr[1]
}
Here's the flow break down:
Set the input and output field separators to |
When the row isn't the first (header) row, re-assign $1 to an encoded value
Print the entire row when 47 is true (always)
Here's the encodeData function break down:
Create a cmd to feed data to sha1sum
Feed it to getline
Close the cmd
On my system, there's extra info after sha1sum, so I discard it by spliting the output
Return the first field of the sha1sum output.
With your data, I get the following:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
running by calling awk.script data (or ./awk.script data if you bash)
EDIT by EdMorton:
sorry for the edit, but your script above is the right approach but needs some tweaks to make it more robust and this is much easier than trying to describe them in a comment:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }
function encodeData( fld, cmd, output ) {
cmd = "echo \047" fld "\047 | sha1sum"
if ( (cmd | getline output) > 0 ) {
sub(/ .*/,"",output)
}
else {
print "failed to hash " fld | "cat>&2"
output = fld
}
close( cmd )
return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
The f[] array decouples your script from hard-coding the number of the field that needs to be hashed, the additional args for your function make them local and so always null/zero on each invocation, the if on getline means you won't return the previous success value if it fails (see http://awk.info/?tip/getline) and the rest is maybe more style/preference with a bit of a performance improvement.

Related

search for a key value pair and append the value to other keys in unix

I need to search for a key and append the value to every key:value pair in a Unix file
Input file data:
1A:trans_ref_id|10:account_no|20:cust_name|30:trans_amt|40:addr
1A:trans_ref_id|10A:ccard_no|20:cust_name|30:trans_amt|40:addr
My desired Output:
account_no|1A:trans_ref_id
account_no|10:account_no
account_no|20:cust_name
account_no|30:trans_amt
account_no|40:addr
ccard_no|1A:trans_ref_id
ccard_no|10A:ccard_no
ccard_no|20:cust_name
ccard_no|30:trans_amt
ccard_no|40:addr
Basically, I need the value of 10 or 10A appended to every key:value pair and split into new lines. To be clear, this won't always be the second field.
I am new to sed, awk and perl. I started with extracting the value using awk:
awk -v FS="|" -v key="59" '$2 == key {print $2}' target.txt
I need the value of 10 or 10A appended to every key:value pair
Going by these requirements, you may try this awk:
awk '
BEGIN{FS=OFS="|"}
match($0, /\|10A?:[^|]+/) {
s = substr($0, RSTART, RLENGTH)
sub(/.*:/, "", s)
}
{
for (i=1; i<=NF; ++i)
print s, $i
}' file
account_no|1A:trans_ref_id
account_no|10:account_no
account_no|20:cust_name
account_no|30:trans_amt
account_no|40:addr
ccard_no|1A:trans_ref_id
ccard_no|10A:ccard_no
ccard_no|20:cust_name
ccard_no|30:trans_amt
ccard_no|40:addr
# Looks for 10 or 10A
perl -F'\|' -lane'my ($id) = map /^10A?:(.*)/s, #F; print "$id|$_" for #F'
# Looks for 10 or 10<non-digit><maybe more>
perl -F'\|' -lane'my ($id) = map /^10(?:\D[^:]*)?:(.*)/s, #F; print "$id|$_" for #F'
-n executes the program for each line of input.
-l removes LF on read and adds it on print.
-a splits the line on | (specified by -F) into #F.
The first statement extracts what follows : in the field with id 10 or 10-plus-something.
The second statement prints a line for each field.
Specifying file to process to Perl one-liner
If you are still stuck on where to get started, you will use a field-separator and output-field-separator (FS and OFS) set equal to '|' that will split each record into fields at each '|'. Your fields are available as $1, $2, ... $NF. You care about getting, e.g. account_no from field two ($2) so you split() field two with the separator ':' saving the split fields in an array (a used below). You want the second part from field two which will be in the 2nd array element a[2] to use as the new field-1 in output.
The rest is just looping over each field and outputting a[2] a separator and then the current field. You can do that with:
awk 'BEGIN{FS=OFS="|"} {split ($2,a,":"); for(i=1;i<=NF;i++) print a[2],$i}' file
Example Use/Output
With your example input in file, the result would be:
account_no|1A:trans_ref_id
account_no|10:account_no
account_no|20:cust_name
account_no|30:trans_amt
account_no|40:addr
ccard_no|1A:trans_ref_id
ccard_no|10A:ccard_no
ccard_no|20:cust_name
ccard_no|30:trans_amt
ccard_no|40:addr
Which appears to be what you are after. Let me know if you have further questions.
"10" or "10A" at Unknown Field
You can handle the fields containing "10" and "10A" in any order. You just add a loop to loop over the fields and determine which holds "10" or "10A" and save the 2nd element from the array resulting from split() from that field. The rest is the same, e.g.
awk '
BEGIN { FS=OFS="|" }
{ for (i=1;i<=NF;i++){
split ($i,a,":")
if (a[1]=="10"||a[1]=="10A"){
key=a[2]
break
}
}
for (i=1;i<=NF;i++)
print key, $i
}
' file1
Example Input
1A:trans_ref_id|10:account_no|20:cust_name|30:trans_amt|40:addr
1A:trans_ref_id|20:cust_name|30:trans_amt|10A:ccard_no|40:addr
Example Use/Output
awk '
> BEGIN { FS=OFS="|" }
> { for (i=1;i<=NF;i++){
> split ($i,a,":")
> if (a[1]=="10"||a[1]=="10A"){
> key=a[2]
> break
> }
> }
> for (i=1;i<=NF;i++)
> print key, $i
> }
> ' file1
account_no|1A:trans_ref_id
account_no|10:account_no
account_no|20:cust_name
account_no|30:trans_amt
account_no|40:addr
ccard_no|1A:trans_ref_id
ccard_no|20:cust_name
ccard_no|30:trans_amt
ccard_no|10A:ccard_no
ccard_no|40:addr
Which picks up the proper new field 1 for output from the 4th field containing "10A" for the second line above.
Let em know if this is what you needed.
EDIT: To find 10 OR 10A values in anywhere in line and then print as per that try following then.
awk '
BEGIN{
FS=OFS="|"
}
match($0,/(10|10A):[^|]*/){
split(substr($0,RSTART,RLENGTH),arr,":")
}
{
for(i=1;i<=NF;i++){
print arr[2],$i
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program.
FS=OFS="|" ##Setting FS and OFS to | here.
}
match($0,/(10|10A):[^|]*/){ ##using match function to match either 10: till | OR 10A: till | here.
split(substr($0,RSTART,RLENGTH),arr,":") ##Splitting matched sub string into array arr with delmiter of : here.
}
{
for(i=1;i<=NF;i++){ ##Running for loop for each field for each line.
print arr[2],$i ##Printing 2nd element of ar, along with current field.
}
}' Input_file ##Mentioning Input_file name here.
With your shown samples, please try following.
awk '
BEGIN{
FS=OFS="|"
}
{
split($2,arr,":")
print arr[2],$1
for(i=2;i<=NF;i++){
print arr[2],$i
}
}
' Input_file
Perl script implementation
use strict;
use warnings;
use feature 'say';
my $fname = shift || die "run as 'script.pl input_file key0 key1 ... key#'";
open my $fh, '<', $fname || die $!;
while( <$fh> ) {
chomp;
my %data = split(/[:\|]/, $_);
for my $key (#ARGV) {
if( $data{$key} ) {
say "$data{$key}|$_" for split(/\|/,$_);
}
}
}
close $fh;
Run as script.pl input_file 10 10A
Output
account_no|1A:trans_ref_id
account_no|10:account_no
account_no|20:cust_name
account_no|30:trans_amt
account_no|40:addr
ccard_no|1A:trans_ref_id
ccard_no|10A:ccard_no
ccard_no|20:cust_name
ccard_no|30:trans_amt
ccard_no|40:addr
Here's an alternate perl solution:
perl -pe '($id) = /(?<![^|])10A?:([^|]+)/; s/([^|]+)[|\n]/$id|$1\n/g'
($id) = /(?<![^|])10A?:([^|]+)/ this will capture the string after 10: or 10A: and save in $id variable. First such match in the line will be captured.
s/([^|]+)[|\n]/$id|$1\n/g every field is then prefixed with value in $id and | character

to arrange the value in columns as per value in 1st column

I have a file with following data
cat text.txt
281475473926267,46,47
281474985385546,310,311
281474984889537,248,249
281475473926267,16,17
281474985385546,20,28
281474984889537,112,68
The values in 1st column are duplicate at some places
i want o/p as given below
cat output.txt
281475473926267 16,17,46,47
281474985385546 20,28,310,311
281474984889537 68,112,248,249
It should print uniq values of column 1 and then space and then it should print respective values of other column in one line arranged in ascending order.
I tried below:
cat text.txt | perl -F, -lane ' $kv{$F[0]}{$F[1]}++; END { while(my($x,$y) = each(%kv)) { print "$x ",join(",",keys %$y) }}'
281474984889537 112,248
281474985385546 310,20
281475473926267 46,16
here i am not able to print all the values in front of value in 1st column
for 281474984889537 it should print 68,112,248,249, but its printing only 112,248
also i am not sure how to arrange them in ascending order.
cat text.txt | perl -F, -lane ' $kv{$F[0]}{$F[1]}++; END { while(my($x,$y) = each(%kv)) { print "$x ",join(",",keys %$y) }}'
281474984889537 112,248
281474985385546 310,20
281475473926267 46,16
here i am not able to print all the values in front of value in 1st column
multi-step
$ awk -F, '{print $1,$2; print $1,$3}' file |
sort -k1n -k2n |
awk 'p!=$1{if(p) print p,a[p]; a[$1]=$2; p=$1; next}
{a[$1]=a[$1] "," $2}
END {print p,a[p]}' |
sort -k2n
281475473926267 16,17,46,47
281474985385546 20,28,310,311
281474984889537 68,112,248,249
With GNU awk for true multi-dimensional arrays and sorted_in:
$ cat tst.awk
BEGIN { FS="," }
{
for (i=2; i<=NF; i++) {
keyVals[$1][$i]
}
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (key in keyVals) {
vals = ""
for (val in keyVals[key]) {
vals = (vals == "" ? "" : vals ",") val
}
print key, vals
}
}
$ awk -f tst.awk file
281474984889537 68,112,248,249
281474985385546 20,28,310,311
281475473926267 16,17,46,47
The above will work no matter how many fields you have on each line and it will remove duplicate values when they occur on multiple lines for the same key value.
This might work for you (GNU sed):
sed -r 'H;x;s/((\n[^\n,]*),[^\n]*)(.*)\2([^\n]*)\n?/\1\4\3/;x;$!d;x;s/.//;:b;h;s/\n.*//;s/[^,]*,//;s/,/\n/g;s/.*/echo "&"|sort -n|paste -sd,/e;G;s/^([^\n]*)\n([^\n,]*),[^\n]*/\2 \1/;P;:c;tc;s/[^\n]*\n//;tb;d' file
The script works in two parts. In the first part of the processing the lines of the file are held in memory and reduced in size by appending values of the same key to a single key. At the end of file the second part of processing is enacted. Each line is broken into two, the appended values are sorted and re-appended to the key, printed and removed, until all the lines have been processed.
To correct your Perl-oneliner, use this.
$ cat text.txt
281475473926267,46,47
281474985385546,310,311
281474984889537,248,249
281475473926267,16,17
281474985385546,20,28
281474984889537,112,68
$ cat text.txt | perl -F, -lanE ' #t1=#{$kv{$F[0]}}; push(#t1,#F[1..2]); $kv{$F[0]}=[#t1]; END { while(my($x,$y) = each(%kv)) { print "$x ",join(",",#{$y}) }}'
281474985385546 310,311,20,28
281475473926267 46,47,16,17
281474984889537 248,249,112,68
$
When you have more columns, a small change on the above one-liner from 1..2 to 1..$#F will do the trick. Check this out
$ cat > text2.txt
281475473926267,46,47,49
281474985385546,310,311
281474984889537,248,249,311,677,213
281475473926267,16,17
281474985385546,20,28
281474984889537,112,68,54,78,324,67
$ cat text2.txt | perl -F, -lanE ' #t1=#{$kv{$F[0]}}; push(#t1,#F[1..$#F]); $kv{$F[0]}=[#t1]; END { while(my($x,$y) = each(%kv)) { print "$x ",join(",",#{$y}) }}'
281474984889537 248,249,311,677,213,112,68,54,78,324,67
281474985385546 310,311,20,28
281475473926267 46,47,49,16,17
$

Piping output from awk to perl

I want to make an array in Perl with the values obtained from my awk script. Then I can do math on them in Perl.
Here is my Perl, which runs a program, which saves a text file:
my $unix_command_dsc = (`./program -s test.fasta saved_file.txt`);
my $dsc_run = qx($unix_command_dsc);
Now I have some Awk that parses that data saved in the text file:
#!/usr/bin/awk -f
BEGIN{ # Initialize the values to zero. Note, done automatically also.
sumc4 = 0
sumc5 = 0
sumc6 = 0
}
/^[1-9][0-9]* residue/ {next} #Match line that begins with number and has word 'residue', skip it.
/^[1-9]/ { #Match line that begins with number.
sumc4 += $4 #Add up the values of the nth column into the variables.
sumc5 += $5
sumc6 += $6
print $4 "\t" $5 "\t" $6 #This will show the whole columns.
}
END{
print "sum H" "\t" "sum E" "\t" "sum C"
print sumc4 "\t" sumc5 "\t" sumc6
}
I run this Awk from terminal with the following commands:
./awk_program.txt saved_file.txt
Any ideas how I would gather this data from the print statements in awk into arrays in perl?
What I've tried is to just run that awk script in perl:
my $unix_command_awk = (`./awk_program.txt saved_file.txt`);
my $awk_run = qx($unix_command_awk);
But perl gives me errors and commands not found, like it thinks the data are commands. Should there be a STDOUT in the awk that I'm missing, rather than print?
It should just be:
my $awk_run = `./awk_program.txt saved_file.txt`;
Backticks tell perl to run the command and return the output. So your assignment to $unix_command_awk is running the command, and then qx($unix_command_awk) executes the output as a new command.
Pipe from awk to your perl script:
./awk_program file.txt | perl perl-script.pl
Then read from stdin inside the perl:
while (<>) {
# do stuff with $_
my #cols = split(/\t/);
}

Call a Perl script in AWK

I have a problem that I need to call a Perl script with parameters passing in and get the return value of the Perl script in an AWK BEGIN block. Just like below.
I have a Perl script util.pl
#!/usr/bin/perl -w
$res=`$exe_cmd`;
print $res;
Now in the AWK BEGIN block (ksh) I need to call the script and get the return value.
BEGIN { print "in awk, application type is " type;
} \
{call per script here;}
How do I call the Perl script with parameter and get the return value of $res?
res = util.pl a b c;
Pipe the script into getline:
awk 'BEGIN {
cmd = "util.pl a b c";
cmd | getline res;
close(cmd);
print "in awk, application type is " res
}'
Part of an AWK-script I use for extracting data from a ldap query. Perhaps you can find some inspiration from how I do the base64 decoding below...
/^dn:/{
if($0 ~ /^dn: /){
split($0, a, "[:=,]")
name=a[3]
}
else if($0 ~ /^dn::/){
# Special handling needed since ldap apparently
# uses base64 encoded strings for *some* users
cmd = "/usr/bin/base64 -i -d <<< " $2 " 2>/dev/null"
while ( ( cmd | getline result ) > 0 ) { }
close(cmd)
split(result, a, "[:=,]")
name=a[2]
}
}

join 2 lines only if field-1 are equals with sed or awk

input file:
$ cat t.txt
id1;value1_1
id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1
id4;value4_2
id5;value5_1
result would be:
id1;value1_1;id1;value1_2
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1
using sed or awk. Please give your opinion.
Here's one way to do it:
awk -F';' 'BEGIN { getline; id=$1; line=$0 } { if ($1 != id) { print line; line = $0; } else { line = line ";" $0; } id=$1; } END { print line; }' t.txt
Explanation:
Set field separator to ;:
-F';'
Start by reading the first line of input (getline), save the first field ($1) as id, and the first line ($0) as line:
BEGIN { getline; id=$1; line=$0 }
For each line of input, check if the first field differs from the stored id:
if ($1 != id)
If it does, then print the saved line and store the new one ($0):
print line; line = $0;
Otherwise, append the new line to the stored line(s):
line = line ";" $0;
And save the new id:
id=$1
At the end, print whatever is left in line:
END { print line; }
I guess in your result example, the id2; line is missing by mistake, right?
anyway, you could try the awk line below:
awk -F';' '{a[$1]=($1 in a)?a[$1]";"$0:$0}END{for(x in a)print a[x]}' yourFile|sort
output would be:
id1;value1_1;id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1
This might work for you:
sed -e '1{h;d};H;${x;:a;s/\(\([^;]*;\)\([^\n]*\)\)\n\2/\1;\2/;ta;p};d' t.txt
Explanation:
Slurp file in to hold space (HS) then on end-of-file swap to the HS and using substitution concatenate lines with duplicate keys and print. N.B. lines normally printed are all deleted.
EDIT:
The above solution works (as far as I know) but for large volumes is not very fast (read incredibly slow). This solution is better:
# cat -A /tmp/t.txt
id1;value1_1$
id1;value1_2$
id2;value2_1$
id3;value3_1$
id4;value4_1$
id4;value4_2$
id5;value5_1$
# for x in {1..1000};do cat /tmp/t.txt;done |
> sed ':a;$!N;/^\([^;]*;\).*\n\1/s/\n//;ta;P;D'| sort | uniq
id1;value1_1;id1;value1_2
id2;value2_1
id3;value3_1
id4;value4_1;id4;value4_2
id5;value5_1