Piping output from awk to perl - perl

I want to make an array in Perl with the values obtained from my awk script. Then I can do math on them in Perl.
Here is my Perl, which runs a program, which saves a text file:
my $unix_command_dsc = (`./program -s test.fasta saved_file.txt`);
my $dsc_run = qx($unix_command_dsc);
Now I have some Awk that parses that data saved in the text file:
#!/usr/bin/awk -f
BEGIN{ # Initialize the values to zero. Note, done automatically also.
sumc4 = 0
sumc5 = 0
sumc6 = 0
}
/^[1-9][0-9]* residue/ {next} #Match line that begins with number and has word 'residue', skip it.
/^[1-9]/ { #Match line that begins with number.
sumc4 += $4 #Add up the values of the nth column into the variables.
sumc5 += $5
sumc6 += $6
print $4 "\t" $5 "\t" $6 #This will show the whole columns.
}
END{
print "sum H" "\t" "sum E" "\t" "sum C"
print sumc4 "\t" sumc5 "\t" sumc6
}
I run this Awk from terminal with the following commands:
./awk_program.txt saved_file.txt
Any ideas how I would gather this data from the print statements in awk into arrays in perl?
What I've tried is to just run that awk script in perl:
my $unix_command_awk = (`./awk_program.txt saved_file.txt`);
my $awk_run = qx($unix_command_awk);
But perl gives me errors and commands not found, like it thinks the data are commands. Should there be a STDOUT in the awk that I'm missing, rather than print?

It should just be:
my $awk_run = `./awk_program.txt saved_file.txt`;
Backticks tell perl to run the command and return the output. So your assignment to $unix_command_awk is running the command, and then qx($unix_command_awk) executes the output as a new command.

Pipe from awk to your perl script:
./awk_program file.txt | perl perl-script.pl
Then read from stdin inside the perl:
while (<>) {
# do stuff with $_
my #cols = split(/\t/);
}

Related

Perl script throws syntax error for awk command

I have a file which contains each users userid and password. I need to fetch userid and password from that file by passing userid as an search element using awk command.
user101,smith,smith#123
user102,jones,passj#007
user103,albert,albpass#01
I am using a awk command inside my perl script like this:
...
...
my $userid = ARGV[0];
my $user_report_file = "report_file.txt";
my $data = `awk -F, '$1 ~ /$userid/ {print $2, $3}' $user_report_file`;
my ($user,$pw) = split(" ",$data);
...
...
Here I am getting the error:
awk: ~ /user101/ {print , }
awk: ^ syntax error
But if I run same command in terminal window its able to give result like below:
$] awk -F, '$1 ~ /user101/ {print $2, $3}' report_file.txt
smith smith#123
What could be the issue here?
The backticks are a double-quoted context, so you need to escape any literal $ that you want awk to interpret.
my $data = `awk -F, '\$1 ~ /$userid/ {print \$2, \$3}' $user_report_file`;
If you don't do that, you're interpolating the capture variables from the last successful Perl match.
When I have these sorts of problems, I try the command as a string first to see if it is what I expect:
my $data = "awk -F, '\$1 ~ /$userid/ {print \$2, \$3}' $user_report_file";
say $data;
Here's the Perl equivalent of that command:
$ perl -aF, -e '$F[0]=~/101/ && print "#F[1,2]"' report_file
But, this is something you probably want to do in Perl instead of creating another process:
Interpolating data into external commands can go wrong, such as a filename that is foo.txt; rm -rf /.
The awk you run is the first one in the path, so someone can make that a completely different program (so use the full path, like /usr/bin/awk).
Taint checking can tell you when you are passing unsanitized data to the shell.
Inside a program you don't get all the shortcuts, but if this is the part of your program that is slow, you probably want to rethink how you are accessing this data because scanning the entire file with any tool isn't going to be that fast:
open my $fh, '<', $user_report_file or die;
while( <$fh> ) {
chomp;
my #F = split /,/;
next unless $F[0] =~ /\Q$userid/;
print "#F[1,2]";
last; # if you only want the first one
}

Merge two lines into one within a configuration file

I have several AIX systems with a configuration file, let's call it /etc/bar/config. The file may or may not have a line declaring values for foo. An example would be:
foo = A_1,GROUP_1,USER_1,USER_2,USER_3
The foo line may or may not be the same on all systems. Different systems may have different values and different a different number of values. My task is to add "bare minimum" values in the config file on all systems. The bare minimum line will look like this.
foo = A_1,USER_1,SYS_1,SYS_2
If the line does not exist, I must create it. If the line does exist, I must merge the two lines. Using my examples, the result would be this. The order of the values does not matter.
foo = A_1,GROUP_1,USER_1,USER_3,USER_2,SYS_1,SYS_2
Obviously I want a script to do my work. I have the standard sh, ksh, awk, sed, grep, perl, cut, etc. Since this is AIX, I do not have access to the GNU versions of these utilities.
Originally, I had a script with these commands to replace the entire foo line.
cp /etc/bar/config /etc/bar/config.$$
sed "s/foo = .*/foo = A_1,USER_1,SYS_1,SYS_2/" /etc/bar/config.$$ > /etc/bar/config
But this simply replaces the line. It does take into consideration any pre-existing configuration, including a line that's missing. And I'm doing other configuration modifications in the script, such as adding completely unique lines to other files and restarting a process, so I'd perfer this be some type of shell-based code snippet I can add to my change script. I am open to other options, especially if the solution is simpler.
Some dirty bash/sed:
#!/usr/bin/bash
input_file="some_filename"
v=$(grep -n '^foo *=' "$input_file")
lineno=$(cut -d: -f1 <<< "${v}0:")
base="A_1,USER_1,SYS_1,SYS_2,"
if [[ "$lineno" == 0 ]]; then
echo "foo = A_1,USER_1,SYS_1,SYS_2" >> "$input_file"
else
all=$(sed -n ${lineno}'s/^foo *= */'"$base"'/p' "$input_file" | \
tr ',' '\n' | sort | uniq | tr '\n' ',' | \
sed -e 's/^/foo = /' -e 's/, *$//' -e 's/ */ /g' <<< "$all")
sed -i "${lineno}"'s/.*/'"$all"'/' "$input_file"
fi
Untested bash, etc.
config=/etc/bar/config
default=A_1,USER_1,SYS_1,SYS_2
pattern='^foo[[:blank:]]*=[[:blank:]]*' # shared with grep and sed
if current=$( grep "$pattern" "$config" | sed "s/$pattern//" )
then
new=$( echo "$current,$default" | tr ',' '\n' | sort | uniq | paste -sd, )
sed "s/$pattern.*/foo = $new/" "$config" > "$config.$$.tmp" &&
mv "$config.$$.tmp" "$config"
else
echo "foo = $default" >> "$config"
fi
A vanilla perl solution:
perl -i -lpe '
BEGIN {%foo = map {$_ => 1} qw/A_1 USER_1 SYS_1 SYS_2/}
if (s/^foo\s*=\s*//) {
$found=1;
$foo{$_}=1 for split /,/;
$_ = "foo = " . join(",", keys %foo);
}
END {print "foo = " . join(",", keys %foo) unless $found}
' /etc/bar/config
This Perl code will do as you ask. It expects the path to the file to be modified as a parameter on the command line.
Note that it reads the entire input file into the array #config and then overwrites the same file with the modified data.
It works by building a hash %values from a combination of the items already present in the foo = line and the list of defaults items in #defaults. The combination is sorted in alphabetical order and joined eith a comma
use strict;
use warnings;
my #defaults = qw/ A_1 USER_1 SYS_1 SYS_2 /;
my ($file) = #ARGV;
my #config = <>;
open my $out_fh, '>', $file or die $!;
select $out_fh;
for ( #config ) {
if ( my ($pfx, $vals) = /^(foo \s* = \s* ) (.+) /x ) {
my %values;
++$values{$_} for $vals =~ /[^,\s]+/g;
++$values{$_} for #defaults;
print $pfx, join(',', sort keys %values), "\n";
}
else {
print;
}
}
close $out_fh;
output
foo = A_1,GROUP_1,SYS_1,SYS_2,USER_1,USER_2,USER_3
Since you didn't provide sample input and expected output I couldn't test this but this is the right approach:
awk '
/foo = / { old = ","$3; next }
{ print }
END {
split("A_1,USER_1,SYS_1,SYS_2"old,all,/,/)
for (i in all)
if (!seen[all[i]]++)
new = (new ? new "," : "") all[i]
print "foo =", new
}
' /etc/bar/config > tmp && mv tmp /etc/bar/config

hash using sha1sum using awk

I have a "pipe-separated" file that has about 20 columns. I want to just hash the first column which is a number like account number using sha1sum and return the rest of the columns as is.
Whats the best way I can do this using awk or sed?
Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...
Above is an example of the text file showing just 3 columns. Only the first column has the hashfunction implemented on it. Result should like:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
What the Best Way™ is is up for debate. One way to do it with awk is
awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\ -f 1"); command | getline hash; close(command); $1 = hash; print }' filename
That is
BEGIN {
OFS = FS # set output field separator to field separator; we will use
# it because we meddle with the fields.
}
NR == 1 { # first line: just print headers.
print
}
NR != 1 { # from there on do the hash/replace
# this constructs a shell command (and runs it) that echoes the field
# (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
# and gets it back into awk with getline (into the variable hash)
# the gsub bit is to prevent the shell from barfing if there's an apostrophe
# in one of the fields.
gsub(/'/, "'\\''", $1);
command = ("echo '" $1 "' | sha1sum -b | cut -d\\ -f 1")
command | getline hash
close(command)
# then replace the field and print the result.
$1 = hash
print
}
You will notice the differences between the shell command at the top and the awk code at the bottom; that is all due to shell expansion. Because I put the awk code in single quotes in the shell commands (double quotes are not up for debate in that context, what with $1 and all), and because the code contains single quotes, making it work inline leads to a nightmare of backslashes. Because of this, my advice is to put the awk code into a file, say foo.awk, and run
awk -F'|' -f foo.awk filename
instead.
Here's an awk executable script that does what you want:
#!/usr/bin/awk -f
BEGIN { FS=OFS="|" }
FNR != 1 { $1 = encodeData( $1 ) }
47
function encodeData( fld ) {
cmd = sprintf( "echo %s | sha1sum", fld )
cmd | getline output
close( cmd )
split( output, arr, " " )
return arr[1]
}
Here's the flow break down:
Set the input and output field separators to |
When the row isn't the first (header) row, re-assign $1 to an encoded value
Print the entire row when 47 is true (always)
Here's the encodeData function break down:
Create a cmd to feed data to sha1sum
Feed it to getline
Close the cmd
On my system, there's extra info after sha1sum, so I discard it by spliting the output
Return the first field of the sha1sum output.
With your data, I get the following:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
running by calling awk.script data (or ./awk.script data if you bash)
EDIT by EdMorton:
sorry for the edit, but your script above is the right approach but needs some tweaks to make it more robust and this is much easier than trying to describe them in a comment:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }
function encodeData( fld, cmd, output ) {
cmd = "echo \047" fld "\047 | sha1sum"
if ( (cmd | getline output) > 0 ) {
sub(/ .*/,"",output)
}
else {
print "failed to hash " fld | "cat>&2"
output = fld
}
close( cmd )
return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
The f[] array decouples your script from hard-coding the number of the field that needs to be hashed, the additional args for your function make them local and so always null/zero on each invocation, the if on getline means you won't return the previous success value if it fails (see http://awk.info/?tip/getline) and the rest is maybe more style/preference with a bit of a performance improvement.

Call a Perl script in AWK

I have a problem that I need to call a Perl script with parameters passing in and get the return value of the Perl script in an AWK BEGIN block. Just like below.
I have a Perl script util.pl
#!/usr/bin/perl -w
$res=`$exe_cmd`;
print $res;
Now in the AWK BEGIN block (ksh) I need to call the script and get the return value.
BEGIN { print "in awk, application type is " type;
} \
{call per script here;}
How do I call the Perl script with parameter and get the return value of $res?
res = util.pl a b c;
Pipe the script into getline:
awk 'BEGIN {
cmd = "util.pl a b c";
cmd | getline res;
close(cmd);
print "in awk, application type is " res
}'
Part of an AWK-script I use for extracting data from a ldap query. Perhaps you can find some inspiration from how I do the base64 decoding below...
/^dn:/{
if($0 ~ /^dn: /){
split($0, a, "[:=,]")
name=a[3]
}
else if($0 ~ /^dn::/){
# Special handling needed since ldap apparently
# uses base64 encoded strings for *some* users
cmd = "/usr/bin/base64 -i -d <<< " $2 " 2>/dev/null"
while ( ( cmd | getline result ) > 0 ) { }
close(cmd)
split(result, a, "[:=,]")
name=a[2]
}
}

variable for field separator in perl

In awk I can write: awk -F: 'BEGIN {OFS = FS} ...'
In Perl, what's the equivalent of FS? I'd like to write
perl -F: -lane 'BEGIN {$, = [what?]} ...'
update with an example:
echo a:b:c:d | awk -F: 'BEGIN {OFS = FS} {$2 = 42; print}'
echo a:b:c:d | perl -F: -ane 'BEGIN {$, = ":"} $F[1] = 42; print #F'
Both output a:42:c:d
I would prefer not to hard-code the : in the Perl BEGIN block, but refer to wherever the -F option saves its argument.
To sum up, what I'm looking for does not exist:
there's no variable that holds the argument for -F, and more importantly
Perl's "FS" is fundamentally a different data type (regular expression) than the "OFS" (string) -- it does not make sense to join a list of strings using a regex.
Note that the same holds true in awk: FS is a string but acts as regex:
echo a:b,c:d | awk -F'[:,]' 'BEGIN {OFS=FS} {$2=42; print}'
outputs "a[:,]42[:,]c[:,]d"
Thanks for the insight and workarounds though.
You can use perl's -s (similar to awk's -v) to pass a "FS" variable, but the split becomes manual:
echo a:b:c:d | perl -sne '
BEGIN {$, = $FS}
#F = split $FS;
$F[1] = 42;
print #F;
' -- -FS=":"
If you know the exact length of input, you could do this:
echo a:b:c:d | perl -F'(:)' -ane '$, = $F[1]; #F = #F[0,2,4,6]; $F[1] = 42; print #F'
If the input is of variable lengths, you'll need something more sophisticated than #f[0,2,4,6].
EDIT: -F seems to simply provide input to an automatic split() call, which takes a complete RE as an expression. You may be able to find something more suitable by reading the perldoc entries for split, perlre, and perlvar.
You can sort of cheat it, because perl is actually using the split function with your -F argument, and you can tell split to preserve what it splits on by including capturing parens in the regex:
$ echo a:b:c:d | perl -F'(:)' -ane 'print join("/", #F);'
a/:/b/:/c/:/d
You can see what perl's doing with some of these "magic" command-line arguments by using -MO=Deparse, like this:
$ perl -MO=Deparse -F'(:)' -ane 'print join("/", #F);'
LINE: while (defined($_ = <ARGV>)) {
our(#F) = split(/(:)/, $_, 0);
print join('/', #F);
}
-e syntax OK
You'd have to change your #F subscripts to double what they'd normally be ($F[2] = 42).
Darnit...
The best I can do is:
echo a:b:c:d | perl -ne '$v=":";#F = split("$v"); $F[1] = 42; print join("$v", #F) . "\n";'
You don't need the -F: this way, and you're only stating the colon once. I was hoping there was someway of setting variables on the command line like you can with Awk's -v switch.
For one liners, Perl is usually not as clean as Awk, but I remember using Awk before I knew of Perl and writing 1000+ line Awk scripts.
Trying things like this made people think Awk was either named after the sound someone made when they tried to decipher such a script, or stood for AWKward.
There is no input record separator in Perl. You're basically emulating awk by using the -a and -F flags. If you really don't want to hard code the value, then why not just use an environmental variable?
$ export SPLIT=":"
$ perl -F$SPLIT -lane 'BEGIN { $, = $ENV{SPLIT}; } ...'