Call a Perl script in AWK - perl

I have a problem that I need to call a Perl script with parameters passing in and get the return value of the Perl script in an AWK BEGIN block. Just like below.
I have a Perl script util.pl
#!/usr/bin/perl -w
$res=`$exe_cmd`;
print $res;
Now in the AWK BEGIN block (ksh) I need to call the script and get the return value.
BEGIN { print "in awk, application type is " type;
} \
{call per script here;}
How do I call the Perl script with parameter and get the return value of $res?
res = util.pl a b c;

Pipe the script into getline:
awk 'BEGIN {
cmd = "util.pl a b c";
cmd | getline res;
close(cmd);
print "in awk, application type is " res
}'

Part of an AWK-script I use for extracting data from a ldap query. Perhaps you can find some inspiration from how I do the base64 decoding below...
/^dn:/{
if($0 ~ /^dn: /){
split($0, a, "[:=,]")
name=a[3]
}
else if($0 ~ /^dn::/){
# Special handling needed since ldap apparently
# uses base64 encoded strings for *some* users
cmd = "/usr/bin/base64 -i -d <<< " $2 " 2>/dev/null"
while ( ( cmd | getline result ) > 0 ) { }
close(cmd)
split(result, a, "[:=,]")
name=a[2]
}
}

Related

Get value of autosplit delimiter?

If I run a script with perl -Fsomething, is that something value saved anywhere in the Perl environment where the script can find it? I'd like to write a script that by default reuses the input delimiter (if it's a string and not a regular expression) as the output delimiter.
Looking at the source, I don't think the delimiter is saved anywhere. When you run
perl -F, -an
the lexer actually generates the code
LINE: while (<>) {our #F=split(q\0,\0);
and parses it. At this point, any information about the delimiter is lost.
Your best option is to split by hand:
perl -ne'BEGIN { $F="," } #F=split(/$F/); print join($F, #F)' foo.csv
or to pass the delimiter as an argument to your script:
F=,; perl -F$F -sane'print join($F, #F)' -- -F=$F foo.csv
or to pass the delimiter as an environment variable:
export F=,; perl -F$F -ane'print join($ENV{F}, #F)' foo.csv
As #ThisSuitIsBlackNot says it looks like the delimiter is not saved anywhere.
This is how the perl.c stores the -F parameter
case 'F':
PL_minus_a = TRUE;
PL_minus_F = TRUE;
PL_minus_n = TRUE;
PL_splitstr = ++s;
while (*s && !isSPACE(*s)) ++s;
PL_splitstr = savepvn(PL_splitstr, s - PL_splitstr);
return s;
And then the lexer generates the code
LINE: while (<>) {our #F=split(q\0,\0);
However this is of course compiled, and if you run it with B::Deparse you can see what is stored.
$ perl -MO=Deparse -F/e/ -e ''
LINE: while (defined($_ = <ARGV>)) {
our(#F) = split(/e/, $_, 0);
}
-e syntax OK
Being perl there is always a way, however ugly. (And this is some of the ugliest code I have written in a while):
use B::Deparse;
use Capture::Tiny qw/capture_stdout/;
BEGIN {
my $f_var;
}
unless ($f_var) {
$stdout = capture_stdout {
my $sub = B::Deparse::compile();
&{$sub}; # Have to capture stdout, since I won't bother to setup compile to return the text, instead of printing
};
my (undef, $split_line, undef) = split(/\n/, $stdout, 3);
($f_var) = $split_line =~ /our\(\#F\) = split\((.*)\, \$\_\, 0\);/;
print $f_var,"\n";
}
Output:
$ perl -Fe/\\\(\\[\\\<\\{\"e testy.pl
m#e/\(\[\<\{"e#
You could possible traverse the bytecode instead, since the start probably will be identical every time until you reach the pattern.

sed command not working ...how to strip a pipe out

I have a data file I am trying to import into redshift (postgress mpp database). I am trying to import into postgres with a '|' delimiter. But some data, has the '|' in the string data itself, for example :
73779087|"UCGr4c0_zShyHbctxJJrJ03w"|"ItsMattSpeaking | SuPra"
So I tried this sed command :
sed -i -E "s/(.+|)(.+|)|/\1\2\\|/g" inputfile.txt >outputfile.txt
Any ideas on what is wrong with the sed command, to replace the | in the last string with a \| escape character so that Redshift will not view this as a delimiter ? Any help is appreciated.
This might work for you (GNU sed):
sed -r ':a;s/^([^"]*("[^"|]*"[^"]*)*"[^"|]*)\|/\1/g;ta' file
This removes | within double quotes however it does not cater for quoted quotes, so beware!
There are somethings you don't use SED for, I'd say this is one of those things. Try using a python script with the re library or just plain string manipulation.
I think this C++ code does what you need.
// $ g++ -Wall -Wextra -std=c++11 main.cpp
#include <iostream>
int main(int, char*[]) {
bool str = false;
char c;
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
while (std::cin.get(c)) {
if (c == '|') {
if (str) {
std::cout << '\\'; } }
else if (c == '"') {
// Toggle string parsing.
str = !str; }
else if (c == '\\') {
// Skip escaped chars.
std::cout << c;
std::cin.get(c); }
std::cout << c; }
return 0; }
The problem with sed in this example is you need to know more than the basics in order to track which state you are in (string or not).
Here's a script for translating a file with pipe-separated values, as described, to one following the simpler conventions of a TSV file. It assumes the availability of a PHP interpreter. If the script is saved as psv2tsv and made executable in a Mac or Linux environment, then psv2tsv -h should offer more details.
Example usage (using <TAB> to indicate a TAB in the output):
$ psv2tsv <<< $'73779087|"UCGr4c0_zShyHbctxJJrJ03w"|"ItsMattSpeaking | SuPra"'
73779087<TAB>UCGr4c0_zShyHbctxJJrJ03w<TAB>ItsMattSpeaking | SuPra<TAB>
$ psv2tsv <<< $'a|"b|c\t\d"|"e\n"'
a<TAB>b|c\t\d<TAB>e\n<TAB>
The script:
#!/usr/bin/env php
<?php
# Author: pkoppstein at gmail.com 12/2015
# Use at your own risk.
# Input - pipe-separated values along the lines of CSV.
# Translate embedded newline and tab characters.
function help() {
global $argv;
echo <<<EOT
Syntax: {$argv[0]} [filepath]
Convert a file or stream of records with pipe-separated values to the
TSV (tab-separated value) format. If no argument is specified, or if
filepath is specified as -, then input is taken from stdin.
The input is assumed to be like a CSV file but with pipe characters
(|) used instead of commas. The output follows the simpler
conventions of TSV files.
Note that each tab in the input is translated to "\\t", and each
embedded newline is translated to "\\n". Each translated record is
then written to stdout. See PHP's fgetcsv for further details.
EOT;
}
$file = ($argc > 1 ) ? $argv[1] : 'php://stdin';
if ( $file == "-h" or $file == "--help" ) {
help();
exit;
}
if ( $file == "-" ) $file = 'php://stdin';
$handle = #fopen($file, "r");
if ($handle) {
while (($data = fgetcsv($handle,0,"|")) !== FALSE) {
$num = count($data);
for ($c=0; $c < $num; $c++) {
# str_replace( mixed $search , mixed $replace , mixed $subject [, int &$count ] )
echo str_replace("\t", "\\t", str_replace("\n", "\\n", $data[$c])) . "\t";
}
echo "\n";
}
fclose($handle);
}
else {
echo "{$argv[0]}: unable to fopen $argv[1]\n";
exit(1);
}
?>

hash using sha1sum using awk

I have a "pipe-separated" file that has about 20 columns. I want to just hash the first column which is a number like account number using sha1sum and return the rest of the columns as is.
Whats the best way I can do this using awk or sed?
Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...
Above is an example of the text file showing just 3 columns. Only the first column has the hashfunction implemented on it. Result should like:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
What the Best Way™ is is up for debate. One way to do it with awk is
awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\ -f 1"); command | getline hash; close(command); $1 = hash; print }' filename
That is
BEGIN {
OFS = FS # set output field separator to field separator; we will use
# it because we meddle with the fields.
}
NR == 1 { # first line: just print headers.
print
}
NR != 1 { # from there on do the hash/replace
# this constructs a shell command (and runs it) that echoes the field
# (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
# and gets it back into awk with getline (into the variable hash)
# the gsub bit is to prevent the shell from barfing if there's an apostrophe
# in one of the fields.
gsub(/'/, "'\\''", $1);
command = ("echo '" $1 "' | sha1sum -b | cut -d\\ -f 1")
command | getline hash
close(command)
# then replace the field and print the result.
$1 = hash
print
}
You will notice the differences between the shell command at the top and the awk code at the bottom; that is all due to shell expansion. Because I put the awk code in single quotes in the shell commands (double quotes are not up for debate in that context, what with $1 and all), and because the code contains single quotes, making it work inline leads to a nightmare of backslashes. Because of this, my advice is to put the awk code into a file, say foo.awk, and run
awk -F'|' -f foo.awk filename
instead.
Here's an awk executable script that does what you want:
#!/usr/bin/awk -f
BEGIN { FS=OFS="|" }
FNR != 1 { $1 = encodeData( $1 ) }
47
function encodeData( fld ) {
cmd = sprintf( "echo %s | sha1sum", fld )
cmd | getline output
close( cmd )
split( output, arr, " " )
return arr[1]
}
Here's the flow break down:
Set the input and output field separators to |
When the row isn't the first (header) row, re-assign $1 to an encoded value
Print the entire row when 47 is true (always)
Here's the encodeData function break down:
Create a cmd to feed data to sha1sum
Feed it to getline
Close the cmd
On my system, there's extra info after sha1sum, so I discard it by spliting the output
Return the first field of the sha1sum output.
With your data, I get the following:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
running by calling awk.script data (or ./awk.script data if you bash)
EDIT by EdMorton:
sorry for the edit, but your script above is the right approach but needs some tweaks to make it more robust and this is much easier than trying to describe them in a comment:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }
function encodeData( fld, cmd, output ) {
cmd = "echo \047" fld "\047 | sha1sum"
if ( (cmd | getline output) > 0 ) {
sub(/ .*/,"",output)
}
else {
print "failed to hash " fld | "cat>&2"
output = fld
}
close( cmd )
return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
The f[] array decouples your script from hard-coding the number of the field that needs to be hashed, the additional args for your function make them local and so always null/zero on each invocation, the if on getline means you won't return the previous success value if it fails (see http://awk.info/?tip/getline) and the rest is maybe more style/preference with a bit of a performance improvement.

Flags inside sed command file

I have a sed file that replaces all occurrences of a string in a file with other string.
I want to do it inline but without using -i from terminal
What changes are to be made to the .sed file
#!/bin/sed
s/include/\#include/
Just use awk:
{ sub(/include/,"#include"); rec = rec $0 RS }
END{ printf "%s", rec > FILENAME }
or if you want to operate strictly on strings:
BEGIN{ old="include"; new="#include" }
s = index($0,old) {$0 = substr($0,1,s-1) new substr($0,s+length(old))
{ rec = rec $0 RS }
END{ printf "%s", rec > FILENAME }
which can be simplified to:
s = index($0,"include") {$0 = substr($0,1,s-1) "#" substr($0,s)
{ rec = rec $0 RS }
END{ printf "%s", rec > FILENAME }
in this particular case of just prepending # to a string.
I don't think it will work, because the -i and -f options can usually both have arguments, but you could be lucky.
The shebang line can contain one option group (aka cluster). You would need for it to contain -f (this is missing from your example) so the cluster could look like
#!/bin/sed -if
provided that your dialect doesn't require an argument to the -i option, and permits clustering like this in the first place (-if pro -i -f).
The obvious workaround is to change the script to a shell script; the -f option is no longer required because the script is not in a file.
#!/bin/sh
exec sed -i 's/include/\#include/' "$#"

Piping output from awk to perl

I want to make an array in Perl with the values obtained from my awk script. Then I can do math on them in Perl.
Here is my Perl, which runs a program, which saves a text file:
my $unix_command_dsc = (`./program -s test.fasta saved_file.txt`);
my $dsc_run = qx($unix_command_dsc);
Now I have some Awk that parses that data saved in the text file:
#!/usr/bin/awk -f
BEGIN{ # Initialize the values to zero. Note, done automatically also.
sumc4 = 0
sumc5 = 0
sumc6 = 0
}
/^[1-9][0-9]* residue/ {next} #Match line that begins with number and has word 'residue', skip it.
/^[1-9]/ { #Match line that begins with number.
sumc4 += $4 #Add up the values of the nth column into the variables.
sumc5 += $5
sumc6 += $6
print $4 "\t" $5 "\t" $6 #This will show the whole columns.
}
END{
print "sum H" "\t" "sum E" "\t" "sum C"
print sumc4 "\t" sumc5 "\t" sumc6
}
I run this Awk from terminal with the following commands:
./awk_program.txt saved_file.txt
Any ideas how I would gather this data from the print statements in awk into arrays in perl?
What I've tried is to just run that awk script in perl:
my $unix_command_awk = (`./awk_program.txt saved_file.txt`);
my $awk_run = qx($unix_command_awk);
But perl gives me errors and commands not found, like it thinks the data are commands. Should there be a STDOUT in the awk that I'm missing, rather than print?
It should just be:
my $awk_run = `./awk_program.txt saved_file.txt`;
Backticks tell perl to run the command and return the output. So your assignment to $unix_command_awk is running the command, and then qx($unix_command_awk) executes the output as a new command.
Pipe from awk to your perl script:
./awk_program file.txt | perl perl-script.pl
Then read from stdin inside the perl:
while (<>) {
# do stuff with $_
my #cols = split(/\t/);
}