I have a data file I am trying to import into redshift (postgress mpp database). I am trying to import into postgres with a '|' delimiter. But some data, has the '|' in the string data itself, for example :
73779087|"UCGr4c0_zShyHbctxJJrJ03w"|"ItsMattSpeaking | SuPra"
So I tried this sed command :
sed -i -E "s/(.+|)(.+|)|/\1\2\\|/g" inputfile.txt >outputfile.txt
Any ideas on what is wrong with the sed command, to replace the | in the last string with a \| escape character so that Redshift will not view this as a delimiter ? Any help is appreciated.
This might work for you (GNU sed):
sed -r ':a;s/^([^"]*("[^"|]*"[^"]*)*"[^"|]*)\|/\1/g;ta' file
This removes | within double quotes however it does not cater for quoted quotes, so beware!
There are somethings you don't use SED for, I'd say this is one of those things. Try using a python script with the re library or just plain string manipulation.
I think this C++ code does what you need.
// $ g++ -Wall -Wextra -std=c++11 main.cpp
#include <iostream>
int main(int, char*[]) {
bool str = false;
char c;
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
while (std::cin.get(c)) {
if (c == '|') {
if (str) {
std::cout << '\\'; } }
else if (c == '"') {
// Toggle string parsing.
str = !str; }
else if (c == '\\') {
// Skip escaped chars.
std::cout << c;
std::cin.get(c); }
std::cout << c; }
return 0; }
The problem with sed in this example is you need to know more than the basics in order to track which state you are in (string or not).
Here's a script for translating a file with pipe-separated values, as described, to one following the simpler conventions of a TSV file. It assumes the availability of a PHP interpreter. If the script is saved as psv2tsv and made executable in a Mac or Linux environment, then psv2tsv -h should offer more details.
Example usage (using <TAB> to indicate a TAB in the output):
$ psv2tsv <<< $'73779087|"UCGr4c0_zShyHbctxJJrJ03w"|"ItsMattSpeaking | SuPra"'
73779087<TAB>UCGr4c0_zShyHbctxJJrJ03w<TAB>ItsMattSpeaking | SuPra<TAB>
$ psv2tsv <<< $'a|"b|c\t\d"|"e\n"'
a<TAB>b|c\t\d<TAB>e\n<TAB>
The script:
#!/usr/bin/env php
<?php
# Author: pkoppstein at gmail.com 12/2015
# Use at your own risk.
# Input - pipe-separated values along the lines of CSV.
# Translate embedded newline and tab characters.
function help() {
global $argv;
echo <<<EOT
Syntax: {$argv[0]} [filepath]
Convert a file or stream of records with pipe-separated values to the
TSV (tab-separated value) format. If no argument is specified, or if
filepath is specified as -, then input is taken from stdin.
The input is assumed to be like a CSV file but with pipe characters
(|) used instead of commas. The output follows the simpler
conventions of TSV files.
Note that each tab in the input is translated to "\\t", and each
embedded newline is translated to "\\n". Each translated record is
then written to stdout. See PHP's fgetcsv for further details.
EOT;
}
$file = ($argc > 1 ) ? $argv[1] : 'php://stdin';
if ( $file == "-h" or $file == "--help" ) {
help();
exit;
}
if ( $file == "-" ) $file = 'php://stdin';
$handle = #fopen($file, "r");
if ($handle) {
while (($data = fgetcsv($handle,0,"|")) !== FALSE) {
$num = count($data);
for ($c=0; $c < $num; $c++) {
# str_replace( mixed $search , mixed $replace , mixed $subject [, int &$count ] )
echo str_replace("\t", "\\t", str_replace("\n", "\\n", $data[$c])) . "\t";
}
echo "\n";
}
fclose($handle);
}
else {
echo "{$argv[0]}: unable to fopen $argv[1]\n";
exit(1);
}
?>
Related
I want to execute some commands in terminal. I create them in Swift 3.0 and write them to a command file. But some special characters make problems, e.g. single quote:
mv 'Don't do it.txt' 'Don_t do it.txt'
I use single quote to cover other special characters. But what's about single quotes itself. How can I convert them in a way every possible filename can be handled correctly?
You question is strange:
In this case we would be writing to shell script rather than a text file
You are replacing single quotes in the output file name, but not spaces,
which should be replaced
Here is a solution that gives proper escaping for the input files, and proper
replacing (read: spaces too) for the output files:
#!/usr/bin/awk -f
BEGIN {
mi = "\47"
no = "[^[:alnum:]%+,./:=#_-]"
print "#!/bin/sh"
while (++os < ARGC) {
pa = split(ARGV[os], qu, mi)
printf "mv "
for (ro in qu) {
printf "%s", match(qu[ro], no) ? mi qu[ro] mi : qu[ro]
if (ro < pa) printf "\\" mi
}
gsub(no, "_", ARGV[os])
print FS ARGV[os]
}
}
Result:
#!/bin/sh
mv 'dont do it!.txt' dont_do_it_.txt
mv Don\''t do it.txt' Don_t_do_it.txt
mv dont-do-it.txt dont-do-it.txt
I have several AIX systems with a configuration file, let's call it /etc/bar/config. The file may or may not have a line declaring values for foo. An example would be:
foo = A_1,GROUP_1,USER_1,USER_2,USER_3
The foo line may or may not be the same on all systems. Different systems may have different values and different a different number of values. My task is to add "bare minimum" values in the config file on all systems. The bare minimum line will look like this.
foo = A_1,USER_1,SYS_1,SYS_2
If the line does not exist, I must create it. If the line does exist, I must merge the two lines. Using my examples, the result would be this. The order of the values does not matter.
foo = A_1,GROUP_1,USER_1,USER_3,USER_2,SYS_1,SYS_2
Obviously I want a script to do my work. I have the standard sh, ksh, awk, sed, grep, perl, cut, etc. Since this is AIX, I do not have access to the GNU versions of these utilities.
Originally, I had a script with these commands to replace the entire foo line.
cp /etc/bar/config /etc/bar/config.$$
sed "s/foo = .*/foo = A_1,USER_1,SYS_1,SYS_2/" /etc/bar/config.$$ > /etc/bar/config
But this simply replaces the line. It does take into consideration any pre-existing configuration, including a line that's missing. And I'm doing other configuration modifications in the script, such as adding completely unique lines to other files and restarting a process, so I'd perfer this be some type of shell-based code snippet I can add to my change script. I am open to other options, especially if the solution is simpler.
Some dirty bash/sed:
#!/usr/bin/bash
input_file="some_filename"
v=$(grep -n '^foo *=' "$input_file")
lineno=$(cut -d: -f1 <<< "${v}0:")
base="A_1,USER_1,SYS_1,SYS_2,"
if [[ "$lineno" == 0 ]]; then
echo "foo = A_1,USER_1,SYS_1,SYS_2" >> "$input_file"
else
all=$(sed -n ${lineno}'s/^foo *= */'"$base"'/p' "$input_file" | \
tr ',' '\n' | sort | uniq | tr '\n' ',' | \
sed -e 's/^/foo = /' -e 's/, *$//' -e 's/ */ /g' <<< "$all")
sed -i "${lineno}"'s/.*/'"$all"'/' "$input_file"
fi
Untested bash, etc.
config=/etc/bar/config
default=A_1,USER_1,SYS_1,SYS_2
pattern='^foo[[:blank:]]*=[[:blank:]]*' # shared with grep and sed
if current=$( grep "$pattern" "$config" | sed "s/$pattern//" )
then
new=$( echo "$current,$default" | tr ',' '\n' | sort | uniq | paste -sd, )
sed "s/$pattern.*/foo = $new/" "$config" > "$config.$$.tmp" &&
mv "$config.$$.tmp" "$config"
else
echo "foo = $default" >> "$config"
fi
A vanilla perl solution:
perl -i -lpe '
BEGIN {%foo = map {$_ => 1} qw/A_1 USER_1 SYS_1 SYS_2/}
if (s/^foo\s*=\s*//) {
$found=1;
$foo{$_}=1 for split /,/;
$_ = "foo = " . join(",", keys %foo);
}
END {print "foo = " . join(",", keys %foo) unless $found}
' /etc/bar/config
This Perl code will do as you ask. It expects the path to the file to be modified as a parameter on the command line.
Note that it reads the entire input file into the array #config and then overwrites the same file with the modified data.
It works by building a hash %values from a combination of the items already present in the foo = line and the list of defaults items in #defaults. The combination is sorted in alphabetical order and joined eith a comma
use strict;
use warnings;
my #defaults = qw/ A_1 USER_1 SYS_1 SYS_2 /;
my ($file) = #ARGV;
my #config = <>;
open my $out_fh, '>', $file or die $!;
select $out_fh;
for ( #config ) {
if ( my ($pfx, $vals) = /^(foo \s* = \s* ) (.+) /x ) {
my %values;
++$values{$_} for $vals =~ /[^,\s]+/g;
++$values{$_} for #defaults;
print $pfx, join(',', sort keys %values), "\n";
}
else {
print;
}
}
close $out_fh;
output
foo = A_1,GROUP_1,SYS_1,SYS_2,USER_1,USER_2,USER_3
Since you didn't provide sample input and expected output I couldn't test this but this is the right approach:
awk '
/foo = / { old = ","$3; next }
{ print }
END {
split("A_1,USER_1,SYS_1,SYS_2"old,all,/,/)
for (i in all)
if (!seen[all[i]]++)
new = (new ? new "," : "") all[i]
print "foo =", new
}
' /etc/bar/config > tmp && mv tmp /etc/bar/config
I have a "pipe-separated" file that has about 20 columns. I want to just hash the first column which is a number like account number using sha1sum and return the rest of the columns as is.
Whats the best way I can do this using awk or sed?
Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...
Above is an example of the text file showing just 3 columns. Only the first column has the hashfunction implemented on it. Result should like:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
What the Best Way™ is is up for debate. One way to do it with awk is
awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\ -f 1"); command | getline hash; close(command); $1 = hash; print }' filename
That is
BEGIN {
OFS = FS # set output field separator to field separator; we will use
# it because we meddle with the fields.
}
NR == 1 { # first line: just print headers.
print
}
NR != 1 { # from there on do the hash/replace
# this constructs a shell command (and runs it) that echoes the field
# (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
# and gets it back into awk with getline (into the variable hash)
# the gsub bit is to prevent the shell from barfing if there's an apostrophe
# in one of the fields.
gsub(/'/, "'\\''", $1);
command = ("echo '" $1 "' | sha1sum -b | cut -d\\ -f 1")
command | getline hash
close(command)
# then replace the field and print the result.
$1 = hash
print
}
You will notice the differences between the shell command at the top and the awk code at the bottom; that is all due to shell expansion. Because I put the awk code in single quotes in the shell commands (double quotes are not up for debate in that context, what with $1 and all), and because the code contains single quotes, making it work inline leads to a nightmare of backslashes. Because of this, my advice is to put the awk code into a file, say foo.awk, and run
awk -F'|' -f foo.awk filename
instead.
Here's an awk executable script that does what you want:
#!/usr/bin/awk -f
BEGIN { FS=OFS="|" }
FNR != 1 { $1 = encodeData( $1 ) }
47
function encodeData( fld ) {
cmd = sprintf( "echo %s | sha1sum", fld )
cmd | getline output
close( cmd )
split( output, arr, " " )
return arr[1]
}
Here's the flow break down:
Set the input and output field separators to |
When the row isn't the first (header) row, re-assign $1 to an encoded value
Print the entire row when 47 is true (always)
Here's the encodeData function break down:
Create a cmd to feed data to sha1sum
Feed it to getline
Close the cmd
On my system, there's extra info after sha1sum, so I discard it by spliting the output
Return the first field of the sha1sum output.
With your data, I get the following:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
running by calling awk.script data (or ./awk.script data if you bash)
EDIT by EdMorton:
sorry for the edit, but your script above is the right approach but needs some tweaks to make it more robust and this is much easier than trying to describe them in a comment:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }
function encodeData( fld, cmd, output ) {
cmd = "echo \047" fld "\047 | sha1sum"
if ( (cmd | getline output) > 0 ) {
sub(/ .*/,"",output)
}
else {
print "failed to hash " fld | "cat>&2"
output = fld
}
close( cmd )
return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
The f[] array decouples your script from hard-coding the number of the field that needs to be hashed, the additional args for your function make them local and so always null/zero on each invocation, the if on getline means you won't return the previous success value if it fails (see http://awk.info/?tip/getline) and the rest is maybe more style/preference with a bit of a performance improvement.
I have a sed file that replaces all occurrences of a string in a file with other string.
I want to do it inline but without using -i from terminal
What changes are to be made to the .sed file
#!/bin/sed
s/include/\#include/
Just use awk:
{ sub(/include/,"#include"); rec = rec $0 RS }
END{ printf "%s", rec > FILENAME }
or if you want to operate strictly on strings:
BEGIN{ old="include"; new="#include" }
s = index($0,old) {$0 = substr($0,1,s-1) new substr($0,s+length(old))
{ rec = rec $0 RS }
END{ printf "%s", rec > FILENAME }
which can be simplified to:
s = index($0,"include") {$0 = substr($0,1,s-1) "#" substr($0,s)
{ rec = rec $0 RS }
END{ printf "%s", rec > FILENAME }
in this particular case of just prepending # to a string.
I don't think it will work, because the -i and -f options can usually both have arguments, but you could be lucky.
The shebang line can contain one option group (aka cluster). You would need for it to contain -f (this is missing from your example) so the cluster could look like
#!/bin/sed -if
provided that your dialect doesn't require an argument to the -i option, and permits clustering like this in the first place (-if pro -i -f).
The obvious workaround is to change the script to a shell script; the -f option is no longer required because the script is not in a file.
#!/bin/sh
exec sed -i 's/include/\#include/' "$#"
I have a problem that I need to call a Perl script with parameters passing in and get the return value of the Perl script in an AWK BEGIN block. Just like below.
I have a Perl script util.pl
#!/usr/bin/perl -w
$res=`$exe_cmd`;
print $res;
Now in the AWK BEGIN block (ksh) I need to call the script and get the return value.
BEGIN { print "in awk, application type is " type;
} \
{call per script here;}
How do I call the Perl script with parameter and get the return value of $res?
res = util.pl a b c;
Pipe the script into getline:
awk 'BEGIN {
cmd = "util.pl a b c";
cmd | getline res;
close(cmd);
print "in awk, application type is " res
}'
Part of an AWK-script I use for extracting data from a ldap query. Perhaps you can find some inspiration from how I do the base64 decoding below...
/^dn:/{
if($0 ~ /^dn: /){
split($0, a, "[:=,]")
name=a[3]
}
else if($0 ~ /^dn::/){
# Special handling needed since ldap apparently
# uses base64 encoded strings for *some* users
cmd = "/usr/bin/base64 -i -d <<< " $2 " 2>/dev/null"
while ( ( cmd | getline result ) > 0 ) { }
close(cmd)
split(result, a, "[:=,]")
name=a[2]
}
}