using command line tools to extract and replace texts for translations

using command line tools to extract and replace texts for translations - sed

For an application, I have a language file in the way
first_identifier = English words
second_identifier = more English words
and need to translate it to further languages. In a first step I'm required to extract the right side of those texts resulting in a file like ...
English words
more English words
... How can I archive that? Using grep maybe?
Next I'd use a translation tool and receive something like
German words
more German words
that need to be inserted in the first file again (replace English words with Germans) now. I was thinking about using sed maybe, but I don't know how to use it for this purpose. Or, do you have other recommendations?

To do it as you describe would be:
$ cat tst.sh
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
trap 'rm -f "$tmp"; exit' 0
sed 's/[^ =]* = //' "${#:--}" > "$tmp" &&
tr 'a-z' 'A-Z' < "$tmp" |
awk '
BEGIN { OFS = " = " }
NR == FNR {
ger[NR] = $0
next
}
{
sub(/ = .*/,"")
print $0, ger[FNR]
}
' - "$tmp"
$ ./tst.sh file
English words = ENGLISH WORDS
more English words = MORE ENGLISH WORDS
but you don't need a temp file for that:
$ cat tst.sh
#!/usr/bin/env bash
sed 's/[^ =]* = //' "$#" |
tr 'a-z' 'A-Z' |
awk '
BEGIN { OFS = " = " }
NR == FNR {
ger[NR] = $0
next
}
{
sub(/ = .*/,"")
print $0, ger[FNR]
}
' - "$#"
$ ./tst.sh file
first_identifier = ENGLISH WORDS
second_identifier = MORE ENGLISH WORDS
and I think this might be what you really want anyway so your translation tool can translate 1 line at a time instead of the whole input at once which might produce different results:
$ cat tst.sh
#!/usr/bin/env bash
while IFS= read -r line; do
id="${line%% = *}"
eng="${line#* = }"
ger="$(tr 'a-z' 'A-Z' <<<"$eng")"
printf '%s = %s\n' "$id" "$ger"
done < "${#:--}"
$ ./tst.sh file
first_identifier = ENGLISH WORDS
second_identifier = MORE ENGLISH WORDS
Just replace tr 'a-z' 'A-Z' < "$tmp" or tr 'a-z' 'A-Z' <<<"$eng" with the call to whatever translation tool you have in mind.

Related

Merge two lines into one within a configuration file

I have several AIX systems with a configuration file, let's call it /etc/bar/config. The file may or may not have a line declaring values for foo. An example would be:
foo = A_1,GROUP_1,USER_1,USER_2,USER_3
The foo line may or may not be the same on all systems. Different systems may have different values and different a different number of values. My task is to add "bare minimum" values in the config file on all systems. The bare minimum line will look like this.
foo = A_1,USER_1,SYS_1,SYS_2
If the line does not exist, I must create it. If the line does exist, I must merge the two lines. Using my examples, the result would be this. The order of the values does not matter.
foo = A_1,GROUP_1,USER_1,USER_3,USER_2,SYS_1,SYS_2
Obviously I want a script to do my work. I have the standard sh, ksh, awk, sed, grep, perl, cut, etc. Since this is AIX, I do not have access to the GNU versions of these utilities.
Originally, I had a script with these commands to replace the entire foo line.
cp /etc/bar/config /etc/bar/config.$$
sed "s/foo = .*/foo = A_1,USER_1,SYS_1,SYS_2/" /etc/bar/config.$$ > /etc/bar/config
But this simply replaces the line. It does take into consideration any pre-existing configuration, including a line that's missing. And I'm doing other configuration modifications in the script, such as adding completely unique lines to other files and restarting a process, so I'd perfer this be some type of shell-based code snippet I can add to my change script. I am open to other options, especially if the solution is simpler.

Some dirty bash/sed:
#!/usr/bin/bash
input_file="some_filename"
v=$(grep -n '^foo *=' "$input_file")
lineno=$(cut -d: -f1 <<< "${v}0:")
base="A_1,USER_1,SYS_1,SYS_2,"
if [[ "$lineno" == 0 ]]; then
echo "foo = A_1,USER_1,SYS_1,SYS_2" >> "$input_file"
else
all=$(sed -n ${lineno}'s/^foo *= */'"$base"'/p' "$input_file" | \
tr ',' '\n' | sort | uniq | tr '\n' ',' | \
sed -e 's/^/foo = /' -e 's/, *$//' -e 's/ */ /g' <<< "$all")
sed -i "${lineno}"'s/.*/'"$all"'/' "$input_file"
fi

Untested bash, etc.
config=/etc/bar/config
default=A_1,USER_1,SYS_1,SYS_2
pattern='^foo[[:blank:]]*=[[:blank:]]*' # shared with grep and sed
if current=$( grep "$pattern" "$config" | sed "s/$pattern//" )
then
new=$( echo "$current,$default" | tr ',' '\n' | sort | uniq | paste -sd, )
sed "s/$pattern.*/foo = $new/" "$config" > "$config.$$.tmp" &&
mv "$config.$$.tmp" "$config"
else
echo "foo = $default" >> "$config"
fi
A vanilla perl solution:
perl -i -lpe '
BEGIN {%foo = map {$_ => 1} qw/A_1 USER_1 SYS_1 SYS_2/}
if (s/^foo\s*=\s*//) {
$found=1;
$foo{$_}=1 for split /,/;
$_ = "foo = " . join(",", keys %foo);
}
END {print "foo = " . join(",", keys %foo) unless $found}
' /etc/bar/config

This Perl code will do as you ask. It expects the path to the file to be modified as a parameter on the command line.
Note that it reads the entire input file into the array #config and then overwrites the same file with the modified data.
It works by building a hash %values from a combination of the items already present in the foo = line and the list of defaults items in #defaults. The combination is sorted in alphabetical order and joined eith a comma
use strict;
use warnings;
my #defaults = qw/ A_1 USER_1 SYS_1 SYS_2 /;
my ($file) = #ARGV;
my #config = <>;
open my $out_fh, '>', $file or die $!;
select $out_fh;
for ( #config ) {
if ( my ($pfx, $vals) = /^(foo \s* = \s* ) (.+) /x ) {
my %values;
++$values{$_} for $vals =~ /[^,\s]+/g;
++$values{$_} for #defaults;
print $pfx, join(',', sort keys %values), "\n";
}
else {
print;
}
}
close $out_fh;
output
foo = A_1,GROUP_1,SYS_1,SYS_2,USER_1,USER_2,USER_3

Since you didn't provide sample input and expected output I couldn't test this but this is the right approach:
awk '
/foo = / { old = ","$3; next }
{ print }
END {
split("A_1,USER_1,SYS_1,SYS_2"old,all,/,/)
for (i in all)
if (!seen[all[i]]++)
new = (new ? new "," : "") all[i]
print "foo =", new
}
' /etc/bar/config > tmp && mv tmp /etc/bar/config

Make some replacements on a bunch of files depending the number of columns per line

I'm having a problem dealing with some files. I need to perform a column count for every line in a file and depending the number of columns i need to add severals ',' in in the end of each line. All lines should have 36 columns separated by ','
This line solves my problem, but how do I run it in a folder with several files in a automated way?
awk ' BEGIN { FS = "," } ;
{if (NF == 32) { print $0",,,," } else if (NF==31) { print $0",,,,," }
}' <SOURCE_FILE> > <DESTINATION_FILE>
Thank you for all your support
R&P

The answer depends on your OS, which you haven't told us. On UNIX and assuming you want to modify each original file, it'd be:
for file in *
do
awk '...' "$file" > tmp$$ && mv tmp$$ "$file"
done
Also, in general to get all records in a file to have the same number of fields you can do this without needing to specify what that number of fields is (though you can if appropriate):
$ cat tst.awk
BEGIN { FS=OFS=","; ARGV[ARGC++] = ARGV[ARGC-1] }
NR==FNR { nf = (NF > nf ? NF : nf); next }
{
tail = sprintf("%*s",nf-NF,"")
gsub(/ /,OFS,tail)
print $0 tail
}
$
$ cat file
a,b,c
a,b
a,b,c,d,e
$
$ awk -f tst.awk file
a,b,c,,
a,b,,,
a,b,c,d,e
$
$ awk -v nf=10 -f tst.awk file
a,b,c,,,,,,,
a,b,,,,,,,,
a,b,c,d,e,,,,,

It's a short one-liner with Perl:
perl -i.bak -F, -alpe '$_ .= "," x (36-#F)' *

if this is only a single folder without subfolders, use:
for oldfile in /path/to/files/*
do
newfile="${oldfile}.new"
awk '...' "${oldfile}" > "${newfile}"
done
if you also want to include subdirectories recursively, it's probably easiest to put the awk+redirection into a small shell-script, like this:
#!/bin/bash
oldfile=$1
newfile="${oldfile}.new"
awk '...' "${oldfile}" > "${newfile}"
and then run this script (let's calls it runawk.sh) via find:
find /path/to/files/ -type f -not -name "*.new" -exec runawk.sh \{\} \;

Print all line between the search pattern into different files using perl or any method

Could someone help out on this
I want to print all line between the search pattern (START & END) to different files (new_file_name can be any incremental name provided)
But the search pattern repeats in file hence each time it finds the pattern it should dump the line b/w them into different files
The file is something like this
START --- ./body1/b1
##########################
123body1
abcbody1
##########################
END --- ./body1/b1
START --- ./body2/b2
##########################
123body2
defbody2
##########################
END --- ./body2/b2

perl solution,
perl -MFile::Basename -MFile::Path -ne '
($a) = /^START.+?(\S+)$/;
$b = /^END/;
$a..$b or next;
if ($a){ mkpath(dirname $a); open STDOUT,">",$a; }
$a||$b or print;
' file

Here is my awk solution:
# print_between_patterns.awk
/^START/ { filename = $NF ; next } # On START, use the last field as file name
/^END/ { next } # On END, skip
{ print > filename } # For the rest of the lines, print to file
Assume your data file is called data.txt, the following will do what you want:
awk -f print_between_patterns.awk data.txt
Discussion
After the script ran, you will have ./body1, ./body2, and so on.
If you don't want to skip the BEGIN and END parts, remove the next commands.
Update
If you want to control the output filename in a sequential way:
/^START/ { filename = sprintf("out%04d.txt", ++count) ; next }
/^END/ { next }
{ print > filename }

To get automatically generated incremental file names:
awk '
/^END/ { inBlock=0 }
inBlock { print > outfile }
/^START/ { inBlock=1; outfile = "outfile" ++count }
' file
To use the file names from your input:
awk '
/^END/ { inBlock=0 }
inBlock { print > outfile }
/^START/ {
inBlock=1
outdir = outfile = $NF
sub(/\/[^\/]+$/,"",outdir)
system("mkdir -p \"" outdir "\"")
}
' file
The problem #JamesBond was having below was that I wasn't escaping the "/" within the character list in the sub() so I've updated my answer above to do that now. There's absolutely no reason why that should need to be escaped but apparently both nawk and /usr/xpg4/bin/awk require it:
$ cat file
the
quick/brown
dog
$ gawk '/[/]/' file
quick/brown
$ nawk '/[/]/' file
nawk: nonterminated character class [
source line number 1
context is
>>> /[/ <<< ]/
$ /usr/xpg4/bin/awk '/[/]/' file
/usr/xpg4/bin/awk: /[/: [ ] imbalance or syntax error Context is:
>>> /[/ <<<
and gawk doesn't care either way:
$ gawk --lint --posix '/[/]/' file
quick/brown
$ gawk --lint '/[/]/' file
quick/brown
$ gawk --lint --posix '/[\/]/' file
quick/brown
$ gawk --lint '/[\/]/' file
quick/brown
They all work just fine if I escape the backslash without putting it in a character list:
$ /usr/xpg4/bin/awk '/\//' file
quick/brown
$ nawk '/\//' file
quick/brown
$ gawk '/\//' file
quick/brown
So I guess that's something worth remembering for portability in future!

Using awk:
awk 'sub(/^START/, ""){out=sprintf("out%d", c++); p=1}
sub(/^END/, ""){print > out; p=0} p{print > out}' file
This will find and store each match between START and END into separate files named out1, out2 etc.

This is one way to do it in Bash.
#!/bin/bash
[ -n "$BASH_VERSION" ] || {
echo "You need Bash to run this script."
exit 1
}
shopt -s extglob || {
echo "Unable to enable extglob shell option."
exit 1
}
IFS=$' \t\n' ## Use default.
while read KEY DASH FILENAME; do
if [[ $KEY == START && $DASH == --- && -n $FILENAME ]]; then
CURRENT_FILENAME=$FILENAME
DIRNAME=${FILENAME%%+([^/])}
if [[ -n $DIRNAME ]]; then
mkdir -p "$DIRNAME" || {
echo "Unable to create directory $DIRNAME."
exit 1
}
fi
exec 4>"$CURRENT_FILENAME" || {
echo "Unable to open $CURRENT_FILENAME for output."
exit 1
}
for (( ;; )); do
IFS= read -r LINE || {
echo "End of file reached finding END block of $CURRENT_FILENAME."
exec 4>&-
exit 1
}
read -r KEY DASH FILENAME <<< "$LINE"
if [[ $KEY == END && $DASH == --- && $FILENAME == "$CURRENT_FILENAME" ]]; then
break
else
echo "$LINE" >&4
fi
done
exec 4>&-
fi
done
Make sure you save the script in UNIX file format then run it as bash script.sh < file.

I guess you need to see this.
perl -lne 'print if((/START/../END/) and ($_!~/START/ and $_!~/END/))' your_file
Tested below:
> cat temp
START --- ./body1
##########################
123body1
abcbody1
##########################
END --- ./body1
START --- ./body2
##########################
123body2
defbody2
##########################
END --- ./body2
> perl -lne 'print if((/START/../END/) and ($_!~/START/ and $_!~/END/))' temp
##########################
123body1
abcbody1
##########################
##########################
123body2
defbody2
##########################
>

This might work for you:
csplit -z file '/^START/' '{*}'
Files will be named xx00 xx01 xx..

Awk/Perl/Sed column substitution based on a text code

I have a text file with the following content
L,4m,06/03/2013
L,33GJm,06/03/2013,G
L,44Bm,06/03/2013,B
L,4q,08/03/2013
J,4m,04/03/2013
J,3GU,04/03/2013,G
J,3jm,04/03/2013
J,3GJ,04/03/2013,G
J,44Bm,06/03/2013,B
J,34Bq,08/03/2013,B
M,4v,12/03/2013
D,3GU,12/03/2013,G
D,4B,11/03/2013,B
D,4m,12/03/2013
D,3GJ,13/03/2013,G
D,3GU,13/03/2013,G
D,4B,14/03/2013,B
D,4B,14/03/2013,B
D,34Bm,14/03/2013,B
L,33BUq,11/03/2013,B
L,3BJUq,11/03/2013,B
L,44Bq,14/03/2013,B
L,44Bq,14/03/2013,B
L,3Bq,15/03/2013,B
L,3q,15/03/2013
J,34Bjq,11/03/2013,B
J,33GUm,12/03/2013,G
J,4q,13/03/2013
J,33GUq,13/03/2013,G
J,33GUq,13/03/2013,G
J,4q,13/03/2013
M,3BU,18/03/2013,B
M,4B,18/03/2013,B
M,4B,18/03/2013,B
M,3GJ,19/03/2013,G
M,3GJ,19/03/2013,G
D,4B,22/03/2013,B
D,3BU,22/03/2013,B
L,34Bv,18/03/2013,B
L,3jm,19/03/2013
L,4m,19/03/2013
L,33GJm,19/03/2013,G
L,33GUm,19/03/2013,G
J,33BUm,18/03/2013,B
J,4m,18/03/2013
J,4B,18/03/2013,B
J,33BUm,18/03/2013,B
J,4q,22/03/2013
J,4q,22/03/2013
A,3GJ,28/03/2013,G
M,4B,27/03/2013,B
D,4B,25/03/2013,B
L,44Bq,25/03/2013,B
L,34Bq,25/03/2013,B
L,34Bq,25/03/2013,B
L,33BUa,26/03/2013,B
L,33BUq,26/03/2013,B
L,33BUq,26/03/2013,B
L,34Bq,27/03/2013,B
L,34Bq,27/03/2013,B
L,4B,27/03/2013,B
L,34Bq,27/03/2013,B
L,4a,28/03/2013
I want to translate the second column based on the following coding system.
If $2 starts with a 1 or 2 - Change $2 to Excellent
If $2 contains 3BU or 3GU - Change $2 to Good
if $2 contains 3BJ or 3GJ - Change $2 to OK
If $2 starts with a 4 - Change $2 to Poor
If $2 starts with a 5 - Change $2 Terrible
I can find and change the 3BUs to Good easy enough using the following command
awk 'BEGIN{FS=",";OFS=","} {if ($2~ /3(B|G)U/)print $1,"Good",$3}' file | sponge file
Though I use all other non 3(B|G)U lines. I could use if else terminology though this seems inelegant. I have tried to use gensub to solve the problem
awk -F, '{gensub(/3(B|G)U/,Good,"",2)}1' file
But this prints the file contents without substitution. Any hints
Desired output
L,Poor,06/03/2013
L,Ok,06/03/2013,G
L,Poor,06/03/2013,B
L,Poor,08/03/2013
J,Poor,04/03/2013
J,Good,04/03/2013,G
A perl or sed one-liner would also be helpful as this code forms part of a bash shell script

If you want to stick with shell:
(
IFS=,
while read -ra f; do # pick more appropriate variable names
case ${f[1]} in
[12]*) f[1]=Excellent ;;
*3[BG]U*) f[1]=Good ;;
*3[BG]J*) f[1]=OK ;;
4*) f[1]=Poor ;;
5*) f[1]=Terrible ;;
esac
echo "${f[*]}"
done < file
) > tmp && mv tmp file
I ran that in a subshell to localize changes to $IFS

a sed solutions too
sed -e 's/\(^.,\)\(1\|2\)[^,]*/\1Excellent/g' -e 's/\(^.,\)3[BG]U[^,]*/\1Good/g' -e 's/\(^.,\)3[BG]J[^,]*/\1OK/g' -e 's/\(^.,\)4[^,]*/\1Poor/g' -e 's/\(^.,\)5[^,]*/\1Terrible/g' <filename>

$ awk '
BEGIN { FS=OFS="," }
$2 ~ /^(1|2)/ { $2 = "Excellent" }
$2 ~ /3(B|G)U/ { $2 = "Good" }
$2 ~ /3(B|G)J/ { $2 = "OK" }
$2 ~ /^4/ { $2 = "Poor" }
$2 ~ /^5/ { $2 = "Terrible" }
1
' foo.txt | head -n 10
L,Poor,06/03/2013
L,OK,06/03/2013,G
L,Poor,06/03/2013,B
L,Poor,08/03/2013
J,Poor,04/03/2013
J,Good,04/03/2013,G
J,3jm,04/03/2013
J,OK,04/03/2013,G
J,Poor,06/03/2013,B
J,34Bq,08/03/2013,B

perl -pe 's{,(\w+)}{ $_ = /^[12]/ ?"Excellent" :/3[BG]U/ ?"Good" :/3[BG]J/ ?"OK" :/^4/ ?"Poor" :/^5/ ?"Terrible" :$_ for $v=$1; ",$v" }e'
More readable version,
s{,(\w+)}{
for ($v = $1) {
$_ = /^[12]/ ?"Excellent"
:/3[BG]U/ ?"Good"
:/3[BG]J/ ?"OK"
:/^4/ ?"Poor"
:/^5/ ?"Terrible"
:$_;
}
",$v";
}e;

How to delete multiple empty lines with SED?

I'm trying to compress a text document by deleting of duplicated empty lines, with sed. This is what I'm doing (to no avail):
sed -i -E 's/\n{3,}/\n/g' file.txt
I understand that it's not correct, according to this manual, but I can't figure out how to do it correctly. Thanks.

I think you want to replace spans of multiple blank lines with a single blank line, even though your example replaces multiple runs of \n with a single \n instead of \n\n. With that in mind, here are two solutions:
sed '/^$/{ :l
N; s/^\n$//; t l
p; d; }' input
In many implementations of sed, that can be all on one line, with the embedded newlines replaced by ;.
awk 't || !/^$/; { t = !/^$/ }'

As tripleee suggested above, I'm using Perl instead of sed:
perl -0777pi -e 's/\n{3,}/\n\n/g'

Use the translate function
tr -s '\n'
the -s or --squeeze-repeats reduces a sequence of repeated character to a single instance.

This is much better handled by tr -s '\n' or cat -s, but if you insist on sed, here's an example from section 4.17 of the GNU sed manual:
#!/usr/bin/sed -f
# on empty lines, join with next
# Note there is a star in the regexp
:x
/^\n*$/ {
N
bx
}
# now, squeeze all '\n', this can be also done by:
# s/^\(\n\)*/\1/
s/\n*/\
/

I am not sure this is what the OP wanted but using the awk solution by William Pursell here is the approach if you want to delete ALL empty lines in the file:
awk '!/^$/' file.txt
Explanation:
The awk pattern
'!/^$/'
is testing whether the current line is consisting only of the beginning of a line (symbolised by '^') and the end of a line (symbolised by '$'), in other words, whether the line is empty.
If this pattern is true awk applies its default and prints the current line.
HTH

I think OP wants to compress empty lines, e.g. where there are 9 consecutive emty lines, he wants to have just three.
I have written a little bash script that does just that:
#! /bin/bash
TOTALLINES="$(cat file.txt|wc -l)"
CURRENTLINE=1
while [ $CURRENTLINE -le $TOTALLINES ]
do
L1=$CURRENTLINE
L2=$(($L1 + 1))
L3=$(($L1 +2))
if [[ $(cat file.txt|head -$L1|tail +$L1) == "" ]]||[[ $(cat file.txt|head -$L1|tail +$L1) == " " ]]
then
L1EMPTY=true
else
L1EMPTY=false
fi
if [[ $(cat file.txt|head -$L2|tail +$L2) == "" ]]||[[ $(cat file.txt|head -$L2|tail +$L2) == " " ]]
then
L2EMPTY=true
else
L2EMPTY=false
fi
if [[ $(cat file.txt|head -$L3|tail +$L3) == "" ]]||[[ $(cat file.txt|head -$L3|tail +$L3) == " " ]]
then
L3EMPTY=true
else
L3EMPTY=false
fi
if [ $L1EMPTY = true ]&&[ $L2EMPTY = true ]&&[ $L3EMPTY = true ]
then
#do not cat line to temp file
echo "Skipping line "$CURRENTLINE
else
echo "$(cat file.txt|head -$CURRENTLINE|tail +$CURRENTLINE)">>temp.txt
echo "Writing line " $CURRENTLINE
fi
((CURRENTLINE++))
done
cat temp.txt>file.txt
rm -r temp.txt
FINALTOTALLINES="$(cat file.txt|wc -l)"
EMPTYLINELINT=$(( $CURRENTLINE - $FINALTOTALLINES ))
echo "Deleted " $EMPTYLINELINT " empty lines."

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse