Awk or Sed: File Annotation - sed

Hallo, my SO friend, my question is:
Specification: annotate the fields of FILE_2 to the corresponding position of FILE_1.
A field is marked, and hence identified, by a delimiter pair.
I did this job in python before I knew awk and sed, with a couple hundred lines of code.
Now I want to see how powerful and efficient awk and sed can be.
Show me some masterpiece of awk or sed, please!
The delimiter pairs can be configured in FILE_3, but let's assume the first delimiter in a pair is 'Marker (number i) start', the other one is 'Marker (number i) done'
Example:
|-----------------FILE_1------------------|
text text text
text blabla
Marker_1_start
Marker_1_done
any text
in between blabla
Marker_2_start
Marker_2_done
text text
|-----------------FILE_2------------------|
Marker_1_start
11
1111
Marker_1_done
Marker_2_start
2222
22
Marker_2_done
Expected Output:
|-----------------FILE_Out------------------|
text text text
text blabla
Marker_1_start
11
1111
Marker_1_done
any text
in between blabla
Marker_2_start
2222
22
Marker_2_done
text text

awk '
FNR==NR && /Marker_.*_done/ {sep = ""; next}
FNR==NR && /Marker_.*_start/ {marker = $0; next}
FNR==NR {marker_text[marker] = marker_text[marker] sep $0; sep = "\n"; next}
1 {print}
/Marker_.*_start/ {print marker_text[$0]}
' file_2 file_1

There are several ways to approach this. I'm assuming that FILE_2 is smaller than FILE_1 and of a reasonable size.
#!/usr/bin/awk -f
FNR == NR {
if ($0 ~ /^Marker.*start$/) {
flag = 1
idx = $0
next
}
if ($0 ~ /^Marker.*done$/) {
flag = 0
nl = ""
next
}
if (flag) lines[idx] = lines[idx] nl $0
nl = "\n"
next
}
{
print
if (lines[$0]) print lines[$0]
}
To run it:
./script.awk FILE_2 FILE_1

Now I want to see how powerful and
efficient awk and sed can be
For this type of problem, very efficient. I'm sure my code can be further reduced.
#!/bin/bash
awk '
FNR == NR {
if ($0 ~ /Marker_1_start/){m1=1;next}
if ($0 ~ /Marker_2_start/){m2=1;next}
if ($0 ~ /Marker_1_done/){m1=0}
if ($0 ~ /Marker_2_done/){m2=0}
if(m1){a[i++]=$0}
if(m2){b[j++]=$0}
}
FNR != NR {
if ($0 ~ /Marker_1_start/){print;n1=1}
if ($0 ~ /Marker_2_start/){print;n2=1}
if ($0 ~ /Marker_1_done/){n1=0}
if ($0 ~ /Marker_2_done/){n2=0}
if(n1)
for (k = 0; k < i; k++)
print a[k]
else if(n2)
for (l = 0; l < j; l++)
print b[l]
else
print
}' ./file_2 ./file_1
Output
$ ./filemerge.sh
text text text
text blabla
Marker_1_start
11
1111
Marker_1_done
any text
in between blabla
Marker_2_start
2222
22
Marker_2_done
text text

Related

using command line tools to extract and replace texts for translations

For an application, I have a language file in the way
first_identifier = English words
second_identifier = more English words
and need to translate it to further languages. In a first step I'm required to extract the right side of those texts resulting in a file like ...
English words
more English words
... How can I archive that? Using grep maybe?
Next I'd use a translation tool and receive something like
German words
more German words
that need to be inserted in the first file again (replace English words with Germans) now. I was thinking about using sed maybe, but I don't know how to use it for this purpose. Or, do you have other recommendations?
To do it as you describe would be:
$ cat tst.sh
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
trap 'rm -f "$tmp"; exit' 0
sed 's/[^ =]* = //' "${#:--}" > "$tmp" &&
tr 'a-z' 'A-Z' < "$tmp" |
awk '
BEGIN { OFS = " = " }
NR == FNR {
ger[NR] = $0
next
}
{
sub(/ = .*/,"")
print $0, ger[FNR]
}
' - "$tmp"
$ ./tst.sh file
English words = ENGLISH WORDS
more English words = MORE ENGLISH WORDS
but you don't need a temp file for that:
$ cat tst.sh
#!/usr/bin/env bash
sed 's/[^ =]* = //' "$#" |
tr 'a-z' 'A-Z' |
awk '
BEGIN { OFS = " = " }
NR == FNR {
ger[NR] = $0
next
}
{
sub(/ = .*/,"")
print $0, ger[FNR]
}
' - "$#"
$ ./tst.sh file
first_identifier = ENGLISH WORDS
second_identifier = MORE ENGLISH WORDS
and I think this might be what you really want anyway so your translation tool can translate 1 line at a time instead of the whole input at once which might produce different results:
$ cat tst.sh
#!/usr/bin/env bash
while IFS= read -r line; do
id="${line%% = *}"
eng="${line#* = }"
ger="$(tr 'a-z' 'A-Z' <<<"$eng")"
printf '%s = %s\n' "$id" "$ger"
done < "${#:--}"
$ ./tst.sh file
first_identifier = ENGLISH WORDS
second_identifier = MORE ENGLISH WORDS
Just replace tr 'a-z' 'A-Z' < "$tmp" or tr 'a-z' 'A-Z' <<<"$eng" with the call to whatever translation tool you have in mind.

print field number and field

I want to print the field number and field like this... Is awk the best way? If so how?
The # of fields in the input line may vary.
input_line ="a|b|c|d"
expected result:
1 a
2 b
3 c
4 d
I'm able to print the fields, but need help printing the field numbers. Here's what I have
echo "a|b|c|d" |awk -F"|" '{for (i=1; i<=NF; i++) print $i}'
a
b
c
d
You can use awk command like:
echo "a|b|c|d" | awk -F"|" '{for(i=1; i<=NF; i++) print i, $i}'
awk with a while loop should do the trick:
awk -F '|' '{ i = 1; while (i <= NF) { print i " " $i; i++; } }' <<< "a|b|c|d"

Gather the data with similar columns

I want to filter the data from a text file in unix.
I have text file in unix as below:
A 200
B 300
C 400
A 100
B 600
B 700
How could i modify/create data as below from the above data i have in awk?
A 200 100
B 300 600 700
C 400
i am not that much good in awk and i believe awk/perl is best for this.
awk 'END {
for (R in r)
print R, r[R]
}
{
r[$1] = $1 in r ? r[$1] OFS $2 : $2
}' infile
If the order of the values in the first field is important,
more code will be needed.
The solution will depend on your awk implementation and version.
Explanation:
r[$1] = $1 in r ? r[$1] OFS $2 : $2
Set the value of the array r element $1 to:
if the key $1 is already present: $1 in r, append OFS $2
to the existing value
otherwise set it to the value of $2
expression ? if true : if false is the ternary operator.
See ternary operation for more.
You could do it like this, but with Perl there's always more than one way to do it:
my %hash;
while(<>) {
my($letter, $int) = split(" ");
push #{ $hash{$letter} }, $int;
}
for my $key (sort keys %hash) {
print "$key " . join(" ", #{ $hash{$key} }) . "\n";
}
Should work like that:
$ cat data.txt | perl script.pl
A 200 100
B 300 600 700
C 400
Not language-specific. More like pseudocode, but here's the idea :
- Get all lines in an array
- Set a target dictionary of arrays
- Go through the array :
- Split the string using ' '(space) as the delimiter, into array parts
- If there is already a dictionary entry for `parts[0]` (e.g. 'A').
If not create it.
- Add `parts[1]` (e.g. 100) to `dictionary(parts[0])`
And that's it! :-)
I'd do it, probably in Python, but that's rather a matter of taste.
Using awk, sorting the output inside it:
awk '
{ data[$1] = (data[$1] ? data[$1] " " : "") $2 }
END {
for (i in data) {
idx[++j] = i
}
n = asort(idx);
for ( i=1; i<=n; i++ ) {
print idx[i] " " data[idx[i]]
}
}
' infile
Using external program sort:
awk '
{ data[$1] = (data[$1] ? data[$1] " " : "") $2 }
END {
for (i in data) {
print i " " data[i]
}
}
' infile | sort
For both commands output is:
A 200 100
B 300 600 700
C 400
Using sed:
Content of script.sed:
## First line. Newline will separate data, so add it after the content.
## Save it in 'hold space' and read next one.
1 {
s/$/\n/
h
b
}
## Append content of 'hold space' to current line.
G
## Search if first char (\1) in line was saved in 'hold space' (\4) and add
## the number (\2) after it.
s/^\(.\)\( *[0-9]\+\)\n\(.*\)\(\1[^\n]*\)/\3\4\2/
## If last substitution succeed, goto label 'a'.
ta
## Here last substitution failed, so it is the first appearance of the
## letter, add it at the end of the content.
s/^\([^\n]*\n\)\(.*\)$/\2\1/
## Label 'a'.
:a
## Save content to 'hold space'.
h
## In last line, get content of 'hold space', remove last newline and print.
$ {
x
s/\n*$//
p
}
Run it like:
sed -nf script.sed infile
And result:
A 200 100
B 300 600 700
C 400
This might work for you:
sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]*\)\( .*\)\n\1/\1\2/;ta;P;D'
A 200 100
B 300 600 700
C 400

awk join multiple lines

I want to join lines between opening tag and closing tag with class named "content_subhd",
For example:
<span class="content_subhd">1
2
3 </span>
<span class="xyz">1
2
3</span>
Output should be:
<span class="content_subhd">123</span>
<span class="xyz">1
2
3
</span>
How can this be achieve, any suggestion?
awk '/<span class="content_subhd">/, /<\/span>/ {
r = r ? r $0 : $0
if (/<\/span>/) {
print r; r = x
}
next
}1' infile
If you want to replace the content of your existing file:
awk > _new_ '/<span class="content_subhd">/, /<\/span>/ {
r = r ? r $0 : $0
if (/<\/span>/) {
print r; r = x
}
next
}1' your_file &&
mv -- _new_ your_file
Added solution for mass replacement (as per OP request):
find <your arguments here> |
while IFS= read -r; do
awk > _new_ '/<span class="content_subhd">/, /<\/span>/ {
r = r ? r $0 : $0
if (/<\/span>/) {
print r; r = x
}
next
}1' "$REPLY" &&
mv -- _new_ "$REPLY"
done
As sed is tagged in this question, here is a one liner:
sed '/<span class="content_subhd">/,/<\/span>/{H;/<\/span>/{s/.*//;x;s/\n//g;p;};d}' source
All lines are passed through except in the special "span class" case. These lines are hived off to the hold space, newlines are removed and then what would have been a multi-line is printed instead.

Splitting file based on variable

I have a file with several lines of the following:
DELIMITER ;
I want to create a separate file for each of these sections.
The man page of split command does not seem to have such option.
The split command only splits a file into blocks of equal size (maybe except for the last one).
However, awk is perfect for your type of problem. Here's a solution example.
Sample input
1
2
3
DELIMITER ;
4
5
6
7
DELIMITER ;
8
9
10
11
awk script split.awk
#!/usr/bin/awk -f
BEGIN {
n = 1;
outfile = n;
}
{
# FILENAME is undefined inside the BEGIN block
if (outfile == n) {
outfile = FILENAME n;
}
if ($0 ~ /DELIMITER ;/) {
n++;
outfile = FILENAME n;
} else {
print $0 >> outfile;
}
}
As pointed out by glenn jackman, the code also can be written as:
#!/usr/bin/awk -f
BEGIN {
n = 1;
}
$0 ~ /DELIMITER ;/ {
n++;
next;
}
{
print $0 >> FILENAME n;
}
The notation on the command prompt awk -v x="DELIMITER ;" -v n=1 '$0 ~ x {n++; next} {print > FILENAME n}' is more suitable if you don't use the script more often, however you can also save it in a file as well.
Test run
$ ls input*
input
$ chmod +x split.awk
$ ./split.awk input
$ ls input*
input input1 input2 input3
$ cat input1
1
2
3
$ cat input2
4
5
6
7
$ cat input3
8
9
10
11
The script is just a starting point. You probably have to adapt it to your personal needs and environment.