sed specific - perform many substitutions inside double quoted multiline strings

sed specific - perform many substitutions inside double quoted multiline strings - sed

I have a sed(1) script doing many step-by-step transformations (substitutions) of a given
input stream that works well for the task itself. Now, what I need is to limit these
operatations to/inside "/" quoted multiline string only. The input stream is simple text
file containing multiline "/" quoted strings I need to perform my
sequence of s/// commands on. I know it's quite hard to achieve that in sed(1) but
I still hope anybody knows how to. Script I have so far (but works correctly on single line basis) follows.
The sed(1) "tricks" are at the beginning and at the
end of the script, the rest is just sequence of s///s expressions and it is correct:
#! /bin/sed -f
# Convert /PinYin/ strings to /UTF-8 PinYin/ strings.
# Notice: /PinYin/ strings MUST NOT be multiline (to do).
/\/.*\// {
s/\//\
/g
:a
h
s/[^\n]*\n//
s/\n.*//
s/ang1/||aq||ng/g
s/ang2/||aw||ng/g
s/ang3/||ae||ng/g
s/ang4/||ar||ng/g
s/eng1/||eq||ng/g
s/eng2/||ew||ng/g
s/eng3/||ee||ng/g
s/eng4/||er||ng/g
s/ing1/||iq||ng/g
s/ing2/||iw||ng/g
s/ing3/||ie||ng/g
s/ing4/||ir||ng/g
s/ong1/||oq||ng/g
s/ong2/||ow||ng/g
s/ong3/||oe||ng/g
s/ong4/||or||ng/g
s/an1/||aq||n/g
s/an2/||aw||n/g
s/an3/||ae||n/g
s/an4/||ar||n/g
s/en1/||eq||n/g
s/en2/||ew||n/g
s/en3/||ee||n/g
s/en4/||er||n/g
s/in1/||iq||n/g
s/in2/||iw||n/g
s/in3/||ie||n/g
s/in4/||ir||n/g
s/un1/||uq||n/g
s/un2/||uw||n/g
s/un3/||ue||n/g
s/un4/||ur||n/g
s/ao1/||aq||o/g
s/ao2/||aw||o/g
s/ao3/||ae||o/g
s/ao4/||ar||o/g
s/ou1/||oq||u/g
s/ou2/||ow||u/g
s/ou3/||oe||u/g
s/ou4/||or||u/g
s/ai1/||aq||i/g
s/ai2/||aw||i/g
s/ai3/||ae||i/g
s/ai4/||ar||i/g
s/ei1/||eq||i/g
s/ei2/||ew||i/g
s/ei3/||ee||i/g
s/ei4/||er||i/g
s/a1/||aq||/g
s/a2/||aw||/g
s/a3/||ae||/g
s/a4/||ar||/g
s/a1/||aq||/g
s/a2/||aw||/g
s/a3/||ae||/g
s/a4/||ar||/g
s/er2/||ew||r/g
s/er3/||ee||r/g
s/er4/||er||r/g
s/lyue/l||u:||e/g
s/nyue/n||u:||e/g
s/e1/||eq||/g
s/e2/||ew||/g
s/e3/||ee||/g
s/e4/||er||/g
s/o1/||oq||/g
s/o2/||ow||/g
s/o3/||oe||/g
s/o4/||or||/g
s/i1/||iq||/g
s/i2/||iw||/g
s/i3/||ie||/g
s/i4/||ir||/g
s/nyu3/n||u:e||/g
s/lyu/l||u:||/g
s/u:1/||u:q||/g
s/u:2/||u:w||/g
s/u:3/||u:e||/g
s/u:4/||u:r||/g
s/u:0/||u:s||/g
s/u1/||uq||/g
s/u2/||uw||/g
s/u3/||ue||/g
s/u4/||ur||/g
s/||aq||/ā/g
s/||aw||/á/g
s/||ae||/ǎ/g
s/||ar||/à/g
s/||eq||/ē/g
s/||ew||/é/g
s/||ee||/ě/g
s/||er||/è/g
s/||iq||/ī/g
s/||iw||/í/g
s/||ie||/ǐ/g
s/||ir||/ì/g
s/||oq||/ō/g
s/||ow||/ó/g
s/||oe||/ǒ/g
s/||or||/ò/g
s/||uq||/ū/g
s/||uw||/ú/g
s/||ue||/ǔ/g
s/||ur||/ù/g
s/||u:q||/ǖ/g
s/||u:w||/ǘ/g
s/||u:e||/ǚ/g
s/||u:r||/ǜ/g
s/||u:s||/ü/g
G
s/\([^\n]*\)\n\([^\n]*\)\n[^\n]*\n/\2\/\1\//
/\n/ b a
}
Sample input:
Some text containing for instance Chinese greeting /ni3
hao3/ and perhaps some other Chinese sentence, say /ni2
kan4, .../
Expected output:
Some text containing for instance Chinese greeting /nǐ
hǎo/ and perhaps some other Chinese sentence, say /ní
kàn, .../
My knowledge of sed(1) is not as powerful to solve this problem on my own. Therefor I ask you for helping me with it. Thank you.

From what I understand in your question, you need to specify the address range for the sed commands:
sed '/\//,/\// {command1; command2; ...}'
However, this will in turn break when the /../ pattern is not multi-line. That means that you'll need to make all of them multi-line. To make sure there's only one / per line do:
sed 's_/_\n/_g' | sed {main sed command}
This also gives one the idea that you could treat multi-line quotes as one-line if you joined all lines to one in the first place:
cat myfile | tr '\n' ' ' | sed {your current commands}
P.S. Also I'd like to note that your "trick" in the beginning is a little flawed:
/\/.*\//
This is greedy, so it won't process multiple patterns on the same line correctly. For this reason the second approach probably won't work as it is.
Edit: okay, this turned out more complex than I thought (or I'm too tired to see an easier way).
To get the lines back together you need to split them in a "unique" way, so that later you can tell which of the newlines were introduced by your script. I suggest doing it like this
sed 's_/_\n/\n_g'
so that each / gets its own line. If you see a line that consists of the only / character, you know you should stick it to the previous one and the next one. So first you do the above sed command on the file, then do the substitutions with the address range as /\//,/\//, and finally you need to put the lines back together. This can be done with
sed ':a $!{N;ba};s/\n\/\n/\//g'
so I suggest you finally pipe to this. I wouldn't be happy about having to use this myself, but you can always hide it inside a shell function or something like that.

Finally it was quite easy to achieve with only a small improvement to the original sed(1) code. Perhaps it could be done somehow better but while having conversion code working in "line scope" I managed to let it be (with minor improvements that are not important to the essence of this question) and rather read whole file in the pattern space, replace newlines with \001 (^A) characters, let the original code do it's work and in the end replace the ^A characters back to newlines. Here it is:
#! /bin/sed -f
# pinyin2utf8.sed -- Convert US-ASCII Pinyin to UTF-8
# Copyright (C) 2012 Matous J. Fialka, <http://mjf.cz/>
# Released under the terms of The MIT License
#
# DESCRIPTION
# Script converts all occurences of US-ASCII encoded Pinyin text
# enclosed by the solidus characters pairs to UTF-8 encoded text.
#
# USAGE
# pinyin2utf8.sed filename [ > filename.out ]
#
# WARNINGS
# Script contains the ^A control character, usually displayed as
# mentioned in most text editors, that can be usually reproduced
# by pressing ^V ^A key sequence. The ^A control characters thus
# MUST NOT occure in the input stream. To find the sequences in
# the script lookup the y/// command in the code, please.
#
# In the US-ASCII encoded Pinyin to UTF-8 Pinyin conversion code
# special delimiting sequences of left and right parentheses are
# used and those two delimiting sequences of left or righ parens
# SHOULD NOT be used in the input stream.
: 0
$! {
N
b 0
}
# HERE BE DRAGONS
y/\n/^A/
y/\//\
/
: a
h
s/[^\n]*\n//
s/\n.*//
# CONVERSION CODE BEGINNING
s/ang1/(((aq)))ng/g
s/ang2/(((aw)))ng/g
s/ang3/(((ae)))ng/g
s/ang4/(((ar)))ng/g
s/eng1/(((eq)))ng/g
s/eng2/(((ew)))ng/g
s/eng3/(((ee)))ng/g
s/eng4/(((er)))ng/g
s/ing1/(((iq)))ng/g
s/ing2/(((iw)))ng/g
s/ing3/(((ie)))ng/g
s/ing4/(((ir)))ng/g
s/ong1/(((oq)))ng/g
s/ong2/(((ow)))ng/g
s/ong3/(((oe)))ng/g
s/ong4/(((or)))ng/g
s/an1/(((aq)))n/g
s/an2/(((aw)))n/g
s/an3/(((ae)))n/g
s/an4/(((ar)))n/g
s/en1/(((eq)))n/g
s/en2/(((ew)))n/g
s/en3/(((ee)))n/g
s/en4/(((er)))n/g
s/in1/(((iq)))n/g
s/in2/(((iw)))n/g
s/in3/(((ie)))n/g
s/in4/(((ir)))n/g
s/un1/(((uq)))n/g
s/un2/(((uw)))n/g
s/un3/(((ue)))n/g
s/un4/(((ur)))n/g
s/ao1/(((aq)))o/g
s/ao2/(((aw)))o/g
s/ao3/(((ae)))o/g
s/ao4/(((ar)))o/g
s/ou1/(((oq)))u/g
s/ou2/(((ow)))u/g
s/ou3/(((oe)))u/g
s/ou4/(((or)))u/g
s/ai1/(((aq)))i/g
s/ai2/(((aw)))i/g
s/ai3/(((ae)))i/g
s/ai4/(((ar)))i/g
s/ei1/(((eq)))i/g
s/ei2/(((ew)))i/g
s/ei3/(((ee)))i/g
s/ei4/(((er)))i/g
s/a1/(((aq)))/g
s/a2/(((aw)))/g
s/a3/(((ae)))/g
s/a4/(((ar)))/g
s/a1/(((aq)))/g
s/a2/(((aw)))/g
s/a3/(((ae)))/g
s/a4/(((ar)))/g
s/er2/(((ew)))r/g
s/er3/(((ee)))r/g
s/er4/(((er)))r/g
s/lyue/l(((u:)))e/g
s/nyue/n(((u:)))e/g
s/e1/(((eq)))/g
s/e2/(((ew)))/g
s/e3/(((ee)))/g
s/e4/(((er)))/g
s/o1/(((oq)))/g
s/o2/(((ow)))/g
s/o3/(((oe)))/g
s/o4/(((or)))/g
s/i1/(((iq)))/g
s/i2/(((iw)))/g
s/i3/(((ie)))/g
s/i4/(((ir)))/g
s/nyu3/n(((u:e)))/g
s/lyu/l(((u:)))/g
s/u:1/(((u:q)))/g
s/u:2/(((u:w)))/g
s/u:3/(((u:e)))/g
s/u:4/(((u:r)))/g
s/u:0/(((u:s)))/g
s/u1/(((uq)))/g
s/u2/(((uw)))/g
s/u3/(((ue)))/g
s/u4/(((ur)))/g
s/(((aq)))/ā/g
s/(((aw)))/á/g
s/(((ae)))/ǎ/g
s/(((ar)))/à/g
s/(((eq)))/ē/g
s/(((ew)))/é/g
s/(((ee)))/ě/g
s/(((er)))/è/g
s/(((iq)))/ī/g
s/(((iw)))/í/g
s/(((ie)))/ǐ/g
s/(((ir)))/ì/g
s/(((oq)))/ō/g
s/(((ow)))/ó/g
s/(((oe)))/ǒ/g
s/(((or)))/ò/g
s/(((uq)))/ū/g
s/(((uw)))/ú/g
s/(((ue)))/ǔ/g
s/(((ur)))/ù/g
s/(((u:q)))/ǖ/g
s/(((u:w)))/ǘ/g
s/(((u:e)))/ǚ/g
s/(((u:r)))/ǜ/g
s/(((u:s)))/ü/g
# CONVERSION CODE END
G
s/\([^\n]*\)\n\([^\n]*\)\n[^\n]*\n/\2\/\1\//
/\n/ b a
# HERE BE DRAGONS
y/^A/\
/
Sample input text:
$ cat test.in
ni3 hao3
/ni3 hao3/
ni3 hao3 /ni3 hao3/
/ni3 hao3/ ni3 hao3
ni3 hao3 /ni3 hao3/ ni3 hao3
ni3 hao3 /ni3 hao3/ ni3 hao3 /ni3 hao3/
/ni3 hao3/ ni3 hao3 /ni3 hao3/
/ni3 hao3/ ni3 hao3 /ni3 hao3/ ni3 hao3
ni3 hao3 /ni3 hao3/ ni3 hao3 /ni3 hao3/ ni3 hao3
/ni3 hao3/ ni3 hao3 /ni3 hao3/ ni3 hao3 /ni3 hao3/
ni3 hao3 /ni3
hao3/ ni3 hao3
/ni3 hao3
ni3
hao3
ni3 hao3/ ni3 hao3
Sample run:
$ pinyin2utf8.sed test.in
ni3 hao3
/nǐ hǎo/
ni3 hao3 /nǐ hǎo/
/nǐ hǎo/ ni3 hao3
ni3 hao3 /nǐ hǎo/ ni3 hao3
ni3 hao3 /nǐ hǎo/ ni3 hao3 /nǐ hǎo/
/nǐ hǎo/ ni3 hao3 /nǐ hǎo/
/nǐ hǎo/ ni3 hao3 /nǐ hǎo/ ni3 hao3
ni3 hao3 /nǐ hǎo/ ni3 hao3 /nǐ hǎo/ ni3 hao3
/nǐ hǎo/ ni3 hao3 /nǐ hǎo/ ni3 hao3 /nǐ hǎo/
ni3 hao3 /nǐ
hǎo/ ni3 hao3
/nǐ hǎo
nǐ
hǎo
nǐ hǎo/ ni3 hao3
It seems to work just fine (at least to suite my needs) and thus I consider this issue to be closed. Many thanks belongs to all people involved, especially Mr. Lev Levitsky!
P.S.: I also placed the code here (GitHub) where you can track some possible future changes.
P.S. 2: The ^A characters were lost while saving this answer. Now they are replaced with their ASCII representation here. You have to replace them to their binary representation (in vi(1) press ^V ^A in insert mode) or use the GitHub version instead.
P.S. 3: I still feel the ^A "hack" as quite ugly. In case anybody knows to avoid it in this case while still having the middle conversion code as simple as it is now, please share your ideas.

Related

How to find text pattern in string for postgresql SQL

For this sample string: "... Key Match extra text..."
How do I get the value "Match", which is the string between blank spaces after "Key"?
is there a better way than:
Find position of "Key "->pos1, find position of first blank space after p1 -> p2, substring(string, p1,p2)?
This is not working as I expected
Select substring('Key Match extra text', 'Key (.+) ');
---
Match extra

You can make the regex be "non-greedy", so that .+ matches as few as possible:
Select substring('Key Match extra text', 'Key (.+?) ');
Or you can change . to something that won't match spaces:
Select substring('Key Match extra text', 'Key (\S+) ');

Postgres regex stop creating double space

I have a query that looks like this:
select regexp_replace('john (junior) jones','\([^)]*\)','','g');
regexp_replace
------------------
john jones
As you can see, this query removes the values in brackets but it results in a double space remaining.
Is there an easy way around this?
So far I have this, which works to an extent:
select regexp_replace((regexp_replace('john (junior) jones','\([^)]*\)','','g')),'\s','');
regexp_replace
------------------
john jones
The above works but not when I pass through something like this:
select regexp_replace((regexp_replace('john (junior) jones (hughes) smith','\([^)]*\)','','g')),'\s','');
regexp_replace
---------------------
john jones smith

SELECT regexp_replace(
'john (junior) jones (hughes) smith',
' *\([^)]*\) *',
' ',
'g'
);
regexp_replace
══════════════════
john jones smith
(1 row)
To explain the regular expression:
an arbitrary number of spaces, followed by an opening parenthesis ( *\()
an arbitrary number of characters that are not a closing parenthesis ([^)]*)
a closing parenthesis and arbitrarily many spaces (\) *)
That is replaced with a single space.

Linux sed remove two patterns

I hope you're having a great day,
I want to remove two patterns, I want to remove the parts that contains the word images from a text that I have:
in the files test1 I have this:
APP:Server1:files APP:Server2:images APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs APP:Server8:images-v2
I need to remove APP:Server2:image and APP:Server8:images-v2 ... I want this output:
APP:Server1:files APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs
I'm trying this:
cat test1 | sed 's/ .*images.* / /g'

You need to make sure that your wildcards do not allow spaces:
cat data | sed 's/ [^ ]*image[^ ]* / /g'

This should work for you
sed 's/\w{1,}:Server[2|8]:\w{1,} //g'
\w matches word characters (letters, numbers, _)
{1,} matches one or more of the preceeding item (\w)
[2|8] matches either the number 2 or 8
cat test.file
APP:Server1:files APP:Server2:images APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs APP:Server8:images-v2
The below command removes the matching lines and leaves blanks in their place
tr ' ' '\n' < test.file |sed 's/\w\{1,\}:Server[2|8]:\w\{1,\}.*$//'
APP:Server1:files
APP:Server3:misc
APP:Server4:xml
APP:Server5:json
APP:Server6:stats
APP:Server7:graphs
To remove the blank lines, just add a second option to the sed command, and paste the contents back together
tr ' ' '\n' < test.file |sed 's/\w\{1,\}:Server[2|8]:\w\{1,\}.*$//;/^$/d'|paste -sd ' ' -
APP:Server1:files APP:Server3:misc APP:Server4:xml APP:Server5:json APP:Server6:stats APP:Server7:graphs

GNU aWk alternative:
awk 'BEGIN { RS="APP:" } $0=="" { next } { split($0,map,":");if (map[2] ~ /images/ ) { next } OFS=RS;printf " %s%s",OFS,$0 }'
Set the record separator to "APP:" and then process the text in between as separate records. If the record is blank, skip to the next record. Split the record into array map based on ":" as the delimiter, then check if there is image in the any of the text in the second index. If there is, skip to the next record, otherwise print along with the record separator.

sed - substitute between pattern on different lines

I have a csv file exported from spreadsheet which has, in the last column, sometimes a list of names. The file comes out like this:
ag,bd,cj,dy,"ss"
aa,bs,cs,fg,"name1
name2
name3
"
ff,ce,sd,de,
ag,bd,jj,ds,"ds"
fs,ee,sd,ee,"name4
name5
"
and so on.
I would like to remove the line feed in the last column between quotes so that the output is:
ag,bd,cj,dy,ss
aa,bs,cs,fg,"name1 name2 name3"
ff,ce,sd,de,
ag,bd,jj,ds,"ds"
fs,ee,sd,ee,"name4 name5"
Thanks

This awk may be one solution for you:
awk '/\"/ {s=!s} {printf "%s"(s?FS:RS),$0}'
ag,bd,cj,dy,ss
aa,bs,cs,fg,"name1 name2 name3 "
ff,ce,sd,de,df
New solution
awk -F\" 'NF==3; NF==2 {s++} s==1 {printf "%s ",$0} s==2 {print;s=0}' | awk '{sub(/ "/,"\"")}1' file
ag,bd,cj,dy,"ss"
aa,bs,cs,fg,"name1 name2 name3"
ag,bd,jj,ds,"ds"
fs,ee,sd,ee,"name4 name5"

grep and replace

I wanted to grep a string at the first occurrence ONLY from a file (file.dat) and replace it by reading from another file (output). I have a file called "output" as an example contains "AAA T 0001"
#!/bin/bash
procdir=`pwd`
cat output | while read lin1 lin2 lin3
do
srt2=$(echo $lin1 $lin2 $lin3 | awk '{print $1,$2,$3}')
grep -m 1 $lin1 $procdir/file.dat | xargs -r0 perl -pi -e 's/$lin1/$srt2/g'
done
Basically what I wanted is: When ever a string "AAA" is grep'ed from the file "file.dat" at the first instance, I want to replace the second and third column next to "AAA" by "T 0001" but still keep the first column "AAA" as it is. Th above script basically does not work. Basically "$lin1" and $srt2 variables are not understood inside 's/$lin1/$srt2/g'
Example:
in my file.dat I have a row
AAA D ---- CITY COUNTRY
What I want is :
AAA T 0001 CITY COUNTRY
Any comments are very appreciated.

If you have output file like this:
$ cat output
AAA T 0001
Your file.dat file contains information like:
$ cat file.dat
AAA D ---- CITY COUNTRY
BBB C ---- CITY COUNTRY
AAA D ---- CITY COUNTRY
You can try something like this with awk:
$ awk '
NR==FNR {
a[$1]=$0
next
}
$1 in a {
printf "%s ", a[$1]
delete a[$1]
for (i=4;i<=NF;i++) {
printf "%s ", $i
}
print ""
next
}1' output file.dat
AAA T 0001 CITY COUNTRY
BBB C ---- CITY COUNTRY
AAA D ---- CITY COUNTRY

Say you place the string for which to search in $s and the string with which to replace in $r, wouldn't the following do?
perl -i -pe'
BEGIN { ($s,$r)=splice(#ARGV,0,2) }
$done ||= s/\Q$s/$r/;
' "$s" "$r" file.dat
(Replaces the first instance if present)

This will only change the first match in the file:
#!/bin/bash
procdir=`pwd`
while read line; do
set $line
sed '0,/'"$1"'/s/\([^ ]* \)\([^ ]* [^ ]*\)/\1'"$2 $3"'/' $procdir/file.dat
done < output
To change all matching lines:
sed '/'"$1"'/s/\([^ ]* \)\([^ ]* [^ ]*\)/\1'"$2 $3"'/' $procdir/file.dat