sed copy substring in following line - sed

I've a .po file I need to copy msgid value into msgstr value if msgstr is empty.
For example
msgid "Hello"
msgstr ""
msgid "Dog"
msgstr "Cane"
Should become
msgid "Hello"
msgstr "Hello"
msgid "Dog"
msgstr "Cane"
Currently, for testing purpose, I'm working with another file, but final script will works inline.
#!/bin/bash
rm it2.po
sed $'s/^msgid.*/&\\\n---&/' it.po > it2.po
sed -i '/^msgstr/d' it2.po
sed -i 's/^---msgid/msgstr/' it2.po
This script has 2 problems (at least):
copies msgid into msgstr also when msgstr is not empty;
I'm pretty sure that exist a single line or a more elegant solution.
Any help would be appreciated. Thanks in advance.

You may consider better tool gnu awk instead of sed:
awk -i inplace -v FPAT='"[^"]*"|\\S+' '$id != "" && $1 == "msgstr" && (NF==1 || $2 == "\"\"") {$2=id} $1 == "msgid" {id=$2} 1' file
msgid "Hello"
msgstr "Hello"
msgid "Dog"
msgstr "Cane"
-v FPAT='"[^"]*"|\\S+' makes a quoted string or any non-whitespace field an individual field.
A more readable form:
awk -i inplace -v FPAT='"[^"]*"|\\S+' '
$id != "" && $1 == "msgstr" && (NF==1 || $2 == "\"\"") {$2=id}
$1 == "msgid" {id=$2}
1' file

This might work for you (GNU sed):
sed -E 'N;s/(msgid "(.*)".*msgstr )""/\1"\2"/;P;D' file
Open a two line window and if the first line contains msgid and the second msgstr "", replace the msgstr value by the msgid value. Print/delete the first line and repeat.

Since the structure of the input file is so simple and consistent, I think the following should be enough (it works with the 3 examples you've provided):
sed -zE 's/(msgid "([^"]+)"\nmsgstr ")"/\1\2"/g' your_file
-z makes the file be a long string of input with embedded \ns, so we don't need commands like N, D, or others, because the whole file is already in the pattern space;
-E lets us use (, ), and + instead of \(, \), and \+ (and also other similar things)
the outermost () captures msgid "Hello"\nmsgstr " (the closing " is matched but not captured);
the innermost () captures the first double-quoted string;
\1\2" concatenates the matched text (except the final ", as I noted above), with the text between the first two "s, and a closing ",
the flag g will apply the substitution across the whole file.
If the leading strings are not that important (e.g. they are always the same, and the lines always appear as msgid followed by msgstr), you can squeeze the command above a bit more:
sed -zE 's/(([^"]+)"\n[^\n]*")"/\1\2"/g' your_file

You can use the hold space:
sed '
/^msgid[\t ]*/ {
p
s///
x
d
}
/^msgstr[\t ]*""/ {
x
s/^/msgstr /
}
' <in.po >out.po
if line starts with msgid
print it
delete the keyword
save string to hold
go to next line
else if lines starts with msgstr and has empty value
retrieve string from hold
prepend the keyword
implicit print

Here's a simple sed script which keeps the latest msgid in the hold space (h) then brings it back (x) and changes it to msgstr if it sees an empty msgstr.
sed -e '/^msgid "/h' -e '/^msgstr ""/!b' \
-e x -e 's/^msgid/msgstr/' it.po >it2.po
Notice also how you would typically combine multiple sed statements with -e rather than create a new file and then repeatedly run sed -i on it. sed is a scripting language; learn it if you want to use it.
(Some sed variants don't tolerate this arrangement; maybe combine the script into a single string with semicolons between the statements if you have trouble with this one.)
Having said that, sed is very much a write-only language. Perhaps you'd be better off with a simple Awk (or Python, or etc) solution.
awk '/^msgid "/ { s=$0; sub(/^msgid/, "", s) }
/^msgstr ""/ { $0 = $1 s } 1' it.po >it2.po

With GNU awk and shown samples only, we could try following.
awk -v RS='"[^"]*"|\n+' '
RT=="\n"{ next }
$0~/^msgstr/{
if(RT=="\"\""){ $0=$0 val }
else { $0=$0 RT }
}
$0~/^msgid/ { val=RT
$0=$0 RT }
RT
' Input_file
2nd solution: A slight different from above solution, above will take only 1 or 2 occurrences of " but this will work till new line comes from 1st occurrence of " in a line then following will help, again written and tested with shown samples.
awk -v RS='"[^\n]*|\n+' '
RT=="\n"{ next }
$0~/^msgstr/{
if(RT=="\"\""){ $0=$0 val }
else { $0=$0 RT }
}
$0~/^msgid/ { val=RT
$0=$0 RT }
RT
' Input_file
Explanation: Adding detailed explanation for above.
awk -v RS='"[^"]*"|\n+' ' ##Starting awk program from here and setting record separator as " till " comes or new lines.
RT=="\n"{ next } ##If RT is newline then take cursor to next line.
$0~/^msgstr/{ ##Checking if line starts from msgstr then:
if(RT=="\"\""){ $0=$0 val } ##Checking if RT us "" then add val to current line.
else { $0=$0 RT } ##Else simply add RT.
}
$0~/^msgid/ { val=RT ##Checking if line starts from msgid then make val to RT
$0=$0 RT } ##Adding RT to $0.
RT ##Printing line if RT is not null.
' Input_file ##Mentioning Input_file name here.

Keep it simple and use awk, e.g. using any awk in any shell on every Unix box:
$ awk '$2~/""/{$2=p} {p=$2} 1' it.po
msgid "Hello"
msgstr "Hello"
msgid "Dog"
msgstr "Cane"
If that isn't all you need then edit your question to provide more comprehensive sample input/output including cases that that doesn't work for.
Since you have GNU sed for -i you also have or can install GNU awk for -i inplace if you want "inplace" editing, or just do tmp=$(mktemp) && awk 'script' file > "$tmp" && mv "$tmp" file like you would for any other command.

Related

How to apply one command into another sed command?

I have one command which is used to extract lines between two string patterns 'string1' and 'string2'. This is stored in variable called 'var1'.
var1=$(awk '/string1/{flag=1; next} /string2/{flag=0} flag' text.txt)
This command works well and the output is a set of lines.
Do you hear the people sing?
Singing a song of angry men?
It is the music of a people
Who will not be slaves again
I want the output of the above command to be inserted after a string pattern 'string3' in another file called stat.txt. I used sed as follows
sed '/string3/a'$var1'' stat.txt
I am having trouble getting the new output. Here, the $var1 seems to be working partially i.e. only one line -
string3
Do you hear the people sing?
Any other suggestions to solve this?
I would be tempted to use sed to extract the lines, and awk to insert them into the other text:
lines=$(sed -n '/string1/,/string2/ p' text.txt)
awk -v new="$lines" '{print} /string3/ {print new}' stat.txt
or perhaps both tasks in a single awk call
awk '
NR == FNR && /string1/ {flag = 1}
NR == FNR && /string2/ {flag = 0}
NR == FNR && flag {lines = lines $0 ORS}
NR == FNR {next}
{print}
/string3/ {printf "%s", lines} # it already ends with a newline
' text.txt stat.txt
It's a data format problem...
Appending a multi-line block of text with the sed append command requires that every line in the block to be appended ends with a \ -- except for the last line of that block. So if we take the two lines of code that didn't work in the question, and reformat the text as required by the append command, the original code should work as expected:
var1=$(awk '/string1/{flag=1; next} /string2/{flag=0} flag' text.txt)
var1="$(sed '$!s/$/\\/' <<< "$var1")"
sed '/string3/a'$var1'' stat.txt
Note that the 2nd line above contains a bashism. A more portable version would be:
var1="$(echo "$var1" | sed '$!s/$/\\/')"
Either variant would convert $var1 to:
Do you hear the people sing?\
Singing a song of angry men?\
It is the music of a people\
Who will not be slaves again

Using sed to remove embedded newlines

What is a sed script that will remove the "\n" character but only if it is inside "" characters (delimited string), not the \n that is actually at the end of the (virtual) line?
For example, I want to turn this file
"lalala","lalalslalsa"
"lalalala","lkjasjdf
asdfasfd"
"lalala","dasdf"
(line 2 has an embedded \n ) into this one
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
(Line 2 and 3 are now joined, and the real line feed was replaced with the character string \\n (or any other easy to spot character string, I'm not picky))
I don't just want to remove every other newline as a previous question asked, nor do I want to remove ALL newlines, just those that are inside quotes. I'm not wedded to sed, if awk would work, that's fine too.
The file being operated on is too large to fit in memory all at once.
sed is an excellent tool for simple substitutions on a single line but for anything else you should use awk., e.g:
$ cat tst.awk
{
if (/"$/) {
print prev $0
prev = ""
}
else {
prev = prev $0 " \\\\n "
}
}
$ awk -f tst.awk file
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
Below was my original answer but after seeing #NeronLeVelu's approach of just testing for a quote at the end of the line I realized I was doing this in a much too complicated way. You could just replace gsub(/"/,"&") % 2 below with /"$/ and it'd work the same but the above code is a simpler implementation of the same functionality and will now handle embedded escaped double quotes as long as they aren't at the end of a line.
$ cat tst.awk
{ $0 = saved $0; saved="" }
gsub(/"/,"&") % 2 { saved = $0 " \\\\n "; next }
{ print }
$ awk -f tst.awk file
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
The above only stores 1 output line in memory at a time. It just keeps building up an output line from input lines while the number of double quotes in that output line is an odd number, then prints the output line when it eventually contains an even number of double quotes.
It will fail if you can have double quotes inside your quoted strings escaped as \", not "", but you don't show that in your posted sample input so hopefully you don't have that situation. If you have that situation you need to write/use a real CSV parser.
sed -n ':load
/"$/ !{N
b load
}
:cycle
s/^\(\([^"]*"[^"]*"\)*\)\([^"]*"[^"]*\)\n/\1\3 \\\\n /
t cycle
p' YourFile
load the lines in working buffer until a close line (ending with ") is found or end reach
replace any \n that is after any couple of open/close " followed by a single " with any other caracter that " between from the start of file by the escapped version of new line (in fact replace starting string + \n by starting string and escaped new line)
if any substitution occur, retry another one (:cycle and t cycle)
print the result
continue until end of file
thanks to #Ed Morton for remark about escaped new line

Find common line in two files and replace the next line of first file with the next line of other file

I want to find common lines in two files and replace the next line of the first file with the next line of second file. Sed, awk, Perl, Bash, any solution is welcomed.
The comparison is case-insensitive and there can be multiple occurrences of the same line.
File 1:
hgacdavd
sndm,ACNMSDC
msgid "Rome"
msgstr ""
kgcksdcgfkdsb
msgid ""
hsdvchgsdvc
msgstr ""
dhshfjksdfhmd
msgid "Vidya"
msgstr ""
sdjhcbnd
dcndnv
cfnkdndvrknvkf
dfkvrnkdfnk
snfvrkng
msgid "Rome"
msgstr ""
wdbhkjbcfj
#dmcdmf
f,nvdf,
fvnfnvk
vfmf,mv
vfn
msgid "vid"
msgstr ""
dmcbdmbcvmfbvmkhsdk
File 2:
dfhkvgjbfrvkf
msgid "Rome"
msgstr "new bie"
sdbsjbcdcbwoido
fjcdcvnm
msgid "vidya"
msgstr "expert"
dvnjfkdvhnkfvnknsbdjh
msgid "vid"
msgstr "newton"
dfenfjdbrfjbvlfnvl
dcnkncvkdfvknfv
fcndkbvknfkv
vfdnkvnfknbvkfn
Later, file 1 should be:
hgacdavd
sndm,ACNMSDC
msgid "Rome"
msgstr "new bie"
kgcksdcgfkdsb
msgid ""
hsdvchgsdvc
msgstr ""
dhshfjksdfhmd
msgid "Vidya"
msgstr "expert"
sdjhcbnd
dcndnv
cfnkdndvrknvkf
dfkvrnkdfnk
snfvrkng
msgid "Rome"
msgstr "new bie"
wdbhkjbcfj
#dmcdmf
f,nvdf,
fvnfnvk
vfmf,mv
vfn
msgid "vid"
msgstr "newton"
dmcbdmbcvmfbvmkhsdk
Version 1 of question:
In this version of the question, the lines with msgid and msgstr appear in pairs and are separated from other lines with a blank line. Here is a one (long) line solution for this case:
$ awk -F'"' 'BEGIN{RS="\n\n";OFS=""} NR==FNR {c[tolower($2)]=$4; next} {print $1,"\"",$2,"\"",$3,"\"",c[tolower($2)],"\"\n"}' file2 file1
msgid "Rome"
msgstr "new bie"
msgid "Vidya"
msgstr "expert"
msgid "Rome"
msgstr "new bie"
msgid "vid"
msgstr "newton"
MORE: The version below updates file1 with the new information:
$ awk -F'"' 'BEGIN{RS="\n\n";OFS=""} NR==FNR {c[tolower($2)]=$4; next} {print $1,"\"",$2,"\"",$3,"\"",c[tolower($2)],"\"\n"}' file2 file1 >tmp && mv tmp file1
How it works: Let's break it down into parts. The first part is just setup:
$ awk -F'"' 'BEGIN{RS="\n\n";OFS=""}
To understand the above, one needs to know that awk breaks a file up into 'records', and then it breaks records up into 'fields'. The above says that every time a blank line appears (two newline characters in a row), treat what follows as a new 'record'. In other words the record separator is two newlines: RS="\n\n". It also says that records should be broken up into fields according to the appearance of a double-quote: -F'"'. Finally, it says that, when printing our output, do not add any additional spaces to what we have. In other words, the output field separator is the empty string: OFS=""
The next part is:
... NR==FNR {c[tolower($2)]=$4; next} ...
This says that, while reading the first file name given (file2), create an associate array (like a dictionary) called c. The keys for the array are the msgid's. The values are the msgstr's. Thus, c[rome]=new bie. We apply tolower to the msgid's so that all the keys are consistently lower case. The next command means that.
The NR==FNR part above probably looks obscure. To understand it, one needs to know that awk counts the number of records that it has seen and assigns that value to NR. It also counts the number of records that it has seen from the file that it is currently reading and assigns that to FNR. So, when we are reading the first file, it follows that the two are equal: NR==FNR. When awk starts reading the second file, then NR>FNR and the block of code will be skipped.
The last part is:
... {print $1,"\"",$2,"\"",$3,"\"",c[tolower($2)],"\"\n"}' file2 file1
This part is executed when we start reading the second file (file1) and consists of just one print statement. It prints the msgid line, including the quotes around the msgid: $1,"\"",$2,"\"". And, it also prints out the msgstr, looking up the value of the msgstr in the our associative array c and putting quotes around it and a newline character at the end: "\"",$3,"\"",c[tolower($2)],"\"\n".
Version 2 of question:
In this version, the msgid and msgstr lines are not necessarily adjacent. Consequently, every time we run across an msgid line, we save its value to the variable id. When going through file2, we look for msgstr lines and store their value in the associative array c. Then, when processing file1, we substitute c[id] in msgstr lines:
awk -F'"' 'NR==FNR && $1=="msgid " {id=tolower($2)} NR==FNR && $1=="msgstr " {c[id]=$2} NR==FNR {next} $1=="msgid " {id=tolower($2)} {if ($1=="msgstr ") print "msgstr \"" c[id] "\""; else print $0}' file2 file1 >tmp && mv tmp file1
This might work for you (GNU sed):
sed -r '/msgid/{$!N;s|(.*)\n(.*)|/\1/I{n;s/.*/\2/}|}' file2 | sed -rf - file1
Turn file2 into a sed script which is run against file1.
The script turns the line beginning msgid and the following line into a command that matches the msgid and then prints that line and replaces the next with the contents of the second line from the script file.
You can use this app I wrote by php. It solved the same problem.
Input: need to 2 files:
file 1: like dictionary for the 2nd file
file 2: like destination for getting the pair "msgid/msgstr" from file 1 if it's existing.
Output: as well as you mentioned
Check my source code here!
https://github.com/NguyenDuyPhong/merge_two_po_files
Connect to main page by the link: .../merge_two_po_files/web (Ex: http://localhost/merge_two_po_files/web )

Remove newline depending on the format of the next line

I have a special file with this kind of format :
title1
_1 texthere
title2
_2 texthere
I would like all newlines starting with "_" to be placed as a second column to the line before
I tried to do that using sed with this command :
sed 's/_\n/ /g' filename
but it is not giving me what I want to do (doing nothing basically)
Can anyone point me to the right way of doing it ?
Thanks
Try following solution:
In sed the loop is done creating a label (:a), and while not match last line ($!) append next one (N) and return to label a:
:a
$! {
N
b a
}
After this we have the whole file into memory, so do a global substitution for each _ preceded by a newline:
s/\n_/ _/g
p
All together is:
sed -ne ':a ; $! { N ; ba }; s/\n_/ _/g ; p' infile
That yields:
title1 _1 texthere
title2 _2 texthere
If your whole file is like your sample (pairs of lines), then the simplest answer is
paste - - < file
Otherwise
awk '
NR > 1 && /^_/ {printf "%s", OFS}
NR > 1 && !/^_/ {print ""}
{printf "%s", $0}
END {print ""}
' file
This might work for you (GNU sed):
sed ':a;N;s/\n_/ /;ta;P;D' file
This avoids slurping the file into memory.
or:
sed -e ':a' -e 'N' -e 's/\n_/ /' -e 'ta' -e 'P' -e 'D' file
A Perl approach:
perl -00pe 's/\n_/ /g' file
Here, the -00 causes perl to read the file in paragraph mode where a "line" is defined by two consecutive newlines. In your example, it will read the entire file into memory and therefore, a simple global substitution of \n_ with a space will work.
That is not very efficient for very large files though. If your data is too large to fit in memory, use this:
perl -ne 'chomp;
s/^_// ? print "$l " : print "$l\n" if $. > 1;
$l=$_;
END{print "$l\n"}' file
Here, the file is read line by line (-n) and the trailing newline removed from all lines (chomp). At the end of each iteration, the current line is saved as $l ($l=$_). At each line, if the substitution is successful and a _ was removed from the beginning of the line (s/^_//), then the previous line is printed with a space in place of a newline print "$l ". If the substitution failed, the previous line is printed with a newline. The END{} block just prints the final line of the file.

How to delete multiple empty lines with SED?

I'm trying to compress a text document by deleting of duplicated empty lines, with sed. This is what I'm doing (to no avail):
sed -i -E 's/\n{3,}/\n/g' file.txt
I understand that it's not correct, according to this manual, but I can't figure out how to do it correctly. Thanks.
I think you want to replace spans of multiple blank lines with a single blank line, even though your example replaces multiple runs of \n with a single \n instead of \n\n. With that in mind, here are two solutions:
sed '/^$/{ :l
N; s/^\n$//; t l
p; d; }' input
In many implementations of sed, that can be all on one line, with the embedded newlines replaced by ;.
awk 't || !/^$/; { t = !/^$/ }'
As tripleee suggested above, I'm using Perl instead of sed:
perl -0777pi -e 's/\n{3,}/\n\n/g'
Use the translate function
tr -s '\n'
the -s or --squeeze-repeats reduces a sequence of repeated character to a single instance.
This is much better handled by tr -s '\n' or cat -s, but if you insist on sed, here's an example from section 4.17 of the GNU sed manual:
#!/usr/bin/sed -f
# on empty lines, join with next
# Note there is a star in the regexp
:x
/^\n*$/ {
N
bx
}
# now, squeeze all '\n', this can be also done by:
# s/^\(\n\)*/\1/
s/\n*/\
/
I am not sure this is what the OP wanted but using the awk solution by William Pursell here is the approach if you want to delete ALL empty lines in the file:
awk '!/^$/' file.txt
Explanation:
The awk pattern
'!/^$/'
is testing whether the current line is consisting only of the beginning of a line (symbolised by '^') and the end of a line (symbolised by '$'), in other words, whether the line is empty.
If this pattern is true awk applies its default and prints the current line.
HTH
I think OP wants to compress empty lines, e.g. where there are 9 consecutive emty lines, he wants to have just three.
I have written a little bash script that does just that:
#! /bin/bash
TOTALLINES="$(cat file.txt|wc -l)"
CURRENTLINE=1
while [ $CURRENTLINE -le $TOTALLINES ]
do
L1=$CURRENTLINE
L2=$(($L1 + 1))
L3=$(($L1 +2))
if [[ $(cat file.txt|head -$L1|tail +$L1) == "" ]]||[[ $(cat file.txt|head -$L1|tail +$L1) == " " ]]
then
L1EMPTY=true
else
L1EMPTY=false
fi
if [[ $(cat file.txt|head -$L2|tail +$L2) == "" ]]||[[ $(cat file.txt|head -$L2|tail +$L2) == " " ]]
then
L2EMPTY=true
else
L2EMPTY=false
fi
if [[ $(cat file.txt|head -$L3|tail +$L3) == "" ]]||[[ $(cat file.txt|head -$L3|tail +$L3) == " " ]]
then
L3EMPTY=true
else
L3EMPTY=false
fi
if [ $L1EMPTY = true ]&&[ $L2EMPTY = true ]&&[ $L3EMPTY = true ]
then
#do not cat line to temp file
echo "Skipping line "$CURRENTLINE
else
echo "$(cat file.txt|head -$CURRENTLINE|tail +$CURRENTLINE)">>temp.txt
echo "Writing line " $CURRENTLINE
fi
((CURRENTLINE++))
done
cat temp.txt>file.txt
rm -r temp.txt
FINALTOTALLINES="$(cat file.txt|wc -l)"
EMPTYLINELINT=$(( $CURRENTLINE - $FINALTOTALLINES ))
echo "Deleted " $EMPTYLINELINT " empty lines."