Fix improper quotes in text - sed

I receive text from some writers that has a string like: string "string "string.
I want it to read string "string" string.
I've tried various sed tricks but none work.
Here is one failed attempt:
sed 's/.* "/.*"/g'

Your attempt fails for multiple reasons.
The wildcard .* will consume as much as it can in the string, meaning it will only ever allow a single substitution to happen (the final double quote in the string).
You cannot use .* in the substitution part -- what you substitute with is just a string, not a regular expression. The way to handle "whatever (part of) the regex matched" is through backreferences.
So here is a slightly less broken attempt:
sed 's/"\([^"]*\) "/"\1"/g' file
This will find a double quote, then find and capture anything which is not a double quote, then find a space and a double quote; and substitute the entire match with a double quote, the first captured expression (aka back reference, or backref), and another double quote. This should fix strings where the only problem is a number of spaces inside the closing double quotes, but not missing whitespace after the closing double quote, nor strings with leading spaces inside the double quotes or unpaired double quotes.
The lack of spaces after can easily be added;
sed 's/"\([^"]*\) " */"\1" /g;s/ $//' file
This will add a space after every closing double quote, and finally trim any space at end of line to fix up this corner case.
Now, you could either try to update the regex for leading spaces, or just do another pass with a similar regex for those. I would go with the latter approach, even though the former is feasible as well (but will require a much more complex regex, and the corner cases are harder to keep in your head).
sed 's/"\([^"]*\) " */"\1" /g;s/ $//;
s/ *" \([^"]*\)"/ "\1"/g;s/^ //' file
This will still fail for inputs with unbalanced double quotes, which are darn near impossible to handle completely automatically anyway (how do you postulate where to add a missing double quote?)

This may work for some cases but may fail with unbalanced quotes:
sed 's/"\([^"]*\S\)\s\s*"/"\1"/g'
to also add space after a quoted phrase, if a space is missing:
sed -e 's/"\([^"]*\S\)\s\s*"/"\1"/g' -e 's/\("[^"]*"\)\([^"]\)/\1 \2/g'

Here is an awk solution:
echo 'string "string "string.' | awk -F' "' '{for (i=1;i<=NF;i++) printf (i%2==0?"\"":"")"%s"(i%2==0?"\"":"")(i!=NF?" ":""),$i;print ""}'
string "string" string.
It looks at numbers of quotes, and every second quotes should be behind text.

Related

Output is not generating while running the bash script [duplicate]

In Bash, what are the differences between single quotes ('') and double quotes ("")?
Single quotes won't interpolate anything, but double quotes will. For example: variables, backticks, certain \ escapes, etc.
Example:
$ echo "$(echo "upg")"
upg
$ echo '$(echo "upg")'
$(echo "upg")
The Bash manual has this to say:
3.1.2.2 Single Quotes
Enclosing characters in single quotes (') preserves the literal value of each character within the quotes. A single quote may not occur between single quotes, even when preceded by a backslash.
3.1.2.3 Double Quotes
Enclosing characters in double quotes (") preserves the literal value of all characters within the quotes, with the exception of $, `, \, and, when history expansion is enabled, !. The characters $ and ` retain their special meaning within double quotes (see Shell Expansions). The backslash retains its special meaning only when followed by one of the following characters: $, `, ", \, or newline. Within double quotes, backslashes that are followed by one of these characters are removed. Backslashes preceding characters without a special meaning are left unmodified. A double quote may be quoted within double quotes by preceding it with a backslash. If enabled, history expansion will be performed unless an ! appearing in double quotes is escaped using a backslash. The backslash preceding the ! is not removed.
The special parameters * and # have special meaning when in double quotes (see Shell Parameter Expansion).
The accepted answer is great. I am making a table that helps in quick comprehension of the topic. The explanation involves a simple variable a as well as an indexed array arr.
If we set
a=apple # a simple variable
arr=(apple) # an indexed array with a single element
and then echo the expression in the second column, we would get the result / behavior shown in the third column. The fourth column explains the behavior.
#
Expression
Result
Comments
1
"$a"
apple
variables are expanded inside ""
2
'$a'
$a
variables are not expanded inside ''
3
"'$a'"
'apple'
'' has no special meaning inside ""
4
'"$a"'
"$a"
"" is treated literally inside ''
5
'\''
invalid
can not escape a ' within ''; use "'" or $'\'' (ANSI-C quoting)
6
"red$arocks"
red
$arocks does not expand $a; use ${a}rocks to preserve $a
7
"redapple$"
redapple$
$ followed by no variable name evaluates to $
8
'\"'
\"
\ has no special meaning inside ''
9
"\'"
\'
\' is interpreted inside "" but has no significance for '
10
"\""
"
\" is interpreted inside ""
11
"*"
*
glob does not work inside "" or ''
12
"\t\n"
\t\n
\t and \n have no special meaning inside "" or ''; use ANSI-C quoting
13
"`echo hi`"
hi
`` and $() are evaluated inside "" (backquotes are retained in actual output)
14
'`echo hi`'
`echo hi`
`` and $() are not evaluated inside '' (backquotes are retained in actual output)
15
'${arr[0]}'
${arr[0]}
array access not possible inside ''
16
"${arr[0]}"
apple
array access works inside ""
17
$'$a\''
$a'
single quotes can be escaped inside ANSI-C quoting
18
"$'\t'"
$'\t'
ANSI-C quoting is not interpreted inside ""
19
'!cmd'
!cmd
history expansion character '!' is ignored inside ''
20
"!cmd"
cmd args
expands to the most recent command matching "cmd"
21
$'!cmd'
!cmd
history expansion character '!' is ignored inside ANSI-C quotes
See also:
ANSI-C quoting with $'' - GNU Bash Manual
Locale translation with $"" - GNU Bash Manual
A three-point formula for quotes
If you're referring to what happens when you echo something, the single quotes will literally echo what you have between them, while the double quotes will evaluate variables between them and output the value of the variable.
For example, this
#!/bin/sh
MYVAR=sometext
echo "double quotes gives you $MYVAR"
echo 'single quotes gives you $MYVAR'
will give this:
double quotes gives you sometext
single quotes gives you $MYVAR
Others explained it very well, and I just want to give something with simple examples.
Single quotes can be used around text to prevent the shell from interpreting any special characters. Dollar signs, spaces, ampersands, asterisks and other special characters are all ignored when enclosed within single quotes.
echo 'All sorts of things are ignored in single quotes, like $ & * ; |.'
It will give this:
All sorts of things are ignored in single quotes, like $ & * ; |.
The only thing that cannot be put within single quotes is a single quote.
Double quotes act similarly to single quotes, except double quotes still allow the shell to interpret dollar signs, back quotes and backslashes. It is already known that backslashes prevent a single special character from being interpreted. This can be useful within double quotes if a dollar sign needs to be used as text instead of for a variable. It also allows double quotes to be escaped so they are not interpreted as the end of a quoted string.
echo "Here's how we can use single ' and double \" quotes within double quotes"
It will give this:
Here's how we can use single ' and double " quotes within double quotes
It may also be noticed that the apostrophe, which would otherwise be interpreted as the beginning of a quoted string, is ignored within double quotes. Variables, however, are interpreted and substituted with their values within double quotes.
echo "The current Oracle SID is $ORACLE_SID"
It will give this:
The current Oracle SID is test
Back quotes are wholly unlike single or double quotes. Instead of being used to prevent the interpretation of special characters, back quotes actually force the execution of the commands they enclose. After the enclosed commands are executed, their output is substituted in place of the back quotes in the original line. This will be clearer with an example.
today=`date '+%A, %B %d, %Y'`
echo $today
It will give this:
Monday, September 28, 2015
Since this is the de facto answer when dealing with quotes in Bash, I'll add upon one more point missed in the answers above, when dealing with the arithmetic operators in the shell.
The Bash shell supports two ways to do arithmetic operation, one defined by the built-in let command and the other the $((..)) operator. The former evaluates an arithmetic expression while the latter is more of a compound statement.
It is important to understand that the arithmetic expression used with let undergoes word-splitting, pathname expansion just like any other shell commands. So proper quoting and escaping need to be done.
See this example when using let:
let 'foo = 2 + 1'
echo $foo
3
Using single quotes here is absolutely fine here, as there isn't any need for variable expansions here. Consider a case of
bar=1
let 'foo = $bar + 1'
It would fail miserably, as the $bar under single quotes would not expand and needs to be double-quoted as
let 'foo = '"$bar"' + 1'
This should be one of the reasons, the $((..)) should always be considered over using let. Because inside it, the contents aren't subject to word-splitting. The previous example using let can be simply written as
(( bar=1, foo = bar + 1 ))
Always remember to use $((..)) without single quotes
Though the $((..)) can be used with double quotes, there isn't any purpose to it as the result of it cannot contain content that would need the double quote. Just ensure it is not single quoted.
printf '%d\n' '$((1+1))'
-bash: printf: $((1+1)): invalid number
printf '%d\n' $((1+1))
2
printf '%d\n' "$((1+1))"
2
Maybe in some special cases of using the $((..)) operator inside a single quoted string, you need to interpolate quotes in a way that the operator either is left unquoted or under double quotes. E.g., consider a case, when you are tying to use the operator inside a curl statement to pass a counter every time a request is made, do
curl http://myurl.com --data-binary '{"requestCounter":'"$((reqcnt++))"'}'
Notice the use of nested double quotes inside, without which the literal string $((reqcnt++)) is passed to the requestCounter field.
There is a clear distinction between the usage of ' ' and " ".
When ' ' is used around anything, there is no "transformation or translation" done. It is printed as it is.
With " ", whatever it surrounds, is "translated or transformed" into its value.
By translation/ transformation I mean the following:
Anything within the single quotes will not be "translated" to their values. They will be taken as they are inside quotes. Example: a=23, then echo '$a' will produce $a on standard output. Whereas echo "$a" will produce 23 on standard output.
A minimal answer is needed for people to get going without spending a lot of time as I had to.
The following is, surprisingly (to those looking for an answer), a complete command:
$ echo '\'
whose output is:
\
Backslashes, surprisingly to even long-time users of bash, do not have any meaning inside single quotes. Nor does anything else.

Append specific caracter at the end of each line

I have a file and I want to append a specific text, \0A, to the end of each of its lines.
I used this command,
sed -i s/$/\0A/ file.txt
but that didn't work with backslash \0A.
In its default operations, sed cyclically appends a line from input, less it's terminating <newline>-character, into the pattern space of sed.
The OP wants to use sed to append the character \0A at the end of a line. This is the hexadecimal representation of the <newline>-character (cfr. http://www.asciitable.com/). So from this perspective, the OP attempts to double space a files. This can be easilly done using:
sed G file
The G command, appends a newline followed by the content of the hold space to the pattern space. Since the hold space is always empty, it just appends a newline character to the pattern space. The default action of sed is to print the line. So this just double-spaces a file.
Your command should be fixed by simply enclosing s/$/\0A/ in single quotes (') and escaping the backslash (with another backslash):
sed -i 's/$/\\0A/' file.txt
Notice that the surrounding 's protect that string from being processed by the shell, but the bashslash still needed escape in order to protect it from SED itself.
Obviously, it's still possible to avoid the single quotes if you escape enough:
sed -i s/$/\\\\0A/ file.txt
In this case there are no single quotes to protect the string, so we need write \\ in the shell to get SED fed with \, but we need two of those \\, i.e. \\\\, so that SED is fed with \\, which is an escaped \.
Move obviously, I'd never ever suggest the second alternative.

confused about what must be escaped for sed

I want to replace specific strings in php files automatically using sed. Some work, and some do not. I already investigated this is not an issue with the replacement string but with the string that is to be replaced. I already tried to escape [ and ] with no success. It seems to be the whitespace within the () - not whitespaces in general. The first whitespaces (around the = ) do not have any problems. Please can someone point me to the problem:
sed -e "1,\$s/$adm = substr($path . rawurlencode($upload['name']) , 16);/$adm = rawurlencode($upload['name']); # fix 23/g" -i administration/identify.php
I already tried to shorten the string which should be replaced and the result was if I cut it directly behind $path it works, with the following whitespace it does not. Escaping whitespace has no effect...
what must be escaped for sed
The following characters have special meaning in sed and have to be escaped with \ for the regex to be taken literally:
\
[
the character used in separating s command parts, ie. / here
.
*
& only replacement string
Newline character is handled specially as the end of the string, but can be replaced for \n.
So first escape all special characters in input and then pass it to sed:
rgx="$adm = substr($path . rawurlencode($upload['name']) , 16);"
rgx_escaped=$(sed 's/[\\\[\.\*\/&]/\\&/g' <<<"$rgx")
sed "s/$rgx_escaped/ etc."
See Escape a string for a sed replace pattern for a generic escaping solution.
You may use
sed -i 's/\$adm = substr(\$path \. rawurlencode(\$upload\['"'"'name'"'"']) , 16);/$adm = rawurlencode($upload['"'"'name'"'"']); # fix 23/g' administration/identify.php
Note:
the sed command is basically wrapped in single quotes, the variable expansion won't occur inside single quotes
In the POSIX BRE syntax, ( matches a literal (, you do not need to escape ) either, but you need to escape [ and . that must match themselves
The single quotes require additional quoting with concatenation.

how to use sed/awk to remove words with multiple pattern count

I have a file of string records where one of the fields - delimited by "," - can contain one or more "-" inside it.
The goal is to delete the field value if it contains more than two "-".
i am trying to recoup my past knowledge of sed/awk but can't make much headway
==========
info,whitepaper,Data-Centers,yes-the-6-top-problems-in-your-data-center-lane
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers,the-evolution-of-lan-technology-lanner
==========
expected outcome:
info,whitepaper,Data-Centers
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers
thanks
Try
sed -r 's/(^|,)([^,-]+-){3,}[^,]+(,|$)/\3/g'
or if you're into slashes
sed 's/\(^\|,\)\([^,-]\+-\)\{3,\}[^,]\+\(,\|$\)/\3/g'
Explanation:
I'm using the most basic sed command: substitution. The syntax is: s/pattern/replacement/flags.
Here pattern is (^|,)([^,-]+-){3,}[^,]+(,|$), replacement is \3, flags is g.
The g flag means global replacement (all matching parts are replaced, not only the first in line).
In pattern:
brackets () create a group. Somewhat like in math. They also allow to refer to a group with a number later.
^ and $ mean beginning and end of the string.
| means "or", so (^|,) means "comma or beginning of the string".
square brackets [] mean a character class, ^ inside means negation. So [^,-] means "anything but comma or hyphen". Not that usually the hyphen has a special meaning in character classes: [a-z] means all lowercase letters. But here it's just a hyphen because it's not in the middle.
+ after an expression means "match it 1 or more times" (like * means match it 0 or more times).
{N} means "match it exactly N times. {N,M} is "from N to M times". {3,} means "three times or more". + is equivalent to {1,}.
So this is it. The replacement is just \3. This refers to the third group in (), in this case (,|$). This will be the only thing left after the substitution.
P.S. the -r option just changes what characters need to be escaped: without it all of ()-{}| are treated as regular chars unless you escape them with \. Conversely, to match literal ( with -r option you'll need to escape it.
P.P.S. Here's a reference for sed. man sed is your friend as well.
Let me know if you have further questions.
You could try perl instead of sed or awk:
perl -F, -lane 'print join ",", grep { !/-.*-.*-/ } #F' < file.txt
This might work for you:
sed 's/,\{,1\}[^,-]*\(-[^,]*\)\{3,\}//g file
sed 's/\(^\|,\)\([^,]*-\)\{3\}[^,]*\(,\|$\)//g'
This should work in more cases:
sed 's/,$/\n/g;s/\(^\|,\|\n\)\([^,\n]*-\)\{3\}[^,\n]*\(,\|\n\|$\)/\3/g;s/,$//;s/\n/,/g'

Confining Substitution to Match Space Using sed?

Is there a way to substitute only within the match space using sed?
I.e. given the following line, is there a way to substitute only the "." chars that are contained within the matching single quotes and protect the "." chars that are not enclosed by single quotes?
Input:
'ECJ-4YF1H10.6Z' ! 'CAP' ! '10.0uF' ! 'TOL' ; MGCDC1008.S1 MGCDC1009.A2
Desired result:
'ECJ-4YF1H10-6Z' ! 'CAP' ! '10_0uF' ! 'TOL' ; MGCDC1008.S1 MGCDC1009.A2
Or is this just a job to which perl or awk might be better suited?
Thanks for your help,
Mark
Give the following a try which uses the divide-and-conquer technique:
sed "s/\('[^']*'\)/\n&\n/g;s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g;s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g;s/\n//g" inputfile
Explanation:
s/\('[^']*'\)/\n&\n/g - Add newlines before and after each pair of single quotes with their contents
s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g - Using a newline and the single quotes to key on, replace the dot with a dash for strings that end in "Z"
s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g - Using a newline and the single quotes to key on, replace the dot with a dash for strings that end in "uF"
s/\n//g - Remove the newlines added in the first step
You can restrict the command to acting only on certain lines:
sed "/foo/{s/\('[^']*'\)/\n&\n/g;s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g;s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g;s/\n//g}" inputfile
where you would substitute some regex in place of "foo".
Some versions of sed like to be spoon fed (instead of semicolons between commands, use -e):
sed -e "/foo/{s/\('[^']*'\)/\n&\n/g" -e "s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g" -e "s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g" -e "s/\n//g}" inputfile
$ cat phoo1234567_sedFix.sed
#! /bin/sed -f
/'[0-9][0-9]\.[0-9][a-zA-Z][a-zA-Z]'/s/'\([0-9][0-9]\)\.\([0-9][a-zA-Z][a-zA-Z]\)'/\1_\2/
This answers your specific question. If the pattern you need to fix isn't always like the example you provided, they you'll need multiple copies of this line, with reg-expressions modified to match your new change targets.
Note that the cmd is in 2 parts, "/'[0-9][0-9].[0-9][a-zA-Z][a-zA-Z]'/" says, must match lines with this pattern, while the trailing "s/'([0-9][0-9]).([0-9][a-zA-Z][a-zA-Z])'/\1_\2/", is the part that does the substitution. You can add a 'g' after the final '/' to make this substitution happen on all instances of this pattern in each line.
The \(\) pairs in match pattern get converted into the numbered buffers on the substitution side of the command (i.e. \1 \2). This is what gives sed power that awk doesn't have.
If your going to do much of this kind of work, I highly recommend O'Rielly's Sed And Awk book. The time spent going thru how sed works will be paid back many times.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer.
this is a job most suitable for awk or any language that supports breaking/splitting strings.
IMO, using sed for this task, which is regex based , while doable, is difficult to read and debug, hence not the most appropriate tool for the job. No offense to sed fanatics.
awk '{
for(i=1;i<=NF;i++) {
if ($i ~ /\047/ ){
gsub(".","_",$i)
}
}
}1' file
The above says for each field (field seperator by default is white space), check to see if there is a single quote, and if there is , substitute the "." to "_". This method is simple and doesn't need complicated regex.