sed: replace a pattern between two similar patterns, i.e. ( replace " to ' between two "," ) - sed

I have a csv file whose fields are delimited by double quote (") and comma (,), e.g:
"123","4"5""6","789"
However, there may be some double quote (") within the data, i.e. 4"5""6 which I need to transform into single quote ('), i.e.
I need to transform
"123","4"5""6","789"
to
"123","4'5''6","789"
I've tried something like
sed "s/\(\",\"\)\(\"\|[^\(","\)]\)*\(\",\"\)/\1'\3/"g
but (\"\|[^\(","\)]\)* only
match " OR not ","
but may be I need something like
match " AND not ","
Another approach may be perform sequential sed, i.e.
find and match 4"5""6 first
pass the result to next statement and replace to 4'5' '6
But for both ways, I don't know exactly how to do it.
Although I can replace all " into ' first and then re-format my csv but it seems to be costly, i.e.
sed -i -e "s/\"/'/g" -e "s/','/\",\"/g" -e "s/^'/\"/g" -e "s/'$/\"/g" myFile.csv

Try this:
$ sed ':a;s/\("[^,"]*\)"\([^,].*\)/\1'\''\2/;ta' <<< '"1"23","4"5""6","78"9"'
"1'23","4'5''6","78'9"
Opening double quote and following characters up to(but excluding) next closing " are captured and replaced with captured string and one single quote.
If replacement succeeds, ta loops to the beginning of the script for further replacements.

echo '"123","4"5""6","789"'|sed -r ':a;s/^([^,]+,"[^,]*)"([^"]*",)/\1\x27\2/;ta'

You may use the following awk approach:
echo '"123","4"5""6","789"' | awk -F, '{OFS=","; $2="\""gensub(/\042/, "\047","g", substr($2, 2, length($2)-2))"\"";}1'
The output:
"123","4'5''6","789"
Explanation:
-F, (OFS=",") - treating , as field separator
"\042" - double quote ASCII octal code
"\047" - single quote ASCII octal code
substr($2, 2, length($2)-2) - extracting substring from the second field except trailing double quotes, i.e. 4"5""6
gensub(/\042/, "\047","g", [target]) - substitutes all double quotes with single quotes within the target string

Another awk proposal:
echo '"123","4"5""6","789"' |awk '{sub(/4"5""/,"4\47"5"\47\47")}1'
"123","4'5''6","789"

find and match 4"5""6 first
pass the result to next statement and replace to 4'5' '6
This is possible in perl
$ echo '"123","4"5""6","789"' | perl -pe 's/"\K[^,]+(?=")/$&=~s|"|\x27|gr/ge'
"123","4'5''6","789"
"\K[^,]+(?=") match column content, leave out outer double quotes with use of lookarounds
$&=~s|"|\x27|gr replace the double quotes found within column content with single quotes
The e modifier used allows this usage of Perl code instead of replacement string
Workaround with sed, involves messy branches
$ echo '"123","4"5""6","789"' | sed -E ':a s/("[^,]+)"([^,]+")(,|$)/\1\x27\2\3/; ta'
"123","4'5''6","789"
:a mark label
("[^,]+)"([^,]+")(,|$) matches column content with at least one inner double quote
\1\x27\2\3 replace the inner double quote with single quote
ta branch to label a as long as there is a match

Related

Identify and replace selective space inside given text file

I am new to sed and its functioning. I need to selectively replace space with "," in a file where the content of the file is as follows. I do not want replace space inside "" but all the other spaces needs to be replaced.
File Content
my data "this is my very first encounter with sed" "valuable" - - "c l e a r"
Used Pattern
using sed to replace space with "," - Patten is 's/ /,/g'
Actual Output
my,data,"this,is,my,very,first,encounter,with,sed",,"valuable",-,-,"c,l,e,a,r"
Expected Output
my,data,"this is my very first encounter with sed",,"valuable",-,-,"c l e a r"
The following sed script with comments with input from bash here string:
<<<'my data "this is my very first encounter with sed" "valuable" - - "c l e a r"' sed -E '
# Split input with each character on its own line
s/./&\n/g;
# Add a newline on the end to separate output from input
s/$/\n/;
# Each line has one character
# Add a leading character that stores "state"
# There are two states available - in quoting or not in quoting
# The state character is space when we are not in quotes
# The state character is double quote when we are in quotes
s/^/ /;
# For each character in input
:again; {
# Substitute a space that is not in quotes for a comma
s/^ / ,/
# When quotes is encountered and we are not in quotes
/^ "/{
# Change state to quotes
s//""/
b removed_quotes
} ; {
# When quotes is encountered and we are in quotes
# then we are no longer in quotes
s/^""/ "/
} ; : removed_quotes
# Preserve state as the first character
# Add the parsed character to the output on the end
# Preserve the rest
s/^(.)(.)\n(.*)/\1\3\2/;
# If end of input was not reached, then parse another character.
/^.\n/!b again;
};
# Remove the leading state character with the newline
s///;
'
outputs:
my,data,"this is my very first encounter with sed",,"valuable",-,-,"c l e a r"
And a oneliner, because who reads these comments:
sed -E 's/./&\n/g;s/$/\n/;s/^/ /;:a;s/^ / ,/;/^ "/{s//""/;bq;};s/^""/ "/;:q;s/^(.)(.)\n(.*)/\1\3\2/;/^.\n/!ba;s///'
I think a newline \n in s command replacement string is an extension not required by posix. Another unique character may be used instead of a newline to separate input while parsing. Anyway I tested that with GNU sed.
As mentioned in the comments, this is something better suited for an actual CSV parser instead of trying to kludge up something using regular expressions - especially sed's rather basic regular expressions.
A one-liner in perl using the useful Text::AutoCSV module (Install through your OS package manager or favorite CPAN client):
$ perl -MText::AutoCSV -e 'Text::AutoCSV->new(sep_char=>" ", out_sep_char=>",")->write' < input.txt
my,data,"this is my very first encounter with sed",,valuable,-,-,"c l e a r"
With GNU awk for FPAT:
$ awk -v FPAT='[^ ]*|"[^"]+"' -v OFS=',' '{$1=$1} 1' file
my,data,"this is my very first encounter with sed",,"valuable",-,-,"c l e a r"
Your input is a CSV where C in this case means "Character" instead of the traditional "Comma" and where the Character in question is a blank and you're just trying to convert it to a Comma-separated CSV. See What's the most robust way to efficiently parse CSV using awk? for more information on what the above does and on parsing CSVs with awk in general.
awk 'BEGIN {RS=ORS="\""} NR%2 {gsub(" ",",")} {print}' file
At the beginning, set the double quote as the record separator.
For odd records, i.e. outside quotes, replace globally any space with comma.
print every record.
This might work for you (GNU sed):
sed -E ':a;s/^((("[^"]*")*[^" ]*)*) /\1,/;ta' file
Replace, the group of zero or more double quoted strings followed by zero or more non-space characters zero or more time followed by a space with the group followed by a comma, repeated until failure.

sed not working as expected when trying to replace "user='mysql'" with "user=`whoami`"

The following command fails.
sed 's/user=\'mysql\'/user=`whoami`/g' input_file
An example input_file contains the following line
user='mysql'
The corresponding expected output is
user=`whoami`
(Yes, I literally want whoami between backticks, I don't want it to expand my userid.)
This should be what you need:
Using double quotes to enclose the sed command,
so that you are free to use single quotes in it;
escape backticks to avoid the expansion.
sed "s/user='mysql'/user=\`whoami\`/g" yourfile
I've intentionally omitted the -i option for the simple reason that it is not part of the issue.
To clarify the relation between single quotes and escaping, compare the following two commands
echo 'I didn\'t know'
echo 'I didn'\''t know'
The former will wait for further input as there's an open ', whereas the latter will work fine, as you are concatenating a single quoted string ('I didn'), an escaped single quote (\'), and another single quoted string ('t know').

sed - Replace comma after first regex match

i m trying to perform the following substitution on lines of the general format:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......
as you see the problem is that its a comma separated file, with a specific field containing a comma decimal. I would like to replace that with a dot .
I ve tried this, to replace the first occurence of a pattern after match, but to no avail, could someone help me?
sed -e '/,"/!b' -e "s/,/./"
sed -e '/"/!b' -e ':a' -e "s/,/\./"
Thanks in advance. An awk or perl solution would help me as well. Here's an awk effort:
gawk -F "," 'substr($10, 0, 3)==3 && length($10)==12 { gsub(/,/,".", $10); print}'
That yielded the same file unchanged.
CSV files should be parsed in awk with a proper FPAT variable that defines what constitutes a valid field in such a file. Once you do that, you can just iterate over the fields to do the substitution you need
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")"; OFS="," }
{ for(i=1; i<=NF;i++) if ($i ~ /[,]/) gsub(/[,]/,".",$i);}1' file
See this answer of mine to understand how to define and parse CSV file content with FPAT variable. Also see Save modifications in place with awk to do in-place file modifications like sed -i''.
The following sed will convert all decimal separators in quoted numeric fields:
sed 's/"\([-+]\?[0-9]*\)[,]\?\([0-9]\+\([eE][-+]\?[0-9]+\)\?\)"/"\1.\2"/g'
See: https://www.regular-expressions.info/floatingpoint.html
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^",]*"[^"]*)*"[^",]*),/\1./;ta' file
This regexp matches a , within a pair of "'s and replaces it by a .. The regexp is anchored to the start of the line and thus needs to be repeated until no further matches can be matched, hence the :a and the ta commands which causes the substitution to be iterated over whilst any substitution is successful.
N.B. The solution expects that all double quotes are matched and that no double quotes are quoted i.e. \" does not appear in a line.
If your input always follows that format of only one quoted field containing 1 comma then all you need is:
$ sed 's/\([^"]*"[^"]*\),/\1./' file
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC, .......
If it's more complicated than that then see What's the most robust way to efficiently parse CSV using awk?.
Assuming you have this:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC
Try this:
awk -F',' '{print $1,$2,$3,$4"."$5,$6,$7}' filename | awk '$1=$1' FS=" " OFS=","
Output will be:
BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109.07",DF,CCCCCCCCCCC
You simply need to know the field numbers for replacing the field separator between them.
In order to use regexp as in perl you have to activate extended regular expression with -r.
So if you want to replace all numbers and omit the " sign, then you can use this:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/g'
If you want to replace first occurrence only you can use that:
echo 'BBBBBBB.2018_08,XXXXXXXXXXXXX,01/01/2014,"109,07",DF,CCCCCCCCCCC, .......'|sed -r 's/\"([0-9]+)\,([0-9]+)\"/\1\.\2/1'
https://www.gnu.org/software/sed/manual/sed.txt

sed pattern negation with a comma separated line

I have a text file full of lines looking like:
Female,"$0 to $25,000",Arlington Heights,0,60462,ZD111326,9/18/13 0:21,Disk Drive
I am trying to change all of the commas , to pipes |, except for the commas within the quotes.
Trying to use sed (which I am new to)... and it is not working. Using:
sed '/".*"/!s/\,/|/g' textfile.csv
Any thoughts?
As a test case, consider this file:
Female,"$0 to $25,000",Arlington Heights,0,60462,ZD111326,9/18/13 0:21,Disk Drive
foo,foo,"x,y,z",foo,"a,b,c",foo,"yes,no"
"x,y,z",foo,"a,b,c",foo,"yes,no",foo
Here is a sed command to replace non-quoted commas with pipe symbols:
$ sed -r ':a; s/^([^"]*("[^"]*"[^"]*)*),/\1|/g; t a' file
Female|"$0 to $25,000"|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
foo|foo|"x,y,z"|foo|"a,b,c"|foo|"yes,no"
"x,y,z"|foo|"a,b,c"|foo|"yes,no"|foo
Explanation
This looks for commas that appear after pairs of double quotes and replaces them with pipe symbols.
:a
This defines a label a.
s/^([^"]*("[^"]*"[^"]*)*),/\1|/g
If 0, 2, 4, or any an even number of quotes precede a comma on the line, then replace that comma with a pipe symbol.
^
This matches at the start of the line.
(`
This starts the main grouping (\1).
[^"]*
This looks for zero or more non-quote characters.
("[^"]*"[^"]*)*
The * outside the parens means that we are looking for zero or more of the pattern inside the parens. The pattern inside the parens consists of a quote, any number of non-quotes, a quote and then any number on non-quotes.
In other words, this grouping only matches pairs of quotes. Because of the * outside the parens, it can match any even number of quotes.
)
This closes the main grouping
,
This requires that the grouping be followed by a comma.
t a
If the previous s command successfully made a substitution, then the test command tells sed to jump back to label a and try again.
If no substitution was made, then we are done.
using awk could be eaiser:
kent$ cat f
foo,foo,"x,y,z",foo,"a,b,c",foo,"yes,no"
Female,"$0 to $25,000",Arlington Heights,0,60462,ZD111326,9/18/13 0:21,Disk Drive
kent$ awk -F'"' -v OFS='"' '{for(i=1;i<=NF;i++)if(i%2)gsub(",","|",$i)}7' f
foo|foo|"x,y,z"|foo|"a,b,c"|foo|"yes,no"
Female|"$0 to $25,000"|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
I suggest a language with a proper CSV parser. For example:
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file
Female|$0 to $25,000|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
Here I would have used gnu awks FPAT. It define how a field looks like FS that tells what the separator is. Then you can just set the output separator to |
awk '{$1=$1}1' OFS=\| FPAT="([^,]+)|(\"[^\"]+\")" file
Female|"$0 to $25,000"|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
If your awk does not support FPAT, this can be used:
awk -F, '{for (i=1;i<NF;i++) {c+=gsub(/\"/,"&",$i);printf "%s"(c%2?FS:"|"),$i}print $NF}' file
Female|"$0 to $25,000"|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
sed 's/"\(.*\),\(.*\)"/"\1##HOLD##\2"/g;s/,/|/g;s/##HOLD##/,/g'
This will match the text in quotes and put a placeholder for the commas, then switch all the other commas to pipes and put the placeholder back to commas. You can change the ##HOLD## text to whatever you want.

sed: Replace part of a line

How can one replace a part of a line with sed?
The line
DBSERVERNAME xxx
should be replaced to:
DBSERVERNAME yyy
The value xxx can vary and there are two tabs between dbservername and the value. This name-value pair is one of many from a configuration file.
I tried with the following backreference:
echo "DBSERVERNAME xxx" | sed -rne 's/\(dbservername\)[[:blank:]]+\([[:alpha:]]+\)/\1 yyy/gip'
and that resulted in an error: invalid reference \1 on `s' command's RHS.
Whats wrong with the expression? Using GNU sed.
This works:
sed -rne 's/(dbservername)\s+\w+/\1 yyy/gip'
(When you use the -r option, you don't have to escape the parens.)
Bit of explanation:
-r is extended regular expressions - makes a difference to how the regex is written.
-n does not print unless specified - sed prints by default otherwise,
-e means what follows it is an expression. Let's break the expression down:
s/// is the command for search-replace, and what's between the first pair is the regex to match, and the second pair the replacement,
gip, which follows the search replace command; g means global, i.e., every match instead of just the first will be replaced in a line; i is case-insensitivity; p means print when done (remember the -n flag from earlier!),
The brackets represent a match part, which will come up later. So dbservername is the first match part,
\s is whitespace, + means one or more (vs *, zero or more) occurrences,
\w is a word, that is any letter, digit or underscore,
\1 is a special expression for GNU sed that prints the first bracketed match in the accompanying search.
Others have already mentioned the escaping of parentheses, but why do you need a back reference at all, if the first part of the line is constant?
You could simply do
sed -e 's/dbservername.*$/dbservername yyy/g'
You're escaping your ( and ). I'm pretty sure you don't need to do that. Try:
sed -rne 's/(dbservername)[[:blank:]]+\([[:alpha:]]+\)/\1 yyy/gip'
You shouldn't be escaping things when you use single quotes. ie.
echo "DBSERVERNAME xxx" | sed -rne 's/(dbservername[[:blank:]]+)([[:alpha:]]+)/\1 yyy/gip'
You shouldn't be escaping your parens. Try:
echo "DBSERVERNAME xxx" | sed -rne 's/(dbservername)[[:blank:]]+([[:alpha:]]+)/\1 yyy/gip'
This might work for you:
echo "DBSERVERNAME xxx" | sed 's/\S*$/yyy/'
DBSERVERNAME yyy
Try this
sed -re 's/DBSERVERNAME[ \t]*([^\S]+)/\yyy/ig' temp.txt
or this
awk '{if($1=="DBSERVERNAME") $2 ="YYY"} {print $0;}' temp.txt