Related
In a file, a would like to replace all occurences of a dot within braces to be replaced by an underscore.
input
something.dots {test.test} foo.bar
another.line
expected output
something.dots {test_test} foo.bar
another.line
What would be the easiest way to achieve that?
You can choose the least ugly sed from the two options below:
$ cat file
something.dots {test.test} foo.bar {a.a} x
something.dots
$ sed 's|\({[^}]*\)\.\([^}]*}\)|\1_\2|g' file
something.dots {test_test} foo.bar {a_a} x
something.dots
$ sed -E 's|(\{[^}]*)\.([^}]*\})|\1_\2|g' file
something.dots {test_test} foo.bar {a_a} x
something.dots
Explanation (I'll use the last form, but they are equivalent):
(\{[^}]*): Matching group 1 consisting of a {, and any number of non-} characters.
\.: A dot.
([^}]*\}): Matching group 2 consisting of any number of non-} characters followed by a }.
If found, replace the whole expression by [Matching group 1].[Matching group 2].
easiest way
Hold the line, extract the part within braces, do the substitution, grab the holded line and shuffle it for the output.
sed 'h;s/.*{//;s/}.*//;s/\./_/g;G;s/^\(.*\)\n\(.*{\).*}/\2\1}/'
#edit - ignore lines without {.*}:
sed '/{.*}/!b; h;s/.*{//;s/}.*//;s/\./_/g;G;s/^\(.*\)\n\(.*{\).*}/\2\1}/'
Tested on repl.
If it's going to be the "easiest way" use AWK instead of sed and then:
awk -F"{|}" '$0 !~ /{.*}/{print($0)}; gsub("\.","_",$2) {print($1"{"$2"}"$3)}' file
This will replace any number of dots, e.g. {test.test.test} and lines without parentheses leaves unchanged.
Explanation:
-F"{|}" Sets the field separator to { or }
$0 !~ /{.*}/{print($0)}; Prints lines unchanged without the {. *}
pattern, "print" can be omitted as this is
the default behavior
gsub("\.","_",$2) Substitutions . to _ for field 2
{print($1"{"$2"}"$3)} Formats and prints lines after changes
I have a csv file whose fields are delimited by double quote (") and comma (,), e.g:
"123","4"5""6","789"
However, there may be some double quote (") within the data, i.e. 4"5""6 which I need to transform into single quote ('), i.e.
I need to transform
"123","4"5""6","789"
to
"123","4'5''6","789"
I've tried something like
sed "s/\(\",\"\)\(\"\|[^\(","\)]\)*\(\",\"\)/\1'\3/"g
but (\"\|[^\(","\)]\)* only
match " OR not ","
but may be I need something like
match " AND not ","
Another approach may be perform sequential sed, i.e.
find and match 4"5""6 first
pass the result to next statement and replace to 4'5' '6
But for both ways, I don't know exactly how to do it.
Although I can replace all " into ' first and then re-format my csv but it seems to be costly, i.e.
sed -i -e "s/\"/'/g" -e "s/','/\",\"/g" -e "s/^'/\"/g" -e "s/'$/\"/g" myFile.csv
Try this:
$ sed ':a;s/\("[^,"]*\)"\([^,].*\)/\1'\''\2/;ta' <<< '"1"23","4"5""6","78"9"'
"1'23","4'5''6","78'9"
Opening double quote and following characters up to(but excluding) next closing " are captured and replaced with captured string and one single quote.
If replacement succeeds, ta loops to the beginning of the script for further replacements.
echo '"123","4"5""6","789"'|sed -r ':a;s/^([^,]+,"[^,]*)"([^"]*",)/\1\x27\2/;ta'
You may use the following awk approach:
echo '"123","4"5""6","789"' | awk -F, '{OFS=","; $2="\""gensub(/\042/, "\047","g", substr($2, 2, length($2)-2))"\"";}1'
The output:
"123","4'5''6","789"
Explanation:
-F, (OFS=",") - treating , as field separator
"\042" - double quote ASCII octal code
"\047" - single quote ASCII octal code
substr($2, 2, length($2)-2) - extracting substring from the second field except trailing double quotes, i.e. 4"5""6
gensub(/\042/, "\047","g", [target]) - substitutes all double quotes with single quotes within the target string
Another awk proposal:
echo '"123","4"5""6","789"' |awk '{sub(/4"5""/,"4\47"5"\47\47")}1'
"123","4'5''6","789"
find and match 4"5""6 first
pass the result to next statement and replace to 4'5' '6
This is possible in perl
$ echo '"123","4"5""6","789"' | perl -pe 's/"\K[^,]+(?=")/$&=~s|"|\x27|gr/ge'
"123","4'5''6","789"
"\K[^,]+(?=") match column content, leave out outer double quotes with use of lookarounds
$&=~s|"|\x27|gr replace the double quotes found within column content with single quotes
The e modifier used allows this usage of Perl code instead of replacement string
Workaround with sed, involves messy branches
$ echo '"123","4"5""6","789"' | sed -E ':a s/("[^,]+)"([^,]+")(,|$)/\1\x27\2\3/; ta'
"123","4'5''6","789"
:a mark label
("[^,]+)"([^,]+")(,|$) matches column content with at least one inner double quote
\1\x27\2\3 replace the inner double quote with single quote
ta branch to label a as long as there is a match
I have a text file full of lines looking like:
Female,"$0 to $25,000",Arlington Heights,0,60462,ZD111326,9/18/13 0:21,Disk Drive
I am trying to change all of the commas , to pipes |, except for the commas within the quotes.
Trying to use sed (which I am new to)... and it is not working. Using:
sed '/".*"/!s/\,/|/g' textfile.csv
Any thoughts?
As a test case, consider this file:
Female,"$0 to $25,000",Arlington Heights,0,60462,ZD111326,9/18/13 0:21,Disk Drive
foo,foo,"x,y,z",foo,"a,b,c",foo,"yes,no"
"x,y,z",foo,"a,b,c",foo,"yes,no",foo
Here is a sed command to replace non-quoted commas with pipe symbols:
$ sed -r ':a; s/^([^"]*("[^"]*"[^"]*)*),/\1|/g; t a' file
Female|"$0 to $25,000"|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
foo|foo|"x,y,z"|foo|"a,b,c"|foo|"yes,no"
"x,y,z"|foo|"a,b,c"|foo|"yes,no"|foo
Explanation
This looks for commas that appear after pairs of double quotes and replaces them with pipe symbols.
:a
This defines a label a.
s/^([^"]*("[^"]*"[^"]*)*),/\1|/g
If 0, 2, 4, or any an even number of quotes precede a comma on the line, then replace that comma with a pipe symbol.
^
This matches at the start of the line.
(`
This starts the main grouping (\1).
[^"]*
This looks for zero or more non-quote characters.
("[^"]*"[^"]*)*
The * outside the parens means that we are looking for zero or more of the pattern inside the parens. The pattern inside the parens consists of a quote, any number of non-quotes, a quote and then any number on non-quotes.
In other words, this grouping only matches pairs of quotes. Because of the * outside the parens, it can match any even number of quotes.
)
This closes the main grouping
,
This requires that the grouping be followed by a comma.
t a
If the previous s command successfully made a substitution, then the test command tells sed to jump back to label a and try again.
If no substitution was made, then we are done.
using awk could be eaiser:
kent$ cat f
foo,foo,"x,y,z",foo,"a,b,c",foo,"yes,no"
Female,"$0 to $25,000",Arlington Heights,0,60462,ZD111326,9/18/13 0:21,Disk Drive
kent$ awk -F'"' -v OFS='"' '{for(i=1;i<=NF;i++)if(i%2)gsub(",","|",$i)}7' f
foo|foo|"x,y,z"|foo|"a,b,c"|foo|"yes,no"
Female|"$0 to $25,000"|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
I suggest a language with a proper CSV parser. For example:
ruby -rcsv -ne 'puts CSV.generate_line(CSV.parse_line($_), :col_sep=>"|")' file
Female|$0 to $25,000|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
Here I would have used gnu awks FPAT. It define how a field looks like FS that tells what the separator is. Then you can just set the output separator to |
awk '{$1=$1}1' OFS=\| FPAT="([^,]+)|(\"[^\"]+\")" file
Female|"$0 to $25,000"|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
If your awk does not support FPAT, this can be used:
awk -F, '{for (i=1;i<NF;i++) {c+=gsub(/\"/,"&",$i);printf "%s"(c%2?FS:"|"),$i}print $NF}' file
Female|"$0 to $25,000"|Arlington Heights|0|60462|ZD111326|9/18/13 0:21|Disk Drive
sed 's/"\(.*\),\(.*\)"/"\1##HOLD##\2"/g;s/,/|/g;s/##HOLD##/,/g'
This will match the text in quotes and put a placeholder for the commas, then switch all the other commas to pipes and put the placeholder back to commas. You can change the ##HOLD## text to whatever you want.
I have a file of string records where one of the fields - delimited by "," - can contain one or more "-" inside it.
The goal is to delete the field value if it contains more than two "-".
i am trying to recoup my past knowledge of sed/awk but can't make much headway
==========
info,whitepaper,Data-Centers,yes-the-6-top-problems-in-your-data-center-lane
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers,the-evolution-of-lan-technology-lanner
==========
expected outcome:
info,whitepaper,Data-Centers
info,whitepaper,Data-Centers,the-evolution-center
info,whitepaper,Data-Centers
thanks
Try
sed -r 's/(^|,)([^,-]+-){3,}[^,]+(,|$)/\3/g'
or if you're into slashes
sed 's/\(^\|,\)\([^,-]\+-\)\{3,\}[^,]\+\(,\|$\)/\3/g'
Explanation:
I'm using the most basic sed command: substitution. The syntax is: s/pattern/replacement/flags.
Here pattern is (^|,)([^,-]+-){3,}[^,]+(,|$), replacement is \3, flags is g.
The g flag means global replacement (all matching parts are replaced, not only the first in line).
In pattern:
brackets () create a group. Somewhat like in math. They also allow to refer to a group with a number later.
^ and $ mean beginning and end of the string.
| means "or", so (^|,) means "comma or beginning of the string".
square brackets [] mean a character class, ^ inside means negation. So [^,-] means "anything but comma or hyphen". Not that usually the hyphen has a special meaning in character classes: [a-z] means all lowercase letters. But here it's just a hyphen because it's not in the middle.
+ after an expression means "match it 1 or more times" (like * means match it 0 or more times).
{N} means "match it exactly N times. {N,M} is "from N to M times". {3,} means "three times or more". + is equivalent to {1,}.
So this is it. The replacement is just \3. This refers to the third group in (), in this case (,|$). This will be the only thing left after the substitution.
P.S. the -r option just changes what characters need to be escaped: without it all of ()-{}| are treated as regular chars unless you escape them with \. Conversely, to match literal ( with -r option you'll need to escape it.
P.P.S. Here's a reference for sed. man sed is your friend as well.
Let me know if you have further questions.
You could try perl instead of sed or awk:
perl -F, -lane 'print join ",", grep { !/-.*-.*-/ } #F' < file.txt
This might work for you:
sed 's/,\{,1\}[^,-]*\(-[^,]*\)\{3,\}//g file
sed 's/\(^\|,\)\([^,]*-\)\{3\}[^,]*\(,\|$\)//g'
This should work in more cases:
sed 's/,$/\n/g;s/\(^\|,\|\n\)\([^,\n]*-\)\{3\}[^,\n]*\(,\|\n\|$\)/\3/g;s/,$//;s/\n/,/g'
Is there a way to substitute only within the match space using sed?
I.e. given the following line, is there a way to substitute only the "." chars that are contained within the matching single quotes and protect the "." chars that are not enclosed by single quotes?
Input:
'ECJ-4YF1H10.6Z' ! 'CAP' ! '10.0uF' ! 'TOL' ; MGCDC1008.S1 MGCDC1009.A2
Desired result:
'ECJ-4YF1H10-6Z' ! 'CAP' ! '10_0uF' ! 'TOL' ; MGCDC1008.S1 MGCDC1009.A2
Or is this just a job to which perl or awk might be better suited?
Thanks for your help,
Mark
Give the following a try which uses the divide-and-conquer technique:
sed "s/\('[^']*'\)/\n&\n/g;s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g;s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g;s/\n//g" inputfile
Explanation:
s/\('[^']*'\)/\n&\n/g - Add newlines before and after each pair of single quotes with their contents
s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g - Using a newline and the single quotes to key on, replace the dot with a dash for strings that end in "Z"
s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g - Using a newline and the single quotes to key on, replace the dot with a dash for strings that end in "uF"
s/\n//g - Remove the newlines added in the first step
You can restrict the command to acting only on certain lines:
sed "/foo/{s/\('[^']*'\)/\n&\n/g;s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g;s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g;s/\n//g}" inputfile
where you would substitute some regex in place of "foo".
Some versions of sed like to be spoon fed (instead of semicolons between commands, use -e):
sed -e "/foo/{s/\('[^']*'\)/\n&\n/g" -e "s/\(\n'[^.]*\)\.\([^']*Z'\)/\1-\2/g" -e "s/\(\n'[^.]*\)\.\([^']*uF'\)/\1_\2/g" -e "s/\n//g}" inputfile
$ cat phoo1234567_sedFix.sed
#! /bin/sed -f
/'[0-9][0-9]\.[0-9][a-zA-Z][a-zA-Z]'/s/'\([0-9][0-9]\)\.\([0-9][a-zA-Z][a-zA-Z]\)'/\1_\2/
This answers your specific question. If the pattern you need to fix isn't always like the example you provided, they you'll need multiple copies of this line, with reg-expressions modified to match your new change targets.
Note that the cmd is in 2 parts, "/'[0-9][0-9].[0-9][a-zA-Z][a-zA-Z]'/" says, must match lines with this pattern, while the trailing "s/'([0-9][0-9]).([0-9][a-zA-Z][a-zA-Z])'/\1_\2/", is the part that does the substitution. You can add a 'g' after the final '/' to make this substitution happen on all instances of this pattern in each line.
The \(\) pairs in match pattern get converted into the numbered buffers on the substitution side of the command (i.e. \1 \2). This is what gives sed power that awk doesn't have.
If your going to do much of this kind of work, I highly recommend O'Rielly's Sed And Awk book. The time spent going thru how sed works will be paid back many times.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer.
this is a job most suitable for awk or any language that supports breaking/splitting strings.
IMO, using sed for this task, which is regex based , while doable, is difficult to read and debug, hence not the most appropriate tool for the job. No offense to sed fanatics.
awk '{
for(i=1;i<=NF;i++) {
if ($i ~ /\047/ ){
gsub(".","_",$i)
}
}
}1' file
The above says for each field (field seperator by default is white space), check to see if there is a single quote, and if there is , substitute the "." to "_". This method is simple and doesn't need complicated regex.