Replace every " within string - sed

I have lines in a text file which looks like this example:
"2009217",2015,3,"N","N","2","UPPER DARBY FIREFIGHTERS "PAC"","","","","7235 WEST CHESTER PIKE","","UPPER DARBY","PA","19082","","6106220269",4245.0100,650.0000,.0000
I want to replace every double quote in multiple partial strings similar to this "UPPER DARBY FIREFIGHTERS "PAC""across the whole file.
So the result should be as below for each instance of the recurring double quotes:
"2009217",2015,3,"N","N","2","UPPER DARBY FIREFIGHTERS PAC","","","","7235 WEST CHESTER PIKE","","UPPER DARBY","PA","19082","","6106220269",4245.0100,650.0000,.0000
I came to this sed line:
cat file.txt | sed "s/\([^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,\)\([^,]*\),\(.*\)/\1\2\3/"
But now I don't know how to replace the double quote within \2.
Is that possible with sed?

I would personally use awk for that because it is more readable:
#!/usr/bin/env awk
BEGIN {
# Use ',' as the input and output field delimiter
FS=OFS=","
}
{
# Iterate through all fields. (NF is the number of fields.)
for(i=1;i<=NF;i++) {
# If the field starts and ends with a '"'
if($i ~ /^".*"$/) {
# Replace all '""
gsub(/"/,"",$i)
# Wrap in '"' again
$i = "\"" $i "\""
}
}
}
print

This might work for you (GNU sed):
sed -r ':a;s/^((([^",]*,)*("[^",]*",([^",]*,)*)*)"[^",]*)"([^,])/\1\6/;ta' file
This removes extra double quotes from strings surrounded by double quotes and delimited by ,'s.
It does this by eliminating properly constructed double quotes strings and non-quoted strings (in this example numbers) and then removes double quotes that are not followed by ,
[^",]*, # non double quoted strings
"[^",]*", # properly quoted strings
(([^",]*,)*("[^",]*",([^",]*,)*)*) # eliminate all properly constructed strings
"[^",]*"([^,]) # improper double quotes
^
|

Related

Sed replace strings starts with special characters

I'm trying to replace strings with sed in ; php_value[date.timezone] = Europe/Riga
i tried something like this:
sed -i 's/; php_value[date.timezone] = Europe/\Riga/; php_value[date.timezone] = America/\Sao_Paulo/g' file
Output:
sed: -e expression #1, char 47: extra characters after command
You can use
sed -i 's/; php_value\[date\.timezone] = Europe\/Riga/; php_value[date.timezone] = America\/Sao_Paulo/g' file
See the online demo.
NOTE:
[ and . are special regex metacharacters and need to be escaped to match literal [ and ., hence, \[ and \. in the regex part
/ is a regex delimiter char here, and should also be escaped. To escape /, use \/. Well, if you use another regex delimiter char, you will have no need escaping /, e.g.
sed -i 's,; php_value\[date\.timezone] = Europe/Riga,; php_value[date.timezone] = America/Sao_Paulo,g' file
See the commas as regex delimiters here.

Eliminate duplicate words across lines

I'd like a sed script that eliminates repeated words in a text file on one or more lines. For example:
this is is is a text file file it is littered with duplicate words
words words on one or more lines lines
lines
lines
should transform to:
this is a text file it is littered with duplicate words
on one or more lines
This awk script produces the correct output:
{
for (i = 1; i <= NF; i++) {
word = $i
if (word != last) {
if (i < NF) {
next_word = $(i+1)
if (word != next_word) {
printf("%s ", word)
}
} else {
printf("%s\n", word)
}
}
}
last = word
}
but I'd really like a sed "one-liner".
This works with GNU sed, at least for the example input:
$ sed -Ez ':a;s/(\<\S+)(\s+)\1\s+/\1\2/g;ta' infile
This is a text file and is littered with duplicate words
on one or more lines
The -E option is just there to avoid having to escape the capture group parentheses and + quantifiers.
-z treats the input as null byte separated, i.e., as a single line.
The commmand is then structured as
:a # label
s///g # substitution
ta # jump to label if substitution did something
And the substitution is this:
s/(\<\S+)(\s+)\1\s+/\1\2/g
First capture group: (\<\S+) – a complete word (start of word boundary, one or more non-space characters
Second capture group: (\s+) – any number of blanks after that first word
\1\s+ – the first word again plus whatever blanks follow it
This preserves the whitespace after the first word and discards the whitespace after the duplicate.
Note that -E, -z, \<, \S and \s are all GNU extensions to POSIX sed.
With sed, you can use
sed -E 's/([a-z]+) +\1/\1/g'
Note that it works for duplicates. Not for triplicates or line breaks.
This can be fixed, by joining all the lines and looping.
sed -E ':a;N;s/(\b[a-z]+\b)([ \n])[ \n]*\b\1\b */\1\2/g;ba'
sed -En '
H
${
g
s/^\n//
s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
p
}
' file
This is a text file with duplicate words
on one or more lines
where
H -- append each line to the hold space
${...} -- on the last line, perform the enclosed commands
g -- replace pattern space with the contents of the hold space
s/^\n// -- remove leading newline (side-effect of H on first line)
s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
..1..2............2............1..........................
the key here is to capture the text and the spaces separately so that the back reference can match with differing whitespace.
captured expression #1 is the first word and it's whitespace (which can contain newlines), and the capture #2 is just the word.

Using sed to remove embedded newlines

What is a sed script that will remove the "\n" character but only if it is inside "" characters (delimited string), not the \n that is actually at the end of the (virtual) line?
For example, I want to turn this file
"lalala","lalalslalsa"
"lalalala","lkjasjdf
asdfasfd"
"lalala","dasdf"
(line 2 has an embedded \n ) into this one
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
(Line 2 and 3 are now joined, and the real line feed was replaced with the character string \\n (or any other easy to spot character string, I'm not picky))
I don't just want to remove every other newline as a previous question asked, nor do I want to remove ALL newlines, just those that are inside quotes. I'm not wedded to sed, if awk would work, that's fine too.
The file being operated on is too large to fit in memory all at once.
sed is an excellent tool for simple substitutions on a single line but for anything else you should use awk., e.g:
$ cat tst.awk
{
if (/"$/) {
print prev $0
prev = ""
}
else {
prev = prev $0 " \\\\n "
}
}
$ awk -f tst.awk file
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
Below was my original answer but after seeing #NeronLeVelu's approach of just testing for a quote at the end of the line I realized I was doing this in a much too complicated way. You could just replace gsub(/"/,"&") % 2 below with /"$/ and it'd work the same but the above code is a simpler implementation of the same functionality and will now handle embedded escaped double quotes as long as they aren't at the end of a line.
$ cat tst.awk
{ $0 = saved $0; saved="" }
gsub(/"/,"&") % 2 { saved = $0 " \\\\n "; next }
{ print }
$ awk -f tst.awk file
"lalala","lalalslalsa"
"lalalala","lkjasjdf \\n asdfasfd"
"lalala","dasdf"
The above only stores 1 output line in memory at a time. It just keeps building up an output line from input lines while the number of double quotes in that output line is an odd number, then prints the output line when it eventually contains an even number of double quotes.
It will fail if you can have double quotes inside your quoted strings escaped as \", not "", but you don't show that in your posted sample input so hopefully you don't have that situation. If you have that situation you need to write/use a real CSV parser.
sed -n ':load
/"$/ !{N
b load
}
:cycle
s/^\(\([^"]*"[^"]*"\)*\)\([^"]*"[^"]*\)\n/\1\3 \\\\n /
t cycle
p' YourFile
load the lines in working buffer until a close line (ending with ") is found or end reach
replace any \n that is after any couple of open/close " followed by a single " with any other caracter that " between from the start of file by the escapped version of new line (in fact replace starting string + \n by starting string and escaped new line)
if any substitution occur, retry another one (:cycle and t cycle)
print the result
continue until end of file
thanks to #Ed Morton for remark about escaped new line

Delete everything between the first tab and last semicolon

I have a file whose lines are like that:
EF457507|S000834932 Root;Bacteria;"Acidobacteria";Acidobacteria_Gp4;Gp4
EF457374|S000834799 Root;Bacteria;"Acidobacteria";Acidobacteria_Gp14;Gp14
AJ133184|S000323093 Root;Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria;Family I;GpI
DQ490004|S000686022 Root;Bacteria;"Armatimonadetes";Armatimonadetes_gp7
AF268998|S000340459 Root;Bacteria;TM7;TM7_genera_incertae_sedis
I would like to print any thing between the first tab and last semicolon, like that
EF457507|S000834932 Gp4
EF457374|S000834799 Gp14
AJ133184|S000323093 GpI
DQ490004|S000686022 Armatimonadetes_gp7
AF268998|S000340459 TM7_genera_incertae_sedis
I tried to use regex but it doesn't work, is there any way to do it using Linux, awk or Perl?
You could use sed:
sed 's/\t.*;/\t/' file
## This matches a tab character '\t'; followed by any character '.' any number of
## times '*'; followed by a semicolon; and; replaces all of this with a tab
## character '\t'.
sed 's/[^\t]*;//' file
## Things inside square brackets become a character class. For example, '[0-9]'
## is a character class. Obviously, this would match any digit between zero and
## nine. However, when the first character in the character class is a '^', the
## character class becomes negated. So '[^\t]*;' means match anything not a tab
## character any number of times followed by a semicolon.
Or awk:
awk 'BEGIN { FS=OFS="\t" } { sub(/.*;/,"",$2) }1' file
awk '{ sub(/[^\t]*;/,"") }1' file
Results:
EF457507|S000834932 Gp4
EF457374|S000834799 Gp14
AJ133184|S000323093 GpI
DQ490004|S000686022 Armatimonadetes_gp7
AF268998|S000340459 TM7_genera_incertae_sedis
As per comments below, to 'remove everything after the last semicolon', with sed:
sed 's/[^;]*$//' file
## '[^;]*$' will match anything not a semicolon any number of times anchored to
## the end of the line.
Or awk:
awk 'BEGIN { FS=OFS="\t" } { sub(/[^;]*$/,"",$2) }1' file
awk '{ sub(/[^;]*$/,"") }1' file

How can I escape an arbitrary string for use as a command line argument in Bash?

I have a list of strings and I want to pass those strings as arguments in a single Bash command line call. For simple alphanumeric strings it suffices to just pass them verbatim:
> script.pl foo bar baz yes no
foo
bar
baz
yes
no
I understand that if an argument contains spaces or backslashes or double-quotes, I need to backslash-escape the double-quotes and backslashes, and then double-quote the argument.
> script.pl foo bar baz "\"yes\"\\\"no\""
foo
bar
baz
"yes"\"no"
But when an argument contains an exclamation mark, this happens:
> script.pl !foo
-bash: !foo: event not found
Double quoting doesn't work:
> script.pl "!foo"
-bash: !foo: event not found
Nor does backslash-escaping (notice how the literal backslash is present in the output):
> script.pl "\!foo"
\!foo
I don't know much about Bash yet but I know that there are other special characters which do similar things. What is the general procedure for safely escaping an arbitrary string for use as a command line argument in Bash? Let's assume the string can be of arbitrary length and contain arbitrary combinations of special characters. I would like an escape() subroutine that I can use as below (Perl example):
$cmd = join " ", map { escape($_); } #args;
Here are some more example strings which should be safely escaped by this function (I know some of these look Windows-like, that's deliberate):
yes
no
Hello, world [string with a comma and space in it]
C:\Program Files\ [path with backslashes and a space in it]
" [i.e. a double-quote]
\ [backslash]
\\ [two backslashes]
\\\ [three backslashes]
\\\\ [four backslashes]
\\\\\ [five backslashes]
"\ [double-quote, backslash]
"\T [double-quote, backslash, T]
"\\T [double-quote, backslash, backslash, T]
!1
!A
"!\/'" [double-quote, exclamation, backslash, forward slash, apostrophe, double quote]
"Jeff's!" [double-quote, J, e, f, f, apostrophe, s, exclamation, double quote]
$PATH
%PATH%
&
<>|&^
*#$$A$##?-_
EDIT:
Would this do the trick? Escape every unusual character with a backslash, and omit single or double quotes. (Example is in Perl but any language can do this)
sub escape {
$_[0] =~ s/([^a-zA-Z0-9_])/\\$1/g;
return $_[0];
}
If you want to securely quote anything for Bash, you can use its built-in printf %q formatting:
cat strings.txt:
yes
no
Hello, world
C:\Program Files\
"
\
\\
\\\
\\\\
\\\\\
"\
"\T
"\\T
!1
!A
"!\/'"
"Jeff's!"
$PATH
%PATH%
&
<>|&^
*#$$A$##?-_
cat quote.sh:
#!/bin/bash
while IFS= read -r string
do
printf '%q\n' "$string"
done < strings.txt
./quote.sh:
yes
no
Hello\,\ world
C:\\Program\ Files\\
\"
\\
\\\\
\\\\\\
\\\\\\\\
\\\\\\\\\\
\"\\
\"\\T
\"\\\\T
\!1
\!A
\"\!\\/\'\"
\"Jeff\'s\!\"
\$PATH
%PATH%
\&
\<\>\|\&\^
\*#\$\$A\$##\?-_
These strings can be copied verbatim to for example echo to output the original strings in strings.txt.
What is the general procedure for safely escaping an arbitrary string for use as a command line argument in Bash?
Replace every occurrence of ' with '\'', then put ' at the beginning and end.
Every character except for a single quote can be used verbatim in a single-quote-delimited string. There's no way to put a single quote inside a single-quote-delimited string, but that's easy enough to work around: end the string ('), then add a single quote by using a backslash to escape it (\'), then begin a new string (').
As far as I know, this will always work, with no exceptions.
You can use single quotes to escape strings for Bash. Note however this does not expand variables within quotes as double quotes do. In your example, the following should work:
script.pl '!foo'
From Perl, this depends on the function you are using to spawn the external process. For example, if you use the system function, you can pass arguments as parameters so there"s no need to escape them. Of course you"d still need to escape quotes for Perl:
system("/usr/bin/rm", "-fr", "/tmp/CGI_test", "/var/tmp/CGI");
sub text_to_shell_lit(_) {
return $_[0] if $_[0] =~ /^[a-zA-Z0-9_\-]+\z/;
my $s = $_[0];
$s =~ s/'/'\\''/g;
return "'$s'";
}
See this earlier post for an example.
Whenever you see you don't get the desired output, use the following method:
"""\special character"""
where special character may include ! " * ^ % $ # # ....
For instance, if you want to create a bash generating another bash file in which there is a string and you want to assign a value to that, you can have the following sample scenario:
Area="(1250,600),(1400,750)"
printf "SubArea="""\""""${Area}"""\""""\n" > test.sh
printf "echo """\$"""{SubArea}" >> test.sh
Then test.sh file will have the following code:
SubArea="(1250,600),(1400,750)"
echo ${SubArea}
As a reminder to have newline \n, we should use printf.
Bash interprets exclamation marks only in interactive mode.
You can prevent this by doing:
set +o histexpand
Inside double quotes you must escape dollar signs, double quotes, backslashes and I would say that's all.
This is not a complete answer, but I find it useful sometimes to combine two types of quote for a single string by concatenating them, for example echo "$HOME"'/foo!?.*' .
FWIW, I wrote this function that invokes a set of arguments using different credentials. The su command required serializing all the arguments, which required escaping them all, which I did with the printf idiom suggested above.
$ escape_args_then_call_as myname whoami
escape_args_then_call_as() {
local user=$1
shift
local -a args
for i in "$#"; do
args+=( $(printf %q "${i}") )
done
sudo su "${user}" -c "${args[*]}"
}