How to replace only specific spaces in a file using sed? - sed

I have this content in a file where I want to replace spaces at certain positions with pipe symbol (|). I used sed for this, but it is replacing all the spaces in the string. But I don't want to replace the space for the 3rd and 4th string.
How to achieve this?
Input:
test test test test
My attempt:
sed -e 's/ /|/g file.txt
Expected Output:
test|test|test test
Actual Output:
test|test|test|test

sed 's/ /\
/3;y/\n / |/'
As newline cannot appear in a sed pattern space, you can change the third space to a newline, then change all newlines and spaces to spaces and pipes.
GNU sed can use \n in the replacement text:
sed 's/ /\n/3;y/\n / |/'

If the original input doesn't contain any pipe characters, you can do
sed -e 's/ /|/g' -e 's/|/ /3' file
to retain the third white space. Otherwise see other answers.

You could replace the 'first space' twice, e.g.
sed -e 's/ /|/' -e 's/ /|/' file.txt
Or, if you want to specify the positions (e.g. the 2nd and 1st spaces):
sed -e 's/ /|/2' -e 's/ /|/1' file.txt

Using GNU sed to replace the first and second one or more whitespace chunks:
sed -i -E 's/\s+/|/;s/\s+/|/' file
See the online demo.
Details
-i - inline replacements on
-E - POSIX ERE syntax enabled
s/\s+/|/ - replaces the first one or more whitespace chars
; - and then
s/\s+/|/ the second one or more whitespace chars on each line (if present).

Keep it simple and use awk, e.g. using any awk in any shell on every Unix box no matter what other characters your input contains:
$ awk '{for (i=1;i<NF;i++) sub(/ /,"|")} 1' file
test|test|test test
The above replaces all but the last " " on each line. If you want to replace a specific number, e.g. 2, then just change NF to 2.

Related

Need to parse the following sed command: sed -e 's/ /\'$'\n/g'

I stumble upon the command sed -e 's/ /\'$'\n/g'that supposedly takes an input and split all spaces into new lines. Still, I don't quite get how the '$' works in the command. I know that s stands for substitute, / / stands for the blank spac, \n stands for new line and /g is for global replacement, but not sure how \'$' fits in the picture. Anybody who can shed some light here will be much appreciated.
Basically it's meant for platform portability. With GNU sed it would be just
sed -e 's/ /\n/g'
because GNU sed is able to interpret \n as new line.
However, other versions of sed, like the BSD version (that comes with MacOS) do not interprete \n as newline.
That's why the command is build out of two parts
sed -e 's/ /\' part2: $'\n/g'
The $'\n/g' is an ANSI C string parsed by the shell before executing sed. Escape sequences like \n will get expanded in such strings. Doing so, the author of the command passed a literal new line (0xa) to the sed command rather than passing the escape sequence \n. (0x5c 0x6e).
One more thing, since the newline (0xa) is a command separator in sed, it needs to get escaped. That's why the \ at the end of the first part.
Alternatively you could just use a multiline version:
sed -e 's/ /\
/g'
Btw, I would have written the command like
sed -e 's/ /\'$'\n''/g'
meaning just putting the $'\n' into the ANSI C string. Imo that's better to understand.

sed to copy part of line to end

I'm trying to copy part of a line to append to the end:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz
becomes:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
I have tried:
sed 's/\(.*(GCA_\)\(.*\))/\1\2\2)'
$ f1=$'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz'
$ echo "$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\1\2\3\/\2\4/' <<<"$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
sed -E (or -r in some systems) enables extended regex support in sed , so you don't need to escape the group parenthesis ( ).
The format (GCA_.[^.]*) equals to "get from GCA_ all chars up and excluding the first found dot" :
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\2/' <<<"$f1"
GCA_900169985
Similarly (.[^_]*) means get all chars up to first found _ (excluding _ char). This is the regex way to perform a non greedy/lazy capture (in perl regex this would have been written something like as .*_?)
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\3/' <<<"$f1"
.1
Short sed approach:
s="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz"
sed -E 's/(GCA_[^._]+)\.([^_]+)/\1.\2\/\1/' <<< "$s"
The output:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz

Why is sed not matching excess whitespace between non-whitespace characters

I have a sed oneliner which removes excess whitespace:
sed -e 's/^\s*//' -e 's/\s*$//' -e 's/\s{2,}/ /g'
When I test it on " \tone1 two\t3three\t ", the sed removes the whitespace at the beginning and end of the line but doesn't match the excess whitespace between words, and sed returns \tone1 two\t3three. What I want is \tone1 two 3three, so sed -e 's/[ \t]{2,}/ /g' is not functioning.
regexr.com shows the expression as functional.
My version is GNU sed version 4.2.1.
{ and } need to be escaped in basic regex mode that sed uses.
However, you can use this sed with a single substitution with alternation:
sed -E 's/^[[:blank:]]+|[[:blank:]]+$|[[:blank:]]{2,}//g' file
POSIX character class [[:blank:]] matches a space or tab characters.

Delete line if string between the 4th and 5th delimiter is empty

"text";"text";"text";"text";;"text";"text"
If after the 4th delimiter the next one is following the line should be deleted.
Actually i'm doing that by using sed
sed -n '/;;/!p' input.txt
Is this a reliable solution?
Thanks for help.
Securing a bit potential escaped double quote and internal ";" (thanks #SLePort for remark)
sed -e 'h;s/\\"//g' -e ':c' -e 's/^\(\("[^"]*";\)*"[^"]*\);/\1/;t c' -e '/^\([^;]*;\)\{4\};/d;h'
sed -r '/^([^;]+;){4}\s*;/d' input.txt
awk -F';' '$5' input.txt
To remove lines containing ; after fourth delimiter:
sed '/^\("*[^"]*"*;\)\{4\};/d' input.txt
This might work for you (GNU sed):
sed -r '/^("(\\.|[^"])*";){4};/d' file
If the fourth grouping of double quotes followed by semi colon, where the characters within the grouping are either a pair of a quote and any other character or not a double quote, is followed by a further semi colon, then delete the line.
A more efficient regexp would be:
sed -r '/^("[^"\\]*(\\.[^"\\]*)*";){4};/d' file
This uses the pattern normal*(abnormal normal*)*

sed to replace non-printable character with printable character

Am running BASH and UNIX utilities on Windows 7.
Have a file that contains a vertical tab. The binary symbol is 0x0B. The octal symbol is 013. I need to replace the symbol with a blank space.
Have tried this sed approach but it fails:
sed -e 's/'$(echo "octal-value")'/replace-word/g'
Specifically:
sed -e 's/'$(echo "\013")'/ /g'
Update:
Following this advice I use GNU sed and this approach:
sed -i 's:\0x0B: :g' file
but the stubborn vertical tab is still in the file.
What is the correct way to replace a non-printable character with a printable character?
Sed should recognise special characters:
sed -e 's/\x0b/ /g'
In answer to why the -e? If you use more than one sed expression, then each one must be preceded by the -e. So, for example:
echo foo bar bas zer | sed -e 's/zer/oh my/g' -e 's/bas/baz/'
would result in:
foo bar baz oh my
thus performing 2 different sed changes ('scripts) with only a single invocation. See sed man pages for more details.
(the above example is, obviously, contrived. I, however, have seen a sed command in a script with 78 individual -e 'scripts'!)
If you only have one 'script', then the -e is optional, obviously.