Replace pattern with consecutive strings from list - sed

I would like to find specific string in one file and then replace it with consecutive strings from another file. The order of replacement should be maintained.
The first file looks like this:
>A1
NNNNNNNNNN
NNNNNNNNNN
>B2
ACGTNNNNNN
NNNGTGTNNN
NNNNNNNNNN
>B3
GGGGGGGGGG
NNNTTTTTTT
NNNNCTGNNN
And the file with strings looks like this:
Name1
Name1
Name2
Name2
Name3
Name4
So finally I would like to find lines containing '>' and replace '>' with '>string' from second file to get this output:
>Name1 A1
NNNNNNNNNN
NNNNNNNNNN
>Name1 B2
ACGTNNNNNN
NNNGTGTNNN
NNNNNNNNNN
>Name2 B3
GGGGGGGGGG
NNNTTTTTTT
NNNNCTGNNN

If you have GNU sed:
sed '/^>/R file_with_strings' first_file | sed '/^>/{N;s/>\(.*\)\n\(.*\)/>\2 \1/;}'

This might work for you (GNU sed):
sed -E '1{x;s/^/cat file2/e;x}
/^>/{G;s/^>(\S+)\n(\S+)/>\2 \1/;P;s/^[^\n]*\n//;h;d}' file1
On the first line slurp the second file into hold space.
If a line begins >, append the hold space and using pattern matching and back references build the required header line from the first line of the hold space.
Print the result (the first line of the pattern space), remove the first line ,replace the hold space and delete the current line.
Repeat.

Related

Sed: find, replace and then append result to original line

I am on Mac, I want to find a pattern in lines, replace it with something, then append the resulting string to the end of the original line. Here is what I tried:
echo "test='123'" | sed -E '/([^a-z])/ s/$/ \1/'
sed: 1: "/([^a-z])/ s/$/ \1/": \1 not defined in the RE
What do I need to define \1? I thought I did it with ([^a-z]). No?
Edit: Perhaps this code will represent better what I want:
1) echo "test='123'" | sed 's/[a-zA-Z0-9]//g'
2) I want the new line = original line + line #1 above
In other words:
Before (what I get): test='123'
After (what I want): test='123' =''
You can edit this command this way:
echo "test='123'" | sed -E 'h;s/([a-zA-Z0-9])//g;G;s/(.*)\n(.*)/\2\1/'
For readability, the script, line by line, reads
h
s/([a-zA-Z0-9])//g
G
s/(.*)\n(.*)/\2\1/
h stores the current line in the hold space,
your s command does what it does
G appends the content of the hold space, i.e. the original line, to the pattern space, i.e. the current line as you have edited it, putting a newline \n in between.
another s command reorders the two pieces, also removing the \n that the G command inserted.
Comments
Your original attempt sed -E '/([^a-z])/ s/$/ \1/' could not work because \1 refers to what is captured by the leftmost (…) group in the search portion of the s command, it does not "remember" the group(s) you used to address the line.
Once you print the pattern space with p, a newline comes with it, and once it's been printed, there's no way you can remove it within the same sed program.

Delete string after '#' using sed

I have a text file that looks like:
#filelists.txt
a
# aaa
b
#bbb
c #ccc
I want to delete parts of lines starting with '#' and afterwards, if line starts with #, then to delete whole line.
So I use 'sed' command in my shell:
sed -e "s/#*//g" -e "/^$/d" filelists.txt
I wish its result is:
a
b
c
but actually result is:
filelists.txt
a
aaa
b
bbb
c ccc
What's wrong in my "sed" command?
I know '*' which means "any", so I think that '#*' means string after "#".
Isn't it?
You may use
sed 's/#.*//;/^$/d' file > outfile
The s/#.*// removes # and all the rest of the line and /^$/d drops empty lines.
See an online test:
s="#filelists.txt
a
# aaa
b
#bbb
c #ccc"
sed 's/#.*//;/^$/d' <<< "$s"
Output:
a
b
c
Another idea: match lines having #, then remove # and the rest of the line there and drop if the line is empty:
sed '/#/{s/#.*//;/^$/d}' file > outfile
See another online demo.
This way, you keep the original empty lines.
* does not mean "any" (at least not in regular expression context). * means "zero or more of the preceding pattern element". Which means you are deleting "zero or more #". Since you only have one #, you delete it, and the rest of the line is intact.
You need s/#.*//: "delete # followed by zero or more of any character".
EDIT: was suggesting grep -v, but didn't notice the third example (# in the middle of the line).

sed: replace pattern only if followed by empty line

I need to replace a pattern in a file, only if it is followed by an empty line. Suppose I have following file:
test
test
test
...
the following command would replace all occurrences of test with xxx
cat file | sed 's/test/xxx/g'
but I need to only replace test if next line is empty. I have tried matching a hex code, but that doesn ot work:
cat file | sed 's/test\x0a/xxx/g'
The desired output should look like this:
test
xxx
xxx
...
Suggested solutions for sed, perl and awk:
sed
sed -rn '1h;1!H;${g;s/test([^\n]*\n\n)/xxx\1/g;p;}' file
I got the idea from sed multiline search and replace. Basically slurp the entire file into sed's hold space and do global replacement on the whole chunk at once.
perl
$ perl -00 -pe 's/test(?=[^\n]*\n\n)$/xxx/m' file
-00 triggers paragraph mode which makes perl read chunks separated by one or several empty lines (just what OP is looking for). Positive look ahead (?=) to anchor substitution to the last line of the chunk.
Caveat: -00 will squash multiple empty lines into single empty lines.
awk
$ awk 'NR==1 {l=$0; next}
/^$/ {gsub(/test/,"xxx", l)}
{print l; l=$0}
END {print l}' file
Basically store previous line in l, substitute pattern in l if current line is empty. Print l. Finally print the very last line.
Output in all three cases
test
xxx
xxx
...
This might work for you (GNU sed):
sed -r '$!N;s/test(\n\s*)$/xxx\1/;P;D' file
Keep a window of 2 lines throughout the length of the file and if the second line is empty and the first line contains the pattern then make a substitution.
Using sed
sed -r ':a;$!{N;ba};s/test([^\n]*\n(\n|$))/xxx\1/g'
explanation
:a # set label a
$ !{ # if not end of file
N # Add a newline to the pattern space, then append the next line of input to the pattern space
b a # Unconditionally branch to label. The label may be omitted, in which case the next cycle is started.
}
# simply, above command :a;$!{N;ba} is used to read the whole file into pattern.
s/test([^\n]*\n(\n|$))/xxx\1/g # replace the key word if next line is empty (\n\n) or end of line ($)

How to find and replace every match except the first using sed?

I am using sed to find and replace text, e.g.:
set -i 's/a/b/g' ./file.txt
This replaces every instance of a with b in the file. I need to add an exception, such that sed replaces every instance of a with b, except for the first appearance in the file, e.g.:
There lived a bird who liked to eat fish.
One day he fly to a tree.
This becomes:
There lived a bird who liked to ebt fish.
One dby he fly to b tree.
How can I modify my sed script to only replace every instance of a with b, except for the first occurrence?
I have GNU sed version 4.2.1.
This might work for you (GNU sed):
sed 's/a/b/2g' file
or
sed ':a;s/\(a[^a]*\)a/\1b/;ta' file
This can be taylored e.g.
sed ':a;s/\(\(a[^a]*\)\{5\}\)a/\1b/;ta' file
will start replacing a with b after 5 a's
You can do a more complete implementation with a script that's more complex:
#!/bin/sed -nf
/a/ {
/a.*a/ {
h
s/a.*/a/
x
s/a/\n/
s/^[^\n]*\n//
s/a/b/g
H
g
s/\n//
}
: loop
p
n
s/a/b/g
$! b loop
}
The functionality of this is easily explained in pseudo-code
if line contains "a"
if line contains two "a"s
tmp = line
remove everything after the first a in line
swap tmp and line
replace the first a with "\n"
remove everything up to "\n"
replace all "a"s with "b"s
tmp = tmp + "\n" + line
line = tmp
remove first "\n" from line
end-if
loop
print line
read next line
replace all "a"s with "b"s
repeat loop if we haven't read the last line yet
end-loop
end-if
One way is to replace all and then reverse the first replacement (thanks potong):
sed -e 'y/a/\n/' -e 's/\n/a/g' -e 'y/\n/b/'
Newline serves as an intermediate so strings beginning with b work correctly.
The above works line-wise, if you want to apply it to the whole file, first make the whole file into one line:
<infile tr '\n' '^A' | sed 'y/a/\n/; s/\n/a/; y/\n/b/' | tr '^A' '\n'
Or more briefly using the sed command from potong's answer:
<infile tr '\n' '^A' | sed 's/a/b/2g' | tr '^A' '\n'
Note ^A (ASCII 0x01) can be produced with Ctrl-vCtrl-a. ^A in tr can be replaced by \001.
This assumes that the file contains no ^A.

The pattern space and hold space of the Sed utility has an initialized value of null or empty string?

From the documentation of sed:
sed maintains two data buffers: the active pattern space, and the
auxiliary hold space. Both are initially empty.
I initially think the value of pattern space and hold space is null (nothing). But from the following example, it seems that the initially value of them is a single newline character (\n).
[root#localhost ~]# cat e.txt
aa
bb
cc
dd
[root#localhost ~]# cat e.txt | sed -r '/c/{x;p;x}'
aa
bb
cc
dd
[root#localhost ~]#
Is my understanding right?
Thanks.
I think the answer is that the p command, like the default print action, is actually adding a newline to the end of the empty pattern space. This is based on this little snippet from the GNU sed documentation (just below that bit you quote, by the way):
sed operates by performing the following cycle on each line of input: first, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space.
... blah, blah blah ...
When the end of the script is reached, unless the -n option is in use, the contents of pattern space are printed out to the output stream, adding back the trailing newline if it was removed.
In other words, the line being held in the pattern (and hold) space does not have the trailing newline - the aa line is held as aa rather than aa<newline>.
Of course, the hold space may still contain multiple lines but that just means that executing the H command on the first two lines of your file will give you a hold space containing aa<newline>bb, not aa<newline>bb<newline>.