Decipher this sed one-liner

Decipher this sed one-liner - sed

I want to remove duplicate lines from a file, without sorting the file.
Example of why this is useful to me: removing duplicates from Bash's $HISTFILE without changing the chronological order.
This page has a one-liner to do that:
http://sed.sourceforge.net/sed1line.txt
Here's the one-liner:
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
I asked a sysadmin and he told me "you just copy the script and it works, don't go philosophising about this", which is fine, so I am asking here as it's a developer forum and I trust people might be like me, suspicious about using things they don't understand:
Could you kindly provide a pseudo-code explanation of what that "black magic" script is doing, please? I tried parsing the incantation in my head but especially the central part is quite hard.

I'll note that this script does not appear to work with my copy of sed (GNU sed 4.1.5) in my current locale. If I run it with LC_ALL=C it works fine.
Here's an annotated version of the script. sed basically has two registers, one is called "pattern space" and is used for (basically) the current input line, and the other, the "hold space", can be used by scripts for temporary storage etc.
sed -n ' # -n: by default, do not print
G # Append hold space to current input line
s/\n/&&/ # Add empty line after current input line
/^\([ -~]*\n\).*\n\1/d # If the current input line is repeated in the hold space, skip this line
# Otherwise, clean up for storing all input in hold space:
s/\n// # Remove empty line after current input line
h # Copy entire pattern space back to hold space
P # Print current input line'
I guess the adding and removal of an empty line is there so that the central pattern can be kept relatively simple (you can count on there being a newline after the current line and before the beginning of the matching line).
So basically, the entire input file (sans duplicates) is kept (in reverse order) in the hold space, and if the first line of the pattern space (the current input line) is found anywhere in the rest of the pattern space (which was copied from the hold space when the script started processing this line), we skip it and start over.
The regex in the conditional can be further decomposed;
^ # Look at beginning of line (i.e. beginning of pattern space)
\( # This starts group \1
[ -~] # Any printable character (in the C locale)
* # Any number of times
\n # Followed by a newline
\) # End of group \1 -- it contains the current input line
.*\n # Skip any amount of lines as necessary
\1 # Another occurrence of the current input line, with newline and all
If this pattern matches, the script discards the pattern space and starts over with the next input line (d).
You can get it to work independently of locale by changing [ -~] to [[:print:]]

The code doesn't work for me, perhaps due to some locale setting, but this does:
vvv
sed -n 'G; s/\n/&&/; /^\([^\n]*\n\).*\n\1/d; s/\n//; h; P'
^^^
Let's first translate this by the book (i.e. sed info page), into something perlish.
# The standard sed loop
my $hold = "";
while ($my pattern = <>) {
chomp $pattern;
$pattern = "$pattern\n$hold"; # G
$pattern =~ s/(\n)/$1$1/; # s/\n/&&/
if ($pattern =~ /^([^\n]*\n).*\n\1/) { # /…/
next; # d
}
$pattern =~ s/\n//; # s/\n//
$hold = $pattern; # h
$pattern =~ /^([^\n]*\n?)/; print $1; # P
}
OK, the basic idea is that the hold space contains all the lines seen so far.
G: At the beginning of each cycle, append that hold space to the current line. Now we have a single string consisting of the current line and all unique lines which preceeded it.
s/\n/&&/: Turn the newline which separates them into a double newline, so that we can match subsequent and non-subsequent duplicates the same, see the next step.
^\([^\n]*\n\).*\n\1/: Look through the current text for the following: at the beginning of all the lines (^) look for a first line including trailing newline (\([^\n]*\n\)), then anything (.*), then a newline (\n), and then that same first line including newline repeated again (\1). If two subsequent lines are the same, then the .* in the regular expression will match the empty string, but the two \n will still match due to the newline duplication in the preceding step. So basically this asks whether the first line appears again among the other lines.
d: If there is a match, this is a duplicate line. We discard this input, keep the hold space as it is as a buffer of all unique lines seen so far, and continue with the next line of input.
s/\n//: Otherwise, we continue and next turn the double newline back into a single newline.
h: We include the current line in our list of all unique lines.
P: And finally print this new unique line, up to the newline character.

For the actual problem to resolve, here is a simpler solution (at least it looks so) with awk:
awk '!_[$0]++' FILE
In short _[$0] is a counter (of appearance) for each unique line, for any line ($0) appearing for the second time _[$0] >= 1, thus !_[$0] evaluates to false, causing it not to be printed except its first time appearance.
See https://gist.github.com/ryenus/5866268 (credit goes to a recent forum I visited.)

Related

Sed command to delete "\" which causes "*** multiple target patterns. Stop." error

In a file, I'm having the lines like this -
a.lo a.o: abc/util.c \
/usr/lib/def.h
b.lo b.o: hash/imp.h \
/usr/lib/toy.c \
c.lo c.o: high/scan.c \
high/scan_f.c
Here you can see one extra \ (back slash) at the end of line number 4 (/usr/lib/toy.c ). How can I use sed command to remove this / (back slash)? Because of this I'm getting "*** multiple target patterns. Stop." error.
P.S. - I'm having this extra \ (back slash) at multiple places in my file. So using sed to delete it by line number won't be feasible. Need something which can check for .lo .o and check a line before, if it finds a \ (back slash) remove it.

Maybe not the simplest but this should work:
sed -nE '${s/\\$//;p;};N;s/\\([^\\]*:)/\1/;P;D' input_file
The main idea is to concatenate input lines in the pattern space (a sed internal text buffer), such that it always contains 2 consecutive lines, separated by a newline character. We then just delete the last \ before a :, if any, print the first of the 2 lines and remove it from the pattern space before continuing with the next line.
sed commands are separated by semi-columns (;) and grouped with curly braces ({...}). They are optionally preceded by a line(s) specification, for instance $ that stands for the last line of the input. So, in our case, ${s/\\$//;p;} applies only to the last line while the rest (N;s/\\([^\\]*:)/\1/;P;D) applies to all lines.
The -n option suppresses the default output. We need this to control the output ourselves with the p (print) command.
The -E option enables the use of extended regular expressions.
Let's first explain the tricky part: N;s/\\([^\\]*:)/\1/;P;D. It is a list of 4 commands that are run for each line of the input because there is no line(s) specification before the commands.
When sed starts processing the input the pattern space already contains the first line (a.lo a.o: abc/util.c \ in your example). This is how sed works: by default it puts the current line in the pattern space, applies the commands and restarts with the next line.
N appends the next input line (/usr/lib/def.h) to the pattern space with a newline character as separator. The pattern space now contains:
a.lo a.o: abc/util.c \
/usr/lib/def.h
N also increments the current line number which becomes 2.
s/\\([^\\]*:)/\1/ deletes the last \ before the first : in the pattern space, if there is one. In our example the only \ is after the first :. The pattern space is not modified.
P prints the first part of the pattern space, up to the first newline character. In our example what is printed is:
a.lo a.o: abc/util.c \
D deletes the first part of the pattern space, up to the first newline character (what has just been printed). The pattern space contains:
/usr/lib/def.h
D also starts a new cycle but different from the normal sed processing, it does not read the next line and leaves the pattern space and current line number unmodified. So when restarting the pattern space contains line number 2 and the the current line number is still 2.
By induction we see that, each time sed restarts executing the list of commands, the pattern space contains the current line, as normal. When processing line number 4 of your example it contains:
/usr/lib/toy.c \
After N it contains:
/usr/lib/toy.c \
c.lo c.o: high/scan.c \
And there, the substitution command (s/\\([^\\]*:)/\1/) matches and deletes the first \:
/usr/lib/toy.c
c.lo c.o: high/scan.c \
It is thus:
/usr/lib/toy.c
that is printed and removed from the pattern space. Exactly what you want.
The last line needs a special treatment. When we start processing it the pattern space contains:
high/scan_f.c
If we don't do anything special N does not change it (there is no next line to concatenate) and terminates the processing. The last line is never printed.
This is why another list of commands is needed, just for the last line: ${s/\\$//;p;}. It applies only to the last line because it is preceded by a line(s) specification ($ for last line). The first command in the list (substitute s/\\$//) removes a trailing \, if there is one. The second (p) prints the pattern space.
Note: if you know that the last line does not end with a trailing backslash you can simplify a bit:
sed -nE '$p;N;s/\\([^\\]*:)/\1/;P;D' input_file

I agree with #G.M. in general, but this will work.
sed captures text before trailing "\" (if present) on lines starting with "\" and prints only that text on those lines. All other text is also printed, of course
sed -e 's/\(.* \)\\$/\1/' input_file

The question is a bit unclear about how to identify the lines from which a trailing backslash should be removed, but inasmuch as the input looks like set of a makefile-format prerequisite lists from which some lines have been removed, I take the objective to be to remove backslashes where they appear after the last (remaining) prerequisite in a list. That requires looking ahead to the next line, so it will be helpful to make use of sed's hold space to store data while you look ahead at the next line to figure out what to do with it.
This would be a pretty robust solution for that problem:
sed -nE 's/\s*(\\){0,1}$/ \\/; :a; /:/ { x; s/\s*\\$//; p; d; }; H; $ { s/.*/:/; b a }' input
That builds up each prerequisite list in the hold space, with backslashes and newlines embedded, then dumps it when the next target list or the end of the input arrives.
Details:
the -n option turns off automatically printing the pattern space after each line
the -E option turns on extended regular expressions
the sed expression contains several sub-expressions, joined by semicolons:
s/\s*(\\){0,1}$/ \\/ : ensure that the current line in the pattern space ends with a space and backslash, without adding a second backslash to lines that already have one
:a : labels that point in the script 'a'
/:/ { x; s/\s*\\$//; p; d; } : on lines that contain a colon, swap the pattern and hold spaces, remove the trailing backslash from (the new contents of) the pattern space, print the result, then start the next cycle
H : (if control reaches this point) append a newline and the contents of the pattern space to the hold space
$ { s/.*/:/; b a } : on the last line of input trigger dumping the hold space by putting a colon in the pattern space and jumping to label 'a'
[end of expression] : read the next line into the pattern space and start over
Alternatively, it would more exactly follow your request, and avoid introducing a leading blank line, to do this:
sed -n ':a; /\\$/! { p; d; }; h; :b; $ { x; s/\\//; p; }; n; /:/ { x; s/\\$//; p; x; b a; }; H; /\\$/ b b; s/.*//; x; p' input
That also assembles pieces in the hold space before ultimately printing them, but it goes about it in a different way:
it starts (at label a) by checking whether the line in the pattern space ends with a backslash. If not (/\\$/!), then it prints the pattern space and starts the next cycle.
otherwise, it replaces the current contents of the hold space with the contents of the pattern space (which must already end with a backslash), then
(at label b) if the current line is the last then it retrieves the contents of the hold space, strips the trailing newline, and prints the result ($ { x; s/\\//; p; }). Either way,
it attempts to read the next input line, and terminates if there are no more (n).
if that results in the pattern space containing a colon within, then the contents of the hold space are printed, less trailing backslash, and control is sent back to label a to process the colon-containing line as a new first line (/:/ { x; s/\\$//; p; x; b a; }).
otherwise, a newline and the contents of the pattern space are appended to the hold space (H).
if the pattern space ends with a backslash then control branches back to label b to consider reading another line (/\\$/ b b).
otherwise, the hold space is printed and cleared (s/.*//; x; p), and
if there are any more lines then the next is read and a new cycle started.
That makes fewer assumptions about the nature of the input, but it is a bit more complicated.

Explain this sed conditional branching behavior

I have the following (gnu) sed script, which is intended to parse another sed script, and output distinct commands on a separate line.
In words, this script should put a newline after each semicolon ;, except semicolons that are inside a matching or substitution command.
Sed script:
#!/bin/sed -rf
# IDEA:
# replace ';' by ';\n' except when it's inside a match expression or subst. expression.
# Ignored patterns:
/^#/b # commented lines
/^$/b # empty lines
# anything in a single line, without semicolon except at the end
/^[^\n;]*;?$/b
# Processed patterns (put on separate lines):
# Any match preceding a semicolon, or the end of the line, or a substitution
s_/^[^/]+/[^;s]*;?_&\n_; t printtopline
s/^\\(.)[^\1]+\1[^;s]*;?/&\n/;t printtopline
# Any substitution (TODO)
# Any other command, separated by semicolon
s/\;/\;\n/; t printtopline;
:printtopline
P;D; # print top line, delete it, start new cycle
For example, I tested it with the following file (actually adapted from an answer of #ctac_ to one of my previous sed questions):
Input file:
#!/bin/sed -f
#/^>/N;
:A;
/\n>/!{s/\n/ /;N;bA}; # join next line if not a sequence label
#h;
#s/\(.*\)\n.*/\1/p;
s/^>//g;P
#x;
#s/.*\n//;
D
bA;
Output
The above script produces the right output, for example, the line /\n>/!{s/\n/ /;N;bA}; # join next line if not a sequence label becomes:
/\n>/!{s/\n/ /;
N;
bA};
# join next line if not a sequence label
Question
However, could you help me understand why this part of the script works:
s/\;/\;\n/; t printtopline;
:printtopline
?
I seems to me that the branching command t printtopline is useless here. I thought whatever the success of the substitution, the next thing to be executed would be :printtopline.
However, if I comment out the t command, or if I replace it with b, the script produces the following output lines:
/\n>/!{s/\n/ /;
N;bA}; # join next line if not a sequence label
From info sed, here is the explanation of t:
't LABEL'
Branch to LABEL only if there has been a successful 's'ubstitution
since the last input line was read or conditional branch was taken.
The LABEL may be omitted, in which case the next cycle is started.
Why isn't the t command immediately followed by its label not behaving like no command at all or the b command?

The crucial part is this:
Branch to label only if there has been a successful substitution since the last input line was read or conditional branch was taken.
I.e. t looks into the past and takes into account the success of all recent substitutions up to the most recent
input, or
conditional branch.
Consider the input line you're asking about. After all the substitutions we have
/\n>/!{s/\n/ /;
N;bA}; # join next line if not a sequence label
in our pattern space when we reach P;D;. The P commands outputs the first line, then D deletes the first line and restarts the main loop. Now we just have
N;bA}; # join next line if not a sequence label
Note that this didn't involve reading any additional lines. No input occurred; D just removed parts of the pattern space.
We process the remaining text (which does nothing because none of the other patterns match) until we reach this part of the code:
s_/^[^/]+/[^;s]*;?_&\n_; t printtopline
The substitution fails (the pattern space doesn't contain /^). But the t command doesn't check the status of just this one s command; it looks at the history of all substitutions since the most recent input or conditional branch taken.
The most recent input occurred when /\n>/!{s/\n/ /;N;bA}; was read.
The most recent conditional branch taken was
s/\;/\;\n/; t printtopline;
:printtopline
in the original version of your code. Since then no other substitution succeeded, so the t command does nothing. The rest of the program continues as expected.
But in the modified version of your code there was no conditional branch at this point (b is an unconditional branch):
s/\;/\;\n/; b printtopline;
:printtopline
That means the t from s_/^[^/]+/[^;s]*;?_&\n_; t printtopline "sees" the s/\;/\;\n/; as having succeeded, so it immediately jumps to the P;D; part. This is what outputs
N;bA}; # join next line if not a sequence label
unmodified.
In summary: t makes a difference here not because of its immediate effect of jumping to a label, but because it serves as a dynamic delimiter for the next t that gets executed. Without t here, the previously executed s command is taken into account for the next t.

Part 1 - how the P;D; sequence works.
Compare this two command's outputs: sed 's/;/;\n/' and sed 's/;/;\n/; P;D;'.
First:
$ sed 's/;/;\n/' <<< 'one;two;three;four'
one;
two;three;four
Second:
$ sed 's/;/;\n/; P;D;' <<< 'one;two;three;four'
one;
two;
three;
four
Why the difference? I will to explain.
The first command substitutes only the first occurrence of the ; character. To substitute all occurrences, the g modifier should be added to the s command: sed 's/;/;\n/g'.
The second command works this way:
sed 's/;/;\n/; - the same as the first command - no difference. Before this command the pattern space is one;two;three;four, after - one\ntwo;three;four.
P; -
from man: "Print up to the first embedded newline of the current pattern space."
That is, it prints up to first newline - one. The pattern space stay unchanged: one\ntwo;three;four
D; -
from man: "If pattern space contains no newline, start a normal new cycle as if the d command was
issued. Otherwise, delete text in the pattern space up to the first newline, and restart
cycle with the resultant pattern space, without reading a new line of input."
In the our case, pattern space has newline - one\ntwo;three;four. The D; removes the one\n part and repeat all commands cycle from the beginning. Now, the pattern space is: two;three;four.
That is, again sed 's/;/;\n/; - pattern space: two\nthree;four, then P; - print two, pattern space unchanged: two\nthree;four, D; - removes two\n, pattern space becomes: three;four. Etc.
Part 2 - what happening with branching.
I looked at the sed source code and found next information:
When the s command is executing and having match, the replaced flag is setting to the true:
/* We found a match, set the 'replaced' flag. */
replaced = true;
The t command is executing, if the replaced flag is true. And it is changing this flag to the false:
case 't':
if (replaced)
{
replaced = false;
So, in the first, s/\;/\;\n/; t printtopline; case, the substitution is successful - therefore, replaced flag is setting to the true. Then, the following t command is running and changing replaced flag back to the false.
In the second case, without t command - s/\;/\;\n/;, substitution is successful, too - therefore, replaced flag is setting to the true.
But now, this flag is stored to the next cycle, initiated by the D command. So, then the first t command appears in the new cycle - s_/^[^/]+/[^;s]*;?_&\n_; t printtopline, it checks the replaced flag, sees, that the flag is true and jumps to the label :printtopline, omitting all other commands before the label.
The pattern space doesn't have newlines, so P;D; sequence just prints pattern space and starts the next cycle with the new line of input.

Can someone break this sed command down for me?

I found this magical command on the unix forum to move the last line of a file to the beginning of the file. I use sed quite a bit but not to this extent. Can someone explain each part to me?
sed '1h;1d;$!H;$!d;G' infile

Yes, it uses exotic commands.
1h: put first line in the "hold" space (sed has 2 spaces: 1 hold space to keep data and the pattern space: actual processed line)
1d: delete first line
$!H: append all lines BUT the last one (and the first one since d command skips to the next line) into the "hold" space
$!d: delete (do not print) all lines except the last one
G: Append a newline to the contents of the pattern space (this is the last line, the only one able to reach that part of the script), and then append the contents of the hold space to that of the pattern space, pattern space which is printed right away. Swap done.
Opinion based comment: I must admit I would never have thought of doing that using sed, and I would have had to make a test to convince me of what this command was doing... in awk, it is much much easier to do that.
But sed has a special place in my heart with it's cryptic commands. I wonder if there are some sed candidates to CodeGolf :)
reference manual: https://www.gnu.org/software/sed/manual/sed.html
some exotic things you can do with sed (my best 1999 read): http://sed.sourceforge.net/grabbag/tutorials/do_it_with_sed.txt

Here is the same command in a more procedural-looking pseudocode:
for line in infile:
# Always do this: Copy the current line to the pattern
pattern = line
# Process the script
if first line:
hold = pattern # 1h
pattern = ""; continue # 1d
elif not last line:
hold = hold + "\n" + pattern # $!H
pattern = ""; continue # $!d
pattern = pattern + "\n" + hold # G
# Always do this after the script is completed.
# Due to the continue statements above, this
# isn't always reached, and in this case
# is only reached for the last line.
print pattern
d clears the pattern space and continues to the next input line without executing the rest of the script.
h copies the pattern space to the hold space.
H appends a newline to the hold space, then appends the pattern space to the hold space.
G is like H, but in the other direction; it copies the hold space to the pattern space.
The overall affect on a file with N lines is to build up a copy of lines 1 through N-1 in the hold space. When the pattern holds line N, append the hold space to the pattern space and print the pattern space to standard output.

sed recipe: how to do stuff between two patterns that can be either on one line or on two lines?

Let's say we want to do some substitutions only between some patterns, let them be <a> and </a> for clarity... (all right, all right, they're start and end!.. Jeez!)
So I know what to do if start and end always occur on the same line: just design a proper regex.
I also know what to do if they're guaranteed to be on different lines and I don't care about anything in the line containing end and I'm also OK with applying all the commands in the line containing start before start: just specify the address range as /start/,/end/.
This, however, doesn't sound very useful. What if I need to do a smarter job, for instance, introduce changes inside a {...} block?
One thing I can think of is breaking the input on { and } before processing and putting it back together afterwards:
sed 's/{\|}/\n/g' input | sed 'main stuff' | sed ':a $!{N;ba}; s/\n\(}\|{\)\n/\1/g'
Another option is the opposite:
cat input | tr '\n' '#' | sed 'whatever; s/#/\n/g'
Both of these are ugly, mainly because the operations are not confined within a single command. The second one is even worse because one has to use some character or substring as a "newline holder" assuming it isn't present in the original text.
So the question is: are there better ways or can the above-mentioned ones be optimized? This is quite a regular task from what I read in recent SO questions, so I'd like to choose the best practice once and for all.
P.S. I'm mostly interested in pure sed solutions: can the job be do with one invocation of sed and nothing else? Please no awk, Perl, etc.: this is more of a theoretical question, not a "need the job done asap" one.

This might work for you:
# create multiline test data
cat <<\! >/tmp/a
> this
> this { this needs
> changing to
> that } that
> that
> !
sed '/{/!b;:a;/}/!{$q;N;ba};h;s/[^{]*{//;s/}.*//;s/this\|that/\U&/g;x;G;s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/' /tmp/a
this
this { THIS needs
changing to
THAT } that
that
# convert multiline test data to a single line
tr '\n' ' ' </tmp/a >/tmp/b
sed '/{/!b;:a;/}/!{$q;N;ba};h;s/[^{]*{//;s/}.*//;s/this\|that/\U&/g;x;G;s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/' /tmp/b
this this { THIS needs changing to THAT } that that
Explanation:
Read the data into the pattern space (PS). /{/!b;:a;/}/!{$q;N;ba}
Copy the data into the hold space (HS). h
Strip non-data from front and back of string. s/[^{]*{//;s/}.*//
Convert data e.g. s/this\|that/\U&/g
Swap to HS and append converted data. x;G
Replace old data with converted data.s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/
EDIT:
A more complicated answer which I think caters for more than one block per line.
# slurp file into pattern space (PS)
:a
$! {
N
ba
}
# check for presence of \v if so quit with exit value 1
/\v/q1
# replace original newlines with \v's
y/\n/\v/
# append a newline to PS as a delimiter
G
# copy PS to hold space (HS)
h
# starting from right to left delete everything but blocks
:b
s/\(.*\)\({.*}\).*\n/\1\n\2/
tb
# delete any non-block details form the start of the file
s/.*\n//
# PS contains only block details
# do any block processing here e.g. uppercase this and that
s/th\(is\|at\)/\U&/g
# append ps to hs
H
# swap to HS
x
# replace each original block with its processed one from right to left
:c
s/\(.*\){.*}\(.*\)\n\n\(.*\)\({.*}\)/\1\n\n\4\2\3/
tc
# delete newlines
s/\n//g
# restore original newlines
y/\v/\n/
# done!
N.B. This uses GNU specific options but could be tweaked to work with generic sed's.

How to restrict a find and replace to only one column within a CSV?

I have a 4-column CSV file, e.g.:
0001 # fish # animal # eats worms
I use sed to do a find and replace on the file, but I need to limit this find and replace to only the text found inside column 3.
How can I have a find and replace only occur on this one column?

Are you sure you want to be using sed? What about csvfix? Is your CSV nice and simple with no quotes or embedded commas or other nasties that make regexes...a less than satisfactory way of dealing with a general CSV file? I'm assuming that the # is the 'comma' in your format.
Consider using awk instead of sed:
awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
Arguably, you should have a BEGIN block that sets OFS once. For one line of input, it didn't make any odds (and you'd probably be hard-pressed to measure a difference on a million lines of input, too):
$ echo "pattern # pattern # pattern # pattern" |
> awk -F# '$3 ~ /pattern/ { OFS= "#"; $3 = "replace"; }'
pattern # pattern #replace# pattern
$
If sed still seems appealing, then:
sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
For example (and note the slightly different input and output – you can fix it to handle the same as the awk quite easily if need be):
$ echo "pattern#pattern#pattern#pattern" |
> sed '/^\([^#]*#[^#]*\)#pattern#\(.*\)/ s//\1#replace#\2/'
pattern#pattern#replace#pattern
$
The first regex looks for the start of a line, a field of non-at-signs, an at-sign, another field of non-at-signs and remembers the lot; it looks for an at-sign, the pattern (which must be in the third field since the first two fields have been matched already), another at-sign, and then the residue of the line. When the line matches, then it replaces the line with the first two fields (unchanged, as required), then adds the replacement third field, and the residue of the line (unchanged, as required).
If you need to edit rather than simply replace the third field, then you think about using awk or Perl or Python. If you are still constrained to sed, then you explore using the hold space to hold part of the line while you manipulate the other part in the pattern space, and end up re-integrating your desired output line from the hold space and pattern space before printing the line. That's nearly as messy as it sounds; actually, possibly even messier than it sounds. I'd go with Perl (because I learned it long ago and it does this sort of thing quite easily), but you can use whichever non-sed tool you like.
Perl editing the third field. Note that the default output is $_ which had to be reassembled from the auto-split fields in the array #F.
$ echo "pattern#pattern#pattern#pattern" | sh -x xxx.pl
> perl -pa -F# -e '$F[2] =~ s/\s*pat(\w\w)rn\s*/ prefix-$1-suffix /; $_ = join "#", #F; ' "$#"
pattern#pattern# prefix-te-suffix #pattern
$
An explanation. The -p means 'loop, reading lines into $_ and printing $_ at the end of each iteration'. The -a means 'auto-split $_ into the array #F'. The -F# means the field separator is #. The -e is followed by the Perl program. Arrays are indexed from 0 in Perl, so the third field is split into $F[2] (the sigil — the # or $ — changes depending on whether you're working with a value from the array or the array as a whole. The =~ is a match operator; it applies the regex on the RHS to the value on the LHS. The substitute pattern recognizes zero or more spaces \s* followed by pat then two 'word' characters which are remembered into $1, then rn and zero or more spaces again; maybe there should be a ^ and $ in there to bind to the start and end of the field. The replacement is a space, 'prefix-', the remembered pair of letters, and '-suffix' and a space. The $_ = join "#", #F; reassembles the input line $_ from the possibly modified separate fields, and then the -p prints that out. Not quite as tidy as I'd like (so there's probably a better way to do it), but it works. And you can do arbitrary transforms on arbitrary fields in Perl without much difficulty. Perl also has a module Text::CSV (and a high-speed C version, Text::CSV_XS) which can handle really complex CSV files.

Essentially break the line into three pieces, with the pattern you're looking for in the middle. Then keep the outer pieces and replace the middle.
/\([^#]*#[^#]*#\[^#]*\)pattern\([^#]*#.*\)/s//\1replacement\2/
\([^#]*#[^#]*#\[^#]*\) - gather everything before the pattern, including the 3rd # and any text before the math - this becomes \1
pattern - the thing you're looking for
\([^#]*#.*\) - gather everything after the pattern - this becomes \2
Then change that line into \1 then the replacement, then everything after pattern, which is \2

This might work for you:
echo 0001 # fish # animal # eats worms|
sed 's/#/&\n/2;s/#/\n&/3;h;s/\n#.*//;s/.*\n//;y/a/b/;G;s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/'
0001 # fish # bnimbl # eats worms
Explanation:
Define the field to be worked on (in this case the 3rd) and insert a newline (\n) before it and directly after it. s/#/&\n/2;s/#/\n&/3
Save the line in the hold space. h
Delete the fields either side s/\n#.*//;s/.*\n//
Now process the field i.e. change all a's to b's. y/a/b/
Now append the original line. G
Substitute the new field for the old field (also removing any newlines). s/\([^\n]*\)\n\([^\n]*\).*\n/\2\1/
N.B. That in step 4 the pattern space only contains the defined field, so any number of commands may be carried out here and the result will not affect the rest of the line.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Decipher this sed one-liner - sed

Related

Sed command to delete "\" which causes "*** multiple target patterns. Stop." error

Explain this sed conditional branching behavior

Can someone break this sed command down for me?

sed recipe: how to do stuff between two patterns that can be either on one line or on two lines?

How to restrict a find and replace to only one column within a CSV?

Categories

Resources