Trying to understand why sed emulating rev loops on same line until all reversed - sed

sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
Above command reverses line and emulates rev.
But as per my understanding, sed is a line editor and it executes all actions seperated by ; on one line then reads second line.
So in above line why even after executing all commands one time it keeps looping until all characters are reversed.??
I dont want to understand how commad works i know that. But why after executing all action seperated by ; once it keeps executing until all characters reverse
Why loop kind of behaviour
I can't understand this behaviour
eg.
echo 'Hey i am fine' | sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
Should be :
Pattern space
-->(1st action) Hey i am fine\n
--> (2nd action) Hey i am fine\ney i am fine\nH
-->(3rd action) ey i am fine\nH
-->(4th action) At this point it should execute s/.// and exit or read next line
But why after 3rd action 1st action is repeated
Seems like as long as patter space is not deleted the sed will not read next line and keep repeating cycle on same pattern space.
Seems like D option is doing this.
But But why after first D action s/.// is not executed but repeated from beginning.??

Related

A way to append the beginning of every line before a pattern to the end of each same line?

I am trying to copy the beginning of every line in a text file before a certain character to the end of the same line.
I've tried duplicating each line to the end of itself, and then deleting everything after the character, but the trouble is I haven't been able to figure out how to skip the first instance of the character so the result is that the duplicated text gets deleted as well as everything beyond the first instance of the character.
I've tried things like
sed '/S/ s/$/ append text/' sample.txt > cleaned.txt
but this only adds a fixed text. I've also tried using:
s/\(.*\)/\1 \1/
to duplicate the line, and then deleting everything after the S, but I can't figure out how to get it to go to the 3rd S not the 1st to start deleting.
What I have to start with:
dog 50_50_S5_Scale
cat 10_RV_S76_Scale
mouse 15_SQ_S81_Scale
What I'm trying to get:
dog 50_50_S5_Scale dog 50_50_
cat 10_RV_17_S76_Scale cat 10_RV_17_
mouse 15_EQ_S81_Scale mouse 15_EQ_
Where everything before the first S gets copied to the end of the line.
You may use
sed 's/\([^S]*\)S.*/& \1/' file
See the online demo
Details
\([^S]*\) - Capturing group 1 (\1): any 0+ chars other than S
S.* - S and the rest of the string (actually, line, since sed processes line by line by default).
The replacement is the concatenation of the whole match (&), space and Group 1 value.
You could try:
awk '{print $0 " " substr($0, 0, index($0,"S") - 1)}' file
We take the substring from the first character up to but not including the first occurance of "S".

Explain this sed conditional branching behavior

I have the following (gnu) sed script, which is intended to parse another sed script, and output distinct commands on a separate line.
In words, this script should put a newline after each semicolon ;, except semicolons that are inside a matching or substitution command.
Sed script:
#!/bin/sed -rf
# IDEA:
# replace ';' by ';\n' except when it's inside a match expression or subst. expression.
# Ignored patterns:
/^#/b # commented lines
/^$/b # empty lines
# anything in a single line, without semicolon except at the end
/^[^\n;]*;?$/b
# Processed patterns (put on separate lines):
# Any match preceding a semicolon, or the end of the line, or a substitution
s_/^[^/]+/[^;s]*;?_&\n_; t printtopline
s/^\\(.)[^\1]+\1[^;s]*;?/&\n/;t printtopline
# Any substitution (TODO)
# Any other command, separated by semicolon
s/\;/\;\n/; t printtopline;
:printtopline
P;D; # print top line, delete it, start new cycle
For example, I tested it with the following file (actually adapted from an answer of #ctac_ to one of my previous sed questions):
Input file:
#!/bin/sed -f
#/^>/N;
:A;
/\n>/!{s/\n/ /;N;bA}; # join next line if not a sequence label
#h;
#s/\(.*\)\n.*/\1/p;
s/^>//g;P
#x;
#s/.*\n//;
D
bA;
Output
The above script produces the right output, for example, the line /\n>/!{s/\n/ /;N;bA}; # join next line if not a sequence label becomes:
/\n>/!{s/\n/ /;
N;
bA};
# join next line if not a sequence label
Question
However, could you help me understand why this part of the script works:
s/\;/\;\n/; t printtopline;
:printtopline
?
I seems to me that the branching command t printtopline is useless here. I thought whatever the success of the substitution, the next thing to be executed would be :printtopline.
However, if I comment out the t command, or if I replace it with b, the script produces the following output lines:
/\n>/!{s/\n/ /;
N;bA}; # join next line if not a sequence label
From info sed, here is the explanation of t:
't LABEL'
Branch to LABEL only if there has been a successful 's'ubstitution
since the last input line was read or conditional branch was taken.
The LABEL may be omitted, in which case the next cycle is started.
Why isn't the t command immediately followed by its label not behaving like no command at all or the b command?
The crucial part is this:
Branch to label only if there has been a successful substitution since the last input line was read or conditional branch was taken.
I.e. t looks into the past and takes into account the success of all recent substitutions up to the most recent
input, or
conditional branch.
Consider the input line you're asking about. After all the substitutions we have
/\n>/!{s/\n/ /;
N;bA}; # join next line if not a sequence label
in our pattern space when we reach P;D;. The P commands outputs the first line, then D deletes the first line and restarts the main loop. Now we just have
N;bA}; # join next line if not a sequence label
Note that this didn't involve reading any additional lines. No input occurred; D just removed parts of the pattern space.
We process the remaining text (which does nothing because none of the other patterns match) until we reach this part of the code:
s_/^[^/]+/[^;s]*;?_&\n_; t printtopline
The substitution fails (the pattern space doesn't contain /^). But the t command doesn't check the status of just this one s command; it looks at the history of all substitutions since the most recent input or conditional branch taken.
The most recent input occurred when /\n>/!{s/\n/ /;N;bA}; was read.
The most recent conditional branch taken was
s/\;/\;\n/; t printtopline;
:printtopline
in the original version of your code. Since then no other substitution succeeded, so the t command does nothing. The rest of the program continues as expected.
But in the modified version of your code there was no conditional branch at this point (b is an unconditional branch):
s/\;/\;\n/; b printtopline;
:printtopline
That means the t from s_/^[^/]+/[^;s]*;?_&\n_; t printtopline "sees" the s/\;/\;\n/; as having succeeded, so it immediately jumps to the P;D; part. This is what outputs
N;bA}; # join next line if not a sequence label
unmodified.
In summary: t makes a difference here not because of its immediate effect of jumping to a label, but because it serves as a dynamic delimiter for the next t that gets executed. Without t here, the previously executed s command is taken into account for the next t.
Part 1 - how the P;D; sequence works.
Compare this two command's outputs: sed 's/;/;\n/' and sed 's/;/;\n/; P;D;'.
First:
$ sed 's/;/;\n/' <<< 'one;two;three;four'
one;
two;three;four
Second:
$ sed 's/;/;\n/; P;D;' <<< 'one;two;three;four'
one;
two;
three;
four
Why the difference? I will to explain.
The first command substitutes only the first occurrence of the ; character. To substitute all occurrences, the g modifier should be added to the s command: sed 's/;/;\n/g'.
The second command works this way:
sed 's/;/;\n/; - the same as the first command - no difference. Before this command the pattern space is one;two;three;four, after - one\ntwo;three;four.
P; -
from man: "Print up to the first embedded newline of the current pattern space."
That is, it prints up to first newline - one. The pattern space stay unchanged: one\ntwo;three;four
D; -
from man: "If pattern space contains no newline, start a normal new cycle as if the d command was
issued. Otherwise, delete text in the pattern space up to the first newline, and restart
cycle with the resultant pattern space, without reading a new line of input."
In the our case, pattern space has newline - one\ntwo;three;four. The D; removes the one\n part and repeat all commands cycle from the beginning. Now, the pattern space is: two;three;four.
That is, again sed 's/;/;\n/; - pattern space: two\nthree;four, then P; - print two, pattern space unchanged: two\nthree;four, D; - removes two\n, pattern space becomes: three;four. Etc.
Part 2 - what happening with branching.
I looked at the sed source code and found next information:
When the s command is executing and having match, the replaced flag is setting to the true:
/* We found a match, set the 'replaced' flag. */
replaced = true;
The t command is executing, if the replaced flag is true. And it is changing this flag to the false:
case 't':
if (replaced)
{
replaced = false;
So, in the first, s/\;/\;\n/; t printtopline; case, the substitution is successful - therefore, replaced flag is setting to the true. Then, the following t command is running and changing replaced flag back to the false.
In the second case, without t command - s/\;/\;\n/;, substitution is successful, too - therefore, replaced flag is setting to the true.
But now, this flag is stored to the next cycle, initiated by the D command. So, then the first t command appears in the new cycle - s_/^[^/]+/[^;s]*;?_&\n_; t printtopline, it checks the replaced flag, sees, that the flag is true and jumps to the label :printtopline, omitting all other commands before the label.
The pattern space doesn't have newlines, so P;D; sequence just prints pattern space and starts the next cycle with the new line of input.

Sed to replace certain number of occurrences

I have the replace sed script below and it works for the first occurrence of every line but I'm trying to make it work for the first 2 occurrences per line instead of one (/1) or the whole line (/g):
sed -r '2,$s/(^ *|, *)([a-z])/\1\U\2/1'
Is there any way to do that either by combining sed commands or creating a script?
The best I can offer is
sed -r '2,$ { s/(^|,) *[a-z]/\U&/; s//\U&/; }'
The \U& trick uses the fact that the upper case version of a space is still a space; this is to make the repetition shorter. Because captures are no longer used, the regex can be simplified a little.
In the second s command, the // is a stand-in for the most recently attempted regex, so the first one is essentially executed a second time (this time matching what was originally the second appearance).
Since /1 doesn't actually do anything (replacing the first occurrence is default), I took the liberty of removing it.

SED Code Explanation

I have a line of SED, below, that is in a batch command that I run every month. It was written by someone before me, and I am looking to understand the parts of this code. From the two outputs I can tell that it takes one line and deletes another when sequential lines are duplicates, I just don't understand how it is being done with this line.
sed "$!N; /^\(.*\)\n\1$/!P; D" finalish.txt > final.txt
Exmple of - Finalish.txt
201408
201409
201409
201409
201409
Example of - Final.txt
201408
201409
Not going in to the basics of sed, here is your sed command broken down:
$!N: If it is not end of file, append next line to pattern space. The two lines will be separated by a newline (\n). At this time your pattern space is 201408\n201409.
/^\(.*\)\n\1$/!P: If the pattern space does not contain two similar content separated by a newline (\n), then Print up to the first newline (\n). So this will print 201408 to STDOUT. During the second iteration though, the pattern space will have 201409\n201409 and since it fails the regex, nothing gets printed and we proceed to the next command.
D: Deletes up to the first newline (\n) and repeats the sed script. Remember during the repeat cycle your pattern space still has the 201409
So during the first iteration 201408 gets printed but 201409 doesn't get printed until the end of file is reached which is when your regex will become true again and the content will get printed.
If you are inheriting alot of sed code, I would strongly recommend sedsed utility which is written in python and will help you understand convoluted and cryptic sed that can often become a maintenance nightmare.
Here is a sample run from the sedsed utility (I haven't shown all iterations as it is pretty verbose but you get the picture. I have added few comments to what the output really means. Also notice I am using single quotes since I am on Mac (BSD Unix) and not Windows):
$ sedsed.py -d '$!N; /^\(.*\)\n\1$/!P; D' file
PATT:201408$ # This shows your current pattern space
HOLD:$ # This shows your current hold buffer
COMM:$ !N # This shows the command that is going to run
PATT:201408$ # This shows the pattern space after the command has ran
201409$
HOLD:$ # This shows the hold buffer after the command has ran
COMM:/^\(.*\)\n\1$/ !P # This shows the command being ran
201408 # Anything without a <TAG:> is what gets printed to STDOUT
PATT:201408$
201409$
HOLD:$
COMM:D
PATT:201409$
HOLD:$
...
...
...
COMM:$ !N
PATT:201409$
HOLD:$
COMM:/^\(.*\)\n\1$/ !P
201409
PATT:201409$
HOLD:$
COMM:D
I would also suggest that once you get the idea of what your sed commands were written for, you port them to a more friendlier scripting language like awk, perl or python
This will not help you understanding the sed, but here is an awk that just get the unique lines.
awk '!seen[$0]++' finalish.txt
201408
201409

What does the 'N' command do in sed?

It looks like the 'N' command works on every other line:
$ cat in.txt
a
b
c
d
$ sed '=;N' in.txt
1
a
b
3
c
d
Maybe that would be natural because command 'N' joins the next line and changes the current line number. But (I saw this here):
$ sed 'N;$!P;$!D;$d' thegeekstuff.txt
The above example deletes the last two lines of a file. This works not only for even-line-numbered files but also for odd-line-numbered files. In this example 'N' command runs on every line. What's the difference?
And could you tell me why I cannot see the last line when I run sed like this:
# sed N odd-lined-file.txt
Excerpt from info sed:
`sed' operates by performing the following cycle on each lines of
input: first, `sed' reads one line from the input stream, removes any
trailing newline, and places it in the pattern space. Then commands
are executed; each command can have an address associated to it:
addresses are a kind of condition code, and a command is only executed
if the condition is verified before the command is to be executed.
...
When the end of the script is reached, unless the `-n' option is in
use, the contents of pattern space are printed out to the output
stream,
...
Unless special commands (like 'D') are used, the pattern space is
deleted between two cycles
...
`N'
Add a newline to the pattern space, then append the next line of
input to the pattern space. If there is no more input then `sed'
exits without processing any more commands.
...
`D'
Delete text in the pattern space up to the first newline. If any
text is left, restart cycle with the resultant pattern space
(without reading a new line of input), otherwise start a normal
new cycle.
This should pretty much resolve your query. But still I will try to explain your three different cases:
CASE 1:
sed reads a line from input. [Now there is 1 line in pattern space.]
= Prints the current line no.
N reads the next line into pattern space.[Now there are 2 lines in pattern space.]
If there is no next line to read then sed exits here. [ie: In case of odd lines, sed exits here - and hence the last line is swallowed without printing.]
sed prints the pattern space and cleans it. [Pattern space is empty.]
If EOF reached sed exits here. Else Restart the complete cycle from step 1. [ie: In case of even lines, sed exits here.]
Summary: In this case sed reads 2 lines and prints 2 lines at a time. Last line is swallowed it there are odd lines (see step 3).
CASE 2:
sed reads a line from input. [Now there is 1 line in pattern space.]
N reads the next line into pattern space. [Now there are 2 lines in pattern space.]
If it fails exit here. This occurs only if there is 1 line.
If its not last line($!) print the first line(P) from pattern space. [The first line from pattern space is printed. But still there are 2 lines in pattern space.]
If its not last line($!) delete the first line(D) from pattern space [Now there is only 1 line (the second one) in the pattern space.] and restart the command cycle from step 2. And its because of the command D (see the excerpt above).
If its last line($) then delete(d) the complete pattern space. [ie. reached EOF ] [Before beginning this step there were 2 lines in the pattern space which are now cleaned up by d - at the end of this step, the pattern space is empty.]
sed automatically stops at EOF.
Summary: In this case :
sed reads 2 lines first.
if there is next line available to read, print the first line and read the next line.
else delete both lines from cache. This way it always deletes the last 2 line.
CASE 3:
Its the same case as CASE:1, just remove the Step 2 from it.