Line range in sed matches multiple times in a file - sed

Suppose I have a file my_file:
start
startx
start3
hi
start4
end
done
stop
endagain
And now I try sed -n '/start/,/end/p' < my_file. How will sed interpret this range of lines since start occurs 4 times?

As running your command against your sample input will show you, the first line that contains start through the nearest following line that contains end (inclusively), will match.
sed doesn't support overlapping ranges:
Once the start pattern on a range matches a line, looking for the end of the range will start on the next line[1], and no matching of the start pattern will occur until after the end of the range is found.
The range ends once either the end pattern is matched or the end of the input is encountered.
Looking for the next range then starts on the line after the one that ended the previous one.
Note that I use the term "line" loosely here: while it's the default case to operate on lines, in sed terms it should be called pattern space, which can be something other than a line, depending on how the commands in the script manipulate the lines read.
[1] Note that, by contrast, awk starts looking for the end pattern on the same line (record).

Related

How to parse sed regex syntax?

sed -i "0,/test/s//#test/g" file.txt
I do not know how to parse this regex. It is commenting out test by putting #, but my questions are
what is "0," at the beginning?
what is it not like "s/test/#test/g" ? aka why is /s is in the middle?
Any help is appreciated.
Lets break it down into smaller pieces:
https://www.gnu.org/software/sed/manual/sed.html#sed-script-overview
sed commands follow this syntax:
[addr]X[options]
X is a single-letter sed command. [addr] is an optional line address. If [addr] is specified, the command X will be executed only on the matched lines.
And
https://www.gnu.org/software/sed/manual/sed.html#Range-Addresses
An address range can be specified by specifying two addresses separated by a comma (,). An address range matches lines starting from where the first address matches, and continues until the second address matches (inclusively)
In the case of 0,/test/s//#test/g the address part is 0,/test/ because s is the command. An address part of 0,/test/ means the s command is only executed on lines inside that range. If the sed command was s/test/#test/g there wouldn't be an address part and the s command would be attempted on every line in the file.
https://www.gnu.org/software/sed/manual/sed.html#index-addr1_002c_002bN
A line number of 0 can be used in an address specification like 0,/regexp/ so that sed will try to match regexp in the first input line too. In other words, 0,/regexp/ is similar to 1,/regexp/, except that if addr2 matches the very first line of input the 0,/regexp/ form will consider it to end the range, whereas the 1,/regexp/ form will match the beginning of its range and hence make the range span up to the second occurrence of the regular expression.
Note that this is the only place where the 0 address makes sense; there is no 0-th line and commands which are given the 0 address in any other way will give an error.
So in 0,/test/s//#test/g, the address part 0,/test/ runs the s command only on the first line that matches /test/ - even if it is the first line.
https://www.gnu.org/software/sed/manual/sed.html#index-empty-regular-expression
The empty regular expression ‘//’ repeats the last regular expression match (the same holds if the empty regular expression is passed to the s command).
So 0,/test/s//#test/g is the same as 0,/test/s/test/#test/g because the empty regular expression matches the one that was used in the address part - but it can be left out because writing the same regex twice just makes the whole command less readable.
In conclusion:
s/test/#test/g does the replacement on every line in the file that contains test
0,/test/s//#test/g does the replacement only on the first line in the file that contains test

A way to append the beginning of every line before a pattern to the end of each same line?

I am trying to copy the beginning of every line in a text file before a certain character to the end of the same line.
I've tried duplicating each line to the end of itself, and then deleting everything after the character, but the trouble is I haven't been able to figure out how to skip the first instance of the character so the result is that the duplicated text gets deleted as well as everything beyond the first instance of the character.
I've tried things like
sed '/S/ s/$/ append text/' sample.txt > cleaned.txt
but this only adds a fixed text. I've also tried using:
s/\(.*\)/\1 \1/
to duplicate the line, and then deleting everything after the S, but I can't figure out how to get it to go to the 3rd S not the 1st to start deleting.
What I have to start with:
dog 50_50_S5_Scale
cat 10_RV_S76_Scale
mouse 15_SQ_S81_Scale
What I'm trying to get:
dog 50_50_S5_Scale dog 50_50_
cat 10_RV_17_S76_Scale cat 10_RV_17_
mouse 15_EQ_S81_Scale mouse 15_EQ_
Where everything before the first S gets copied to the end of the line.
You may use
sed 's/\([^S]*\)S.*/& \1/' file
See the online demo
Details
\([^S]*\) - Capturing group 1 (\1): any 0+ chars other than S
S.* - S and the rest of the string (actually, line, since sed processes line by line by default).
The replacement is the concatenation of the whole match (&), space and Group 1 value.
You could try:
awk '{print $0 " " substr($0, 0, index($0,"S") - 1)}' file
We take the substring from the first character up to but not including the first occurance of "S".

Can someone break this sed command down for me?

I found this magical command on the unix forum to move the last line of a file to the beginning of the file. I use sed quite a bit but not to this extent. Can someone explain each part to me?
sed '1h;1d;$!H;$!d;G' infile
Yes, it uses exotic commands.
1h: put first line in the "hold" space (sed has 2 spaces: 1 hold space to keep data and the pattern space: actual processed line)
1d: delete first line
$!H: append all lines BUT the last one (and the first one since d command skips to the next line) into the "hold" space
$!d: delete (do not print) all lines except the last one
G: Append a newline to the contents of the pattern space (this is the last line, the only one able to reach that part of the script), and then append the contents of the hold space to that of the pattern space, pattern space which is printed right away. Swap done.
Opinion based comment: I must admit I would never have thought of doing that using sed, and I would have had to make a test to convince me of what this command was doing... in awk, it is much much easier to do that.
But sed has a special place in my heart with it's cryptic commands. I wonder if there are some sed candidates to CodeGolf :)
reference manual: https://www.gnu.org/software/sed/manual/sed.html
some exotic things you can do with sed (my best 1999 read): http://sed.sourceforge.net/grabbag/tutorials/do_it_with_sed.txt
Here is the same command in a more procedural-looking pseudocode:
for line in infile:
# Always do this: Copy the current line to the pattern
pattern = line
# Process the script
if first line:
hold = pattern # 1h
pattern = ""; continue # 1d
elif not last line:
hold = hold + "\n" + pattern # $!H
pattern = ""; continue # $!d
pattern = pattern + "\n" + hold # G
# Always do this after the script is completed.
# Due to the continue statements above, this
# isn't always reached, and in this case
# is only reached for the last line.
print pattern
d clears the pattern space and continues to the next input line without executing the rest of the script.
h copies the pattern space to the hold space.
H appends a newline to the hold space, then appends the pattern space to the hold space.
G is like H, but in the other direction; it copies the hold space to the pattern space.
The overall affect on a file with N lines is to build up a copy of lines 1 through N-1 in the hold space. When the pattern holds line N, append the hold space to the pattern space and print the pattern space to standard output.

Alternatives to grep/sed that treat new lines as just another character

Both grep and sed handle input line-by-line and, as far as I know, getting either of them to handle multiple lines isn't very straightforward. What I'm looking for is an alternative or alternatives to these two programs that treat newlines as just another character. Is there any tool that fits such a criteria
The tool you want is awk. It is record-oriented, not line-oriented, and you can specify your record-separator by setting the builtin variable RS. In particular, GNU awk lets you set RS to any regular expression, not just a single character.
Here is an example where awk uses one blank line to separate every record. If you show us what data you have, we can help you with it.
cat file
first line
second line
third line
fourth line
fifth line
sixth line
seventh line
eight line
more data
Running awk on this and reconstruct data using blank line as new record.
awk -v RS= '{$1=$1}1' file
first line second line third line
fourth line fifth line sixth line
seventh line eight line
more data
PS RS is not equal to file, is set to RS= blank, equal to RS=""
1) Sed can handle a block lines together, not always line by line.
In sed, normally I use :loop; $!{N; b loop}; to get all the lines available in pattern space delimited by newline.
Sample:
Productivity
Google Search\
Tips
"Web Based Time Tracking,
Web Based Todo list and
Reduce Key Stores etc"
result (remove the content between ")
sed -e ':loop; $!{N; b loop}; s/\"[^\"]*\"//g' thegeekstuff.txt
Productivity
Google Search\
Tips
You should read this URL (Unix Sed Tutorial: 6 Examples for Sed Branching Operation), it will give you detail how it works.
http://www.thegeekstuff.com/2009/12/unix-sed-tutorial-6-examples-for-sed-branching-operation/
2) For grep, check if your grep support -z option, which needn't handle input line by line.
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.

What does the 'N' command do in sed?

It looks like the 'N' command works on every other line:
$ cat in.txt
a
b
c
d
$ sed '=;N' in.txt
1
a
b
3
c
d
Maybe that would be natural because command 'N' joins the next line and changes the current line number. But (I saw this here):
$ sed 'N;$!P;$!D;$d' thegeekstuff.txt
The above example deletes the last two lines of a file. This works not only for even-line-numbered files but also for odd-line-numbered files. In this example 'N' command runs on every line. What's the difference?
And could you tell me why I cannot see the last line when I run sed like this:
# sed N odd-lined-file.txt
Excerpt from info sed:
`sed' operates by performing the following cycle on each lines of
input: first, `sed' reads one line from the input stream, removes any
trailing newline, and places it in the pattern space. Then commands
are executed; each command can have an address associated to it:
addresses are a kind of condition code, and a command is only executed
if the condition is verified before the command is to be executed.
...
When the end of the script is reached, unless the `-n' option is in
use, the contents of pattern space are printed out to the output
stream,
...
Unless special commands (like 'D') are used, the pattern space is
deleted between two cycles
...
`N'
Add a newline to the pattern space, then append the next line of
input to the pattern space. If there is no more input then `sed'
exits without processing any more commands.
...
`D'
Delete text in the pattern space up to the first newline. If any
text is left, restart cycle with the resultant pattern space
(without reading a new line of input), otherwise start a normal
new cycle.
This should pretty much resolve your query. But still I will try to explain your three different cases:
CASE 1:
sed reads a line from input. [Now there is 1 line in pattern space.]
= Prints the current line no.
N reads the next line into pattern space.[Now there are 2 lines in pattern space.]
If there is no next line to read then sed exits here. [ie: In case of odd lines, sed exits here - and hence the last line is swallowed without printing.]
sed prints the pattern space and cleans it. [Pattern space is empty.]
If EOF reached sed exits here. Else Restart the complete cycle from step 1. [ie: In case of even lines, sed exits here.]
Summary: In this case sed reads 2 lines and prints 2 lines at a time. Last line is swallowed it there are odd lines (see step 3).
CASE 2:
sed reads a line from input. [Now there is 1 line in pattern space.]
N reads the next line into pattern space. [Now there are 2 lines in pattern space.]
If it fails exit here. This occurs only if there is 1 line.
If its not last line($!) print the first line(P) from pattern space. [The first line from pattern space is printed. But still there are 2 lines in pattern space.]
If its not last line($!) delete the first line(D) from pattern space [Now there is only 1 line (the second one) in the pattern space.] and restart the command cycle from step 2. And its because of the command D (see the excerpt above).
If its last line($) then delete(d) the complete pattern space. [ie. reached EOF ] [Before beginning this step there were 2 lines in the pattern space which are now cleaned up by d - at the end of this step, the pattern space is empty.]
sed automatically stops at EOF.
Summary: In this case :
sed reads 2 lines first.
if there is next line available to read, print the first line and read the next line.
else delete both lines from cache. This way it always deletes the last 2 line.
CASE 3:
Its the same case as CASE:1, just remove the Step 2 from it.