SED Code Explanation - sed

I have a line of SED, below, that is in a batch command that I run every month. It was written by someone before me, and I am looking to understand the parts of this code. From the two outputs I can tell that it takes one line and deletes another when sequential lines are duplicates, I just don't understand how it is being done with this line.
sed "$!N; /^\(.*\)\n\1$/!P; D" finalish.txt > final.txt
Exmple of - Finalish.txt
201408
201409
201409
201409
201409
Example of - Final.txt
201408
201409

Not going in to the basics of sed, here is your sed command broken down:
$!N: If it is not end of file, append next line to pattern space. The two lines will be separated by a newline (\n). At this time your pattern space is 201408\n201409.
/^\(.*\)\n\1$/!P: If the pattern space does not contain two similar content separated by a newline (\n), then Print up to the first newline (\n). So this will print 201408 to STDOUT. During the second iteration though, the pattern space will have 201409\n201409 and since it fails the regex, nothing gets printed and we proceed to the next command.
D: Deletes up to the first newline (\n) and repeats the sed script. Remember during the repeat cycle your pattern space still has the 201409
So during the first iteration 201408 gets printed but 201409 doesn't get printed until the end of file is reached which is when your regex will become true again and the content will get printed.
If you are inheriting alot of sed code, I would strongly recommend sedsed utility which is written in python and will help you understand convoluted and cryptic sed that can often become a maintenance nightmare.
Here is a sample run from the sedsed utility (I haven't shown all iterations as it is pretty verbose but you get the picture. I have added few comments to what the output really means. Also notice I am using single quotes since I am on Mac (BSD Unix) and not Windows):
$ sedsed.py -d '$!N; /^\(.*\)\n\1$/!P; D' file
PATT:201408$ # This shows your current pattern space
HOLD:$ # This shows your current hold buffer
COMM:$ !N # This shows the command that is going to run
PATT:201408$ # This shows the pattern space after the command has ran
201409$
HOLD:$ # This shows the hold buffer after the command has ran
COMM:/^\(.*\)\n\1$/ !P # This shows the command being ran
201408 # Anything without a <TAG:> is what gets printed to STDOUT
PATT:201408$
201409$
HOLD:$
COMM:D
PATT:201409$
HOLD:$
...
...
...
COMM:$ !N
PATT:201409$
HOLD:$
COMM:/^\(.*\)\n\1$/ !P
201409
PATT:201409$
HOLD:$
COMM:D
I would also suggest that once you get the idea of what your sed commands were written for, you port them to a more friendlier scripting language like awk, perl or python

This will not help you understanding the sed, but here is an awk that just get the unique lines.
awk '!seen[$0]++' finalish.txt
201408
201409

Related

How to avoid the last newline in sed?

I want to remove the last part of a file, starting at a line following a certain pattern and including the preceding newline.
So, stopping at "STOP", the following file:
keep\n
STOP\n
whatever
Should output:
keep
With no trailing newline.
I tried this, and the logic seems to work, but it seems that sed adds a newline every time it prints its buffer. How can I avoid that? When sed doesn't manipulate the buffer, I don't have that problem (IE If I remove the STOP, sed outputs 'whatever' at the end of the file without a newline).
printf 'keep
STOP
Whatever' | sed 'N
/\nSTOP/ {
s/\n.*$//
P
Q
}
P
D'
I'm trying to write a git cleaning filter, and I cannot have a new newline appended every time I commit.
$ awk '/^STOP/{exit} {printf "%s%s", ors, $0; ors=RS}' file
keep$
The above prints every line without a trailing newline but preceded by a newline (\n or \r\n - whichever your environment dictates so it'll behave correctly on UNIX or Windows or whatever) for every 2nd and subsequent line. When it finds a STOP line it just exits before printing anything.
Note that the above doesn't keep anything in memory except the current line so it'll work no matter how large your input file is and no matter where the STOP appears in it - it'll even work if STOP is the first line of the file unlike the other answers you have so far.
It will also work using any awk in any shell on every UNIX box.
This might work for you (GNU sed):
sed -z 's/\nSTOP.*//' file
The -z option slurps the whole file into memory and the substitute command, removes the remainder of the file from the first newline followed by STOP.
Using awk you could:
$ awk '$0=="STOP"{exit} {b=b (b==""?"":ORS) $0} END{printf "%s",b}' file
Output:
keep$
Explained:
$ awk '
$0=="STOP" { exit } # exit at STOP, ie. go to END
{ b=b (b==""?"":ORS) $0 } # gather an output buffer, control \n
END { printf "%s",b } # in the END output output buffer
' file
... more (focusing a bit on the conditional operator):
b=b # appending to b, so b is b and ...
(b==""?"":ORS) # if b was empty, add nothing to it, if not add ORS ie. \n ...
$0 # and the current record

Use sed to take all lines containing regex and append to end of file

I'm trying to come up with a sed script to take all lines containing a pattern and move them to the end of the output. This is an exercise in learning hold vs pattern space and I'm struggling to come up with it (though I feel close).
I'm here:
$ echo -e "hi\nfoo1\nbar\nsomething\nfoo2\nyo" | sed -E '/foo/H; //d; $G'
hi
bar
something
yo
foo1
foo2
But I want the output to be:
hi
bar
something
yo
foo1
foo2
I understand why this is happening. It is because the first time we find foo the hold space is empty so the H appends \n to the blank hold space and then the first foo, which I suppose is fine. But then the $G does it again, namely another append which appends \n plus what is in the hold space to the pattern space.
I tried a final delete command with /^$/d but that didn't remove the blank line (I think this is because this pattern is being matched not against the last line, but against the, now, multiline pattern space which has a \n\n in it.
I'm sure the sed gurus have a fix for me.
This might work for you (GNU sed):
sed '/foo/H;//!p;$!d;x;//s/.//p;d' file
If the line contains the required string append it to the hold space (HS) otherwise print it as normal. If it is not the last line delete it otherwise swap the HS for the pattern space (PS). If the required string(s) is now in the PS (what was the HS); since all such patterns were appended, the first character will be a newline, delete the first character and print. Delete whatever is left.
An alternative, using the -n flag:
sed -n '/foo/H;//!p;$!b;x;//s/.//p' file
N.B. When the d or b (without a parameter) command is performed no further sed commands are, a new line is read into the PS and the sed script begins with the first command i.e. the sed commands do not resume following the previous d command.
Why? Stuff like this is absolutely trivial in awk, awk is available everywhere that sed is, and the resulting awk script will be simpler, more portable, faster and better in almost every other way than a sed script to do the same task. All that hold space stuff was necessary in sed before the mid-1970s when awk was invented but there's absolutely no use for it now other than as a mental exercise.
$ echo -e "hi\nfoo1\nbar\nsomething\nfoo2\nyo" |
awk '/foo/{buf = buf $0 RS;next} {print} END{printf "%s",buf}'
hi
bar
something
yo
foo1
foo2
The above will work as-is in every awk on every UNIX installation and I bet you can figure out how it works very easily.
This feels like a hack and I think it should be possible to handle this situation more gracefully. The following works on GNU sed:
echo -e "hi\nfoo1\nbar\nsomething\nfoo2\nyo" | sed -r '/foo/{H;d;}; $G; s/\n\n/\n/g'
However, on OSX/BSD sed, results in this odd output:
hi
bar
something
yonfoo1
foo2
Note the 2 consecutive newlines was replaced with the literal character n
The OSX/BSD vs GNU sed is explained in this article. And the following works (in GNU SED as well):
echo -e "hi\nfoo1\nbar\nsomething\nfoo2\nyo" | sed '/foo/{H;d;}; $G; s/\n\n/\'$'\n''/'
TL;DR; in BSD sed, it does not accept escaped characters in the RHS of the replacement expression and so you either have to put a true LF/newline in there at the command line, or do the above where you split the sed script string where you need the newline on the RHS and put a dollar sign in front of '\n' so the shell interprets it as a line feed.

Use sed to replace line if pattern is on next line

How do I get sed to replace previous line? I only came across examples of delete, insert lines, but what I actually need is that I only make substitution to current line if a condition on following line is met.
My sample file is like this
$ /bin/cat test
Cygwin
Cygwin is a cool emulator for Linux on Windows.
Unix
Maybe
the coolest environment?
Linux
Is also one of the best environments
Solaris
Why did Sun feel copying Java into Unix would matter?
AIX
Unknown
The output I expect is as below. Prepend ::: to strings having max 25 chars but only if the string on next line is longer than 25 chars. Thus, the line having Unix, AIX below should not get prepended with :::, but others would.
$ # See detailed sed expression in my answer below
:::Cygwin
Cygwin is a cool emulator for Linux on Windows.
Unix
Maybe
the coolest environment?
:::Linux
Is also one of the best environments
:::Solaris
Why did Sun feel copying Java into Unix would matter?
AIX
Unknown
What sed expression can help me do this?
I am inclined to use only sed since this is a part of some other script that has other sed expressions going on, so I do not want to deviate if possible.
Here's one sed expression that gives me the output I desire,
/bin/sed -rne '/^\s*$/{d;};{p;}' test | /bin/sed -rne '/(^.{5,26}$)/{$p;h;n;/^.{5,26}$/{x;p;x;p;D;};{x;s/(^.*$)/:::\1/;p;x;p;D;}};{$p;h;p;}'
Specifically, below two sed expressions are piped together above,
/bin/sed -rne '/^\s*$/{d;};{p;}' test
# Remove any empty-lines (optionally containing spaces)
/bin/sed -rne '/(^.{5,26}$)/{$p;h;n;/^.{5,26}$/{x;p;x;p;D;};{x;s/(^.*$)/:::\1/;p;x;p;D;}};{$p;h;p;}'
# This is the killer sed expression I came up with hunting around with my limited knowledge
# The detailed breakdown of this expression is as below,
/(^.{5,26}$)/ # Get a string of characters atleast 5 chars to max 26 chars
{
$p; # Print if it's already on last line (since -n is in effect)
h; # Save it to hold space
n; # Get the next line into pattern space
/^.{5,26}$/ # Check if pattern space (i.e. next line) also has min 5, max 26 chars
{ # if above condition passed, execute inside here
x; # Swap pattern with hold space; i.e. Get current line back
p; # Print it (i.e. the first line)
x; # Swap again; to get back next line
p; # Print it (i.e. the second line)
D; # Stop cycle here, and process the next line in the input file
};
{ # else block for above if-condition
x; # Swap pattern with hold space; i.e. Get current line back
s/(^.*$)/:::\1/; # Append ::: in front of line
p; # Print it (i.e. the first line)
x; # Swap again; to get back next line
p; # Print it (i.e. the second line)
D; # Stop cycle here, and process the next line in the input file
} # End processing next line
} # End if match
{ # Current line is longer than max 26 chars,
$p; # Print if it's already on last line (since -n is in effect)
h; # Remember it in hold space
p; # Print it (i.e. the current line)
}
With above explanation, I am able to achieve what I need.
But I still not confident if this could not be written or explained in a concise, or perhaps better way?
It's pretty simple in awk if you get tired of trying to use the hammer of sed on this particular screw :-)
awk '{x[NR]=$0} END{for(i=1;i<=NR;i++){if(length(x[i])<26 && length(x[i+1])>25)printf ":::";print x[i]}}' file
Save all the lines in array x[]. At the end, go through the lines printing them but prefixing ones that meet your conditions with :::.
This might work for you (GNU sed):
sed -r '$!N;/^.{1,25}\n.{26,}$/s/^/:::/;P;D' file
Perl One-Liner from Command-Line
This perl one-liner will do it (tested just now):
perl -0777 -pe 's/^([^\n]{1,25}$)(?=\n[^\n]{25,}$)/:::$1/smg' yourfile

Alternatives to grep/sed that treat new lines as just another character

Both grep and sed handle input line-by-line and, as far as I know, getting either of them to handle multiple lines isn't very straightforward. What I'm looking for is an alternative or alternatives to these two programs that treat newlines as just another character. Is there any tool that fits such a criteria
The tool you want is awk. It is record-oriented, not line-oriented, and you can specify your record-separator by setting the builtin variable RS. In particular, GNU awk lets you set RS to any regular expression, not just a single character.
Here is an example where awk uses one blank line to separate every record. If you show us what data you have, we can help you with it.
cat file
first line
second line
third line
fourth line
fifth line
sixth line
seventh line
eight line
more data
Running awk on this and reconstruct data using blank line as new record.
awk -v RS= '{$1=$1}1' file
first line second line third line
fourth line fifth line sixth line
seventh line eight line
more data
PS RS is not equal to file, is set to RS= blank, equal to RS=""
1) Sed can handle a block lines together, not always line by line.
In sed, normally I use :loop; $!{N; b loop}; to get all the lines available in pattern space delimited by newline.
Sample:
Productivity
Google Search\
Tips
"Web Based Time Tracking,
Web Based Todo list and
Reduce Key Stores etc"
result (remove the content between ")
sed -e ':loop; $!{N; b loop}; s/\"[^\"]*\"//g' thegeekstuff.txt
Productivity
Google Search\
Tips
You should read this URL (Unix Sed Tutorial: 6 Examples for Sed Branching Operation), it will give you detail how it works.
http://www.thegeekstuff.com/2009/12/unix-sed-tutorial-6-examples-for-sed-branching-operation/
2) For grep, check if your grep support -z option, which needn't handle input line by line.
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.

How can I remove all non-word characters except the newline?

I have a file like this:
my line - some words & text
oh lóok i've got some characters
I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this:
mylinesomewordstext
ohlóokivegotsomecharacters
I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I can use.
I tried this:
cat file | perl -pe 's/\W//'
But that removed all the newlines and put everything one line. Is there someway I can tell Perl to not include newlines in the \W? Or is there some other way?
This removes characters that don't match \w or \n:
cat file | perl -C -pe 's/[^\w\n]//g'
#sth's solution uses Perl, which is (at least on my system) not Unicode compatible, thus it loses the accented o character.
On the other hand, sed is Unicode compatible (according to the lists on this page), and gives a correct result:
$ sed 's/\W//g' a.txt
mylinesomewordstext
ohlóokivegotsomecharacters
In Perl, I'd just add the -l switch, which re-adds the newline by appending it to the end of every print():
perl -ple 's/\W//g' file
Notice that you don't need the cat.
The previous response isn't echoing the "ó" character. At least in my case.
sed 's/\W//g' file
Best practices for shell scripting dictate that you should use the tr program for replacing single characters instead of sed, because it's faster and more efficient. Obviously use sed if replacing longer strings.
tr -d '[:blank:][:punct:]' < file
When run with time I get:
real 0m0.003s
user 0m0.000s
sys 0m0.004s
When I run the sed answer (sed -e 's/\W//g' file) with time I get:
real 0m0.003s
user 0m0.004s
sys 0m0.004s
While not a "huge" difference, you'll notice the difference when running against larger data sets. Also please notice how I didn't pipe cat's output into tr, instead using I/O redirection (one less process to spawn).