Sed running recursively and can't re-allocate memory - sed

i am new to sed commands..... and i am trying some commands but it is always getting in some recursion..... and in some cases it says "can't re-allocate memory"
Infinite Recursive Output:
echo -e 'hell\nnautanki\nwtf' | sed -e '1h;1!H;$!d' -e 'x;l;D'
Memory Re-allocation problem:
echo -e 'hell\nnautanki\nwtf'| sed -e '1h;1!H;$!d' -e 'x;D'
Errors out with:
sed: couldn't re-allocate memory

As noted by paulsm4, you've created an infinite loop which allocates memory on each iteration, the first example is just a slower version of the second, because of the printing, it will eventually also error with couldn't re-allocate memory.
Let's break it down:
1h
1!H
$!d
Will save all input into hold space, note that d starts the next cycle.
The last two commands x; D are only executed when the last line is reached. This is the situation just before these will be run:
PS: wtf
HS: hell\nnautanki\nwtf
x swaps them and D removes hell\n and restarts cycle because pattern space isn't empty. The 1!H will is executed on the new pattern space, resulting in:
PS: nautanki\nwtf
HS: wtf\nnautanki\nwtf
i.e. a slight increase in memory usage on every iteration.
With two lines of input, the situation is a bit different:
PS: nautanki
HS: hell\nnautanki
Becomes:
PS: nautanki
HS: nautanki
And so on to infinity.
One line of input results in:
PS: hell
HS: hell
Then:
PS:
HS: hell
And so terminates.

Related

What's the correct usage of sed with parallel --jobs option?

parallel -a input --colsep ' ' --jobs 100 -I {} sed -i 's/{1}/{2}/g' file
input is a file delimited by space, where the first column is pattern and the second column is replacement.
The problem is that after I ran the command, not all patterns were replaced in file. Then I ran the same command again, more patterns were replaced, but still not all.
However, if I change --jobs 100 to --jobs 1, it will work as expected (but much slower).
Is there any parameter necessary missing in my command?
Sounds more like you have a race condition. If you have several sed processes writing to the file, one will win, and the other(s) will lose.
Having multiple processes process the same file is hugely suboptimal anyway; just generate a single sed script and then run it once. Or if you really want to parallelize, split the input file into smaller pieces, run the generated sed script on each in parallel, and then concatenate them back when you are done.
Parallel processing helps when your task is CPU bound, but this one is I/O bound; you are simply creating congestion by having several processes fight over the access to bytes from the disk, and then in this case also fighting over write access back to the same file.
There are many examples of how to generate a sed script; here's a quick and dirty one which will however not work on some platforms where sed -f - does not read the script from standard input.
sed 's%^\([^ ]*\) \([^ ]*\)$%s/\1/\2/g%' input |
sed -f - file >temp # or sed -f - -i file
I omitted the -i option so that you can check that this does what you want before plunging ahead and deploying it in production. The commented-out version is what you would use once you are satisfied that this really does what you want.
There is still the question of replacement precedence. If you have s/a/b/ and s/b/c/ then do you want effectively s/a/c/, or the opposite? If you have s/abc/x/ and s/abcdef/y/, should abcdef always become y, or is xdef what you expect? A common hack is to sort the replacements by length so that the longer ones always get executed before the shorter ones; then at least you know what to expect.
Let us assume that input is big and file is huge.
You really do not want to read file more than once.
First you need to convert input into a single big sed script.
cat input | parallel --colsep ' ' echo s/{1}/{2}/g >bigsed
As #tripleee says, you may need to sort this, so the longest source string is first.
Then you need to split file into one chunk per CPU thread, run the script on each chunk and finally append the replaced chunks back in order:
parallel --pipepart -a file -k sed -f bigsed > replaced
You will need that /tmp has enough free space to contain replaced or set $TMPDIR to a dir that is.

SED Code Explanation

I have a line of SED, below, that is in a batch command that I run every month. It was written by someone before me, and I am looking to understand the parts of this code. From the two outputs I can tell that it takes one line and deletes another when sequential lines are duplicates, I just don't understand how it is being done with this line.
sed "$!N; /^\(.*\)\n\1$/!P; D" finalish.txt > final.txt
Exmple of - Finalish.txt
201408
201409
201409
201409
201409
Example of - Final.txt
201408
201409
Not going in to the basics of sed, here is your sed command broken down:
$!N: If it is not end of file, append next line to pattern space. The two lines will be separated by a newline (\n). At this time your pattern space is 201408\n201409.
/^\(.*\)\n\1$/!P: If the pattern space does not contain two similar content separated by a newline (\n), then Print up to the first newline (\n). So this will print 201408 to STDOUT. During the second iteration though, the pattern space will have 201409\n201409 and since it fails the regex, nothing gets printed and we proceed to the next command.
D: Deletes up to the first newline (\n) and repeats the sed script. Remember during the repeat cycle your pattern space still has the 201409
So during the first iteration 201408 gets printed but 201409 doesn't get printed until the end of file is reached which is when your regex will become true again and the content will get printed.
If you are inheriting alot of sed code, I would strongly recommend sedsed utility which is written in python and will help you understand convoluted and cryptic sed that can often become a maintenance nightmare.
Here is a sample run from the sedsed utility (I haven't shown all iterations as it is pretty verbose but you get the picture. I have added few comments to what the output really means. Also notice I am using single quotes since I am on Mac (BSD Unix) and not Windows):
$ sedsed.py -d '$!N; /^\(.*\)\n\1$/!P; D' file
PATT:201408$ # This shows your current pattern space
HOLD:$ # This shows your current hold buffer
COMM:$ !N # This shows the command that is going to run
PATT:201408$ # This shows the pattern space after the command has ran
201409$
HOLD:$ # This shows the hold buffer after the command has ran
COMM:/^\(.*\)\n\1$/ !P # This shows the command being ran
201408 # Anything without a <TAG:> is what gets printed to STDOUT
PATT:201408$
201409$
HOLD:$
COMM:D
PATT:201409$
HOLD:$
...
...
...
COMM:$ !N
PATT:201409$
HOLD:$
COMM:/^\(.*\)\n\1$/ !P
201409
PATT:201409$
HOLD:$
COMM:D
I would also suggest that once you get the idea of what your sed commands were written for, you port them to a more friendlier scripting language like awk, perl or python
This will not help you understanding the sed, but here is an awk that just get the unique lines.
awk '!seen[$0]++' finalish.txt
201408
201409

Does deleting sed pattern space with 'd' erase hold space as well?

Can someone please explain why this is happening?
This is expected:
$ echo -e "foo\nbar" | sed -n 'h; x; p'
foo
bar
I put every line in the hold space, then swap hold space and pattern space, then print the pattern space, so every line is printed. Now, why is the following different?
$ echo -e "foo\nbar" | sed -n 'h; d; x; p'
I thought that wouldn't be, because I delete the pattern space before swapping, so the stored line should be put back to the pattern space anyway. It's the hold space that should be empty after x;, right? I delete the pattern space, then swap. Where does the line I've saved go?
When you use d, the pattern space is cleared, the next line is read, and processing starts over from the beginning of the script. Thus, you never actually reach the x and p steps, instead just copying to the hold space and deleting.
I guess it's related to the following line in man sed:
d Delete pattern space. Start next cycle.
The following works as expected:
$ echo -e "foo\nbar" | sed -n 'h; s/.*//; g; p'
foo
bar
Sorry for bothering you guys.

cshell: running cat on a large text file inside backticks gives 'word too long'

I have a file that has fairly long lines. The longest line has length 4609:
% perl -nle 'print length' ~/very_large_file | sort -nu | tail -1
4609
Now, when I just run cat ~/very_large_file it runs fine. But when I put inside backticks, it gives a 'word too long' error
% foreach line (`cat ~/very_large_file`)
Word too long.
% set x = `cat ~/very_large_file`
Word too long.
Is there an alternative to using backticks in csh to process each line of such a file?
Update My problem was solved by using a different language, but I still couldn't get the reason for the failing csh. Just came across this page that describes the manner of finding ARG_MAX. In particular, the getconf command is useful. Of course, I am still not sure whether this limit is the root cause, and if the limit applies to the languages other than csh.
I don't mean to beat an old horse, but if you're scripting do consider moving to bash, zsh or even Korn. csh has disadvantages.
What you can try without abandoning csh completely:
Move to tcsh if you're with regular old (very old) csh.
Recompile tcsh with a longer word length (the default is 1000 bytes, I think) or with dynamic allocation.
If possible move the line processing to a secondary script or program and write that loop like this:
cat ~/very_large_file | xargs secondary_script

What does the 'N' command do in sed?

It looks like the 'N' command works on every other line:
$ cat in.txt
a
b
c
d
$ sed '=;N' in.txt
1
a
b
3
c
d
Maybe that would be natural because command 'N' joins the next line and changes the current line number. But (I saw this here):
$ sed 'N;$!P;$!D;$d' thegeekstuff.txt
The above example deletes the last two lines of a file. This works not only for even-line-numbered files but also for odd-line-numbered files. In this example 'N' command runs on every line. What's the difference?
And could you tell me why I cannot see the last line when I run sed like this:
# sed N odd-lined-file.txt
Excerpt from info sed:
`sed' operates by performing the following cycle on each lines of
input: first, `sed' reads one line from the input stream, removes any
trailing newline, and places it in the pattern space. Then commands
are executed; each command can have an address associated to it:
addresses are a kind of condition code, and a command is only executed
if the condition is verified before the command is to be executed.
...
When the end of the script is reached, unless the `-n' option is in
use, the contents of pattern space are printed out to the output
stream,
...
Unless special commands (like 'D') are used, the pattern space is
deleted between two cycles
...
`N'
Add a newline to the pattern space, then append the next line of
input to the pattern space. If there is no more input then `sed'
exits without processing any more commands.
...
`D'
Delete text in the pattern space up to the first newline. If any
text is left, restart cycle with the resultant pattern space
(without reading a new line of input), otherwise start a normal
new cycle.
This should pretty much resolve your query. But still I will try to explain your three different cases:
CASE 1:
sed reads a line from input. [Now there is 1 line in pattern space.]
= Prints the current line no.
N reads the next line into pattern space.[Now there are 2 lines in pattern space.]
If there is no next line to read then sed exits here. [ie: In case of odd lines, sed exits here - and hence the last line is swallowed without printing.]
sed prints the pattern space and cleans it. [Pattern space is empty.]
If EOF reached sed exits here. Else Restart the complete cycle from step 1. [ie: In case of even lines, sed exits here.]
Summary: In this case sed reads 2 lines and prints 2 lines at a time. Last line is swallowed it there are odd lines (see step 3).
CASE 2:
sed reads a line from input. [Now there is 1 line in pattern space.]
N reads the next line into pattern space. [Now there are 2 lines in pattern space.]
If it fails exit here. This occurs only if there is 1 line.
If its not last line($!) print the first line(P) from pattern space. [The first line from pattern space is printed. But still there are 2 lines in pattern space.]
If its not last line($!) delete the first line(D) from pattern space [Now there is only 1 line (the second one) in the pattern space.] and restart the command cycle from step 2. And its because of the command D (see the excerpt above).
If its last line($) then delete(d) the complete pattern space. [ie. reached EOF ] [Before beginning this step there were 2 lines in the pattern space which are now cleaned up by d - at the end of this step, the pattern space is empty.]
sed automatically stops at EOF.
Summary: In this case :
sed reads 2 lines first.
if there is next line available to read, print the first line and read the next line.
else delete both lines from cache. This way it always deletes the last 2 line.
CASE 3:
Its the same case as CASE:1, just remove the Step 2 from it.