how many instances did a command affect? - sed

Is there a way when using sed from the cli to return how many lines were affected, or better yet how many instances were affected by a command that might have multiple affects per line if the global param is used? Pretty much, for me, that would mean how many substitutions were made.
I guess one could output to a new file and then run a diff on the two files afterward, but my need to know how many instances a command affects is not that great to do that. I just wondered if there might be a feature native to sed that can be employed.

As far as I know, sed has no native feature to manipulate variables (e.g. to increase an internal counter). Definitely, that is one of the features brought by awk that were lacking in sed. So my advice would be that you switch to awk, and then you can easily use an awk script such as:
BEGIN { counter = 0 }
/mypattern/ { do-whatever-you-want; counter++ }
END { print counter }

You ask for a sed solution. Here is a pure sed approach that does some of what you want:
sed 's/old/new/;t change;b;:change w changes'
After executing this, the changed lines, if any, are written to the file changes.
How it works:
s/old/new/;
Replace this with whatever substitution you want to do.
t change;
This tells sed to jump to the label change if the preceding s command made any changes.
b;
If the preceding jump did not happen, then this b command is executed which ends the processing of this line.
:change w changes
This tells sed to write the current line, as changed by your s command, to the file changes.
The Next Step
The next step here would be to count the changes. To this end, sed can do arithmetic but it is not for the faint of heart.
OSX
As I recall, the version of sed on Mac OSX does not support chaining commands together with semicolons. Instead, try:
sed -e 's/old/new/' -e 't change' -e b -e ':change w changes'

Related

What's the correct usage of sed with parallel --jobs option?

parallel -a input --colsep ' ' --jobs 100 -I {} sed -i 's/{1}/{2}/g' file
input is a file delimited by space, where the first column is pattern and the second column is replacement.
The problem is that after I ran the command, not all patterns were replaced in file. Then I ran the same command again, more patterns were replaced, but still not all.
However, if I change --jobs 100 to --jobs 1, it will work as expected (but much slower).
Is there any parameter necessary missing in my command?
Sounds more like you have a race condition. If you have several sed processes writing to the file, one will win, and the other(s) will lose.
Having multiple processes process the same file is hugely suboptimal anyway; just generate a single sed script and then run it once. Or if you really want to parallelize, split the input file into smaller pieces, run the generated sed script on each in parallel, and then concatenate them back when you are done.
Parallel processing helps when your task is CPU bound, but this one is I/O bound; you are simply creating congestion by having several processes fight over the access to bytes from the disk, and then in this case also fighting over write access back to the same file.
There are many examples of how to generate a sed script; here's a quick and dirty one which will however not work on some platforms where sed -f - does not read the script from standard input.
sed 's%^\([^ ]*\) \([^ ]*\)$%s/\1/\2/g%' input |
sed -f - file >temp # or sed -f - -i file
I omitted the -i option so that you can check that this does what you want before plunging ahead and deploying it in production. The commented-out version is what you would use once you are satisfied that this really does what you want.
There is still the question of replacement precedence. If you have s/a/b/ and s/b/c/ then do you want effectively s/a/c/, or the opposite? If you have s/abc/x/ and s/abcdef/y/, should abcdef always become y, or is xdef what you expect? A common hack is to sort the replacements by length so that the longer ones always get executed before the shorter ones; then at least you know what to expect.
Let us assume that input is big and file is huge.
You really do not want to read file more than once.
First you need to convert input into a single big sed script.
cat input | parallel --colsep ' ' echo s/{1}/{2}/g >bigsed
As #tripleee says, you may need to sort this, so the longest source string is first.
Then you need to split file into one chunk per CPU thread, run the script on each chunk and finally append the replaced chunks back in order:
parallel --pipepart -a file -k sed -f bigsed > replaced
You will need that /tmp has enough free space to contain replaced or set $TMPDIR to a dir that is.

The perl -pe command

So I've done a research about the perl -pe command and I know that it takes records from a file and creates an output out of it in a form of another file. Now I'm a bit confused as to how this line of command works since it's a little modified so I can't really figure out what exactly is the role of perl pe in it. Here's the command:
cd /usr/kplushome/entities/Standalone/config/webaccess/WebaccessServer/etc
(PATH=/usr/ucb:$PATH; ./checkall.sh;) | perl -pe "s,^, ,g;"
Any idea how it works here?
What's even more confusing in the above statement is this part : "s,^, ,g;"
Any help would be much appreciated. Let me know if you guys need more info. Thank you!
It simply takes an expression given by the -e flag (in this case, s,^, ,g) and performs it on every line of the input, printing the modified line (i.e. the result of the expression) to the output.
The expression itself is something called a regular expression (or "regexp" or "regex") and is a field of learning in and of itself. Quick googles for "regular expression tutorial" and "getting started with regular expressions" turn up tons of results, so that might be a good place to start.
This expression, s,^, ,g, adds ten spaces to the start of the line, and as I said earlier, perl -p applies it to every line.
"s,^, ,g;"
s is use for substitution. syntax is s/somestring/replacement/.
In your command , is the delimiter instead of /.
g is for work globally, means replace all occurrence.
For example:
perl -p -i -e "s/oldstring/newstring/g" file.txt;
In file.txt all oldstring will replace with newstring.
i is for inplace file editing.
See these doc for information:
perlre
perlretut
perlop

Sed to replace certain number of occurrences

I have the replace sed script below and it works for the first occurrence of every line but I'm trying to make it work for the first 2 occurrences per line instead of one (/1) or the whole line (/g):
sed -r '2,$s/(^ *|, *)([a-z])/\1\U\2/1'
Is there any way to do that either by combining sed commands or creating a script?
The best I can offer is
sed -r '2,$ { s/(^|,) *[a-z]/\U&/; s//\U&/; }'
The \U& trick uses the fact that the upper case version of a space is still a space; this is to make the repetition shorter. Because captures are no longer used, the regex can be simplified a little.
In the second s command, the // is a stand-in for the most recently attempted regex, so the first one is essentially executed a second time (this time matching what was originally the second appearance).
Since /1 doesn't actually do anything (replacing the first occurrence is default), I took the liberty of removing it.

Make sed not buffer by lines

I'm not trying to prevent sed from block-buffering! I am looking to get it to not even line-buffer.
I am not sure if this is even possible at all.
Basically there is a big difference between the behavior of sed and that of cat when interacting with them from a raw pseudo-terminal: cat will immediately spit back the inserted characters when it receives them over STDIN, while sed even in raw mode will not.
A thought experiment could be carried out: given a simple sed command such as s/abc/zzz/g, sending a stream of input to sed like 123ab means that sed at best can provide over standard output the characters 123, because it does not yet know if a c will arrive and cause the result string to be 123zzz, while any other character would have it print exactly what came in (allowing it to "catch up", if you will). So in a way it's obvious why cat does respond immediately; it can afford to.
So of course that's how it would work in an ideal world where sed's authors actually cared about this kind of a use case.
I suspect that that is not the case. In reality, through my not too terribly exhaustive methods, I see that sed will line buffer no matter what (which allows it to always be able to figure out whether to print the 3 z's or not), unless you tell it that you care about matching your regexes past/over newlines, in which case it will just buffer the whole damn thing before providing any output.
My ideal solution is to find a sed that will spit out all the text that it has already finished parsing, without waiting till the end of line to do so. In my little example above, it would instantly spit back the characters 1, 2, and 3, and while a and b are being entered (typed), it says nothing, till either a c is seen (prints zzz), or any other character X is seen, in which case abX is printed, or in the case of EOF ab is printed.
Am I SOL? Should I just incrementally implement my Perl code with the features I want, or is there still some chance that this sort of magically delicious functionality can be got through some kind of configuration?
See another question of mine for more details on why I want this.
So, one potential workaround on this is to manually establish groups of input to "split" across calls to sed (or in my case since i'm already dealing with a Perl script, perl's regex replacement operators) so that I can sort of manually do the flushing. But this cannot achieve the same level of responsiveness because it would require me to think through the expression to describe the points at which the "buffering" is to occur, rather than having a regex parser automatically do it.
There is a tool that matches an input stream against multiple regular expressions in parallel and acts as soon as it decides on a match. It's not sed. It's lex. Or the GNU version, flex.
To make this demonstration work, I had to define a YY_INPUT macro, because flex was line-buffering input by default. Even with no buffering at the stdio level, and even in "interactive" mode, there is an assumption that you don't want to process less than a line at a time.
So this is probably not portable to other versions of lex.
%{
#include <stdio.h>
#define YY_INPUT(buf,result,max_size) \
{ \
int c = getchar(); \
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
}
%}
%%
abc fputs("zzz", stdout); fflush(stdout);
. fputs(yytext, stdout); fflush(stdout);
%%
int main(void)
{
setbuf(stdin, 0);
yylex();
}
Usage: put that program into a file called abczzz.l and run
flex --always-interactive -o abczzz.c abczzz.l
cc abczzz.c -ll -o abczzz
for ch in a b c 1 2 3 ; do echo -n $ch ; sleep 1 ; done | ./abczzz ; echo
You can actually write entire programs in sed.
Here is a way to slurp the whole file into the editing buffer. I added the -n to suppress printing and the $p so it would only print the buffer at the end, after i switch the hold space I have been building up with the current buffer I am editing.
sed -n 'H;$x;$p' FILENAME
You can conditionally build up the hold space based on patterns you encounter:
'/pattern/{H}'
You can conditionally print the buffer as well
'/pattern/{p}'
You can even nest these conditional blocks, if you feel saucy.
You could use a combination of `g' (to copy the hold space to the pattern space, thereby overwriting it) and then s/(.).*/\1/ and such to get at individual characters.
I hope this was at least informative. I would advise you to write a tool in a different language.

How to use sed to return something from first line which matches and quit early?

I saw from
How to use sed to replace only the first occurrence in a file?
How to do most of what I want:
sed -n '0,/.*\(something\).*/s//\1/p'
This finds the first match of something and extracts it (of course my real example is more complicated). I know sed has a 'quit' command which will make it stop early, but I don't know how to combine the 'q' with the above line to get my desired behavior.
I've tried replacing the 'p' with {p;q;} as I've seen in some examples, but that is clearly not working.
The "print and quit" trick works, if you also put the substitution command into the block (instead of only putting the print and quit there).
Try:
sed -n '/.*\(something\).*/{s//\1/p;q}'
Read this like: Match for the pattern given between the slashes. If this pattern is found, execute the actions specified in the block: Replace, print and exit. Typically, match- and substitution-patterns are equal, but this is not required, so the 's' command could also contain a different RE.
Actually, this is quite similar to what ghostdog74 answered for gawk.
My specific use case was to find out via 'git log' on a git-p4 imported tree which perforce branch was used for the last commit. git log (when called without -n will log every commit that every happened (hundreds of thousands for me)).
We can't know a-priori what value to give git for '-n.' After posting this, I found my solution which was:
git log | sed -n '0,/.*\[git-p4:.*\/\/depot\/blah\/\([^\/]*\)\/.*/s//\1/p; /\[git-p4/ q'
I'd still like to know how to do this with a non '/' separator, and without having to specify the 'git-p4' part twice, once for the extraction and once for the quit. There has got to be a way to combine both on the same line...
Sed usually has an option to specify more than one pattern to execute (IIRC, it's the -e option). That way, you can specify a second pattern that quits after the first line.
Another approach is to use sed to extract the first line (sed '1q'), then pipe that to a second sed command (what you show above).
use gawk
gawk '/MATCH/{
print "do something with "$0
exit
}' file
To "return something from first line and quit" you could also use grep:
grep --max-count 1 'something'
grep is supposed to be faster than sed and awk [1]. Even if not, this syntax is easier to remember (and to type: grep -m1 'something').
If you don't want to see the whole line but just the matching part, the --only-matching (-o) option might suffice.