Best scripting tool to profile a log file - perl

I have extracted log files from servers based on my date and time requirement and after extraction it has hundreds of HTTP requests (URLs). Each request may or may not contain various parameters a,b,c,d,e,f,g etc.,
For example:
http:///abcd.com/blah/blah/blah%20a=10&b=20ORC
http:///abcd.com/blah/blah/blahsomeotherword%20a=30&b=40ORC%26D
http:///abcd.com/blah/blah/blahORsomeORANDworda=30%20b=40%20C%26D
http:///abcd.com/blah/blah/"blah"%20b=40ORCANDD%20G%20F
I wrote a shell script to profile this log file in a while loop, grep for different parameters a,b,c,d,e. If they contain respective parameter then what is the value for that parameter, or TRUE or FALSE.
while read line ; do
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ms.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ABC.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*?q=/ /;s/AUTH_TYPE:.*//'>> output.txt
echo -n -e "\t" >> output.txt
echo " " >> output.txt
done < queries.csv
My question is, my cygwin is taking lot of time (an hour or so) to execute on a log file containing 70k-80k requests. Is there a best way to write this script so that it executes asap? I'm okay with perl too. But my concern is, the script is flexible enough to execute and extract parameters.

Like #reinerpost already pointed out, the loop-internal redirection is probably the #1 killer issue here. You might be able to reap significant gains already by switching from
while read line; do
something >>file
something else too >>file
done <input
to instead do a single redirection after done:
while read line; do
something
something else too
done <input >file
Notice how this also simplifies the loop body, and allows you to overwrite the file when you (re)start the script, instead of separately needing to clean out any old results. As also suggested by #reinerpost, not hard-coding the output file would also make your script more general; simply print to standard output, and let the invoker decide what to do with the results. So maybe just remove the redirections altogether.
(Incidentally, you should switch to read -r unless you specifically want the shell to interpret backslashes and other slightly weird legacy behavior.)
Additionally, collecting results and doing a single print at the end would probably be a lot more efficient than the repeated unbuffered echo -n -e writes. (And again, printf would probably be preferrable to echo for both portability and usability reasons.)
The current script could be reimplemented in sed quite easily. You collect portions of the input URL and write each segment to a separate field. This is easily done in sed with the following logic: Save the input to the hold space. Swap the hold space and the current pattern space, perform the substitution you want, append to the hold space, and swap back the input into the pattern space. Repeat as necessary.
Because your earlier script was somewhat more involved, I'm suggesting to use Awk instead. Here is a crude skeleton for doing things you seem to be wanting to do with your data.
awk '# Make output tab-delimited
BEGIN { OFS="\t" }
{ xyz_ms = $0; sub("^.*XYX:", " ", xyz_ms); sub("ms.*$", "", xyz_ms);
xyz_abc = $0; sub("^.*XYZ:", " ", xyz_abc); sub("ABC.*$", "", xyz_abc);
q = $0; sub("^.*?q=", " ", q); sub("AUTH_TYPE:.*$", "", q);
# ....
# Demonstration of how to count something
n = split($0, _, "&"); ampersand_count = n-1;
# ...
# Done: Now print
print xyz_mx, xyz_abc, q, " " }' queries.csv
Notice how we collect stuff in variables and print only at the end. This is less crucial here than it would have been in your earlier shell script, though.
The big savings here is avoiding to spawn a large number of subprocesses for each input line. Awk is also better optimized for doing this sort of processing quickly.
If Perl is more convenient for you, converting the entire script to Perl should produce similar benefits, and be somewhat more compatible with the sed-centric syntax you have already. Perl is bigger and sometimes slower than Awk, but in the grand scheme of things, not by much. If you really need to optimize, do both and measure.

Problems with your script:
The main problem: you append to a file in every statement. This means the file has to be opened and closed in every statement, which is extremely inefficient.
You hardcode the name of the output file in your script. This is a bad practice. Your script will be much more versatile if it just writes its output to stdout. Leave it to the call to specify where to direct the output. That will also get rid of the previous problem.
bash is interpreted and not optimized for text manipulation: it is bound to be slow, and complex text filtering won't be very readable. Using awk instead will probably make it more concise and more readable (once you know the language); however, if you don't know it yet, I advise learning Perl instead, which is good at what awk is good at but is also general-purpose language: it makes you much far more flexible, allows you to make it even more readable (those who complain about Perl being hard to read have never seen nontrivial shell scripts), and probably makes it a lot faster, because perl compiles scripts prior to running them. If you'd rather invest your efforts into a more popular language than Perl, try Python.

Related

What's the correct usage of sed with parallel --jobs option?

parallel -a input --colsep ' ' --jobs 100 -I {} sed -i 's/{1}/{2}/g' file
input is a file delimited by space, where the first column is pattern and the second column is replacement.
The problem is that after I ran the command, not all patterns were replaced in file. Then I ran the same command again, more patterns were replaced, but still not all.
However, if I change --jobs 100 to --jobs 1, it will work as expected (but much slower).
Is there any parameter necessary missing in my command?
Sounds more like you have a race condition. If you have several sed processes writing to the file, one will win, and the other(s) will lose.
Having multiple processes process the same file is hugely suboptimal anyway; just generate a single sed script and then run it once. Or if you really want to parallelize, split the input file into smaller pieces, run the generated sed script on each in parallel, and then concatenate them back when you are done.
Parallel processing helps when your task is CPU bound, but this one is I/O bound; you are simply creating congestion by having several processes fight over the access to bytes from the disk, and then in this case also fighting over write access back to the same file.
There are many examples of how to generate a sed script; here's a quick and dirty one which will however not work on some platforms where sed -f - does not read the script from standard input.
sed 's%^\([^ ]*\) \([^ ]*\)$%s/\1/\2/g%' input |
sed -f - file >temp # or sed -f - -i file
I omitted the -i option so that you can check that this does what you want before plunging ahead and deploying it in production. The commented-out version is what you would use once you are satisfied that this really does what you want.
There is still the question of replacement precedence. If you have s/a/b/ and s/b/c/ then do you want effectively s/a/c/, or the opposite? If you have s/abc/x/ and s/abcdef/y/, should abcdef always become y, or is xdef what you expect? A common hack is to sort the replacements by length so that the longer ones always get executed before the shorter ones; then at least you know what to expect.
Let us assume that input is big and file is huge.
You really do not want to read file more than once.
First you need to convert input into a single big sed script.
cat input | parallel --colsep ' ' echo s/{1}/{2}/g >bigsed
As #tripleee says, you may need to sort this, so the longest source string is first.
Then you need to split file into one chunk per CPU thread, run the script on each chunk and finally append the replaced chunks back in order:
parallel --pipepart -a file -k sed -f bigsed > replaced
You will need that /tmp has enough free space to contain replaced or set $TMPDIR to a dir that is.

What is the purpose of filtering a log file using this Perl one-liner before displaying it in the terminal?

I came across this script which was not written by me, but because of an issue I need to know what it does.
What is the purpose of filtering the log file using this Perl one-liner?
cat log.txt | perl -pe 's/\e([^\[\]]|\[.*?[a-zA-Z]|\].*?\a)/ /g'
The log.txt file contains the output of a series of commands. I do not understand what is being filtered here, and why it might be useful.
It looks like the code should remove ANSI escape codes from the input, i.e codes to set colors, window title .... Since some of these code might cause harm it might be a security measure in case some kind of attack was able to include such escape codes into the log file. Since usually a log file does not contain any such escape codes this would also explain why you don't see any effect of this statement for normal log files.
For more information about this kind of attack see A Blast From the Past: Executing Code in Terminal Emulators via Escape Sequences.
BTW, while your question looks bad on the first view it is actually not. But you might try to improve questions by at least formatting it properly. Otherwise you risk that this questions gets down-voted fast.
First, the command line suffers from a useless use of cat. perl is fully capable of reading from a file name on the command line.
So,
$ perl -pe 's/\e([^\[\]]|\[.*?[a-zA-Z]|\].*?\a)/ /g' log.txt
would have done the same thing, but avoided spawning an extra process.
Now, -e is followed by a script for perl to execute. In this case, we have a single global substitution.
\e in a Perl regex pattern corresponds to the escape character, x1b.
The pattern following \e looks like the author wants to match ANSI escape sequences.
The -p option essentially wraps the script specified with -e in while loop, so the s/// is executed for each line of the input.
The pattern probably does the job for this simple purpose, but one might benefit from using Regexp::Common::ANSIescape as in:
$ perl -MRegexp::Common::ANSIescape=ANSIescape,no_defaults -pe 's/$RE{ANSIescape}/ /g' log.txt
Of course, if one uses a script like this very often, one might want to either use an alias, or even write a very short script that does this, as in:
#!/usr/bin/env perl
use strict;
use Regexp::Common 'ANSIescape', 'no_defaults';
while (<>) {
s/$RE{ANSIescape}/ /g;
print;
}

Make sed not buffer by lines

I'm not trying to prevent sed from block-buffering! I am looking to get it to not even line-buffer.
I am not sure if this is even possible at all.
Basically there is a big difference between the behavior of sed and that of cat when interacting with them from a raw pseudo-terminal: cat will immediately spit back the inserted characters when it receives them over STDIN, while sed even in raw mode will not.
A thought experiment could be carried out: given a simple sed command such as s/abc/zzz/g, sending a stream of input to sed like 123ab means that sed at best can provide over standard output the characters 123, because it does not yet know if a c will arrive and cause the result string to be 123zzz, while any other character would have it print exactly what came in (allowing it to "catch up", if you will). So in a way it's obvious why cat does respond immediately; it can afford to.
So of course that's how it would work in an ideal world where sed's authors actually cared about this kind of a use case.
I suspect that that is not the case. In reality, through my not too terribly exhaustive methods, I see that sed will line buffer no matter what (which allows it to always be able to figure out whether to print the 3 z's or not), unless you tell it that you care about matching your regexes past/over newlines, in which case it will just buffer the whole damn thing before providing any output.
My ideal solution is to find a sed that will spit out all the text that it has already finished parsing, without waiting till the end of line to do so. In my little example above, it would instantly spit back the characters 1, 2, and 3, and while a and b are being entered (typed), it says nothing, till either a c is seen (prints zzz), or any other character X is seen, in which case abX is printed, or in the case of EOF ab is printed.
Am I SOL? Should I just incrementally implement my Perl code with the features I want, or is there still some chance that this sort of magically delicious functionality can be got through some kind of configuration?
See another question of mine for more details on why I want this.
So, one potential workaround on this is to manually establish groups of input to "split" across calls to sed (or in my case since i'm already dealing with a Perl script, perl's regex replacement operators) so that I can sort of manually do the flushing. But this cannot achieve the same level of responsiveness because it would require me to think through the expression to describe the points at which the "buffering" is to occur, rather than having a regex parser automatically do it.
There is a tool that matches an input stream against multiple regular expressions in parallel and acts as soon as it decides on a match. It's not sed. It's lex. Or the GNU version, flex.
To make this demonstration work, I had to define a YY_INPUT macro, because flex was line-buffering input by default. Even with no buffering at the stdio level, and even in "interactive" mode, there is an assumption that you don't want to process less than a line at a time.
So this is probably not portable to other versions of lex.
%{
#include <stdio.h>
#define YY_INPUT(buf,result,max_size) \
{ \
int c = getchar(); \
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
}
%}
%%
abc fputs("zzz", stdout); fflush(stdout);
. fputs(yytext, stdout); fflush(stdout);
%%
int main(void)
{
setbuf(stdin, 0);
yylex();
}
Usage: put that program into a file called abczzz.l and run
flex --always-interactive -o abczzz.c abczzz.l
cc abczzz.c -ll -o abczzz
for ch in a b c 1 2 3 ; do echo -n $ch ; sleep 1 ; done | ./abczzz ; echo
You can actually write entire programs in sed.
Here is a way to slurp the whole file into the editing buffer. I added the -n to suppress printing and the $p so it would only print the buffer at the end, after i switch the hold space I have been building up with the current buffer I am editing.
sed -n 'H;$x;$p' FILENAME
You can conditionally build up the hold space based on patterns you encounter:
'/pattern/{H}'
You can conditionally print the buffer as well
'/pattern/{p}'
You can even nest these conditional blocks, if you feel saucy.
You could use a combination of `g' (to copy the hold space to the pattern space, thereby overwriting it) and then s/(.).*/\1/ and such to get at individual characters.
I hope this was at least informative. I would advise you to write a tool in a different language.

I want to print a text file in columns

I have a text file which looks something like this:
jdkjf
kjsdh
jksfs
lksfj
gkfdj
gdfjg
lkjsd
hsfda
gadfl
dfgad
[very many lines, that is]
but would rather like it to look like
jdkjf kjsdh
jksfs lksfj
gkfdj gdfjg
lkjsd hsfda
gadfl dfgad
[and so on]
so I can print the text file on a smaller number of pages.
Of course, this is not a difficult problem, but I'm wondering if there is some excellent tool out there for solving problems like these.
EDIT: I'm not looking for a way to remove every other newline from a text file, but rather a tool which interprets text as "pictures" and then lays these out on the page nicely (by writing the appropriate whitespace symbols).
You can use this python code.
tables=input("Enter number of tables ")
matrix=[]
file=open("test.txt")
for line in file:
matrix.append(line.replace("\n",""))
if (len(matrix)==int(tables)):
print (matrix)
matrix=[]
file.close()
(Since you don't name your operating system, I'll simply assume Linux, Mac OS X or some other Unix...)
Your example looks like it can also be described by the expression "joining 2 lines together".
This can be achieved in a shell (with the help of xargs and awk) -- but only for an input file that is structured like your example (the result always puts 2 words on a line, irrespective of how many words each one contains):
cat file.txt | xargs -n 2 | awk '{ print $1" "$2 }'
This can also be achieved with awk alone (this time it really joins 2 full lines, irrespective of how many words each one contains):
awk '{printf $0 " "; getline; print $0}' file.txt
Or use sed --
sed 'N;s#\n# #' < file.txt
Also, xargs could do it:
xargs -L 2 < file.txt
I'm sure other people could come up with dozens of other, quite different methods and commandline combinations...
Caveats: You'll have to test for files with an odd number of lines explicitly. The last input line may not be processed correctly in case of odd number of lines.

cshell: running cat on a large text file inside backticks gives 'word too long'

I have a file that has fairly long lines. The longest line has length 4609:
% perl -nle 'print length' ~/very_large_file | sort -nu | tail -1
4609
Now, when I just run cat ~/very_large_file it runs fine. But when I put inside backticks, it gives a 'word too long' error
% foreach line (`cat ~/very_large_file`)
Word too long.
% set x = `cat ~/very_large_file`
Word too long.
Is there an alternative to using backticks in csh to process each line of such a file?
Update My problem was solved by using a different language, but I still couldn't get the reason for the failing csh. Just came across this page that describes the manner of finding ARG_MAX. In particular, the getconf command is useful. Of course, I am still not sure whether this limit is the root cause, and if the limit applies to the languages other than csh.
I don't mean to beat an old horse, but if you're scripting do consider moving to bash, zsh or even Korn. csh has disadvantages.
What you can try without abandoning csh completely:
Move to tcsh if you're with regular old (very old) csh.
Recompile tcsh with a longer word length (the default is 1000 bytes, I think) or with dynamic allocation.
If possible move the line processing to a secondary script or program and write that loop like this:
cat ~/very_large_file | xargs secondary_script