cshell: running cat on a large text file inside backticks gives 'word too long' - perl

I have a file that has fairly long lines. The longest line has length 4609:
% perl -nle 'print length' ~/very_large_file | sort -nu | tail -1
4609
Now, when I just run cat ~/very_large_file it runs fine. But when I put inside backticks, it gives a 'word too long' error
% foreach line (`cat ~/very_large_file`)
Word too long.
% set x = `cat ~/very_large_file`
Word too long.
Is there an alternative to using backticks in csh to process each line of such a file?
Update My problem was solved by using a different language, but I still couldn't get the reason for the failing csh. Just came across this page that describes the manner of finding ARG_MAX. In particular, the getconf command is useful. Of course, I am still not sure whether this limit is the root cause, and if the limit applies to the languages other than csh.

I don't mean to beat an old horse, but if you're scripting do consider moving to bash, zsh or even Korn. csh has disadvantages.
What you can try without abandoning csh completely:
Move to tcsh if you're with regular old (very old) csh.
Recompile tcsh with a longer word length (the default is 1000 bytes, I think) or with dynamic allocation.
If possible move the line processing to a secondary script or program and write that loop like this:
cat ~/very_large_file | xargs secondary_script

Related

Why does this command for qsub to submit multiple pbs scripts work in the bash shell but not fish?

I have a bunch of .pbs files in one directory.
I can qsub the files no problem with this command in the bash shell but for the fish shell, I continuously hit enter and it just creates a new input line. Any ideas why it doesn't work in fish?
for file in *.pbs; do qsub $file; done
Fish's syntax for loops and other block constructs is different.
In this case it's
for file in *.pbs
qsub $file
end
or, on one line:
for file in *.pbs; qsub $file; end
Other looping constructs look similar - no introductory "do" and they end with "end".
Other differences here: This behaves like bash with the nullglob option, so if no file matches the for-loop simply won't be executed, there's no need to guard for a literal *.pbs being passed.
Also the $file doesn't need to be quoted because it's set as one element and so it will be passed as one element, no word splitting takes place.
Fish uses a different script syntax from bash and other shells. To read up on it: Fish for bash users is the quick starting point, but the rest of the documentation is worth reading (in my, admittedly biased, opinion).

Best scripting tool to profile a log file

I have extracted log files from servers based on my date and time requirement and after extraction it has hundreds of HTTP requests (URLs). Each request may or may not contain various parameters a,b,c,d,e,f,g etc.,
For example:
http:///abcd.com/blah/blah/blah%20a=10&b=20ORC
http:///abcd.com/blah/blah/blahsomeotherword%20a=30&b=40ORC%26D
http:///abcd.com/blah/blah/blahORsomeORANDworda=30%20b=40%20C%26D
http:///abcd.com/blah/blah/"blah"%20b=40ORCANDD%20G%20F
I wrote a shell script to profile this log file in a while loop, grep for different parameters a,b,c,d,e. If they contain respective parameter then what is the value for that parameter, or TRUE or FALSE.
while read line ; do
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ms.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ABC.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*?q=/ /;s/AUTH_TYPE:.*//'>> output.txt
echo -n -e "\t" >> output.txt
echo " " >> output.txt
done < queries.csv
My question is, my cygwin is taking lot of time (an hour or so) to execute on a log file containing 70k-80k requests. Is there a best way to write this script so that it executes asap? I'm okay with perl too. But my concern is, the script is flexible enough to execute and extract parameters.
Like #reinerpost already pointed out, the loop-internal redirection is probably the #1 killer issue here. You might be able to reap significant gains already by switching from
while read line; do
something >>file
something else too >>file
done <input
to instead do a single redirection after done:
while read line; do
something
something else too
done <input >file
Notice how this also simplifies the loop body, and allows you to overwrite the file when you (re)start the script, instead of separately needing to clean out any old results. As also suggested by #reinerpost, not hard-coding the output file would also make your script more general; simply print to standard output, and let the invoker decide what to do with the results. So maybe just remove the redirections altogether.
(Incidentally, you should switch to read -r unless you specifically want the shell to interpret backslashes and other slightly weird legacy behavior.)
Additionally, collecting results and doing a single print at the end would probably be a lot more efficient than the repeated unbuffered echo -n -e writes. (And again, printf would probably be preferrable to echo for both portability and usability reasons.)
The current script could be reimplemented in sed quite easily. You collect portions of the input URL and write each segment to a separate field. This is easily done in sed with the following logic: Save the input to the hold space. Swap the hold space and the current pattern space, perform the substitution you want, append to the hold space, and swap back the input into the pattern space. Repeat as necessary.
Because your earlier script was somewhat more involved, I'm suggesting to use Awk instead. Here is a crude skeleton for doing things you seem to be wanting to do with your data.
awk '# Make output tab-delimited
BEGIN { OFS="\t" }
{ xyz_ms = $0; sub("^.*XYX:", " ", xyz_ms); sub("ms.*$", "", xyz_ms);
xyz_abc = $0; sub("^.*XYZ:", " ", xyz_abc); sub("ABC.*$", "", xyz_abc);
q = $0; sub("^.*?q=", " ", q); sub("AUTH_TYPE:.*$", "", q);
# ....
# Demonstration of how to count something
n = split($0, _, "&"); ampersand_count = n-1;
# ...
# Done: Now print
print xyz_mx, xyz_abc, q, " " }' queries.csv
Notice how we collect stuff in variables and print only at the end. This is less crucial here than it would have been in your earlier shell script, though.
The big savings here is avoiding to spawn a large number of subprocesses for each input line. Awk is also better optimized for doing this sort of processing quickly.
If Perl is more convenient for you, converting the entire script to Perl should produce similar benefits, and be somewhat more compatible with the sed-centric syntax you have already. Perl is bigger and sometimes slower than Awk, but in the grand scheme of things, not by much. If you really need to optimize, do both and measure.
Problems with your script:
The main problem: you append to a file in every statement. This means the file has to be opened and closed in every statement, which is extremely inefficient.
You hardcode the name of the output file in your script. This is a bad practice. Your script will be much more versatile if it just writes its output to stdout. Leave it to the call to specify where to direct the output. That will also get rid of the previous problem.
bash is interpreted and not optimized for text manipulation: it is bound to be slow, and complex text filtering won't be very readable. Using awk instead will probably make it more concise and more readable (once you know the language); however, if you don't know it yet, I advise learning Perl instead, which is good at what awk is good at but is also general-purpose language: it makes you much far more flexible, allows you to make it even more readable (those who complain about Perl being hard to read have never seen nontrivial shell scripts), and probably makes it a lot faster, because perl compiles scripts prior to running them. If you'd rather invest your efforts into a more popular language than Perl, try Python.

What is the purpose of filtering a log file using this Perl one-liner before displaying it in the terminal?

I came across this script which was not written by me, but because of an issue I need to know what it does.
What is the purpose of filtering the log file using this Perl one-liner?
cat log.txt | perl -pe 's/\e([^\[\]]|\[.*?[a-zA-Z]|\].*?\a)/ /g'
The log.txt file contains the output of a series of commands. I do not understand what is being filtered here, and why it might be useful.
It looks like the code should remove ANSI escape codes from the input, i.e codes to set colors, window title .... Since some of these code might cause harm it might be a security measure in case some kind of attack was able to include such escape codes into the log file. Since usually a log file does not contain any such escape codes this would also explain why you don't see any effect of this statement for normal log files.
For more information about this kind of attack see A Blast From the Past: Executing Code in Terminal Emulators via Escape Sequences.
BTW, while your question looks bad on the first view it is actually not. But you might try to improve questions by at least formatting it properly. Otherwise you risk that this questions gets down-voted fast.
First, the command line suffers from a useless use of cat. perl is fully capable of reading from a file name on the command line.
So,
$ perl -pe 's/\e([^\[\]]|\[.*?[a-zA-Z]|\].*?\a)/ /g' log.txt
would have done the same thing, but avoided spawning an extra process.
Now, -e is followed by a script for perl to execute. In this case, we have a single global substitution.
\e in a Perl regex pattern corresponds to the escape character, x1b.
The pattern following \e looks like the author wants to match ANSI escape sequences.
The -p option essentially wraps the script specified with -e in while loop, so the s/// is executed for each line of the input.
The pattern probably does the job for this simple purpose, but one might benefit from using Regexp::Common::ANSIescape as in:
$ perl -MRegexp::Common::ANSIescape=ANSIescape,no_defaults -pe 's/$RE{ANSIescape}/ /g' log.txt
Of course, if one uses a script like this very often, one might want to either use an alias, or even write a very short script that does this, as in:
#!/usr/bin/env perl
use strict;
use Regexp::Common 'ANSIescape', 'no_defaults';
while (<>) {
s/$RE{ANSIescape}/ /g;
print;
}

How to reformat a source file to go from 2 space indentations to 3?

This question is nearly identical to this question except that I have to go to three spaces (company coding guidelines) rather than four and the accepted solution will only double the matched pattern. Here was my first attempt:
:%s/^\(\s\s\)\+/\1 /gc
But this does not work because four spaces get replaced by three. So I think that what I need is some way to get the count of how many times the pattern matched "+" and use that number to create the other side of the substitution but I feel this functionality is probably not available in Vim's regex (Let me know if you think it might be possible).
I also tried doing the substitution manually by replacing the largest indents first and then the next smaller indent until I got it all converted but this was hard to keep track of the spaces:
:%s/^ \(\S\)/ \1/gc
I could send it through Perl as it seems like Perl might have the ability to do it with its Extended Patterns. But I could not get it to work with my version of Perl. Here was my attempt with trying to count a's:
:%!perl -pe 'm<(?{ $cnt = 0 })(a(?{ local $cnt = $cnt + 1; }))*aaaa(?{ $res = $cnt })>x; print $res'
My last resort will be to write a Perl script to do the conversion but I was hoping for a more general solution in Vim so that I could reuse the idea to solve other issues in the future.
Let vim do it for you?
:set sw=3<CR>
gg=G
The first command sets the shiftwidth option, which is how much you indent by. The second line says: go to the top of the file (gg), and reindent (=) until the end of the file (G).
Of course, this depends on vim having a good formatter for the language you're using. Something might get messed up if not.
Regexp way... Safer, but less understandable:
:%s#^\(\s\s\)\+#\=repeat(' ',strlen(submatch(0))*3/2)#g
(I had to do some experimentation.)
Two points:
If the replacement starts with \=, it is evaluated as an expression.
You can use many things instead of /, so / is available for division.
The perl version you asked for...
From the command line (edits in-place, no backup):
bash$ perl -pi -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
(in-place, original backed up to "YOUR_FILE.bak"):
bash$ perl -pi.bak -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
From vim while editing YOUR_FILE:
:%!perl -pe 's{^((?: )+)}{" " x (length($1)/2)}e'
The regex matches the beginning of the line, followed by (the captured set of) one or more "two space" groups. The substitution pattern is a perl expression (hence the 'e' modifier) which counts the number of "two space" groups that were captured and creates a string of that same number of "three space" groups. If an "extra" space was present in the original it is preserved after the substitution. So if you had three spaces before, you'll have four after, five before will turn into seven after, etc.

How do I count the number of rows in a large CSV file with Perl?

I have to use Perl on a Windows environment at work, and I need to be able to find out the number of rows that a large csv file contains (about 1.4Gb).
Any idea how to do this with minimum waste of resources?
Thanks
PS This must be done within the Perl script and we're not allowed to install any new modules onto the system.
Do you mean lines or rows? A cell may contain line breaks which would add lines to the file, but not rows. If you are guaranteed that no cells contain new lines, then just use the technique in the Perl FAQ. Otherwise, you will need a proper CSV parser like Text::xSV.
Yes, don't use perl.
Instead use the simple utility for counting lines; wc.exe
It's part of a suite of windows utilities ported from unix originals.
http://unxutils.sourceforge.net/
For example;
PS D:\> wc test.pl
12 26 271 test.pl
PS D:\>
Where 12 == number of lines, 26 == number of words, 271 == number of characters.
If you really have to use perl;
D:\>perl -lne "END{print $.;}" < test.pl
12
perl -lne "END { print $. }" myfile.csv
This only reads one line at a time, so it doesn't waste any memory unless each line is enormously long.
This one-liner handles new lines within the rows:
Considering lines with an odd number of quotes.
Considering that doubled quotes is a way of indicating quotes within the field.
It uses the awesome flip-flop operator.
perl -ne 'BEGIN{$re=qr/^[^"]*(?:"[^"]*"[^"]*)*?"[^"]*$/;}END{print"Count: $t\n";}$t++ unless /$re/../$re/'
Consider:
wc is not going to work. It's awesome for counting lines, but not CSV rows
You should install--or fight to install--Text::CSV or some similar standard package for proper handling.
This may get you there, nonetheless.
EDIT: It slipped my mind that this was windows:
perl -ne "BEGIN{$re=qr/^[^\"]*(?:\"[^\"]*\"[^\"]*)*?\"[^\"]*$/;}END{print qq/Count: $t\n/;};$t++ unless $pq and $pq = /$re/../$re/;"
The weird thing is that The Broken OS' shell interprets && as the OS conditional exec and I couldn't do anything to change its mind!! If I escaped it, it would just pass it that way to perl.
Upvote for edg's answer, another option is to install cygwin to get wc and a bunch of other handy utilities on Windows.
I was being idiotic, the simple way to do it in the script is:
open $extract, "<${extractFileName}" or die ("Cannot read row count of $extractFileName");
$rowCount=0;
while (<$extract>)
{
$rowCount=$rowCount+1;
}
close($extract);