Why this exclusion not working for long sentences? - text-processing

Command
perl -ne 'print unless /.[240,]/' input.txt > output.txt
which includes some sentences which are longer than 240 letters. Why?
Example data
Development of World Funny Society program on young people who
are working hard for the sport and social life such that they have
time to go pizzeria every now and then and do some fun, programming
and having great breakfast:| World scenario of free time programs are
often too long such that this star probably makes the program upset
(*)|Again a very long option which is not probably the reason which
makes this program upset|good shorter option which is ok but nice to
write here coffee morning messages|c last option is always good one
because you know that you can soon stop 1
Example data 2
Indications of this program depends on many things and I like much
more this than Lorem ipsum which is too generic and takes too much
time to open:|short option just in case|little longer option so good
to have it here too|shorter which is better too but how much is the
question|shortest is not the least|longer one once again but not too
long 1

You're using the wrong syntax: [] is used to match a character class, while here you're trying to match a number of occurences of ., which can be done using {}:
perl -ne 'print unless /.{240,}/' input.txt > output.txt
Also, as suggested by salva in the comments, the pattern can be shortened to .{240}:
perl -ne 'print unless /.{240}/' input.txt > output.txt

Related

Best scripting tool to profile a log file

I have extracted log files from servers based on my date and time requirement and after extraction it has hundreds of HTTP requests (URLs). Each request may or may not contain various parameters a,b,c,d,e,f,g etc.,
For example:
http:///abcd.com/blah/blah/blah%20a=10&b=20ORC
http:///abcd.com/blah/blah/blahsomeotherword%20a=30&b=40ORC%26D
http:///abcd.com/blah/blah/blahORsomeORANDworda=30%20b=40%20C%26D
http:///abcd.com/blah/blah/"blah"%20b=40ORCANDD%20G%20F
I wrote a shell script to profile this log file in a while loop, grep for different parameters a,b,c,d,e. If they contain respective parameter then what is the value for that parameter, or TRUE or FALSE.
while read line ; do
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ms.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*XYZ:/ /;s/ABC.*//' >> output.txt
echo -n -e "\t" >> output.txt
echo -n -e $line | sed 's/^.*?q=/ /;s/AUTH_TYPE:.*//'>> output.txt
echo -n -e "\t" >> output.txt
echo " " >> output.txt
done < queries.csv
My question is, my cygwin is taking lot of time (an hour or so) to execute on a log file containing 70k-80k requests. Is there a best way to write this script so that it executes asap? I'm okay with perl too. But my concern is, the script is flexible enough to execute and extract parameters.
Like #reinerpost already pointed out, the loop-internal redirection is probably the #1 killer issue here. You might be able to reap significant gains already by switching from
while read line; do
something >>file
something else too >>file
done <input
to instead do a single redirection after done:
while read line; do
something
something else too
done <input >file
Notice how this also simplifies the loop body, and allows you to overwrite the file when you (re)start the script, instead of separately needing to clean out any old results. As also suggested by #reinerpost, not hard-coding the output file would also make your script more general; simply print to standard output, and let the invoker decide what to do with the results. So maybe just remove the redirections altogether.
(Incidentally, you should switch to read -r unless you specifically want the shell to interpret backslashes and other slightly weird legacy behavior.)
Additionally, collecting results and doing a single print at the end would probably be a lot more efficient than the repeated unbuffered echo -n -e writes. (And again, printf would probably be preferrable to echo for both portability and usability reasons.)
The current script could be reimplemented in sed quite easily. You collect portions of the input URL and write each segment to a separate field. This is easily done in sed with the following logic: Save the input to the hold space. Swap the hold space and the current pattern space, perform the substitution you want, append to the hold space, and swap back the input into the pattern space. Repeat as necessary.
Because your earlier script was somewhat more involved, I'm suggesting to use Awk instead. Here is a crude skeleton for doing things you seem to be wanting to do with your data.
awk '# Make output tab-delimited
BEGIN { OFS="\t" }
{ xyz_ms = $0; sub("^.*XYX:", " ", xyz_ms); sub("ms.*$", "", xyz_ms);
xyz_abc = $0; sub("^.*XYZ:", " ", xyz_abc); sub("ABC.*$", "", xyz_abc);
q = $0; sub("^.*?q=", " ", q); sub("AUTH_TYPE:.*$", "", q);
# ....
# Demonstration of how to count something
n = split($0, _, "&"); ampersand_count = n-1;
# ...
# Done: Now print
print xyz_mx, xyz_abc, q, " " }' queries.csv
Notice how we collect stuff in variables and print only at the end. This is less crucial here than it would have been in your earlier shell script, though.
The big savings here is avoiding to spawn a large number of subprocesses for each input line. Awk is also better optimized for doing this sort of processing quickly.
If Perl is more convenient for you, converting the entire script to Perl should produce similar benefits, and be somewhat more compatible with the sed-centric syntax you have already. Perl is bigger and sometimes slower than Awk, but in the grand scheme of things, not by much. If you really need to optimize, do both and measure.
Problems with your script:
The main problem: you append to a file in every statement. This means the file has to be opened and closed in every statement, which is extremely inefficient.
You hardcode the name of the output file in your script. This is a bad practice. Your script will be much more versatile if it just writes its output to stdout. Leave it to the call to specify where to direct the output. That will also get rid of the previous problem.
bash is interpreted and not optimized for text manipulation: it is bound to be slow, and complex text filtering won't be very readable. Using awk instead will probably make it more concise and more readable (once you know the language); however, if you don't know it yet, I advise learning Perl instead, which is good at what awk is good at but is also general-purpose language: it makes you much far more flexible, allows you to make it even more readable (those who complain about Perl being hard to read have never seen nontrivial shell scripts), and probably makes it a lot faster, because perl compiles scripts prior to running them. If you'd rather invest your efforts into a more popular language than Perl, try Python.

Perl one-liner to extract groups of characters

I am trying to extract a group of characters with a Perl one-liner, but I have been unsuccessful:
echo "hello_95_.txt" | perl -ne 's/.*([0-9]+).*/\1/'
Returns nothing, while I would like it to return 95. How can I do this with Perl?
Update:
Note that, in contrast to the suggested duplicate, I am interested in how to do this from the command-line. Surely this looks like a subtle difference, but it's not straightforward unless you already know how to effectively use Perl one-liners.
Since people are asking, eventually I want to learn to use Perl to write powerful one-liners, but most immediately I need a one-liner to extract consecutive digits from each line in a large text file.
perl -pe's/\D*(\d+).*/$1/'
or
perl -nE'/\d+/&&say$&'
or
perl -nE'say/(\d+)/'
or
perl -ple's/\D//g'
or may be
perl -nE'$,=" ";say/\d+/g'
Well first, you need to use the -p rather than the -n switch.
And you need to amend your regular expression, as in:
echo "hello_95_.txt" | perl -pe "s/^.*?([0-9]+).*$/\1/"
which looks for the longest non-greedy string of chars, followed by one or more digits, followed by any number of chars to the end of the line.
Note that while '\1' is acceptable as a back-reference and is more familiar to SED/AWK users, '$1' is the more up-to-date form. So, you might wish to use:
echo "hello_95_.txt" | perl -pe "s/^.*?([0-9]+).*$/$1/"
instead.

Make sed not buffer by lines

I'm not trying to prevent sed from block-buffering! I am looking to get it to not even line-buffer.
I am not sure if this is even possible at all.
Basically there is a big difference between the behavior of sed and that of cat when interacting with them from a raw pseudo-terminal: cat will immediately spit back the inserted characters when it receives them over STDIN, while sed even in raw mode will not.
A thought experiment could be carried out: given a simple sed command such as s/abc/zzz/g, sending a stream of input to sed like 123ab means that sed at best can provide over standard output the characters 123, because it does not yet know if a c will arrive and cause the result string to be 123zzz, while any other character would have it print exactly what came in (allowing it to "catch up", if you will). So in a way it's obvious why cat does respond immediately; it can afford to.
So of course that's how it would work in an ideal world where sed's authors actually cared about this kind of a use case.
I suspect that that is not the case. In reality, through my not too terribly exhaustive methods, I see that sed will line buffer no matter what (which allows it to always be able to figure out whether to print the 3 z's or not), unless you tell it that you care about matching your regexes past/over newlines, in which case it will just buffer the whole damn thing before providing any output.
My ideal solution is to find a sed that will spit out all the text that it has already finished parsing, without waiting till the end of line to do so. In my little example above, it would instantly spit back the characters 1, 2, and 3, and while a and b are being entered (typed), it says nothing, till either a c is seen (prints zzz), or any other character X is seen, in which case abX is printed, or in the case of EOF ab is printed.
Am I SOL? Should I just incrementally implement my Perl code with the features I want, or is there still some chance that this sort of magically delicious functionality can be got through some kind of configuration?
See another question of mine for more details on why I want this.
So, one potential workaround on this is to manually establish groups of input to "split" across calls to sed (or in my case since i'm already dealing with a Perl script, perl's regex replacement operators) so that I can sort of manually do the flushing. But this cannot achieve the same level of responsiveness because it would require me to think through the expression to describe the points at which the "buffering" is to occur, rather than having a regex parser automatically do it.
There is a tool that matches an input stream against multiple regular expressions in parallel and acts as soon as it decides on a match. It's not sed. It's lex. Or the GNU version, flex.
To make this demonstration work, I had to define a YY_INPUT macro, because flex was line-buffering input by default. Even with no buffering at the stdio level, and even in "interactive" mode, there is an assumption that you don't want to process less than a line at a time.
So this is probably not portable to other versions of lex.
%{
#include <stdio.h>
#define YY_INPUT(buf,result,max_size) \
{ \
int c = getchar(); \
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
}
%}
%%
abc fputs("zzz", stdout); fflush(stdout);
. fputs(yytext, stdout); fflush(stdout);
%%
int main(void)
{
setbuf(stdin, 0);
yylex();
}
Usage: put that program into a file called abczzz.l and run
flex --always-interactive -o abczzz.c abczzz.l
cc abczzz.c -ll -o abczzz
for ch in a b c 1 2 3 ; do echo -n $ch ; sleep 1 ; done | ./abczzz ; echo
You can actually write entire programs in sed.
Here is a way to slurp the whole file into the editing buffer. I added the -n to suppress printing and the $p so it would only print the buffer at the end, after i switch the hold space I have been building up with the current buffer I am editing.
sed -n 'H;$x;$p' FILENAME
You can conditionally build up the hold space based on patterns you encounter:
'/pattern/{H}'
You can conditionally print the buffer as well
'/pattern/{p}'
You can even nest these conditional blocks, if you feel saucy.
You could use a combination of `g' (to copy the hold space to the pattern space, thereby overwriting it) and then s/(.).*/\1/ and such to get at individual characters.
I hope this was at least informative. I would advise you to write a tool in a different language.

How to reformat a source file to go from 2 space indentations to 3?

This question is nearly identical to this question except that I have to go to three spaces (company coding guidelines) rather than four and the accepted solution will only double the matched pattern. Here was my first attempt:
:%s/^\(\s\s\)\+/\1 /gc
But this does not work because four spaces get replaced by three. So I think that what I need is some way to get the count of how many times the pattern matched "+" and use that number to create the other side of the substitution but I feel this functionality is probably not available in Vim's regex (Let me know if you think it might be possible).
I also tried doing the substitution manually by replacing the largest indents first and then the next smaller indent until I got it all converted but this was hard to keep track of the spaces:
:%s/^ \(\S\)/ \1/gc
I could send it through Perl as it seems like Perl might have the ability to do it with its Extended Patterns. But I could not get it to work with my version of Perl. Here was my attempt with trying to count a's:
:%!perl -pe 'm<(?{ $cnt = 0 })(a(?{ local $cnt = $cnt + 1; }))*aaaa(?{ $res = $cnt })>x; print $res'
My last resort will be to write a Perl script to do the conversion but I was hoping for a more general solution in Vim so that I could reuse the idea to solve other issues in the future.
Let vim do it for you?
:set sw=3<CR>
gg=G
The first command sets the shiftwidth option, which is how much you indent by. The second line says: go to the top of the file (gg), and reindent (=) until the end of the file (G).
Of course, this depends on vim having a good formatter for the language you're using. Something might get messed up if not.
Regexp way... Safer, but less understandable:
:%s#^\(\s\s\)\+#\=repeat(' ',strlen(submatch(0))*3/2)#g
(I had to do some experimentation.)
Two points:
If the replacement starts with \=, it is evaluated as an expression.
You can use many things instead of /, so / is available for division.
The perl version you asked for...
From the command line (edits in-place, no backup):
bash$ perl -pi -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
(in-place, original backed up to "YOUR_FILE.bak"):
bash$ perl -pi.bak -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
From vim while editing YOUR_FILE:
:%!perl -pe 's{^((?: )+)}{" " x (length($1)/2)}e'
The regex matches the beginning of the line, followed by (the captured set of) one or more "two space" groups. The substitution pattern is a perl expression (hence the 'e' modifier) which counts the number of "two space" groups that were captured and creates a string of that same number of "three space" groups. If an "extra" space was present in the original it is preserved after the substitution. So if you had three spaces before, you'll have four after, five before will turn into seven after, etc.

How do I count the number of rows in a large CSV file with Perl?

I have to use Perl on a Windows environment at work, and I need to be able to find out the number of rows that a large csv file contains (about 1.4Gb).
Any idea how to do this with minimum waste of resources?
Thanks
PS This must be done within the Perl script and we're not allowed to install any new modules onto the system.
Do you mean lines or rows? A cell may contain line breaks which would add lines to the file, but not rows. If you are guaranteed that no cells contain new lines, then just use the technique in the Perl FAQ. Otherwise, you will need a proper CSV parser like Text::xSV.
Yes, don't use perl.
Instead use the simple utility for counting lines; wc.exe
It's part of a suite of windows utilities ported from unix originals.
http://unxutils.sourceforge.net/
For example;
PS D:\> wc test.pl
12 26 271 test.pl
PS D:\>
Where 12 == number of lines, 26 == number of words, 271 == number of characters.
If you really have to use perl;
D:\>perl -lne "END{print $.;}" < test.pl
12
perl -lne "END { print $. }" myfile.csv
This only reads one line at a time, so it doesn't waste any memory unless each line is enormously long.
This one-liner handles new lines within the rows:
Considering lines with an odd number of quotes.
Considering that doubled quotes is a way of indicating quotes within the field.
It uses the awesome flip-flop operator.
perl -ne 'BEGIN{$re=qr/^[^"]*(?:"[^"]*"[^"]*)*?"[^"]*$/;}END{print"Count: $t\n";}$t++ unless /$re/../$re/'
Consider:
wc is not going to work. It's awesome for counting lines, but not CSV rows
You should install--or fight to install--Text::CSV or some similar standard package for proper handling.
This may get you there, nonetheless.
EDIT: It slipped my mind that this was windows:
perl -ne "BEGIN{$re=qr/^[^\"]*(?:\"[^\"]*\"[^\"]*)*?\"[^\"]*$/;}END{print qq/Count: $t\n/;};$t++ unless $pq and $pq = /$re/../$re/;"
The weird thing is that The Broken OS' shell interprets && as the OS conditional exec and I couldn't do anything to change its mind!! If I escaped it, it would just pass it that way to perl.
Upvote for edg's answer, another option is to install cygwin to get wc and a bunch of other handy utilities on Windows.
I was being idiotic, the simple way to do it in the script is:
open $extract, "<${extractFileName}" or die ("Cannot read row count of $extractFileName");
$rowCount=0;
while (<$extract>)
{
$rowCount=$rowCount+1;
}
close($extract);