Extract everything between first and last occurence of the same pattern in single iteration - perl

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE and input
1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......
Print:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.

You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
}
Feed the input into this program via zcat file.gz | perl script.pl.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push#b,$_;if(/CAPTURE/){$x||=#b=$_;print#b;#b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'

I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac

Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed holding space, you can implemnt the same approach with awk:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'

I don't think regex will be faster than double scan...
Here is an awk solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.

Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
#!/usr/bin/perl
{
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
}
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
result:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE

This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.
When run against the test files (as above) I get timings around 10 seconds.

While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application.
The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.
Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
else {
$first = 1;
}
}
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.

Related

Awk's output in Perl doesn't seem to be working properly

I'm writing a simple Perl script which is meant to output the second column of an external text file (columns one and two are separated by a comma).
I'm using AWK because I'm familiar with it.
This is my script:
use v5.10;
use File::Copy;
use POSIX;
$s = `awk -F ',' '\$1==500 {print \$2}' STD`;
say $s;
The contents of the local file "STD" is:
CIR,BS
60,90
70,100
80,120
90,130
100,175
150,120
200,260
300,500
400,600
500,850
600,900
My output is very strange and it prints out the desired "850" but it also prints a trailer of the line and a new line too!
ka#man01:$ ./test.pl
850
ka#man01:$
The problem isn't just printing. I need to use the variable generated by awk "i.e. the $s variable) but the variable is also being reserved with a long string and a new line!
Could you guys help?
Thank you.
I'd suggest that you're going down a dirty road by trying to inline awk into perl in the first place. Why not instead:
open ( my $input, '<', 'STD' ) or die $!;
while ( <$input> ) {
s/\s+\z//;
my #fields = split /,/;
print $fields[1], "\n" if $fields[0] == 500;
}
But the likely problem is that you're not handling linefeeds, and say is adding an extra one. Try using print instead, or chomp on the resultant string.
perl can do many of the things that awk can do. Here's something similar that replaces your entire Perl program:
$ perl -naF, -le 'chomp; print $F[1] if $F[0]==500' STD
850
The -n creates a while loop around your argument to -e.
The -a splits up each line into #F and -F lets you specify the separator. Since you want to separate the fields on a comma you use -F,.
The -l adds a newline each time you call print.
The -e argument is the program to run (with the added while from -n). The chomp removes the newline from the output. You get a newline in your output because you happen to use the last field in the line. The -l adds a newline when you print; that's important when you want to extract a field in the middle of the line.
The reason you get 2 newlines:
the backtick operator does not remove the trailing newline from the awk output. $s contains "850\n"
the say function appends a newline to the string. You have say "850\n" which is the same as print "850\n\n"

How to display lines in a file where it contains more than 5 comma in a line using egrep or awk

I have the lines in the following format:
enter image description here
Help is required to display the line alone containing more than 5 comma in a line in a separate file.
perl has a tr (translate) operator that returns the number of translations that occurred. We can use this to count substrings in a string.
cat file.txt | perl -ne 'print if tr/,// > 5'
Using egrep:
egrep '([^,]*,){6,}'
Using awk:
awk -F, 'NF>5{print}'
Using a sed which has an "extended regular expression option" (I'll assume -r here, but it could be -E):
sed -n -r -e '/([^,]*,){6,}/p'
Of course you have to be careful what you ask for. For example, if you have a CSV file with commas embedded within "values", and if you only want lines with more than five "values", then things get a little trickier for tools that are not CSV-aware.
Text in image looks like CSV.
then, using AWK...
awk -F'","' 'NF>5{print}'
like peak's above answer.
I think you already have answers to your raw question here. However, if what you're really asking is if you want to find how many rows have CSV fields that exceed 5, then I think you need something like Perl's Text::CSV module.
An example of this is the following string:
one,two,three,four,five,"six,seven"
This has six commas but only five fields. Do you want to see this line, or do you want to skip it? If you want to see it (as an exception -- a line with more than five commas), then use one of the methods already suggested.
If you don't, then you really want a CSV parser, and Perl's is quite nice -- more lightweight and easier than most languages, in my opinion:
use strict;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } );
open my $IN, "<:encoding(utf8)", "file.csv" or die;
while (my $row = $csv->getline($IN)) {
if (#$row > 5) {
$csv->combine(#$row);
print $csv->string(), "\n";
}
}
close $IN;

Creating CSV of information extracted from filenames in a given format

I have a little script that lists paths to all files in a directory and all subdirectories and parses each path on the list with regex in Perl.
#!/bin/sh
find * -type f | while read j; do
echo $j | perl -n -e '/\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/ && print "\"0\";\"$1$2$3\";\"$4\";\"$5\";$fl\""' >> bss.csv
echo | readlink -f -n "$j" >>bss.csv
echo \">>bss.csv
done
Output:
"0";"13957";"4121113";"2";"/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg"
I am using the readlink from GNU coreutils: -n suppresses newline at the end, -f performs canonicalization by recursively following symlinks on the path.
Problem is, when input string did not pass regex I have only line with file path.
How can I add condition to check if regex passed - show path, else - no.
I broke my brain with various combinations, but didn't find any that work properly.
Description of solution
In Perl, use if (/…/) {…} else {…} instead of /…/ && …. Thus you can execute print if match is successful and some other code otherwise.
If this is not the problem and you only want to get rid of the readlink output and closing quote, you can call readlink from Perl using backticks.
Resulting code
I turned everything into a single Perl program, used File::Find instead of find command, assumed $fl at the end of print in Perl is a relict (ignored it) and used Cwd::realpath() to find canonical path of the file instead of readlink -f from GNU coreutils. If you still want to use readlink -f, feel free to change Cwd::realpath($_) to `readlink -f '$_'` (including the backticks!), but then it will not work for filenames containing a single-quote.
You should call this script as ./script-name starting-directory > bss.csv. If you put it in the directory you are examining, the output would contain it too, along with the bss.csv.
#!/usr/bin/perl
# Usage: ./$0 [<starting-directory>...]
use strict;
use warnings;
use File::Find;
use Cwd;
no warnings 'File::Find';
sub handleFile() {
return if not -f;
if ($File::Find::name =~ /\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/) {
local $, = ';', $\ = "\n";
print map "\"$_\"", 0, $1.$2.$3, $4, $5, Cwd::realpath($_);
} else {
print STDERR "File $File::Find::name did not match\n";
}
}
find(\&handleFile, #ARGV ? #ARGV : '.');
For reference I also enclose polished version of the original program. It is calling readlink from Perl as I suggested above and really utilizes the -n option of Perl, avoiding the while read loop.
#!/bin/sh
find . -type f | perl -n -e 'm{/(\d{2})/(\d{2})/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?} && print qq{"0";"$1$2$3";"$4";"$5";"`readlink -f -n '\''$_'\''`"}' > bss.csv
Other remarks to the original code
The echo | before the readlink does nothing and should be removed. Readlink does not read its stdin.
Where does $fl at the end of print in Perl come from? I assume it is a relict.
Use of generic quotes like qq{} and thoughtful use of delimiters (e.g. in regex matching and other quote-like operators) can save you from quoting hell. I already used this tip above: /…/ → m{…} and "…" → qq{…}. Thx, Slade! See perlop manpage for more info.
If I understand you, you want to capture the following parts of the filename:
/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg
~~ ~~ ~ ~~~ ~~~~~~~ ~
1 2 3 4 5 6
But your perl regex doesn't do that. Let's break it apart for better understanding.
/\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/
Sliced into pieces, this would be...
\/(\d{2}) - a slash then two digits (with the digits captured)
\/(\d{2}) - another slash and two digits
\/(\d) - one more slash and any number of digits
.*- - any run of characters until the final hyphen in the input string
([a-zA-Z]+) - one or more alpha characters
(?:_(\d{1}))? - nonsensical (I think) construct matching an optional single digit that won't be captured (because it's inside a (?:...))
If you step through your filename, you'll see that there is nothing here to handle the second last string of digits.
I'd do this using simpler tools. Sed, for example:
[ghoti#pc ~]$ s="/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg"
[ghoti#pc ~]$ echo "$s" | sed -rne 's/.*/"&"/;h;s:.*/([0-9]{2})/([0-9]{2})/([0-9]+)[^[a-zA-Z]]*[^-]+-([0-9]+)(_([0-9]+))?.*:"0";"\1\2\3";"\4";"\6":;G;s/\n/;/;p'
"0";"13957";"4121113";"2";"/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg"
[ghoti#pc ~]$
I'll break up the sed script for easier reading:
s/.*/"&"/; - Put quotes around the filename.
h; - Store the filename in Sed's "hold" space, for future use...
s: - Start the big substitution...
.*/([0-9]{2})/([0-9]{2})/([0-9]+)[^[a-zA-Z]]*[^-]+-([0-9]+)(_([0-9]+))?.* - This is the pattern we want to match for substitution. Similar to what you did in Perl, obviously, but using ERE instead of PCRE.
:"0";"\1\2\3";"\4";"\6":; - The replacement pattern, with \n being replaced by the bracketed elements of the RE. Note that \5 is skipped in the replace string, as that subexpression is only being used for the match.
G; - Append the "hold" space to the pattern space
s/\n/;/; - and remove the newline between them.
p - Print the result.
Note that this solution, as is, assumes that all input lines match the pattern you're looking for. If that's not the case, then you may get unpredictable output, and should put some pattern matching into the script.

sed / perl regex extremly slow

So, I've got a file called
cracked.txt, which contains thousands(80million+) lines of this:
dafaa15bec90fba537638998a5fa5085:_BD:zzzzzz12
a8c2e774d406b319e33aca8b38540063:2JB:zzzzzz999
d6d24dfcef852729d10391f186da5b08:WNb:zzzzzzzss
2f1c72ccc940828b5daf4ab98e0f8731:#]9:zzzzzzzz
3b7633b6c19d79e5ab76bdb9cce4fd42:#A9:zzzzzzzz
a3dc9c03ff845776b485fa8337c9625a:yQ,:zzzzzzzz
ade1d43b29674814a16e96098365f956:FZ-:zzzzzzzz
ba93090dfa64d964889f521788aca889:/.g:zzzzzzzz
c3bd6861732affa3a437df46a6295810:m}Z:zzzzzzzz
b31d9f86c28bd1245819817e353ceeb1:>)L:zzzzzzzzzzzz
and in my output.txt 80 million lines like this:
('chen123','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
('chen1234','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
('chen125','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
(45a36afe044ff58c09dc3cd2ee287164 and f+P change every line)
What I've done is created a simple bash script to match the cracked.txt to output.txt and join them.
cat './cracked.txt' | while read LINE; do
pwd=$(echo "${LINE}" | awk -F ":" '{print $NF}' | sed -e 's/\x27/\\\\\\\x27/g' -e 's/\//\\\x2f/g' -e 's/\x22/\\\\\\\x22/g' )
hash=$(echo "${LINE}" | awk -F ":" '{print $1}')
lines=$((lines+1))
echo "${lines} ${pwd}"
perl -p -i -e "s/${hash}/${hash} ( ${pwd} ) /g" output.txt
#sed -u -i "s/${hash}/${hash} ( ${pwd} ) /g" output.txt
done
As you can see by the comment, I've tried sed, and perl.
perl seems to be a tad faster than sed
I'm getting something like one line per second.
I've never used perl, so I've got no idea how to use that to my advantage (multi threading, etc)
What would the best way to speed up this process?
Thanks
edit:
I got a suggestion that it would be better to use something like this:
while IFS=: read pwd seed hash; do
...
done < cracked.txt
But because inbetween the first and last occurance of : (awk '{print $1}' awk '{print $NF}', : could appear inbetween there, it would make it bad(corrupt it)
I could use it just for the "hash", but not for the "pwd".
edit again;
The above wouldn't work, because I would have to name all the other data, which ofc will be a problem.
The problem with bash scripting is that, while very flexible and powerful, it creates new processes for nearly anything, and forking is expensive. In each iteration of the loop, you spawn 3×echo, 2×awk, 1×sed and 1×perl. Restricting yourself to one process (and thus, one programming language) will boost performance.
Then, you are re-reading output.txt each time in the call to perl. IO is always slow, so buffering the file would be more efficient, if you have the memory.
Multithreading would work if there were no hash collisions, but is difficult to program. Simply translating to Perl will get you a greater performance increase than transforming Perl to multithreaded Perl.[citation needed]
You would probably write something like
#!/usr/bin/perl
use strict; use warnings;
open my $cracked, "<", "cracked.txt" or die "Can't open cracked";
my #data = do {
open my $output, "<", "output.txt" or die "Can't open output";
<$output>;
};
while(<$cracked>) {
my ($hash, $seed, $pwd) = split /:/, $_, 3;
# transform $hash here like "$hash =~ s/foo/bar/g" if really neccessary
# say which line we are at
print "at line $. with pwd=$pwd\n";
# do substitutions in #data
s/\Q$hash\E/$hash ( $pwd )/ for #data;
# the \Q...\E makes any characters in between non-special,
# so they are matched literally.
# (`C++` would match many `C`s, but `\QC++\E` matches the character sequence)
}
# write #data to the output file
(not tested or anything, no guarantees)
While this would still be an O(n²) solution, it would perform better than the bash script. Do note that it can be reduced to O(n), when organizing #data into a hash tree, indexed by hash codes:
my %data = map {do magic here to parse the lines, and return a key-value pair} #data;
...;
$data{$hash} =~ s/\Q$hash\E/$hash ( $pwd )/; # instead of evil for-loop
In reality, you would store a reference to an array containing all lines that contain the hash code in the hash tree, so the previous lines would rather be
my %data;
for my $line (#data) {
my $key = parse_line($line);
push #$data{$key}, $line;
}
...;
s/\Q$hash\E/$hash ( $pwd )/ for #{$data{$hash}}; # is still faster!
On the other hand, a hash with 8E7 elems might not exactly perform well. The answer lies in benchmarking.
When parsing logs on my work i do this thing: split file for N parts (N=num_processors); align split points to \n. Start N threads to work each part. Works really fast but harddrive is bottleneck.

What's the use of <> in Perl?

What's the use of <> in Perl. How to use it ?
If we simply write
<>;
and
while(<>)
what is that the program doing in both cases?
The answers above are all correct, but it might come across more plainly if you understand general UNIX command line usage. It is very common to want a command to work on multiple files. E.g.
ls -l *.c
The command line shell (bash et al) turns this into:
ls -l a.c b.c c.c ...
in other words, ls never see '*.c' unless the pattern doesn't match. Try this at a command prompt (not perl):
echo *
you'll notice that you do not get an *.
So, if the shell is handing you a bunch of file names, and you'd like to go through each one's data in turn, perl's <> operator gives you a nice way of doing that...it puts the next line of the next file (or stdin if no files are named) into $_ (the default scalar).
Here is a poor man's grep:
while(<>) {
print if m/pattern/;
}
Running this script:
./t.pl *
would print out all of the lines of all of the files that match the given pattern.
cat /etc/passwd | ./t.pl
would use cat to generate some lines of text that would then be checked for the pattern by the loop in perl.
So you see, while(<>) gets you a very standard UNIX command line behavior...process all of the files I give you, or process the thing I piped to you.
<>;
is a short way of writing
readline();
or if you add in the default argument,
readline(*ARGV);
readline is an operator that reads a line from the specified file handle. Reading from the special file handle ARGV will read from STDIN if #ARGV is empty or from the concatenation of the files named by #ARGV if it's not.
As for
while (<>)
It's a syntax error. If you had
while (<>) { ... }
it get rewritten to
while (defined($_ = <>)) { ... }
And as previously explained, that means the same as
while (defined($_ = readline(*ARGV))) { ... }
That means it will read lines from (previously explained) ARGV until there are no more lines to read.
It is called the diamond operator and feeds data from either stdin if ARGV is empty or each line from the files named in ARGV. This webpage http://docstore.mik.ua/orelly/perl/learn/ch06_02.htm explains it very well.
In many cases of programming with syntactical sugar like this, Deparse of O is helpful to find out what's happening:
$ perl -MO=Deparse -e 'while(<>){print 42}'
while (defined($_ = <ARGV>)) {
print 42;
}
-e syntax OK
Quoting perldoc perlop:
The null filehandle <> is special: it can be used to emulate the
behavior of sed and awk, and any other Unix filter program that takes
a list of filenames, doing the same to each line of input from all of
them. Input from <> comes either from standard input, or from each
file listed on the command line.
it takes the STDIN standard input:
> cat temp.pl
#!/usr/bin/perl
use strict;
use warnings;
my $count=<>;
print "$count"."\n";
>
below is the execution:
> temp.pl
3
3
>
so as soon as you execute the script it will wait till the user gives some input.
after 3 is given as input,it stores that value in $count and it prints the value in the next statement.