sed / perl regex extremly slow - perl

So, I've got a file called
cracked.txt, which contains thousands(80million+) lines of this:
dafaa15bec90fba537638998a5fa5085:_BD:zzzzzz12
a8c2e774d406b319e33aca8b38540063:2JB:zzzzzz999
d6d24dfcef852729d10391f186da5b08:WNb:zzzzzzzss
2f1c72ccc940828b5daf4ab98e0f8731:#]9:zzzzzzzz
3b7633b6c19d79e5ab76bdb9cce4fd42:#A9:zzzzzzzz
a3dc9c03ff845776b485fa8337c9625a:yQ,:zzzzzzzz
ade1d43b29674814a16e96098365f956:FZ-:zzzzzzzz
ba93090dfa64d964889f521788aca889:/.g:zzzzzzzz
c3bd6861732affa3a437df46a6295810:m}Z:zzzzzzzz
b31d9f86c28bd1245819817e353ceeb1:>)L:zzzzzzzzzzzz
and in my output.txt 80 million lines like this:
('chen123','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
('chen1234','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
('chen125','45a36afe044ff58c09dc3cd2ee287164','','','','f+P',''),
(45a36afe044ff58c09dc3cd2ee287164 and f+P change every line)
What I've done is created a simple bash script to match the cracked.txt to output.txt and join them.
cat './cracked.txt' | while read LINE; do
pwd=$(echo "${LINE}" | awk -F ":" '{print $NF}' | sed -e 's/\x27/\\\\\\\x27/g' -e 's/\//\\\x2f/g' -e 's/\x22/\\\\\\\x22/g' )
hash=$(echo "${LINE}" | awk -F ":" '{print $1}')
lines=$((lines+1))
echo "${lines} ${pwd}"
perl -p -i -e "s/${hash}/${hash} ( ${pwd} ) /g" output.txt
#sed -u -i "s/${hash}/${hash} ( ${pwd} ) /g" output.txt
done
As you can see by the comment, I've tried sed, and perl.
perl seems to be a tad faster than sed
I'm getting something like one line per second.
I've never used perl, so I've got no idea how to use that to my advantage (multi threading, etc)
What would the best way to speed up this process?
Thanks
edit:
I got a suggestion that it would be better to use something like this:
while IFS=: read pwd seed hash; do
...
done < cracked.txt
But because inbetween the first and last occurance of : (awk '{print $1}' awk '{print $NF}', : could appear inbetween there, it would make it bad(corrupt it)
I could use it just for the "hash", but not for the "pwd".
edit again;
The above wouldn't work, because I would have to name all the other data, which ofc will be a problem.

The problem with bash scripting is that, while very flexible and powerful, it creates new processes for nearly anything, and forking is expensive. In each iteration of the loop, you spawn 3×echo, 2×awk, 1×sed and 1×perl. Restricting yourself to one process (and thus, one programming language) will boost performance.
Then, you are re-reading output.txt each time in the call to perl. IO is always slow, so buffering the file would be more efficient, if you have the memory.
Multithreading would work if there were no hash collisions, but is difficult to program. Simply translating to Perl will get you a greater performance increase than transforming Perl to multithreaded Perl.[citation needed]
You would probably write something like
#!/usr/bin/perl
use strict; use warnings;
open my $cracked, "<", "cracked.txt" or die "Can't open cracked";
my #data = do {
open my $output, "<", "output.txt" or die "Can't open output";
<$output>;
};
while(<$cracked>) {
my ($hash, $seed, $pwd) = split /:/, $_, 3;
# transform $hash here like "$hash =~ s/foo/bar/g" if really neccessary
# say which line we are at
print "at line $. with pwd=$pwd\n";
# do substitutions in #data
s/\Q$hash\E/$hash ( $pwd )/ for #data;
# the \Q...\E makes any characters in between non-special,
# so they are matched literally.
# (`C++` would match many `C`s, but `\QC++\E` matches the character sequence)
}
# write #data to the output file
(not tested or anything, no guarantees)
While this would still be an O(n²) solution, it would perform better than the bash script. Do note that it can be reduced to O(n), when organizing #data into a hash tree, indexed by hash codes:
my %data = map {do magic here to parse the lines, and return a key-value pair} #data;
...;
$data{$hash} =~ s/\Q$hash\E/$hash ( $pwd )/; # instead of evil for-loop
In reality, you would store a reference to an array containing all lines that contain the hash code in the hash tree, so the previous lines would rather be
my %data;
for my $line (#data) {
my $key = parse_line($line);
push #$data{$key}, $line;
}
...;
s/\Q$hash\E/$hash ( $pwd )/ for #{$data{$hash}}; # is still faster!
On the other hand, a hash with 8E7 elems might not exactly perform well. The answer lies in benchmarking.

When parsing logs on my work i do this thing: split file for N parts (N=num_processors); align split points to \n. Start N threads to work each part. Works really fast but harddrive is bottleneck.

Related

Awk's output in Perl doesn't seem to be working properly

I'm writing a simple Perl script which is meant to output the second column of an external text file (columns one and two are separated by a comma).
I'm using AWK because I'm familiar with it.
This is my script:
use v5.10;
use File::Copy;
use POSIX;
$s = `awk -F ',' '\$1==500 {print \$2}' STD`;
say $s;
The contents of the local file "STD" is:
CIR,BS
60,90
70,100
80,120
90,130
100,175
150,120
200,260
300,500
400,600
500,850
600,900
My output is very strange and it prints out the desired "850" but it also prints a trailer of the line and a new line too!
ka#man01:$ ./test.pl
850
ka#man01:$
The problem isn't just printing. I need to use the variable generated by awk "i.e. the $s variable) but the variable is also being reserved with a long string and a new line!
Could you guys help?
Thank you.
I'd suggest that you're going down a dirty road by trying to inline awk into perl in the first place. Why not instead:
open ( my $input, '<', 'STD' ) or die $!;
while ( <$input> ) {
s/\s+\z//;
my #fields = split /,/;
print $fields[1], "\n" if $fields[0] == 500;
}
But the likely problem is that you're not handling linefeeds, and say is adding an extra one. Try using print instead, or chomp on the resultant string.
perl can do many of the things that awk can do. Here's something similar that replaces your entire Perl program:
$ perl -naF, -le 'chomp; print $F[1] if $F[0]==500' STD
850
The -n creates a while loop around your argument to -e.
The -a splits up each line into #F and -F lets you specify the separator. Since you want to separate the fields on a comma you use -F,.
The -l adds a newline each time you call print.
The -e argument is the program to run (with the added while from -n). The chomp removes the newline from the output. You get a newline in your output because you happen to use the last field in the line. The -l adds a newline when you print; that's important when you want to extract a field in the middle of the line.
The reason you get 2 newlines:
the backtick operator does not remove the trailing newline from the awk output. $s contains "850\n"
the say function appends a newline to the string. You have say "850\n" which is the same as print "850\n\n"

Extract everything between first and last occurence of the same pattern in single iteration

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE and input
1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......
Print:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.
You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
}
Feed the input into this program via zcat file.gz | perl script.pl.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push#b,$_;if(/CAPTURE/){$x||=#b=$_;print#b;#b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'
I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac
Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed holding space, you can implemnt the same approach with awk:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'
I don't think regex will be faster than double scan...
Here is an awk solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.
Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
#!/usr/bin/perl
{
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
}
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
result:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.
When run against the test files (as above) I get timings around 10 seconds.
While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application.
The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.
Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
else {
$first = 1;
}
}
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.

Find available space in /tmp on a remote machine

I want to find out the available space in /tmp on a remote machine. I can do it with the following command from my machine:
ssh host-name df /tmp | awk '{ print $4 }' | tail +2`
IT works and gives an output something like this :9076656, which the available space in KB.
But when I put this command inside a perl program, I get the error message about the use of uninitialized value for $4.
Use of uninitialized value $4 in concatenation
Here is how I am doing it in the perl code:
my $output = `ssh $server df /tmp | awk '{ print $4 }' | tail +2`;
Any ideas how can I resolve this issue ?
The problem is that $4 is interpolated by Perl, just like you expect $server to be interpolated and not literal. The immediate solution is to escape the dollar sign: \$4. However, it is as Brian Agnew says, pretty redundant to use awk inside perl to do something that perl excels at.
use strict;
use warnings;
use Data::Dumper;
my #output = `ssh $server df /tmp`; # capture output in array
#output = map { ( split )[3] } #output; # take only 4th field
print Dumper \#output; # display data
Then you can use the various array tools to trim #output to your liking, e.g. pop, push, shift, unshift, splice, using slices and subscripts. For example, taking all but the first two lines:
print #output[2 .. $#output];
Removing the first two lines:
splice #output, 0, 2;
Using Perl to spawn off an awk script seems a little redundant (and heavyweight) given Perl's ability to read files, parse out fields etc. I would rather investigate reading a process' output via Perl and splitting the output lines.

How to remove a specific word from a file in perl

A file contains:
rhost=localhost
ruserid=abcdefg_xxx
ldir=
lfile=
rdir=p01
rfile=
pgp=none
mainframe=no
ftpmode=binary
ftpcmd1=
ftpcmd2=
ftpcmd3=
ftpcmd1a=
ftpcmd2a=
notifycc=no
firstfwd=Yes
NOTIFYNYL=
decompress=no
compress=no
I want to write a simple code that removes the "_xxx" in that second line. Keep in mind that there will never be a file that contains the string "_xxx" so that should make it extremely easier. I'm just not too familiar with the syntax. Thanks!
The short answer:
Here's how you can remove just the literal '_xxx'.
perl -pli.bak -e 's/_xxx$//' filename
The detailed explanation:
Since Perl has a reputation for code that is indistinguishable from line noise, here's an explanation of the steps.
-p creates an implicit loop that looks something like this:
while( <> ) {
# Your code goes here.
}
continue {
print or die;
}
-l sort of acts like "auto-chomp", but also places the line ending back on the line before printing it again. It's more complicated than that, but in its simplest use, it changes your implicit loop to look like this:
while( <> ) {
chomp;
# Your code goes here.
}
continue {
print $_, $/;
}
-i tells Perl to "edit in place." Behind the scenes it creates a separate output file and at the end it moves that temporary file to replace the original.
.bak tells Perl that it should create a backup named 'originalfile.bak' so that if you make a mistake it can be reversed easily enough.
Inside the substitution:
s/
_xxx$ # Match (but don't capture) the final '_xxx' in the string.
/$1/x; # Replace the entire match with nothing.
The reference material:
For future reference, information on the command line switches used in Perl "one-liners" can be obtained in Perl's documentation at perlrun. A quick introduction to Perl's regular expressions can be found at perlrequick. And a quick overview of Perl's syntax is found at perlintro.
This overwrites the original file, getting rid of _xxx in the 2nd line:
use warnings;
use strict;
use Tie::File;
my $filename = shift;
tie my #lines, 'Tie::File', $filename or die $!;
$lines[1] =~ s/_xxx//;
untie #lines;
Maybe this can help
perl -ple 's/_.*// if /^ruserid/' < file
will remove anything after the 1st '_' (included) in the line what start with "ruserid".
One way using perl. In second line ($. == 2), delete from last _ until end of line:
perl -lpe 's/_[^_]*\Z// if $. == 2' infile

How to find a solaris process with ___ status

I made the following script which searches for certain processes, displays uses pflags for each one, and stops when it finds one with the word "pause":
!cat find_pause
#!/usr/bin/perl -W
use warnings;
use strict;
if (open(WCF,
"ps -ef | grep '/transfile' | cut -c10-15 | xargs -n1 pflags 2>&1 |"
)) {
while (<WCF>) {
next if ($_ =~ /cannot/);
print $_;
last if ($_ =~ /pause/);
}
close(WCF);
}
It works, but I wonder if there is a better way to do this.
Update
pause is a low-level system call. Like read, nanosleep, waitid, etc.
With this script I want to find processes that are stuck in the pause call. We are trying to find a bug in our system, and we think it might be related to this.
I don't know what you'd consider a "better way" in this case, but I can offer some technique guidance for the approach you already have:
grep '/[t]ransfile'
A grep against ps output often runs the risk of matching the grep process itself, which is almost never desired. An easy protection against this is simply to introduce a character class of one member in the grep pattern argument.
awk '/\/[t]ransfile/{ print $2 }'
grep + cut, that is, field extraction following a pattern match, is an easy task for a single awk command.
Don't refer to $_
Tighter, more idiomatic perl would omit explicit use of $_. Try next if /cannot/ and the like.
open(my $wcf, ...)
Please use lexical filehandles, otherwise you'll be chided by those old enough to remember when we couldn't use them. :)
There are two possible improvements to this, depending on:
Do you actually require to print exact output of pflags command or some info from it (e.g. list of PIDs and flags?)
What does "pause" in pflags output mean? It's nowhere in "proc" or "pflags" man-pages and all the actual flags are upper case. Depending on its meaning, it might be found in native Perl implementation of "/proc" - Proc::processTable::Process.
For example, that Process object contains all the flags (in a bit vector) and process status (my suspicion is that "pause" might be a process status).
If the answers to those questions are "Proc::processTable::Process contains enough info for my needs", then a better solution is to use that:
#!/usr/bin/perl -W
use warnings;
use strict;
use Proc::ProcessTable;
my $t = new Proc::ProcessTable;
foreach $p ( #{$t->table} ) {
my $flags = $p->pid; # This is an integer containing bit vector.
# Somehow process $flags or $p->status to find "if the process is paused"
print "$flags\n";
last if paused($p); # No clue how to do that without more info from you
# May be : last if $p->status =~ /paused/;
}
However, if the native Perl process does not have enough info for you (unlikely but possible), OR if you acually desire to print exact pflags output as-is for some reason, the best optimization is to construct a list of PIDs for pflags natively - not as big of a win but you still lose a bunch of extra forked off processes. Something like this:
#!/usr/bin/perl -W
use warnings;
use strict;
use Proc::ProcessTable;
my $t = new Proc::ProcessTable;
my $pids = join " ", map { $_->pid } #{$t->table};
if (open(WCF, "pflags 2>&1 $pids|")) {
while (<WCF>) {
next if ($_ =~ /cannot/);
print $_;
last if ($_ =~ /pause/);
}
close(WCF);
}