Jumping to lines in a file the fastest way in perl - perl

I have a very large file of size about 300-500 MB. I need to search for String1 in that file first. Then search for String2 starting from the position of String1. Then again search for String3 starting from the position of String2. For example,
String1 = "abc"
String2 = "123"
String3 = "opq"
File :
def
123
opq
opq
123
opq
abc //come here first
blah blah
123 //come here next
blah
opq //read this finally and print
afg
123
blah blah
123
def
Methods I followed,
I tried reading the file line by line and searching for the matching pattern.
It was a very slow method (had to wait for minutes).
Then I stored the whole file into an array and grepped the matching lines to get the final line.
It was quite fast in searching but slower in loading the file into an array. The memory consumed is also high.
Is there an efficient method to perform such a task?

Using a perl one liner and range operators:
perl -ne 'print("$. $_") && exit if (/abc/ .. 1) && (/123/ .. 1) && /opq/' file
Outputs:
11 opq //read this finally and print
Explanation:
Switches:
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
Code:
print("$. $_"): prints the line number $. followed by the current line $_
exit: Terminate processing after desired line is found.
if (/abc/ .. 1) && (/123/ .. 1) && /opq/: Find patterns in order.
Addendum - for including functionality inside a script
I would advise against shelling out to another perl process to achieve this functionality. Instead just convert this to a non-commandline version:
use strict;
use warnings;
use autodie;
open my $fh, '<', 'file';
while (<$fh>) {
if ((/abc/ .. 1) && (/123/ .. 1) && /opq/) {
print "$. $_";
last;
}
}

Related

How to print result STDOUT to a temporary blank new file in the same directory in Perl?

I'm new in Perl, so it's maybe a very basic case that i still can't understand.
Case:
Program tell user to types the file name.
User types the file name (1 or more files).
Program read the content of file input.
If it's single file input, then it just prints the entire content of it.
if it's multi files input, then it combines the contents of each file in a sequence.
And then print result to a temporary new file, which located in the same directory with the program.pl .
file1.txt:
head
a
b
end
file2.txt:
head
c
d
e
f
end
SINGLE INPUT program ioSingle.pl:
#!/usr/bin/perl
print "File name: ";
$userinput = <STDIN>; chomp ($userinput);
#read content from input file
open ("FILEINPUT", $userinput) or die ("can't open file");
#PRINT CONTENT selama ada di file tsb
while (<FILEINPUT>) {
print ; }
close FILEINPUT;
SINGLE RESULT in cmd:
>perl ioSingle.pl
File name: file1.txt
head
a
b
end
I found tutorial code that combine content from multifiles input but cannot adapt the while argument to code above:
while ($userinput = <>) {
print ($userinput);
}
I was stucked at making it work for multifiles input,
How am i suppose to reformat the code so my program could give result like this?
EXPECTED MULTIFILES RESULT in cmd:
>perl ioMulti.pl
File name: file1.txt file2.txt
head
a
b
end
head
c
d
e
f
end
i appreciate your response :)
A good way to start working on a problem like this, is to break it down into smaller sections.
Your problem seems to break down to this:
get a list of filenames
for each file in the list
display the file contents
So think about writing subroutines that do each of these tasks. You already have something like a subroutine to display the contents of the file.
sub display_file_contents {
# filename is the first (and only argument) to the sub
my $filename = shift;
# Use lexical filehandl and three-arg open
open my $filehandle, '<', $filename or die $!;
# Shorter version of your code
print while <$filehandle>;
}
The next task is to get our list of files. You already have some of that too.
sub get_list_of_files {
print 'File name(s): ';
my $files = <STDIN>;
chomp $files;
# We might have more than one filename. Need to split input.
# Assume filenames are separated by whitespace
# (Might need to revisit that assumption - filenames can contain spaces!)
my #filenames = split /\s+/, $files;
return #filenames;
}
We can then put all of that together in the main program.
#!/usr/bin/perl
use strict;
use warnings;
my #list_of_files = get_list_of_files();
foreach my $file (#list_of_files) {
display_file_contents($file);
}
By breaking the task down into smaller tasks, each one becomes easier to deal with. And you don't need to carry the complexity of the whole program in you head at one time.
p.s. But like JRFerguson says, taking the list of files as command line parameters would make this far simpler.
The easy way is to use the diamond operator <> to open and read the files specified on the command line. This would achieve your objective:
while (<>) {
chomp;
print "$_\n";
}
Thus: ioSingle.pl file1.txt file2.txt
If this is the sole objective, you can reduce this to a command line script using the -p or -n switch like:
perl -pe '1' file1.txt file2.txt
perl -ne 'print' file1.txt file2.txt
These switches create implicit loops around the -e commands. The -p switch prints $_ after every loop as if you had written:
LINE:
while (<>) {
# your code...
} continue {
print;
}
Using -n creates:
LINE:
while (<>) {
# your code...
}
Thus, -p adds an implicit print statement.

perl find text from file in certain position

130723,-001,1.14,130725,+002,4.20,130731,+006,1.52,130728
130725,+002,4.20,130731,+006,1.52,130728,-003,0.00,130731
130731,+006,1.52,130728,-003,0.00,130731,+003,1.00,130731
130728,-003,0.00,130731,+003,1.00,130731,+000,0.00,130729
130731,+000,0.00,130729,-002,1.00,130728,-001,0.00,130728
the above is part of a log file. Each line in the log file is always the same length and has the same pattern as you can see above. I need to read the file and place in an array all the lines where position 42 to 46 in each line meet certain expectations. In the case above we are looking at the following numbers:
+006
-003
+003
+000
-001
Can someone point me in the right direction?
EDIT :
Thx to Amon for his suggestion.
I ended up with this code for future reference.
open (FILE, $filename) or die "Couldn't open log: $!";
while (<FILE>) {
if ((split /,/)[8] == "+003"){
push #data, $_ }}
close FILE;
foreach(#data)
{
print "$_\r\n";
}
I was thinking towards the future if this file gets really big what steps should I take to optimise the process speedwise?
If you want to do it by column numbers, then substr() is usable with care:
perl -pe '$_ = substr($_, 41, 4) . "\n"' data
Your question asks for columns 42..46, but with an inclusive notation, that selects 5 positions, the last of which is a comma. Specifying 42..46 is perhaps the 1-based half-open range of columnns.
The 41 in the code is 'column 42 - 1' (0-based indexes); the 4 is '46 - 42'. So, for columns [N..M), the formula would be:
perl -pe '$_ = substr($_, N-1, M-N) . "\n"' data
While #amon's answer is elegant, you can just use regex:
open FILE, "filename.txt" or die $!;
while (<FILE>) {
if $_ =~ /^.{41}(\+006)|(-003)|(\+003)|(\+000)|(-001)/
}
Try
perl -F, -ane '$F[7] eq "+003" and push #l,$_; END { print for #l }'<<XXX
130723,-001,1.14,130725,+002,4.20,130731,+006,1.52,130728
130725,+002,4.20,130731,+006,1.52,130728,-003,0.00,130731
130731,+006,1.52,130728,-003,0.00,130731,+003,1.00,130731
130728,-003,0.00,130731,+003,1.00,130731,+000,0.00,130729
130731,+000,0.00,130729,-002,1.00,130728,-001,0.00,130728
XXX
Output:
130731,+006,1.52,130728,-003,0.00,130731,+003,1.00,130731

Perl - Stop reading a file if multiple lines match [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do I break out of a loop in Perl?
I have data that looks like what you see bellow. I am trying to create a perl script that will capture selected text. My idea of going about it was saying "if the previous line read was all -'s and the current line read is all ='s then stop reading the file and don't print those lines with only ='s and -'s.
However, I don't know how to code that. I only started using perl 3 days ago. I don't know if that is the best way of doing it. Let me know if there is a better way.
Either way if you could help with the code, I'd appreciate it.
My code so far:
...
$end_section_flag = "true" # I was going to use this to signify
# when I want to stop reading
# ie. when I reached the end of the
# data I want to capture
while (<$in-fh>)
{
my $line = $_;
chomp $line;
if ($line eq $string)
{
print "Found it\n";
$end_section_flag = "false";
}
if ($end_section_flag eq "false" )
{
print $out-fh "$line\n";
// if you found the end of the section i'm reading
// don't pring the -'s and ='s and exit
}
}
What my data looks like
-------------------------------------------------------------------------------
===============================================================================
BLAH BLAH
===============================================================================
asdfsad
fasd
fas
df
asdf
a
\n
\n
-------------------------------------------------------------------------------
===============================================================================
BLAH BLAH
===============================================================================
...
What I want to capture
-------------------------------------------------------------------------------
===============================================================================
BLAH BLAH
===============================================================================
asdfsad
fasd
fas
df
asdf
a
\n
\n
Line-wise processing is not so suitable because your boundary crosses line endings. Slurp the file whole, then extract the in-between with the match operator.
use strictures;
use File::Slurp qw(read_file);
my $content = read_file 'so11454427.txt', { binmode => ':raw' };
my $boundary = qr'-{79} \R ={79}'msx;
my (#extract) = $content =~ /$boundary (.*?) $boundary/gmsx;
See if this suits your needs:
perl -ne 'm/^---/...m?/---/ and print' file
Should you want only the first block, change the delimiter from / to ? thusly:
perl -ne 'm?^---?...m?^---? and print' file
See the range operator discussion.
This will print the range of lines bounded by '---'. You can redirect the output into a file of your choice using your shell's redirection:
perl -ne 'm/^---/...m?/---/ and print' file > myoutput

Search files and when match is found, store it, then print out 4 lines above, 3 lines below

I have a simple search script that takes user input and searches across directories & files and just lists the files it is found in. What I want to do is to be able to is when a match is found, grab 4 lines above it, and 3 lines below it and print it. So, lets say I have.
somefile.html
"a;lskdj a;sdkjfa;klsjdf a aa;ksjd a;kjaf ;;jk;kj asdfjjasdjjfajsd jdjd
jdjajsdf<blah></blah> ok ok okasdfa stes test tes tes test test<br>
blah blah blah ok, I vouch for the sincerity of my post all day long.
Even though I can sometimes be a little crass.
I would only know the blue moon of pandora if I saw it. I heard tales of long ago
times in which .. blah blah
<some html>whatever some number 76854</some html>
running thru files of grass etc.. ===> more info
whatever more "
and lets say I want to find "76854" it would print or store in an array so I can print all matches found in dirs/files.
*Match found:*
**I would only know the blue moon of pandora if I saw it. I heard tales of long ago
times in which .. blah blah
<some html>whatever whatever</some html>
running thru files of grass etc.. ===> more info
whatever more**
**********************************
Something like that. So far I have and it is working by printing out files in which it finds a match:
if ($args->{'keyword'}){
if($keyword =~ /^\d+$/){
print "Your Results are as Follows:\n";
find( sub
{
local $/;
return if ($_ =~ /^\./);
return unless ($_ =~ /\.html$/i);
stat $File::Find::name;
return if -d; #is the current file a director?
return unless -r; # is the file readable?
open(FILE, "< $File::Find::name") or return;
my $string = <FILE>;
close (FILE);
print "$keyword\n";
if(grep /$keyword/, $string){
push(#resultholder, $File::Find::name);
}else{
return;
}
},'/app/docs/');
print "Results: #resultholder\n";
}else{
print "\n\n ERROR\n";
print "*************************************\n\n";
print "Seems Your Entry was in the wrong format \n\n";
print "*************************************\n\n";
}
exit;
}
Is perl a prerequisite here? This is trivially easy with grep, you can tell it to print N number of lines before and after a match.
grep <search-term> file.txt -B <# of lines before> -A <# of lines after>
Please disregard if you really want to use perl, just throwing out an alternative.
Are you using Windows or Linux?
If you are on Linux your script is better to replace with:
grep -r -l 'search_string' path_to_search_directory
It will list you all files containing search_string. And to get 4 lines of context before and 3 lines after the line with match you need to run:
grep -r -B 4 -A 3 'search_string' path_to_search_directory
If for some reason you cannot or don't want to use grep, you need to improve your script.
First, with this construction you are reading only the first string from the file:
my $string = <FILE>;
Second, you'd better avoid reading all the file to the memory, because you can encounter several Gb file. And even reading one string to memory, because you can encounter realy large string. Replace it with sequential read to some small buffer.
And the last, to get 4 lines before and 3 lines after you need to perform reverse reading from the match found (seek to the position which is to buffer_size before the match, read that block and check if there is enough line breaks in it).
So you need to store at least 8 lines, and output those 8 lines when the 5th line matches your pattern. The shift operator, for removing an element from the front of an array, and the push operator, for adding an element to the end of a list, could be helpful here.
find( sub {
... # but don't set $\
open( FILE, '<', $File::Find::name) or return;
my #buffer = () x 8;
while (<FILE>) {
shift #buffer;
push #buffer, $_;
if ($buffer[4] =~ /\Q$keyword\E/) {
print "--- Found in $File::Find::name ---\n";
print #buffer;
# return?
}
}
close FILE;
# handle the case where the keyword is in the last ~4 lines of the file.
while (#buffer > 5) {
shift #buffer;
if ($buffer[4] =~ /\Q$keyword\E/) {
print "--- Found in $File::Find::name ---\n";
print #buffer;
}
}
} );

Perl: extract rows from 1 to n (Windows)

I want to extract rows 1 to n from my .csv file. Using this
perl -ne 'if ($. == 3) {print;exit}' infile.txt
I can extract only one row. How to put a range of rows into this script?
If you have only a single range and a single, possibly concatenated input stream, you can use:
#!/usr/bin/perl -n
if (my $seqno = 1 .. 3) {
print;
exit if $seqno =~ /E/;
}
But if you want it to apply to each input file, you need to catch the end of each file:
#!/usr/bin/perl -n
print if my $seqno = 1 .. 3;
close ARGV if eof || $seqno =~ /E/;
And if you want to be kind to people who forget args, add a nice warning in a BEGIN or INIT clause:
#!/usr/bin/perl -n
BEGIN { warn "$0: reading from stdin\n" if #ARGV == 0 && -t }
print if my $seqno = 1 .. 3;
close ARGV if eof || $seqno =~ /E/;
Notable points include:
You can use -n or -p on the #! line. You could also put some (but not all) other command line switches there, like ‑l or ‑a.
Numeric literals as
operands to the scalar flip‐flop
operator are each compared against
readline counter, so a scalar 1 ..
3 is really ($. == 1) .. ($. ==
3).
Calling eof with neither an argument nor empty parens means the last file read in the magic ARGV list of files. This contrasts with eof(), which is the end of the entire <ARGV> iteration.
A flip‐flop operator’s final sequence number is returned with a "E0" appended to it.
The -t operator, which calls libc’s isatty(3), default to the STDIN handle — unlike any of the other filetest operators.
A BEGIN{} block happens during compilation, so if you try to decompile this script with ‑MO=Deparse to see what it really does, that check will execute. With an INIT{}, it will not.
Doing just that will reveal that the implicit input loop as a label called LINE that you perhaps might in other circumstances use to your advantage.
HTH
What's wrong with:
head -3 infile.txt
If you really must use Perl then this works:
perl -ne 'if ($. <= 3) {print} else {exit}' infile.txt
You can use the range operator:
perl -ne 'if (1 .. 3) { print } else { last }' infile.txt