Perl: Read from file till specified character(s) found - perl

I have a very huge(10 GB) single line file(basically insert statement) which i cant load into memory.
I want to process that line(doing some regex) and taking meaning full values.
The values are in tuples(data is between-> (.*) ).
So i want to just read each tuple from the file and process it.
What i am thinking of doing is using getc like this:
getc FILEHANDLE
So i read each character and check if it matches my tuple ending character(in my case it is ), ).
Is there a more efficient and better way to perform this in optimized way?
Thanks.

You could set the special perl variable INPUT_RECORD_SEPARATOR $/ to match your tuple-ending character.
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/ say /;
open( my $fh, '<', 'foo.txt' ) or die;
my $tuple_ending_char = ')';
local $/ = $tuple_ending_char;
while (<$fh>) {
say $_;
}

You can try the following code also but it is not as elegant as davewood's solution.
use strict;
use Data::Dumper;
my $filename='/tmp/sample.txt';
if (open(my $fh, $filename)) {
my #file_stats = stat($fh);
my $bytes_remaining = $file_stats[7];
my $answer = "";
my $buffer_size=1024;
while (1) {
my $bytes_read = read($fh, $answer, $buffer_size);
my #tuples = ($answer =~ /\(.*?\),\s*/g);
print Dumper(\#tuples);
$answer =~ s/.*\)\s*,\s*([^\)]*)$/$1/g;
$bytes_remaining -= $bytes_read;
if ($bytes_remaining < 0) {$bytes_remaining = 0;}
if (($bytes_read == 0) ||($bytes_remaining <= 0)) {
last;
};
};
close($fh);
}

Related

How to use Perl's File::Grep module

I am using the File::Grep module. I have following example:
#!/usr/bin/perl
use strict;
use warnings;
use File::Grep qw( fgrep fmap fdo );
my #matches = fgrep { 1.1.1 } glob "file.csv";
foreach my $str (#matches) {
print "$str\n";
}
But when I try to print $str value it gives me HEX value: GLOB(0xac2e78)
What's wrong with this code?
The documentation doesn't seem to be accurate, but judging from the source-code — http://cpansearch.perl.org/src/MNEYLON/File-Grep-0.02/Grep.pm — the list you get back from fgrep contains one element per file. Each element is a hash of the form
{
filename => $filename,
count => $num_matches_in_that_file,
matches => {
$line_number => $line,
...
}
}
I think it would be simpler to skip fgrep and its complicated return-value that has way more information than you want, in favor of fdo, which lets you just iterate over all lines of a file and do what you want:
fdo { my ( $file, $pos, $line ) = #_;
print $line if $line =~ m/1\.1\.1/;
} 'file.csv';
(Note that I removed the glob, by the way. There's not much point in writing glob "file.csv", since only one file can match that globstring.)
or even just dispense with this module and write:
{
open my $fh, '<', 'file.csv';
while (<$fh>) {
print if m/1\.1\.1/;
}
}
I assume you want to see all the lines in file.csv that contain 1.1.1?
The documentation for File::Grep isn't up to date, but this program will put into #lines all the matching lines from all the files (if there were more than one).
use strict;
use warnings;
use File::Grep qw/ fgrep /;
$File::Grep::SILENT = 0;
my #matches = fgrep { /1\.1\.1/ } 'file.csv';
my #lines = map {
my $matches = $_->{matches};
#{$matches}{ sort { $a <=> $b } keys %$matches};
} #matches;
print for #lines;
Update
The most Perlish way to do this is like so
use strict;
use warnings;
open my $fh, '<', 'file.csv' or die $!;
while (<$fh>) {
print if /1\.1\.1/;
}

perl parsing files for multiple strings

I have been learning perl for the past two weeks. I have been writing some perl scripts for my school project. I need to parse a text file for multiple strings. I searched perl forums and got some information.The below function parses a text file for one string and returns a result. However I need the script to search the file for multiple strings.
use strict;
use warnings;
sub find_string {
my ($file, $string) = #_;
open my $fh, '<', $file;
while (<$fh>) {
return 1 if /\Q$string/;
}
die "Unable to find string: $string";
}
find_string('filename', 'string');
Now for instance if the file contains multiple strings with regular expressions as listed below
"testing"
http://www.yahoo.com =1
http://www.google.com=2
I want the function to search for multiple strings like
find_string('filename', 'string1','string2','string3');
Please can somebody explain me how i need to do that.It would be really helpful
Going through this very quickly here:
You right now pass the name of a file, and one string. What if you pass multiple strings:
if ( find_string ( $file, #strings ) ) {
print "Found a string!\n";
}
else {
print "No string found\n";
}
..
sub find_string {
my $file = shift;
my #strings = #_;
#
# Let's make the strings into a regular expression
#
my $reg_exp = join "|" ,#strings; # Regex is $string1|$string2|$string3...
open my $fh, "<", $file or die qq(Can't open file...);
while ( my $line = <$fh> ) {
chomp $line;
if ( $line =~ $reg_exp ) {
return 1; # Found the string
}
}
return 0; # String not found
}
I am about to go into a meeting, so I haven't really even tested this, but the idea is there. A few things:
You want to handle characters in your strings that could be regular expression characters. You can use either the quotemeta command, or use \Q and \E before and after each string.
Think about using use autodie to handle files that can't be open. Then, you don't have to check your open statement (like I did above).
There are limitations. This would be awful if you were searching for 1,000 different strings, but should be okay with a few.
Note how I use a scalar file handle ($fh). Instead of opening your file via the subroutine, I would pass in a scalar file handle. This would allow you to take care of an invalid file issue in your main program. That's the big advantage of scalar file handles: They can be easily passed to subroutines and stored in class objects.
Tested Program
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
INPUT_FILE => 'test.txt',
};
open my $fh, "<", INPUT_FILE;
my #strings = qw(foo fo+*o bar fubar);
if ( find_string ( $fh, #strings ) ) {
print "Found a string!\n";
}
else {
print "No string found\n";
}
sub find_string {
my $fh = shift; # The file handle
my #strings = #_; # A list of strings to look for
#
# We need to go through each string to make sure there's
# no special re characters
for my $string ( #strings ) {
$string = quotemeta $string;
}
#
# Let's join the stings into one big regular expression
#
my $reg_exp = join '|', #strings; # Regex is $string1|$string2|$string3...
$reg_exp = qr($reg_exp); # This is now a regular expression
while ( my $line = <$fh> ) {
chomp $line;
if ( $line =~ $reg_exp ) {
return 1; # Found the string
}
}
return 0; # String not found
}
autodie handles issues when I can't open a file. No need to check for it.
Notice I have three parameters in my open. This is the preferred way.
My file handle is $fh which allows me to pass it to my find_string subroutine. Open the file in the main program, and I can handle read errors there.
I loop through my #strings and use the quotemeta command to automatically escape special regular expression characters.
Note that when I change $string in my loop, it actually modifies the #strings array.
I use qr to create a regular expression.
My regular expression is /foo|fo\+\*o|bar|fubar/.
There are a few bugs For example, the string fooburberry will match with foo. Do you want that, or do you want your strings to be whole words?
I'm happy to see use strict and use warnings in your script. Here is one basic way to do it.
use strict;
use warnings;
sub find_string {
my ($file, $string1, $string2, $string3) = #_;
my $found1 = 0;
my $found2 = 0;
my $found3 = 0;
open my $fh, '<', $file;
while (<$fh>) {
if ( /$string1/ ) {
$found1 = 1;
}
if ( /$string2/ ) {
$found2 = 1;
}
if ( /$string3/ ) {
$found3 = 1;
}
}
if ( $found1 == 1 and $found2 == 1 and $found3 == 1 ) {
return 1;
} else {
return 0;
}
}
my $result = find_string('filename', 'string1'. 'string2', 'string3');
if ( $result == 1 ) {
print "Found all three strings\n";
} else {
print "Didn't find all three\n";
}
I think you can store the file content in an array first, then grep the input in the array.
use strict;
use warnings;
sub find_multi_string {
my ($file, #strings) = #_;
my $fh;
open ($fh, "<$file");
#store the whole file in an array
my #array = <$fh>;
for my $string (#strings) {
if (grep /$string/, #array) {
next;
} else {
die "Cannot find $string in $file";
}
}
return 1;
}

How can I read from a Perl filehandle that is an array element?

I quickly jotted off a Perl script that would average a few files with just columns of numbers. It involves reading from an array of filehandles. Here is the script:
#!/usr/local/bin/perl
use strict;
use warnings;
use Symbol;
die "Usage: $0 file1 [file2 ...]\n" unless scalar(#ARGV);
my #fhs;
foreach(#ARGV){
my $fh = gensym;
open $fh, $_ or die "Unable to open \"$_\"";
push(#fhs, $fh);
}
while (scalar(#fhs)){
my ($result, $n, $a, $i) = (0,0,0,0);
while ($i <= $#fhs){
if ($a = <$fhs[$i]>){
$result += $a;
$n++;
$i++;
}
else{
$fhs[$i]->close;
splice(#fhs,$i,1);
}
}
if ($n){ print $result/$n . "\n"; }
}
This doesn't work. If I debug the script, after I initialize #fhs it looks like this:
DB<1> x #fhs
0 GLOB(0x10443d80)
-> *Symbol::GEN0
FileHandle({*Symbol::GEN0}) => fileno(6)
1 GLOB(0x10443e60)
-> *Symbol::GEN1
FileHandle({*Symbol::GEN1}) => fileno(7)
So far, so good. But it fails at the part where I try to read from the file:
DB<3> x $fhs[$i]
0 GLOB(0x10443d80)
-> *Symbol::GEN0
FileHandle({*Symbol::GEN0}) => fileno(6)
DB<4> x $a
0 'GLOB(0x10443d80)'
$a is filled with this string rather than something read from the glob. What have I done wrong?
You can only use a simple scalar variable inside <> to read from a filehandle. <$foo> works. <$foo[0]> does not read from a filehandle; it's actually equivalent to glob($foo[0]). You'll have to use the readline builtin, a temporary variable, or use IO::File and OO notation.
$text = readline($foo[0]);
# or
my $fh = $foo[0]; $text = <$fh>;
# or
$text = $foo[0]->getline; # If using IO::File
If you weren't deleting elements from the array inside the loop, you could easily use a temporary variable by changing your while loop to a foreach loop.
Personally, I think using gensym to create filehandles is an ugly hack. You should either use IO::File, or pass an undefined variable to open (which requires at least Perl 5.6.0, but that's almost 10 years old now). (Just say my $fh; instead of my $fh = gensym;, and Perl will automatically create a new filehandle and store it in $fh when you call open.)
If you are willing to use a bit of magic, you can do this very simply:
use strict;
use warnings;
die "Usage: $0 file1 [file2 ...]\n" unless #ARGV;
my $sum = 0;
# The current filehandle is aliased to ARGV
while (<>) {
$sum += $_;
}
continue {
# We have finished a file:
if( eof ARGV ) {
# $. is the current line number.
print $sum/$. , "\n" if $.;
$sum = 0;
# Closing ARGV resets $. because ARGV is
# implicitly reopened for the next file.
close ARGV;
}
}
Unless you are using a very old perl, the messing about with gensym is not necessary. IIRC, perl 5.6 and newer are happy with normal lexical handles: open my $fh, '<', 'foo';
I have trouble understanding your logic. Do you want to read several files, which just contains numbers (one number per line) and print its average?
use strict;
use warnings;
my #fh;
foreach my $f (#ARGV) {
open(my $fh, '<', $f) or die "Cannot open $f: $!";
push #fh, $fh;
}
foreach my $fh (#fh) {
my ($sum, $n) = (0, 0);
while (<$fh>) {
$sum += $_;
$n++;
}
print "$sum / $n: ", $sum / $n, "\n" if $n;
}
Seems like a for loop would work better for you, where you could actually use the standard read (iteration) operator.
for my $fh ( #fhs ) {
while ( defined( my $line = <$fh> )) {
# since we're reading integers we test for *defined*
# so we don't close the file on '0'
#...
}
close $fh;
}
It doesn't look like you want to shortcut the loop at all. Therefore, while seems to be the wrong loop idiom.

How do I push more than one matched groups as same element of array in Perl?

I am need to push all the matched groups into an array.
#!/usr/bin/perl
use strict;
open (FILE, "/home/user/name") || die $!;
my #lines = <FILE>;
close (FILE);
open (FH, ">>/home/user/new") || die $!;
foreach $_(#lines){
if ($_ =~ /AB_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/){
print FH "$1 $2 $3 $4 $5 $6 $7\n"; #needs to be first element of array
}
elsif ($_ =~ /CD_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/){
print FH "$1 $2 $3 $4 $5 $6\n"; #needs to be second element of array
}
}close (FH);
_ INPUT _
AB_ first--2-45_ Name_ is34_ correct_ OR_ not_W3478.txt
CD_ second_ input_ 89-is_ diffErnt_ 76-from_Wfirst6.txt
Instead of writing matched groups to FILE, I want to push them into array. I can't think of any other command other than push but this function does not accept more than one argument. What is the best way to do the same? The output should look like following after pushing matched groups into array.
_ OUTPUT _
$array[0] = first--2-45 Name is34 correct OR not
$array[1] = second input 89-is diffErnt 76-from
Use the same argument for push that you use for print: A string in double quotes.
push #array, "$1 $2 $3 $4 $5 $6 $7";
Take a look at perldoc -f grep, which returns a list of all elements of a list that match some criterion.
And incidentally, push does take more than one argument: see perldoc -f push.
push #matches, grep { /your regex here/ } #lines;
You didn't include the code leading up to this though.. some of it is a little odd, such as the use of $_ as a function call. Are you sure you want to do that?
If you are using Perl 5.10.1 or later, this is how I would write it.
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1; # or use 5.010;
use autodie;
my #lines = do{
# don't need to check for errors, because of autodie
open( my $file, '<', '/home/user/name' );
grep {chomp} <$file>;
# $file is automatically closed
};
# use 3 arg form of open
open( my $file, '>>', '/home/user/new' );
my #matches;
for( #lines ){
if( /(?:AB|CD)( (?:_[^_]+)+ )_W .+ txt/x ){
my #match = "$1" =~ /_([^_]+)/g;
say {$file} "#match";
push #matches, \#match;
# or
# push #matches, [ "$1" =~ /_([^_]+)/g ];
# if you don't need to print it in this loop.
}
}
close $file;
This is a little bit more permissive of inputs, but the regex should be a little bit more "correct", than the original.
Remember that a capturing match in list context returns the captured fields, if any:
#!/usr/bin/perl
use strict; use warnings;
my $file = '/home/user/name';
open my $in, '<', $file
or die "Cannot open '$file': $!";
my #matched;
while ( <$in> ) {
my #fields;
if (#fields = /AB_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/
or #fields = /CD_(.+)_(.+)_(.+)_(.+)_(.+)_(.+)_W.+txt/)
{
push #matched, "#fields";
}
}
use Data::Dumper;
print Dumper \#matched;
Of course, you could also do
push #matched, \#fields;
depending on what you intend to do with the matches.
I wonder if using push and giant regexes is really the right way to go.
The OP says he wants lines starting with AB at index 0, and those with CD at index 1.
Also, those regexes look like an inside-out split to me.
In the code below I have added some didactic comments that point out why I am doing things differently than the OP and the other solutions offered here.
#!/usr/bin/perl
use strict;
use warnings; # best use warnings too. strict doesn't catch everything
my $filename = "/home/user/name";
# Using 3 argument open addresses some security issues with 2 arg open.
# Lexical filehandles are better than global filehandles, they prevent
# most accidental filehandle name colisions, among other advantages.
# Low precedence or operator helps prevent incorrect binding of die
# with open's args
# Expanded error message is more helpful
open( my $inh, '<', $filename )
or die "Error opening input file '$filename': $!";
my #file_data;
# Process file with a while loop.
# This is VERY important when dealing with large files.
# for will read the whole file into RAM.
# for/foreach is fine for small files.
while( my $line = <$inh> ) {
chmop $line;
# Simple regex captures the data type indicator and the data.
if( $line =~ /(AB|CD)_(.*)_W.+txt/ ) {
# Based on the type indicator we set variables
# used for validation and data access.
my( $index, $required_fields ) = $1 eq 'AB' ? ( 0, 7 )
: $1 eq 'CD' ? ( 1, 6 )
: ();
next unless defined $index;
# Why use a complex regex when a simple split will do the same job?
my #matches = split /_/, $2;
# Here we validate the field count, since split won't check that for us.
unless( #matches == $required_fields ) {
warn "Incorrect field count found in line '$line'\n";
next;
}
# Warn if we have already seen a line with the same data type.
if( defined $file_data[$index] ) {
warn "Overwriting data at index $index: '#{$file[$index]}'\n";
}
# Store the data at the appropriate index.
$file_data[$index] = \#matches;
}
else {
warn "Found non-conformant line: $line\n";
}
}
Be forewarned, I just typed this into the browser window. So, while the code should be correct, there may be typos or missed semicolons lurking--it's untested, use it at your own peril.

Dynamically Change the Key Value based on Delimiter in Perl

I'm reading from a CSV file and populating a Hash based on Key-Value Pairs.
The First Column of the record is the key, and the rest of the record is the value. However, for some file I need to make first 2 columns as Key and the rest of the record is value. I have written it as below based on if loop by checking the number of Key Columns, but I wanted to know if there is any better way to do this?
use strict;
use warnings;
open my $fh, '<:encoding(utf8)', 'Sample.csv'
or die "Couldn't open Sample.csv";
my %hash;
my $KeyCols=2;
while (<$fh>) {
chomp;
if ($KeyCols==1) {
next unless /^(.*?),(.*)$/;
$hash{$1} = $2;
}
elsif ($KeyCols==2) {
next unless /^(.*?),(.*?),(.*)$/;
$hash{$1.$2} = $3;
}
}
Here is one way to allow for any number of key columns (not just 1 or 2), but it uses split instead of a regex:
use warnings;
use strict;
my %hash;
my $KeyCols = 2;
while (<DATA>) {
chomp;
my #cols = split /,/, $_, $KeyCols+1;
next unless #cols > $KeyCols;
my $v = pop #cols;
my $k = join '', #cols;
$hash{$k} = $v;
}
__DATA__
a,b,c,d,e,f
q,w,e,r,t,y
This is a self-contained code example.
A big assumption is that your CSV file does not contain commas in the data itself. You should be using a CSV parser such as Text::CSV anyway.
Perhaps it is better to define variables at first lines of the code -- otherwise you have to jump all over the code.
You can define regex based on your $KeyCols and processing code will be same as before.
use strict;
use warnings;
use feature 'say';
my $KeyCols = 2;
my $fname = 'Sample.csv';
my %hash;
my $re;
if( $KeyCols == 2 ) {
$re = qr/^(.*?,.*?),(.*)$/
} else {
$re = qr/^(.*?),(.*)$/;
}
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while (<$fh>) {
chomp;
next unless /$re/;
$hash{$1} = $2;
}
close $fh;