Building indexes for files in Perl

Building indexes for files in Perl - perl

I'm currently new to Perl, and I've stumbled upon a problem :
My task is to create a simple way to access a line of a big file in Perl, the fastest way possible.
I created a file consisting of 5 million lines with, on each line, the number of the line.
I've then created my main program that will need to be able to print any content of a given line.
To do this, I'm using two methods I've found on the internet :
use Config qw( %Config );
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
my $file = "testfile.err";
open(FILE, "< $file") or die "Can't open $file for reading: $!\n";
open(INDEX, "+>$file.idx")
or die "Can't open $file.idx for read/write: $!\n";
build_index(*FILE, *INDEX);
my $line = line_with_index(*FILE, *INDEX, 129);
print "$line";
sub build_index {
my $data_file = shift;
my $index_file = shift;
my $offset = 0;
while (<$data_file>) {
print $index_file pack($off_t, $offset);
$offset = tell($data_file);
}
}
sub line_with_index {
my $data_file = shift;
my $index_file = shift;
my $line_number = shift;
my $size; # size of an index entry
my $i_offset; # offset into the index of the entry
my $entry; # index entry
my $d_offset; # offset into the data file
$size = length(pack($off_t, 0));
$i_offset = $size * ($line_number-1);
seek($index_file, $i_offset, 0) or return;
read($index_file, $entry, $size);
$d_offset = unpack($off_t, $entry);
seek($data_file, $d_offset, 0);
return scalar(<$data_file>);
}
Those methods sometimes work, I get a value once out of ten tries on different set of values, but most of the time I get "Used of uninitialized value $line in string at test2.pl line 10" (when looking for line 566 in the file) or not the right numeric value. Moreover, the indexing seems to work fine on the first two hundred or so lines, but afterwards I get the error. I really don't know what I'm doing wrong..
I know you can use a basic loop that will parse each line, but I really need a way of accessing, at any given time, one line of a file without reparsing it all over again.
Edit : I've tried using a little tip found here : Reading a particular line by line number in a very large file
I've replaced the "N" template for pack with :
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
It makes the process work better, until line 128, where instead of getting 128 , I get a blank string. For 129, I get 3, which doesn't mean much..
Edit2 : Basically what I need is a mechanism that enables me to read the next 2 lines for instance for a file that is already being read, while keeping the read "head" at the current line (and not 2 lines after).
Thanks for your help !

Since you are writing binary data to the index file, you need to set the filehandle to binary mode, especially if you are in Windows:
open(INDEX, "+>$file.idx")
or die "Can't open $file.idx for read/write: $!\n";
binmode(INDEX);
Right now, when you perform something like this in Windows:
print $index_file pack("j", $offset);
Perl will convert any 0x0a's in the packed string to 0x0d0a's. Setting the filehandle to binmode will make sure line feeds are not converted to carriage return-line feeds.

Related

Program argument is 100 but returns the value as 0100

Right now I am trying to do an assignment where I have to
- Extract information from an HTML file
- Save it to a scalar
- Run a regular expression to find the number of seats available in the designated course (the program argument is the course number for example 100 for ICS 100)
- If the course has multiple sessions, I have to find the sum of the seats available and print
- The output is just the number of seats available
The problem here is that when I was debugging and checking to make sure that my variable I have the program arg saved to was storing the correct value, it was storing the values with an extra 0 behind it.
ex.) perl filename.pl 100
ARGV[0] returns as 0100
I've tried storing the True regular expression values to an array, saving using multiple scalar variables, and changing my regular expression but none worked.
die "Usage: perl NameHere_seats.pl course_number" if (#ARGV < 1);
# This variable will store the .html file contents
my $fileContents;
# This variable will store the sum of seats available in the array #seatAvailable
my $sum = 0;
# This variable will store the program argument
my $courseNum = $ARGV[0];
# Open the file to read contents all at once
open (my $fh, "<", "fa19_ics_class_availability.html") or die ("Couldn't open 'fa19_ics_class_availability.html'\n");
# use naked brakets to limit the $/
{
#use local $/ to get <$fh> to read the whole file, and not one line
local $/;
$fileContents = <$fh>;
}
# Close the file handle
close $fh;
# Uncomment the line below to check if you've successfully extracted the text
# print $fileContents;
# Check if the course exists
die "No courses matched...\n" if ($ARGV[0] !~ m/\b(1[0-9]{2}[A-Z]?|2[0-8][0-9][A-Z]?|29[0-3])[A-Z]?\b/);
while ($fileContents =~ m/$courseNum(.+?)align="center">(\d)</) {
my $num = $2;
$sum = $sum + $num;
}
print $sum;
# Use this line as error checking to make sure #ARGV[0] is storing proper number
print $courseNum;
The current output I am receiving when program argument is 100 is just 0, and I assume it's because the regular expression is not catching any values as true therefore the sum remains at a value of 0. The output should be 15...
This is a link to the .html page > https://laulima.hawaii.edu/access/content/user/emeyer/ics/215/FA19/01/perl/fa19_ics_class_availability.html

You're getting "0100" because you have two print() statements.
print $sum;
...
print $courseNum;
And because there are no newlines or other output between them, you get the two values printed out next to each other. $sum is '0' and $courseNum is '100'.
So why is $sum zero? Well, that's because your regex isn't picking up the data you want it to match. Your regex looks like this:
m/$courseNum(.+?)align="center">(\d)</
You're looking for $courseNum followed by a number of other characters, followed by 'align="center">' and then your digit. This doesn't work for a number of reasons.
The string "100" appears many times in your text. Many times it doesn't even mean a course number (e.g. "100%"). Perhaps you should look for something more precise (ICS $coursenum).
The .+? doesn't do what you think it does. The dot doesn't match newline characters unless you use the /s option on the match operator.
But even if you fix those first two problems, it still won't work as there are a number of numeric table cells for each course and you're doing nothing to ensure that you're grabbing the last one. Your current code will get the "Curr. Enrolled" column, not the "Seats Avail" one.
This is a non-trivial HTML parsing problem. It shouldn't be addressed using regexes (HTML should never be parsed using regexes). You should look at one of the HTML parsing modules from CPAN - I think I'd use Web::Query.
Update: An example solution using Web::Query:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use File::Basename;
use Web::Query;
my $course_num = shift
or die 'Usage: perl ' . basename $0 . " course_number\n";
my $source = 'fa19_ics_class_availability.html';
open my $fh, '<', $source
or die "Cannot open '$source': $!\n";
my $html = do { local $/; <$fh> };
my $count_free;
wq($html)
# Get each table row in the table
->find('table.listOfClasses tr')
->each(sub {
my ($i, $elem) = #_;
my #tds;
# Get each <td> in the <tr>
$elem->find('td')->each(sub { push #tds, $_[1] });
# Ignore rows that don't have 13 columns
return if #tds != 13;
# Ignore rows that aren't about the right course
return if $tds[2]->text ne "ICS $course_num";
# Add the number of available places
$count_free += $tds[8]->text;
});
say $count_free;

Save contents of those files which contain a specific known string in an single .txt or .tmp file using perl

I'm trying to write a perl script where I'm trying to save whole contents of those files which contain a specific string 'PYAG_GENERATED', in a single .txt/.tmp file one after another. These file names are in a specific pattern and this pattern is 'output_nnnn.txt' where nnnn is 0001,0002 and so on. But I don't know how many number of files are present with this 'output_nnnn.txt' name.
I'm new in perl and I don't know how I can resolve this issue to get the output correctly. Can anyone help me. Thanks in advance.
I've tried to write perl script in different ways but nothing is coming in output file. I'm giving here one of those I've tried. 'new_1.txt' is the new file where I want to save the expected output and "PYAG_GENERATED" is that specific string I'm finding for in the files.
open(NEW,">>new_1.txt") or die "could not open:$!";
$find2="PYAG_GENERATED";
$n='0001';
while('output_$n.txt'){
if(/find2/){
print NEW;
}
$n++;
}
close NEW;
I expect that the output file 'new_1.txt' will save the whole contents of the the files(with filename pattern 'output_nnnn.txt') which have 'PYAG_GENERATED' string at least once inside.

Well, you tried I guess.
Welcome to the wonderful world of Perl where there are always a dozen ways of doing X :-) One possible way to achieve what you want. I put in a lot of comments I hope are helpful. It's also a bit verbose for the sake of clarity. I'm sure it could be golfed down to 5 lines of code.
use warnings; # Always start your Perl code with these two lines,
use strict; # and Perl will tell you about possible mistakes
use experimental 'signatures';
use File::Slurp;
# this is a subroutine/function, a block of code that can be called from
# somewhere else. it takes to arguments, that the caller must provide
sub find_in_file( $filename, $what_to_look_for )
{
# the open function opens $filename for reading
# (that's what the "<" means, ">" stands for writing)
# if successfull open will return we will have a "file handle" in the variable $in
# if not open will return false ...
open( my $in, "<", $filename )
or die $!; # ... and the program will exit here. The variable $! will contain the error message
# now we read the file using a loop
# readline will give us the next line in the file
# or something false when there is nothing left to read
while ( my $line = readline($in) )
{
# now we test wether the current line contains what
# we are looking for.
# the index function gives us the index of a string within another string.
# for example index("abc", "c") will give us 3
if ( index( $line, $what_to_look_for ) > 0 )
{
# we found what we were looking for
# so we don't need to keep looking in this file anymore
# so we must first close the file
close( $in );
# and then we indicate to the caller the search was a successfull
# this will immedeatly end the subroutine
return 1;
}
}
# If we arrive here the search was unsuccessful
# so we tell that to the caller
return 0;
}
# Here starts the main program
# First we get a list of files
# we want to look at
my #possible_files = glob( "where/your/files/are/output_*.txt" );
# Here we will store the files that we are interested in, aka that contain PYAG_GENERATED
my #wanted_files;
# and now we can loop over the files and see if they contain what we are looking for
foreach my $filename ( #possible_files )
{
# here we use the function we defined earlier
if ( find_in_file( $filename, "PYAG_GENERATED" ) )
{
# with push we can add things to the end of an array
push #wanted_files, $filename;
}
}
# We are finished searching, now we can start adding the files together
# if we found any
if ( scalar #wanted_files > 0 )
{
# Now we could code that us ourselves, open the files, loop trough them and write out
# line by line. But we make life easy for us and just
# use two functions from the module File::Slurp, which comes with Perl I believe
# If not you have to install it
foreach my $filename ( #wanted_files )
{
append_file( "new_1.txt", read_file( $filename ) );
}
print "Output created from " . (scalar #wanted_files) . " files\n";
}
else
{
print "No input files\n";
}

use strict;
use warnings;
my #a;
my $i=1;
my $find1="PYAG_GENERATED";
my $n=1;
my $total_files=47276; #got this no. of files by writing 'ls' command in the terminal
while($n<=$total_files){
open(NEW,"<output_$n.txt") or die "could not open:$!";
my $join=join('',<NEW>);
$a[$i]=$join;
#print "$a[10]";
$n++;
$i++;
}
close NEW;
for($i=1;$i<=$total_files;$i++){
if($a[$i]=~m/$find1/){
open(NEW1,">>new_1.tmp") or die "could not open:$!";
print NEW1 $a[$i];
}
}
close NEW1;

Remove duplicate lines on file by substring - preserve order (PERL)

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.

It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

perl logic request for an IO scenario

I need help in building a logic for below scenario.
I am using Perl to read 2 files which have 4 and 2 elements as below. I need to read each line from File1 and compare if code exists in both files.
If code exists in both files, I need to display all the line elements of file1 along with its description. Can you please share your thoughts?
File1:
Testname, code, date, file
Test1, 4, 4/11/15, /tmp
Test2, 2, 4/11/15, /log
Test3, 1, 4/11/15, /log
File2:
Code: description
1: Generic user error
2: Error with file location
3: File not found
4: Syntax error

Here's a working solution:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
## read file2 first
my $fh2;
open($fh2, '<', 'file2' ) || die("error: could not open file2: $!");
my $header2 = <$fh2>;
chomp($header2);
my $names2 = split(': ',$header2);
my $descriptionMap = {};
while (my $line = <$fh2>) {
chomp($line);
my $fields = [split(': ',$line)];
$descriptionMap->{$fields->[0]} = $fields->[1];
} ## end while
close($fh2);
## run through error list from file1 and print out all error info, plus description from file2 (if found)
my $fh1;
open($fh1, '<', 'file1' ) || die("error: could not open file1: $!");
my $header1 = <$fh1>;
chomp($header1);
my $names1 = [split(', ',$header1)];
while (my $line = <$fh1>) {
## parse
chomp($line);
my $fields = [split(', ',$line)];
my $error = {};
#$error{#$names1} = #$fields; ## hash slice assignment
## get description
my $description = exists($descriptionMap->{$error->{'code'}}) ? $descriptionMap->{$error->{'code'}} : '(not found)';
## output
print($error->{'Testname'},',',$error->{'code'},',',$error->{'date'},',',$error->{'file'},',',$description,"\n");
} ## end while
close($fh1);
exit 0;
Output:
Test1,4,4/11/15,/tmp,Syntax error
Test2,2,4/11/15,/log,Error with file location
Test3,1,4/11/15,/log,Generic user error
It works by first reading in file2 into a hash keyed on the code, whose value is the description. Next, it reads file1, and for each error, prints out its info, with a lookup on the file2 hash to get the description.

If you are using Perl to read two files, you must certainly have code to show that we could help you with. If you haven't written any code, it's not clear why you would land on one specific language as a means to the end. In the absence of code to help you work through, it's probably best to start with a description of how one might solve the problem.
You could start by reading file2 into a data container that makes it easy to do code lookups. If the code numbers are are sparsely distributed, you would probably use a hash where the keys are the code number, and the values are the codes' descriptions. If the numbers are densely distributed, you might use an array as your structure, where the index represents the code number, and the value is the code's description. Since we don't know the answer to that question, we should probably explore a general solution that would work satisfactorily in either case; a hash is advisable.
The next step would be to iterate over the lines in file1, extract the code from the line using either a regular expression, or split, or Text::CSV, and then using that code to perform the lookup in your code reference hash.
Here is a sample implementation of that strategy:
open my $f2, '<', 'file2' or die $!;
my %codes = map { m/^(\d+):\s*(.+)$/ ? ($1,$2) : () } <$f2>;
open my $f1, '<', 'file1' or die $!;
while( <$f1> ) {
next unless m/^[^,]+,\s*(\d+),/;
exists $codes{$1} && do {
chomp;
print "$_ => $codes{$1}\n";
};
}
So in the first segment of that code we open file2, and extract its lines into a hash called %codes. The keys in the hash are the numeric codes, and the values are the textual descriptions.
In the second segment we iterate over the lines of file1, again extracting the code. We use that code to do a lookup of the code's meaning. If the lookup fails to find meaning, we move on. Otherwise, we go ahead and print the line from file1 and the code's meaning from file2.
The output this produces is:
Test1, 4, 4/11/15, /tmp => Syntax error
Test2, 2, 4/11/15, /log => Error with file location
Test3, 1, 4/11/15, /log => Generic user error
It seems to me it would probably be better to print something even if the code isn't found. If that's preferable to you, you could alter that second block of code to look like this:
open my $f1, '<', 'file1' or die $!;
while( <$f1> ) {
next unless m/^[^,]+,\s*(\d+),/;
chomp;
print "$_ => ", exists $codes{$1} ? "$codes{$1}\n" : "UNKNOWN CODE\n"
}
The two snippets of code differ in that the first one will skip any lines in file1 where a matching code is not found in file2. This seems to fit the description of the goal provided in your question. The second snippet prints every line from file1 that has a code, and for those lines where a matching code isn't found from file2, the response "UNKNOWN CODE" is printed in lieu of a code description.

Dynamic Loop outputs the same on each iteration

I am attempting to write a script to automate some data collection. Initially the script runs a series of commands which are carried out by the system. The output of these commands is stored in two text files. Following data collection, I am attempting to implement a for loop so that a third output file is generated which lists the value of interest from the first line of output 1 and the second line of output one, as well as the relative error. The following code completes the correct number of times, but returns the same values on all four lines. I suspect this has to do with the filehandler variable, but am unsure how to solve the issue.
for($ln = 1; $ln<5;$ln++){
open($fh, '<',"theoretical.dat",<$ln>)
or die "Could not open file 'theoretical.dat' $!";
#line = split(' ',<$fh>);
$v = $line[3];
open($fh2, '<',"actual.dat",<$ln>)
or die "Could not open file 'actual.dat' $!";
#line = split(' ',<$fh2>);
$v0 = $line[3];
$e = abs(($v0-$v)/$v0);
$rms = $rms + $e^2;
my #result = ($v, $v0, $e);
print "#result \n";
}
The output file code has been omitted. It contains an if/else depending upon if output should be piped into results.dat or appended.
Note that the data in question is stored in as 4 numbers per line, only the fourth of which I wish to access with this script. From the output generated it seems that $ln is changing accordingly after each iteration, but the line being read is not despite the argument within the open command which dictates to read line number $ln.
I have tried undefing $fh and $fh2 after each loop, but it still outputs the same.

You can't specify the line number of a file on the open call. In fact reopening a file will cause it to be read again starting from the top.
Without seeing your data files I can't be sure, but I think you want something like this.
Note that you can use autodie instead of coding an explicit test for an open succeeding. You must also use strict and use warnings a the top of every Perl program, and declare all of your variables using my as close as possible to their first point of use. I have declared $rms outside the loop here so that it can accumulate an aggregate sum of squares instead of being destroyed and recreated each time around the loop.
use strict;
use warnings;
use autodie;
open my $theo_fh, '<', 'theoretical.dat';
open my $act_fh, '<', 'actual.dat';
my $rms;
for my $ln (1 .. 5) {
my $v_theo = (split ' ', <$theo_fh>)[3];
my $v_act = (split ' ', <$act_fh>)[3];
my $e = abs(($v_act - $v_theo) / $v_act);
my $rms = $rms + $e ^ 2;
my #result = ($v_theo, $v_act, $e);
print "#result\n";
}