perl logic request for an IO scenario - perl

I need help in building a logic for below scenario.
I am using Perl to read 2 files which have 4 and 2 elements as below. I need to read each line from File1 and compare if code exists in both files.
If code exists in both files, I need to display all the line elements of file1 along with its description. Can you please share your thoughts?
File1:
Testname, code, date, file
Test1, 4, 4/11/15, /tmp
Test2, 2, 4/11/15, /log
Test3, 1, 4/11/15, /log
File2:
Code: description
1: Generic user error
2: Error with file location
3: File not found
4: Syntax error

Here's a working solution:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
## read file2 first
my $fh2;
open($fh2, '<', 'file2' ) || die("error: could not open file2: $!");
my $header2 = <$fh2>;
chomp($header2);
my $names2 = split(': ',$header2);
my $descriptionMap = {};
while (my $line = <$fh2>) {
chomp($line);
my $fields = [split(': ',$line)];
$descriptionMap->{$fields->[0]} = $fields->[1];
} ## end while
close($fh2);
## run through error list from file1 and print out all error info, plus description from file2 (if found)
my $fh1;
open($fh1, '<', 'file1' ) || die("error: could not open file1: $!");
my $header1 = <$fh1>;
chomp($header1);
my $names1 = [split(', ',$header1)];
while (my $line = <$fh1>) {
## parse
chomp($line);
my $fields = [split(', ',$line)];
my $error = {};
#$error{#$names1} = #$fields; ## hash slice assignment
## get description
my $description = exists($descriptionMap->{$error->{'code'}}) ? $descriptionMap->{$error->{'code'}} : '(not found)';
## output
print($error->{'Testname'},',',$error->{'code'},',',$error->{'date'},',',$error->{'file'},',',$description,"\n");
} ## end while
close($fh1);
exit 0;
Output:
Test1,4,4/11/15,/tmp,Syntax error
Test2,2,4/11/15,/log,Error with file location
Test3,1,4/11/15,/log,Generic user error
It works by first reading in file2 into a hash keyed on the code, whose value is the description. Next, it reads file1, and for each error, prints out its info, with a lookup on the file2 hash to get the description.

If you are using Perl to read two files, you must certainly have code to show that we could help you with. If you haven't written any code, it's not clear why you would land on one specific language as a means to the end. In the absence of code to help you work through, it's probably best to start with a description of how one might solve the problem.
You could start by reading file2 into a data container that makes it easy to do code lookups. If the code numbers are are sparsely distributed, you would probably use a hash where the keys are the code number, and the values are the codes' descriptions. If the numbers are densely distributed, you might use an array as your structure, where the index represents the code number, and the value is the code's description. Since we don't know the answer to that question, we should probably explore a general solution that would work satisfactorily in either case; a hash is advisable.
The next step would be to iterate over the lines in file1, extract the code from the line using either a regular expression, or split, or Text::CSV, and then using that code to perform the lookup in your code reference hash.
Here is a sample implementation of that strategy:
open my $f2, '<', 'file2' or die $!;
my %codes = map { m/^(\d+):\s*(.+)$/ ? ($1,$2) : () } <$f2>;
open my $f1, '<', 'file1' or die $!;
while( <$f1> ) {
next unless m/^[^,]+,\s*(\d+),/;
exists $codes{$1} && do {
chomp;
print "$_ => $codes{$1}\n";
};
}
So in the first segment of that code we open file2, and extract its lines into a hash called %codes. The keys in the hash are the numeric codes, and the values are the textual descriptions.
In the second segment we iterate over the lines of file1, again extracting the code. We use that code to do a lookup of the code's meaning. If the lookup fails to find meaning, we move on. Otherwise, we go ahead and print the line from file1 and the code's meaning from file2.
The output this produces is:
Test1, 4, 4/11/15, /tmp => Syntax error
Test2, 2, 4/11/15, /log => Error with file location
Test3, 1, 4/11/15, /log => Generic user error
It seems to me it would probably be better to print something even if the code isn't found. If that's preferable to you, you could alter that second block of code to look like this:
open my $f1, '<', 'file1' or die $!;
while( <$f1> ) {
next unless m/^[^,]+,\s*(\d+),/;
chomp;
print "$_ => ", exists $codes{$1} ? "$codes{$1}\n" : "UNKNOWN CODE\n"
}
The two snippets of code differ in that the first one will skip any lines in file1 where a matching code is not found in file2. This seems to fit the description of the goal provided in your question. The second snippet prints every line from file1 that has a code, and for those lines where a matching code isn't found from file2, the response "UNKNOWN CODE" is printed in lieu of a code description.

Related

Save contents of those files which contain a specific known string in an single .txt or .tmp file using perl

I'm trying to write a perl script where I'm trying to save whole contents of those files which contain a specific string 'PYAG_GENERATED', in a single .txt/.tmp file one after another. These file names are in a specific pattern and this pattern is 'output_nnnn.txt' where nnnn is 0001,0002 and so on. But I don't know how many number of files are present with this 'output_nnnn.txt' name.
I'm new in perl and I don't know how I can resolve this issue to get the output correctly. Can anyone help me. Thanks in advance.
I've tried to write perl script in different ways but nothing is coming in output file. I'm giving here one of those I've tried. 'new_1.txt' is the new file where I want to save the expected output and "PYAG_GENERATED" is that specific string I'm finding for in the files.
open(NEW,">>new_1.txt") or die "could not open:$!";
$find2="PYAG_GENERATED";
$n='0001';
while('output_$n.txt'){
if(/find2/){
print NEW;
}
$n++;
}
close NEW;
I expect that the output file 'new_1.txt' will save the whole contents of the the files(with filename pattern 'output_nnnn.txt') which have 'PYAG_GENERATED' string at least once inside.
Well, you tried I guess.
Welcome to the wonderful world of Perl where there are always a dozen ways of doing X :-) One possible way to achieve what you want. I put in a lot of comments I hope are helpful. It's also a bit verbose for the sake of clarity. I'm sure it could be golfed down to 5 lines of code.
use warnings; # Always start your Perl code with these two lines,
use strict; # and Perl will tell you about possible mistakes
use experimental 'signatures';
use File::Slurp;
# this is a subroutine/function, a block of code that can be called from
# somewhere else. it takes to arguments, that the caller must provide
sub find_in_file( $filename, $what_to_look_for )
{
# the open function opens $filename for reading
# (that's what the "<" means, ">" stands for writing)
# if successfull open will return we will have a "file handle" in the variable $in
# if not open will return false ...
open( my $in, "<", $filename )
or die $!; # ... and the program will exit here. The variable $! will contain the error message
# now we read the file using a loop
# readline will give us the next line in the file
# or something false when there is nothing left to read
while ( my $line = readline($in) )
{
# now we test wether the current line contains what
# we are looking for.
# the index function gives us the index of a string within another string.
# for example index("abc", "c") will give us 3
if ( index( $line, $what_to_look_for ) > 0 )
{
# we found what we were looking for
# so we don't need to keep looking in this file anymore
# so we must first close the file
close( $in );
# and then we indicate to the caller the search was a successfull
# this will immedeatly end the subroutine
return 1;
}
}
# If we arrive here the search was unsuccessful
# so we tell that to the caller
return 0;
}
# Here starts the main program
# First we get a list of files
# we want to look at
my #possible_files = glob( "where/your/files/are/output_*.txt" );
# Here we will store the files that we are interested in, aka that contain PYAG_GENERATED
my #wanted_files;
# and now we can loop over the files and see if they contain what we are looking for
foreach my $filename ( #possible_files )
{
# here we use the function we defined earlier
if ( find_in_file( $filename, "PYAG_GENERATED" ) )
{
# with push we can add things to the end of an array
push #wanted_files, $filename;
}
}
# We are finished searching, now we can start adding the files together
# if we found any
if ( scalar #wanted_files > 0 )
{
# Now we could code that us ourselves, open the files, loop trough them and write out
# line by line. But we make life easy for us and just
# use two functions from the module File::Slurp, which comes with Perl I believe
# If not you have to install it
foreach my $filename ( #wanted_files )
{
append_file( "new_1.txt", read_file( $filename ) );
}
print "Output created from " . (scalar #wanted_files) . " files\n";
}
else
{
print "No input files\n";
}
use strict;
use warnings;
my #a;
my $i=1;
my $find1="PYAG_GENERATED";
my $n=1;
my $total_files=47276; #got this no. of files by writing 'ls' command in the terminal
while($n<=$total_files){
open(NEW,"<output_$n.txt") or die "could not open:$!";
my $join=join('',<NEW>);
$a[$i]=$join;
#print "$a[10]";
$n++;
$i++;
}
close NEW;
for($i=1;$i<=$total_files;$i++){
if($a[$i]=~m/$find1/){
open(NEW1,">>new_1.tmp") or die "could not open:$!";
print NEW1 $a[$i];
}
}
close NEW1;

How to extract unique fields from a CSV file using a Perl script

I have a CSV file with data that looks similar to this:
alpha,a,foo,bar
alpha,b,foo,bar
alpha,c,foo,bar
beta,d,foo,bar
beta,e,foo,bar
I'm able to use the following code to successfully create two new files using the data:
open (my $FH, '<', '/home/<username>/inputs.csv') || die "ERROR Cannot read file\n";
while (my $line = <$FH>) {
chomp $line;
my #fields = split "," , $line;
my $file = "ziggy.$fields[0]";
open (my $FH2, '>>', $file) || die "ERROR Cannot open file\n";
print $FH2 "$fields[1]\n";
print $FH2 "$fields[2]\n";
print $FH2 "$fields[3]\n\n";
close $FH2;
}
Basically, this code reads through the rows in the CSV file and creates content in files that are named based on the first field. So, the "ziggy.alpha" file has nine lines of content, while the "ziggy.beta" file has six lines of content. Note that I'm appending data to these files as the rows are being read via the "while" loop.
My challenge:
Following the data set example cited, I need to create a second pair of files that use the same "first field" naming convention (something like "zaggy.alpha" and "zaggy.beta"). The files will only be created once with static content written to them, and will not have additional data appended to them from the CSV file.
My question:
Is there a way to identify the unique values in the first field ("alpha" and "beta"), store them in a hash, then reference them in a "while" loop in order to create my second set of files while the inputs.csv file is open?
Thanks in advance for any insight that can be provided!
In perl you can a get a list of keys from an associative array like:
my #keys = keys %hash;
So something like this will work;
my %unique_first_values;
Then later in the loop.
$my_unique_first_values{$fields[0]} = 1;
You can then call 'keys' on the hash to get the unique values.
#unique = keys %my_unique_virst_values;
In order to "create my second set of files while the inputs.csv file is open" you're going to want to know if you've seen a value before.
The conventional way to do this in Perl is to create a hash to store previously-seen values, and check-then-set in order to determine whether you've seen it, record that it has been seen, and go on.
if (exists($seen_before{$key})) {
# seen it
}
else {
# new key!
$seen_before{$key} = 1;
}
Given that you're going to be opening files and appending data, it might make sense to store a file handle in the hash instead of a 1. That way, your # new key! code could just be opening the file, and your # seen it code could be a default condition (fall-through) writing the fields out. Something like this:
unless (exists($file_handle{$key})) {
$file_handle{$key} = open ... or die ...
}
# now we know it's in the hash, write the data:
print $file_handle{$key} ...

Remove duplicate lines on file by substring - preserve order (PERL)

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.
It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

Building indexes for files in Perl

I'm currently new to Perl, and I've stumbled upon a problem :
My task is to create a simple way to access a line of a big file in Perl, the fastest way possible.
I created a file consisting of 5 million lines with, on each line, the number of the line.
I've then created my main program that will need to be able to print any content of a given line.
To do this, I'm using two methods I've found on the internet :
use Config qw( %Config );
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
my $file = "testfile.err";
open(FILE, "< $file") or die "Can't open $file for reading: $!\n";
open(INDEX, "+>$file.idx")
or die "Can't open $file.idx for read/write: $!\n";
build_index(*FILE, *INDEX);
my $line = line_with_index(*FILE, *INDEX, 129);
print "$line";
sub build_index {
my $data_file = shift;
my $index_file = shift;
my $offset = 0;
while (<$data_file>) {
print $index_file pack($off_t, $offset);
$offset = tell($data_file);
}
}
sub line_with_index {
my $data_file = shift;
my $index_file = shift;
my $line_number = shift;
my $size; # size of an index entry
my $i_offset; # offset into the index of the entry
my $entry; # index entry
my $d_offset; # offset into the data file
$size = length(pack($off_t, 0));
$i_offset = $size * ($line_number-1);
seek($index_file, $i_offset, 0) or return;
read($index_file, $entry, $size);
$d_offset = unpack($off_t, $entry);
seek($data_file, $d_offset, 0);
return scalar(<$data_file>);
}
Those methods sometimes work, I get a value once out of ten tries on different set of values, but most of the time I get "Used of uninitialized value $line in string at test2.pl line 10" (when looking for line 566 in the file) or not the right numeric value. Moreover, the indexing seems to work fine on the first two hundred or so lines, but afterwards I get the error. I really don't know what I'm doing wrong..
I know you can use a basic loop that will parse each line, but I really need a way of accessing, at any given time, one line of a file without reparsing it all over again.
Edit : I've tried using a little tip found here : Reading a particular line by line number in a very large file
I've replaced the "N" template for pack with :
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
It makes the process work better, until line 128, where instead of getting 128 , I get a blank string. For 129, I get 3, which doesn't mean much..
Edit2 : Basically what I need is a mechanism that enables me to read the next 2 lines for instance for a file that is already being read, while keeping the read "head" at the current line (and not 2 lines after).
Thanks for your help !
Since you are writing binary data to the index file, you need to set the filehandle to binary mode, especially if you are in Windows:
open(INDEX, "+>$file.idx")
or die "Can't open $file.idx for read/write: $!\n";
binmode(INDEX);
Right now, when you perform something like this in Windows:
print $index_file pack("j", $offset);
Perl will convert any 0x0a's in the packed string to 0x0d0a's. Setting the filehandle to binmode will make sure line feeds are not converted to carriage return-line feeds.

Perl: How to add a line to sorted text file

I want to add a line to the text file in perl which has data in a sorted form. I have seen examples which show how to append data at the end of the file, but since I want the data in a sorted format.
Please guide me how can it be done.
Basically from what I have tried so far :
(I open a file, grep its content to see if the line which I want to add to the file already exists. If it does than exit else add it to the file (such that the data remains in a sorted format)
open(my $FH, $file) or die "Failed to open file $file \n";
#file_data = <$FH>;
close($FH);
my $line = grep (/$string1/, #file_data);
if($line) {
print "Found\n";
exit(1);
}
else
{
#add the line to the file
print "Not found!\n";
}
Here's an approach using Tie::File so that you can easily treat the file as an array, and List::BinarySearch's bsearch_str_pos function to quickly find the insert point. Once you've found the insert point, you check to see if the element at that point is equal to your insert string. If it's not, splice it into the array. If it is equal, don't splice it in. And finish up with untie so that the file gets closed cleanly.
use strict;
use warnings;
use Tie::File;
use List::BinarySearch qw(bsearch_str_pos);
my $insert_string = 'Whatever!';
my $file = 'something.txt';
my #array;
tie #array, 'Tie::File', $file or die $!;
my $idx = bsearch_str_pos $insert_string, #array;
splice #array, $idx, 0, $insert_string
if $array[$idx] ne $insert_string;
untie #array;
The bsearch_str_pos function from List::BinarySearch is an adaptation of a binary search implementation from Mastering Algorithms with Perl. Its convenient characteristic is that if the search string isn't found, it returns the index point where it could be inserted while maintaining the sort order.
Since you have to read the contents of the text file anyway, how about a different approach?
Read the lines in the file one-by-one, comparing against your target string. If you read a line equal to the target string, then you don't have to do anything.
Otherwise, you eventually read a line 'greater' than your current line according to your sort criteria, or you hit the end of the file. In the former case, you just insert the string at that position, and then copy the rest of the lines. In the latter case, you append the string to the end.
If you don't want to do it that way, you can do a binary search in #file_data to find the spot to add the line without having to examine all of the entries, then insert it into the array before outputting the array to the file.
Here's a simple version that reads from stdin (or filename(s) specified on command line) and appends 'string to append' to the output if it's not found in the input. Outuput is printed on stdout.
#! /usr/bin/perl
$found = 0;
$append='string to append';
while(<>) {
$found = 1 if (m/$append/o);
print
}
print "$append\n" unless ($found);;
Modifying it to edit a file in-place (with perl -i) and taking the append string from the command line would be quite simple.
A 'simple' one-liner to insert a line without using any module could be:
perl -ni -le '$insert="lemon"; $eq=($insert cmp $_); if ($eq == 0){$found++}elsif($eq==-1 && !$found){print$insert} print'
giver a list.txt whose context is:
ananas
apple
banana
pear
the output is:
ananas
apple
banana
lemon
pear
{
local ($^I, #ARGV) = ("", $file); # Enable in-place editing of $file
while (<>) {
# If we found the line exactly, bail out without printing it twice
last if $_ eq $insert;
# If we found the place where the line should be, insert it
if ($_ gt $insert) {
print $insert;
print;
last;
}
print;
}
# We've passed the insertion point, now output the rest of the file
print while <>;
}
Essentially the same answer as pavel's, except with a lot of readability added. Note that $insert should already contain a trailing newline.