Best way to prevent output of a duplicate item in Perl in realtime during a loop - perl

I see a lot of 'related' questions showing up, but none I looked at answer this specific scenario.
During a while/for loop that parses a result set generated from a SQL select statement, what is the best way to prevent the next line from being outputted if the line before it contains the same field data (whether it be the 1st field or the xth field)?
For example, if two rows were:
('EML-E','jsmith#mail.com','John','Smith')
('EML-E','jsmith2#mail.com','John','Smith')
What is the best way to print only the first row based on the fact that 'EML-E' is the same in both rows?
Right now, I'm doing this:
Storing the first field (specific to my scenario) into a 2-element array (dupecatch[1])
Checking if dupecatch[0] = dupcatch[1] (duplicate - escape loop using 's')
After row is processed, set dupecatch[0] = dupecatch[1]
while ($DBS->SQLFetch() == *PLibdata::RET_OK)
{
$s=0; #s = 1 to escape out of inside loop
while ($i != $array_len and $s==0)
{
$rowfetch = $DBS->{Row}->GetCharValue($array_col[$i]);
if($i==0){$dupecatch[1] = $rowfetch;} #dupecatch prevents duplicate primary key field entries
if($dupecatch[0] ne $dupecatch[1])
{
dosomething($rowfetch);
}
else{$s++;}
$i++;
}
$i=0;
$dupecatch[0]=$dupecatch[1];
}

That is that standard way if you only care about duplicate items in a row, but $dupecatch[0] is normally named $old and $dupecatch[1] normally just the variable in question. You can tell the array is not a good fit because you only ever refer to its indices.
If you want to avoid all duplicates you can use a %seen hash:
my %seen;
while (defined (my $row = get_data())) {
next if $seen{$row->[0]}++; #skip all but the first instance of the key
do_stuff();
}

I suggest using DISTINCT in your SQL statement. That's probably by far the easiest fix.

Related

How to prematurely detect whether it's the last iteration during a while loop in Perl

Similar question(does not solve my question): Is it possible to detect if the current while loop iteration is the last in perl?
Post above has an answer which solves the issue of detecting whether it's the last iteration solely when reading from a file.
In a while loop, is it possible to detect if the current iteration is the last one from a mysql query?
while( my($id, $name, $email) = $sth->fetchrow_array() )
{
if(this_is_last_iteration)
{
print "last iteration";
}
}
my $next_row = $sth->fetch();
while (my $row = $next_row) {
my ($id, $name, $email) = #$row;
$next_row = $sth->fetch();
if (!$next_row) {
print "last iteration";
}
...
}
You'll need to verify this compiles, but a rough outline is:
my($rows) = $sth->rows;
my($i) = 0;
while( my($id, $name, $email) = $sth->fetchrow_array() )
{
$i++;
if ($i == $rows)
{
print "last iteration";
}
}
If you give us some more context, there may be other options. For example, your print statement is the last thing in the while loop. If that matches reality, you could simply move this after the loop and do away with the counter.
A couple of commentators have correctly noted that the rows command will not always have the correct value for a SELECT command (eg. see here). If you're using SELECT (which seems likely from your code) then this could be an issue. You could perform a COUNT before the SELECT to get the number of rows provided the data set does not change between the COUNT and the SELECT.
Although you can count rows to tell when you are on the last result from an SQL query, no, in the general case it is not possible to know in advance whether you're on the last iteration of a while loop.
Consider the following example:
while (rand() > 0.05) {
say "Is this the last iteration?";
}
There is no way to predict in advance what rand() will return, thus the code within the loop has no way of knowing whether it will iterate again until the next iteration starts.
You can keep a counter and compare it to the array length. I'm not familiar with Perl, but that's how I would do it in any other language.

Dynamic Array Inside a Foreach Loop

First time poster and new to Perl so I'm a little stuck. I'm iterating through a collection of long file names with columns separated by variable amounts of whitespace for example:
0 19933 12/18/2013 18:00:12 filename1.someextention
1 11912 12/17/2013 18:00:12 filename2.someextention
2 19236 12/16/2013 18:00:12 filename3.someextention
These are generated by multiple servers so I am iterating through multiple collections. That mechanism is simple enough.
I'm focused solely on the date column and need to ensure the date is changing like the above example as that ensures the file is being created on a daily basis and only once. If the file is created more than once per day I need to do something like send an email to myself and move on to the next server collection. If the date changes from the first file to the second exit the loop as well.
My issue is I don't know how to keep the date element of the first file stored so that I can compare it to the next file's date going through the loop. I thought about keeping the element stored in an array inside the loop until the current collection is finished and then move onto the next collection but I don't know the correct way of doing so. Any help would be greatly appreciated. Also, if there is a more eloquent way please enlighten me since I am willing to learn and not just wanting someone to write my script for me.
#file = `command -h server -secFilePath $secFilePath analyzer -archive -list`;
#array = reverse(#file); # The output from the above command lists the oldest file first
foreach $item (#array) {
#first = split (/ +/, #item);
#firstvar = #first[2];
#if there is a way to save the first date in the #firstvar array and keep it until the date
changes
if #firstvar == #first[2] { # This part isnt designed correctly I know. }
elsif #firstvar ne #first[2]; {
last;
}
}
One common technique is to use a hash, which is a data structure mapping key-value pairs. If you key by date, you can check if a given date has been encountered before.
If a date hasn't been encountered, it has no key in the hash.
If a date has been encountered, we insert 1 under that key to mark it.
my %dates;
foreach my $line (#array) {
my ($idx, $id, $date, $time, $filename) = split(/\s+/, $line);
if ($dates{$date}) {
#handle duplicate
} else {
$dates{$date} = 1;
#...
#code here will be executed only if the entry's date is unique
}
#...
#code here will be executed for each entry
}
Note that this will check each date against each other date. If for some reason you only want to check if two adjacent dates match, you could just cache the last $date and check against that.
In comments, OP mentioned they might rather perform that second check I mentioned. It's similar. Might look like this:
#we declare the variable OUTSIDE of the loop
#if needs to be, so that it stays in scope between runs
my $last_date;
foreach my $line (#array) {
my ($idx, $id, $date, $time, $filename) = split(/\s+/, $line);
if ($date eq $last_date) { #we use 'eq' for string comparison
#handle duplicate
} else {
$last_date = $date;
#...
#code here will be executed only if the entry's date is unique
}
#...
#code here will be executed for each entry
}

Comparison to an array of a value [duplicate]

This question already has answers here:
How can I verify that a value is present in an array (list) in Perl?
(8 answers)
Closed 9 years ago.
I'm still feeling my way though perl and so there's probably a simple way of doing this but I can find it. I want to compare a single value say A or E to an array that may or may not contain that value, eg A B C D and then perform an action if they match. How should I set this up?
Thanks.
You filter each element of the array to see if it is the element you are looking for and then use the resulting array as a boolean value (not empty = true, empty = false):
#filtered_array = grep { $_ eq 'A' } #array;
if (#filtered_array) {
print "found it!\n";
}
If you store the list in an array then the only way is to examine each element individually in a loop, using grep, or for or any from List::MoreUtils. (grep is the worst of these, as it searches the entire array, even if a match has been found early on.) This is fine if the array is small, but you will hit performance probelms if the array has a significant size and you have to check it frequently.
You can speed things up by representing the same list in a hash, when a check for membership is just a single key lookup.
Alternatively, if the list is enormous, then it is best kept in a database, using SQLite.
Are you stuck on arrays?
Whenever in Perl you're talk about quickly looking up data, you should think in terms of hashes. A hash is a collection of data like an array, but it is keyed, and looking up the key is a very fast operation in Perl.
There's nothing that says the keys to your hash can't be your data, and it is very common in Perl to index an array with a hash in order to quickly search for values.
This turns your array #array into a hash called %arrays_hash.
use strict;
use warnings;
use feature qw(say);
use autodie;
my #array = qw(Alpha Beta Delta Gamma Ohm);
my %array_index;
for my $entry ( #array ) {
$array_index{$entry} = 1; # Doesn't matter. As long as it isn't blank or zero
}
Now, looking up whether or not your data is in your array is very quick. Just simply see if it's a key in your %array_index:
my $item = "Delta"; # Is this in my initial array?
if ( $array_index{$item} ) {
say "Yes! Item '$item' is in my array.";
}
else {
say "No. Item '$item' isn't in my array. David sad.";
}
This is so common, that you'll see a lot of programs that use the map command to index the array. Instead of that for loop, I could have done this:
my %array_index = ( map { $_ => 1 } #array );
or
my %array_index;
map { $array_index{$_} = 1 } #array;
You'll see both. The first one is a one liner. The map command takes each entry in the array, and puts it in $_. Then, it returns the results into an array. Thus, the map will return an array with your data in the even positions (0, 2, 4 8...) and a 1 in the odd positions (1, 3, 5...).
The second one is more literal and easier to understand (or about as easy to understand in a map command). Again, each item in your #array is being assigned to $_, and that is being used as the key in my %array_index hash.
Whether or not you want to use hashes depend upon the length of your array, and how many items of input you'll be searching for. If you're simply searching whether a single item is in your array, I'd probably use List::Utils or List::MoreUtils, or use a for loop to search each value of my array. If I am doing this for multiple values, I am better off with a hash.

Perl - Data comparison taking huge time

open(INFILE1,"INPUT.txt");
my $modfile = 'Data.txt';
open MODIFIED,'>',$modfile or die "Could not open $modfile : $!";
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
my ($tablename1, $colname1,$sql1) = split(/\t/, $line1);
my ($tablename2, $colname2,$sql2) = split(/\t/, $line2);
if ($tablename1 eq $tablename2)
{
my $sth1 = $dbh->prepare($sql1);
$sth1->execute;
my $hash_ref1 = $sth1->fetchall_hashref('KEY');
my $sth2 = $dbh->prepare($sql2);
$sth2->execute;
my $hash_ref2 = $sth2->fetchall_hashref('KEY');
my #fieldname = split(/,/, $colname1);
my $colcnt=0;
my $rowcnt=0;
foreach $key1 ( keys(%{$hash_ref1}) )
{
foreach (#fieldname)
{
$colname =$_;
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
my $strvalue2='';
#val2 = $hash_ref2->{$key1}->{$colname};
if (defined #val2)
{
my #filtered = grep /#val2/, #metadata2;
my $strvalue2 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
if ($strvalue1 ne $strvalue2 )
{
$colcnt = $colcnt + 1;
print MODIFIED "$tablename1\t$colname\t$strvalue1\t$strvalue2\n";
}
}
}
if ($colcnt>0)
{
print "modified count is $colcnt\n";
}
%$hash_ref1 = ();
%$hash_ref2 = ();
}
The program is Read input file in which every line contrain three strings seperated by tab. First is TableName, Second is ALL Column Name with commas in between and third contain the sql to be run. As this utlity is doing comparison of data, so there are two rows for every tablename. One for each DB. So data needs to be picked from each respective db's and then compared column by column.
SQL returns as ID in the result set and if the value is coming from db then it needs be translated to a string by reading from a array (that array contains 100K records with Key and value seperated by ||)
Now I ran this for one set of tables which contains 18K records in each db. There are 8 columns picked from db in each sql. So for every record out of 18K, and then for every field in that record i.e. 8, this script is taking a lot of time.
My question is if someone can look and see if it can be imporoved so that it takes less time.
File contents sample
INPUT.TXT
TABLENAME COL1,COL2 select COL1,COL2 from TABLENAME where ......
TABLENAMEB COL1,COL2 select COL1,COL2 from TABLENAMEB where ......
Metadata array contains something like this(there are two i.e. for each db)
111||Code 1
222||Code 2
Please suggest
Your code does look a bit unusual, and could gain clarity from using subroutines vs. just using loops and conditionals. Here are a few other suggestions.
The excerpt
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
...;
}
is overly complicated: Not everyone knows the C-ish for(;;) idiom. You have lots of code duplication. And aren't you actually saying loop while I can read two lines?
while (defined(my $line1 = <INFILE1>) and defined(my $line2 = <INFILE1>)) {
...;
}
Yes, that line is longer, but I think it's a bit more self-documenting.
Instead of doing
if ($tablename1 eq $tablename2) { the rest of the loop }
you could say
next if $tablename1 eq $tablename2;
the rest of the loop;
and save a level of intendation. And better intendation equals better readability makes it easier to write good code. And better code might perform better.
What are you doing at foreach $key1 (keys ...) — something tells me you didn't use strict! (Just a hint: lexical variables with my can perform slightly better than global variables)
Also, doing $colname = $_ inside a for-loop is a dumb thing, for the same reason.
for my $key1 (keys ...) {
...;
for my $colname (#fieldname) { ... }
}
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
I don't think this does what you think it does.
From the $hash_ref1 you retrive a single element, then assign that element to an array (a collection of multiple values).
Then you called defined on this array. An array cannot be undefined, and what you are doing is quite deprecated. Calling defined function on a collection returns info about the memory management, but does not indicate ① whether the array is empty or ② whether the first element in that array is defined.
Interpolating an array into a regex isn't likely to be useful: The elements of the array are joined with the value of $", usually a whitespace, and the resulting string treated as a regex. This will wreak havoc if there are metacharacters present.
When you only need the first value of a list, you can force list context, but assign to a single scalar like
my ($filtered) = produce_a_list;
This frees you from weird subscripts you don't need and that only slow you down.
Then you assign to a $strvalue1 variable you just declared. This shadows the outer $strvalue1. They are not the same variable. So after the if branch, you still have the empty string in $strvalue1.
I would write this code like
my $val1 = $hash_ref1->{$key1}{$colname};
my $strvalue1 = defined $val1
? do {
my ($filtered) = grep /\Q$val1/, #metadata;
substr $filtered, 2 + index $filtered, '||'
} : '';
But this would be even cheaper if you pre-split #metadata into pairs and test for equality with the correct field. This would remove some of the bugs that are still lurking in that code.
$x = $x + 1 is commonly written $x++.
Emptying the hashrefs at the end of the iteration is unneccessary: The hashrefs are assigned to a new value at the next iteration of the loop. Also, it is unneccessary to assist Perls garbage collection for such simple tasks.
About the metadata: 100K records is a lot, so either put it in a database itself, or at the very least a hash. Especially for so many records, using a hash is a lot faster than looping through all entries and using slow regexes … aargh!
Create the hash from the file, once at the beginning of the program
my %metadata;
while (<METADATA>) {
chomp;
my ($key, $value) = split /\|\|/;
$metadata{$key} = $value; # assumes each key only has one value
}
Simply look up the key inside the loop
my $strvalue1 = defined $val1 ? $metadata{$val1} // '' : ''
That should be so much faster.
(Oh, and please consider using better names for variables. $strvalue1 doesn't tell me anything, except that it is a stringy value (d'oh). $val1 is even worse.)
This is not really an answer but it won't really fit well in a comment either so, until you provide some more information, here are some observations.
Inside you inner for loop, there is:
#val1 = $hash_ref1->{$key1}->{$colname};
Did you mean #val1 = #{ $hash_ref1->{$key1}->{$colname} };?
Later, you check if (defined #val1)? What did you really want to check? As perldoc -f defined points out:
Use of "defined" on aggregates (hashes and arrays) is
deprecated. It used to report whether memory for that aggregate
had ever been allocated. This behavior may disappear in future
versions of Perl. You should instead use a simple test for size:
In your case, if (defined #val1) will always be true.
Then, you have my #filtered = grep /#val1/, #metadata; Where did #metadata come from? What did you actually intend to check?
Then you have my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
There is some interesting stuff going on in there.
You will need to verbalize what you are actually trying to do.
I strongly suspect there is a single SQL query you can run that will give you what you want but we first need to know what you want.

Declare and populate a hash table in one step in Perl

Currently when I want to build a look-up table I use:
my $has_field = {};
map { $has_field->{$_} = 1 } #fields;
Is there a way I can do inline initialization in a single step? (i.e. populate it at the same time I'm declaring it?)
Just use your map to create a list then drop into a hash reference like:
my $has_field = { map { $_ => 1 } #fields };
Update: sorry, this doesn't do what you want exactly, as you still have to declare $has_field first.
You could use a hash slice:
#{$has_field}{#fields} = (1)x#fields;
The right hand side is using the x operator to repeat one by the scalar value of #fields (i.e. the number of elements in your array). Another option in the same vein:
#{$has_field}{#fields} = map {1} #fields;
Where I've tested it smart match can be 2 to 5 times as fast as creating a lookup hash and testing for the value once. So unless you're going to reuse the hash a good number of times, it's best to do a smart match:
if ( $cand_field ~~ \#fields ) {
do_with_field( $cand_field );
}
It's a good thing to remember that since 5.10, Perl now has a way native to ask "is this untested value any of these known values", it's smart match.