Dynamic Array Inside a Foreach Loop - perl

First time poster and new to Perl so I'm a little stuck. I'm iterating through a collection of long file names with columns separated by variable amounts of whitespace for example:
0 19933 12/18/2013 18:00:12 filename1.someextention
1 11912 12/17/2013 18:00:12 filename2.someextention
2 19236 12/16/2013 18:00:12 filename3.someextention
These are generated by multiple servers so I am iterating through multiple collections. That mechanism is simple enough.
I'm focused solely on the date column and need to ensure the date is changing like the above example as that ensures the file is being created on a daily basis and only once. If the file is created more than once per day I need to do something like send an email to myself and move on to the next server collection. If the date changes from the first file to the second exit the loop as well.
My issue is I don't know how to keep the date element of the first file stored so that I can compare it to the next file's date going through the loop. I thought about keeping the element stored in an array inside the loop until the current collection is finished and then move onto the next collection but I don't know the correct way of doing so. Any help would be greatly appreciated. Also, if there is a more eloquent way please enlighten me since I am willing to learn and not just wanting someone to write my script for me.
#file = `command -h server -secFilePath $secFilePath analyzer -archive -list`;
#array = reverse(#file); # The output from the above command lists the oldest file first
foreach $item (#array) {
#first = split (/ +/, #item);
#firstvar = #first[2];
#if there is a way to save the first date in the #firstvar array and keep it until the date
changes
if #firstvar == #first[2] { # This part isnt designed correctly I know. }
elsif #firstvar ne #first[2]; {
last;
}
}

One common technique is to use a hash, which is a data structure mapping key-value pairs. If you key by date, you can check if a given date has been encountered before.
If a date hasn't been encountered, it has no key in the hash.
If a date has been encountered, we insert 1 under that key to mark it.
my %dates;
foreach my $line (#array) {
my ($idx, $id, $date, $time, $filename) = split(/\s+/, $line);
if ($dates{$date}) {
#handle duplicate
} else {
$dates{$date} = 1;
#...
#code here will be executed only if the entry's date is unique
}
#...
#code here will be executed for each entry
}
Note that this will check each date against each other date. If for some reason you only want to check if two adjacent dates match, you could just cache the last $date and check against that.
In comments, OP mentioned they might rather perform that second check I mentioned. It's similar. Might look like this:
#we declare the variable OUTSIDE of the loop
#if needs to be, so that it stays in scope between runs
my $last_date;
foreach my $line (#array) {
my ($idx, $id, $date, $time, $filename) = split(/\s+/, $line);
if ($date eq $last_date) { #we use 'eq' for string comparison
#handle duplicate
} else {
$last_date = $date;
#...
#code here will be executed only if the entry's date is unique
}
#...
#code here will be executed for each entry
}

Related

How to prematurely detect whether it's the last iteration during a while loop in Perl

Similar question(does not solve my question): Is it possible to detect if the current while loop iteration is the last in perl?
Post above has an answer which solves the issue of detecting whether it's the last iteration solely when reading from a file.
In a while loop, is it possible to detect if the current iteration is the last one from a mysql query?
while( my($id, $name, $email) = $sth->fetchrow_array() )
{
if(this_is_last_iteration)
{
print "last iteration";
}
}
my $next_row = $sth->fetch();
while (my $row = $next_row) {
my ($id, $name, $email) = #$row;
$next_row = $sth->fetch();
if (!$next_row) {
print "last iteration";
}
...
}
You'll need to verify this compiles, but a rough outline is:
my($rows) = $sth->rows;
my($i) = 0;
while( my($id, $name, $email) = $sth->fetchrow_array() )
{
$i++;
if ($i == $rows)
{
print "last iteration";
}
}
If you give us some more context, there may be other options. For example, your print statement is the last thing in the while loop. If that matches reality, you could simply move this after the loop and do away with the counter.
A couple of commentators have correctly noted that the rows command will not always have the correct value for a SELECT command (eg. see here). If you're using SELECT (which seems likely from your code) then this could be an issue. You could perform a COUNT before the SELECT to get the number of rows provided the data set does not change between the COUNT and the SELECT.
Although you can count rows to tell when you are on the last result from an SQL query, no, in the general case it is not possible to know in advance whether you're on the last iteration of a while loop.
Consider the following example:
while (rand() > 0.05) {
say "Is this the last iteration?";
}
There is no way to predict in advance what rand() will return, thus the code within the loop has no way of knowing whether it will iterate again until the next iteration starts.
You can keep a counter and compare it to the array length. I'm not familiar with Perl, but that's how I would do it in any other language.

Parsing CSV files, finding columns and remembering them

I am trying to figure out a way to do this, I know it should be possible. A little background first.
I want to automate the process of creating the NCBI Sequin block for submitting DNA sequences to GenBank. I always end up creating a table that lists the species name, the specimen ID value, the type of sequences, and finally the location of the the collection. It is easy enough for me to export this into a tab-delimited file. Right now I do something like this:
while ($csv) {
foreach ($_) {
if ($_ =! m/table|species|accession/i) {
#csv = split('\t', $csv);
print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
}
else {
next;
}
}
}
I know that is messy, and I just typed up something similar to what I have by memory (don't have script on any of my computers at home, only at work).
Now that works for me fine right now because I know which columns the information I need (species, location, and ID number) are in.
But is there a way (there must be) for me to find the columns that are for the needed info dynamically? That is, no matter the order of the columns the correct info from the correct column goes to the right place?
The first row will usually as Table X (where X is the number of the table in the publication), the next row will usually have the column headings of interest and are nearly universal in title. Nearly all tables will have standard headings to search for and I can just use | in my pattern matching.
First off, I would be remiss if I didn’t recommend the excellent Text::CSV_XS module; it does a much more reliable job of reading CSV files, and can even handle the column-mapping scheme that Barmar referred to above.
That said, Barmar has the right approach, though it ignores the "Table X" row being a separate row entirely. I recommend taking an explicit approach, perhaps something like this (and this is going to have a bit more detail just to make things clear; I would probably write it more tightly in production code):
# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.
my $hdr_data = {};
while( <$csv_fh> ) {
if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
$hdr_data->{'table'} = $1;
next;
}
if( ! $hdr_data->{'species'} && /species/ ) {
my $n = 0;
# Takes the column headers as they come, creating
# a map between the column name and column number.
# Assumes that column names are case-insensitively
# unique.
my %columns = map { lc($_) => $n++ } split( /\t/ );
# Now pick out exactly the columns we want.
foreach my $thingy ( qw{ species accession country } ) {
$hdr_data->{$thingy} = $columns{$thingy};
}
last;
}
}
# Now process the rest of the lines.
while( <$csv_fh> ) {
my $col = split( /\t/ );
printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
$col[$hdr_data->{'species'}],
$col[$hdr_data->{'country'}],
$col[$hdr_data->{'accession'}];
}
Some variation of that will get you close to what you need.
Create a hash that maps column headings to column numbers:
my %columns;
...
if (/table|species|accession/i) {
my #headings = split('\t');
my $col = 0;
foreach my $col (#headings) {
$columns{"\L$col"} = $col++;
}
}
Then you can use $csv[$columns{'species'}].

Perl - Data comparison taking huge time

open(INFILE1,"INPUT.txt");
my $modfile = 'Data.txt';
open MODIFIED,'>',$modfile or die "Could not open $modfile : $!";
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
my ($tablename1, $colname1,$sql1) = split(/\t/, $line1);
my ($tablename2, $colname2,$sql2) = split(/\t/, $line2);
if ($tablename1 eq $tablename2)
{
my $sth1 = $dbh->prepare($sql1);
$sth1->execute;
my $hash_ref1 = $sth1->fetchall_hashref('KEY');
my $sth2 = $dbh->prepare($sql2);
$sth2->execute;
my $hash_ref2 = $sth2->fetchall_hashref('KEY');
my #fieldname = split(/,/, $colname1);
my $colcnt=0;
my $rowcnt=0;
foreach $key1 ( keys(%{$hash_ref1}) )
{
foreach (#fieldname)
{
$colname =$_;
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
my $strvalue2='';
#val2 = $hash_ref2->{$key1}->{$colname};
if (defined #val2)
{
my #filtered = grep /#val2/, #metadata2;
my $strvalue2 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
if ($strvalue1 ne $strvalue2 )
{
$colcnt = $colcnt + 1;
print MODIFIED "$tablename1\t$colname\t$strvalue1\t$strvalue2\n";
}
}
}
if ($colcnt>0)
{
print "modified count is $colcnt\n";
}
%$hash_ref1 = ();
%$hash_ref2 = ();
}
The program is Read input file in which every line contrain three strings seperated by tab. First is TableName, Second is ALL Column Name with commas in between and third contain the sql to be run. As this utlity is doing comparison of data, so there are two rows for every tablename. One for each DB. So data needs to be picked from each respective db's and then compared column by column.
SQL returns as ID in the result set and if the value is coming from db then it needs be translated to a string by reading from a array (that array contains 100K records with Key and value seperated by ||)
Now I ran this for one set of tables which contains 18K records in each db. There are 8 columns picked from db in each sql. So for every record out of 18K, and then for every field in that record i.e. 8, this script is taking a lot of time.
My question is if someone can look and see if it can be imporoved so that it takes less time.
File contents sample
INPUT.TXT
TABLENAME COL1,COL2 select COL1,COL2 from TABLENAME where ......
TABLENAMEB COL1,COL2 select COL1,COL2 from TABLENAMEB where ......
Metadata array contains something like this(there are two i.e. for each db)
111||Code 1
222||Code 2
Please suggest
Your code does look a bit unusual, and could gain clarity from using subroutines vs. just using loops and conditionals. Here are a few other suggestions.
The excerpt
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
...;
}
is overly complicated: Not everyone knows the C-ish for(;;) idiom. You have lots of code duplication. And aren't you actually saying loop while I can read two lines?
while (defined(my $line1 = <INFILE1>) and defined(my $line2 = <INFILE1>)) {
...;
}
Yes, that line is longer, but I think it's a bit more self-documenting.
Instead of doing
if ($tablename1 eq $tablename2) { the rest of the loop }
you could say
next if $tablename1 eq $tablename2;
the rest of the loop;
and save a level of intendation. And better intendation equals better readability makes it easier to write good code. And better code might perform better.
What are you doing at foreach $key1 (keys ...) — something tells me you didn't use strict! (Just a hint: lexical variables with my can perform slightly better than global variables)
Also, doing $colname = $_ inside a for-loop is a dumb thing, for the same reason.
for my $key1 (keys ...) {
...;
for my $colname (#fieldname) { ... }
}
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
I don't think this does what you think it does.
From the $hash_ref1 you retrive a single element, then assign that element to an array (a collection of multiple values).
Then you called defined on this array. An array cannot be undefined, and what you are doing is quite deprecated. Calling defined function on a collection returns info about the memory management, but does not indicate ① whether the array is empty or ② whether the first element in that array is defined.
Interpolating an array into a regex isn't likely to be useful: The elements of the array are joined with the value of $", usually a whitespace, and the resulting string treated as a regex. This will wreak havoc if there are metacharacters present.
When you only need the first value of a list, you can force list context, but assign to a single scalar like
my ($filtered) = produce_a_list;
This frees you from weird subscripts you don't need and that only slow you down.
Then you assign to a $strvalue1 variable you just declared. This shadows the outer $strvalue1. They are not the same variable. So after the if branch, you still have the empty string in $strvalue1.
I would write this code like
my $val1 = $hash_ref1->{$key1}{$colname};
my $strvalue1 = defined $val1
? do {
my ($filtered) = grep /\Q$val1/, #metadata;
substr $filtered, 2 + index $filtered, '||'
} : '';
But this would be even cheaper if you pre-split #metadata into pairs and test for equality with the correct field. This would remove some of the bugs that are still lurking in that code.
$x = $x + 1 is commonly written $x++.
Emptying the hashrefs at the end of the iteration is unneccessary: The hashrefs are assigned to a new value at the next iteration of the loop. Also, it is unneccessary to assist Perls garbage collection for such simple tasks.
About the metadata: 100K records is a lot, so either put it in a database itself, or at the very least a hash. Especially for so many records, using a hash is a lot faster than looping through all entries and using slow regexes … aargh!
Create the hash from the file, once at the beginning of the program
my %metadata;
while (<METADATA>) {
chomp;
my ($key, $value) = split /\|\|/;
$metadata{$key} = $value; # assumes each key only has one value
}
Simply look up the key inside the loop
my $strvalue1 = defined $val1 ? $metadata{$val1} // '' : ''
That should be so much faster.
(Oh, and please consider using better names for variables. $strvalue1 doesn't tell me anything, except that it is a stringy value (d'oh). $val1 is even worse.)
This is not really an answer but it won't really fit well in a comment either so, until you provide some more information, here are some observations.
Inside you inner for loop, there is:
#val1 = $hash_ref1->{$key1}->{$colname};
Did you mean #val1 = #{ $hash_ref1->{$key1}->{$colname} };?
Later, you check if (defined #val1)? What did you really want to check? As perldoc -f defined points out:
Use of "defined" on aggregates (hashes and arrays) is
deprecated. It used to report whether memory for that aggregate
had ever been allocated. This behavior may disappear in future
versions of Perl. You should instead use a simple test for size:
In your case, if (defined #val1) will always be true.
Then, you have my #filtered = grep /#val1/, #metadata; Where did #metadata come from? What did you actually intend to check?
Then you have my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
There is some interesting stuff going on in there.
You will need to verbalize what you are actually trying to do.
I strongly suspect there is a single SQL query you can run that will give you what you want but we first need to know what you want.

new to Perl - CSV - find a string and print all numbers in that column

I've got a bunch of data in a CSV file, first row is all strings (all text and underscores), all subsequent rows are filled with numbers relating to said strings.
I'm trying to parse through the first line and find particular strings, remember which column that string was in, and then go through the rest of the file and get the data in the same column. I need to do this to three strings.
I've been using Text::CSV but I can't figure out how to get it to increment a counter until it finds the string in the first line and then go to the next line, get the data from that same column, etc. etc. Here's what I've tried so far:
while (<CSV>) {
if ($csv->parse($data)) {
my #field = $csv->fields;
my $count = 0;
for $column (#field) {
print ++$count, " => ", $column, "\n";
}
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
Since $data is in line 1, it prints "1 $data" 25 times (# of lines in CSV file). How do I get it to remember which column it found $data in? Also, since I know all of the strings are in line 1, how do I get it to only parse through line 1, find all of the strings in #data, and then parse through the rest of the file, grabbing data from the necessary columns and putting it into a matrix or array of arrays?
Thanks for the help!
edit: I realized my questions were a bit poorly phrased. I don't know how to get the column number from CSV. How is this done?
Also, once I've got the column number, how do I tell it CSV to run through the subsequent lines and grab data from only that column?
Try something like this:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({binary=>1});
my $thing_to_match = "blah";
my $matched_index;
my #stored_data = ();
while(my $row= $csv->getline(*DATA)) #grabs lines below __DATA__
#(near the end of the script)
{
my #fields = #$row;
#If we haven't found the matched index, yet, search for it.
if(not defined $matched_index)
{
foreach my $i(0..$#fields)
{
$matched_index = $i if($fields[$i] eq $thing_to_match);
}
}
#NOTE: We're pushing a *reference* to an array!
#Look at perldoc perldata
push #stored_data,\#fields;
}
die "Column for '$thing_to_match' not found!" unless defined $matched_index;
foreach my $row(#stored_data)
{
print $row->[$matched_index] . "\n";
}
__DATA__
stuff,more stuff,yet more stuff
"yes, this thing, is one item",blah,blarg
1,2,3
The output is:
more stuff
blah
2
I don't have time to write up a full example, but I wrote a module that might help you do this. Tie::Array::CSV uses some magic to make your csv file act like a Perl array of arrayrefs. In this way you can use your knowledge of Perl to interact with the file.
A word of warning though! One benefit of my module is that it is read/write. Since you only want read, be careful not to assign to it!

Can't make sense out of this Perl code

This snippet basically reads a file line by line, which looks something like:
Album=In Between Dreams
Interpret=Jack Johnson
Titel=Better Together
Titel=Never Know
Titel=Banana Pancakes
Album=Pictures
Interpret=Katie Melua
Titel=Mary Pickford
Titel=It's All in My Head
Titel=If the Lights Go Out
Album=All the Lost Souls
Interpret=James Blunt
Titel=1973
Titel=One of the Brightest Stars
So it somehow connects the "Interpreter" with an album and this album with a list of titles. But what I don't quite get is how:
while ($line = <IN>) {
chomp $line;
if ($line =~ /=/) {
($name, $wert) = split(/=/, $line);
}
else {
next;
}
if ($name eq "Album") {
$album = $wert;
}
if ($name eq "Interpret") {
$interpret = $wert;
$cd{$interpret}{album} = $album; // assigns an album to an interpreter?
$titelnummer = 0;
}
if ($name eq "Titel") {
$cd{$interpret}{titel}[$titelnummer++] = $wert; // assigns titles to an interpreter - WTF? how can this work?
}
}
The while loop keeps running and putting the current line into $line as long as there are new lines in the file handle <IN>. chomp removes the newline at the end of every row.
split splits the line into two parts on the equal sign (/=/ is a regular expression) and puts the first part in $name and the second part in $wert.
%cd is a hash that contains references to other hashes. The first "level" is the name of interpreter.
(Please ask more specific questions if you still do not understand.)
cd is a hash of hashes.
$cd{$interpret}{album} contains album for interpreter.
$cd{$interpret}{titel} contains an array of Titel, which is filled incrementally in the last if.
Perl is a very concise language.
The best way to figure out what's going on is to inspect the data structure. After the while loop, temporarily insert this code:
use Data::Dumper;
print '%cd ', Dumper \%cd;
exit;
This may have a large output if the input is large.