Parsing CSV files, finding columns and remembering them - perl

I am trying to figure out a way to do this, I know it should be possible. A little background first.
I want to automate the process of creating the NCBI Sequin block for submitting DNA sequences to GenBank. I always end up creating a table that lists the species name, the specimen ID value, the type of sequences, and finally the location of the the collection. It is easy enough for me to export this into a tab-delimited file. Right now I do something like this:
while ($csv) {
foreach ($_) {
if ($_ =! m/table|species|accession/i) {
#csv = split('\t', $csv);
print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
}
else {
next;
}
}
}
I know that is messy, and I just typed up something similar to what I have by memory (don't have script on any of my computers at home, only at work).
Now that works for me fine right now because I know which columns the information I need (species, location, and ID number) are in.
But is there a way (there must be) for me to find the columns that are for the needed info dynamically? That is, no matter the order of the columns the correct info from the correct column goes to the right place?
The first row will usually as Table X (where X is the number of the table in the publication), the next row will usually have the column headings of interest and are nearly universal in title. Nearly all tables will have standard headings to search for and I can just use | in my pattern matching.

First off, I would be remiss if I didn’t recommend the excellent Text::CSV_XS module; it does a much more reliable job of reading CSV files, and can even handle the column-mapping scheme that Barmar referred to above.
That said, Barmar has the right approach, though it ignores the "Table X" row being a separate row entirely. I recommend taking an explicit approach, perhaps something like this (and this is going to have a bit more detail just to make things clear; I would probably write it more tightly in production code):
# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.
my $hdr_data = {};
while( <$csv_fh> ) {
if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
$hdr_data->{'table'} = $1;
next;
}
if( ! $hdr_data->{'species'} && /species/ ) {
my $n = 0;
# Takes the column headers as they come, creating
# a map between the column name and column number.
# Assumes that column names are case-insensitively
# unique.
my %columns = map { lc($_) => $n++ } split( /\t/ );
# Now pick out exactly the columns we want.
foreach my $thingy ( qw{ species accession country } ) {
$hdr_data->{$thingy} = $columns{$thingy};
}
last;
}
}
# Now process the rest of the lines.
while( <$csv_fh> ) {
my $col = split( /\t/ );
printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
$col[$hdr_data->{'species'}],
$col[$hdr_data->{'country'}],
$col[$hdr_data->{'accession'}];
}
Some variation of that will get you close to what you need.

Create a hash that maps column headings to column numbers:
my %columns;
...
if (/table|species|accession/i) {
my #headings = split('\t');
my $col = 0;
foreach my $col (#headings) {
$columns{"\L$col"} = $col++;
}
}
Then you can use $csv[$columns{'species'}].

Related

How to use hash in perl to reduce the use of loops?

For example, first, I have identified several key words. Then, in another file, I want to find lines that matches these key words exactly.
Using loop, I would loop through the file for each key word. So if there are 10 key words, I would loop the file 10 times. But if the file is huge, this method would be very time inefficient.
I heard proper use of hash could ease such a situation but never saw an example. So exactly how to use hash to achieve the same goal while using loop only once? Could you provide an example?
You have a solution whose time taken is proportional to the size of the file and to the number of keywords (O(N*K)). You would like a solution where the time taken is not affected by the number of keywords (O(N)).
You could indeed achieve this using a hash.
my #keywords = ...;
my %keywords = map { $_ => 1 } #keywords;
while (<$fh>) {
my #words = split ' '; # To be improved.
if (grep { $keywords{$_} } #words) { # Replace with foreach loop to exit ASAP.
...
}
}
I'd use a regex match.
my #keywords = ...;
my $pat = join '|', map quotemeta, #keywords;
my $re = qr/$pat/;
while (<$fh>) {
if (/(?:^|\s)$re(?:\s|\z)/) { # To be improved.
...
}
}

Dynamic Array Inside a Foreach Loop

First time poster and new to Perl so I'm a little stuck. I'm iterating through a collection of long file names with columns separated by variable amounts of whitespace for example:
0 19933 12/18/2013 18:00:12 filename1.someextention
1 11912 12/17/2013 18:00:12 filename2.someextention
2 19236 12/16/2013 18:00:12 filename3.someextention
These are generated by multiple servers so I am iterating through multiple collections. That mechanism is simple enough.
I'm focused solely on the date column and need to ensure the date is changing like the above example as that ensures the file is being created on a daily basis and only once. If the file is created more than once per day I need to do something like send an email to myself and move on to the next server collection. If the date changes from the first file to the second exit the loop as well.
My issue is I don't know how to keep the date element of the first file stored so that I can compare it to the next file's date going through the loop. I thought about keeping the element stored in an array inside the loop until the current collection is finished and then move onto the next collection but I don't know the correct way of doing so. Any help would be greatly appreciated. Also, if there is a more eloquent way please enlighten me since I am willing to learn and not just wanting someone to write my script for me.
#file = `command -h server -secFilePath $secFilePath analyzer -archive -list`;
#array = reverse(#file); # The output from the above command lists the oldest file first
foreach $item (#array) {
#first = split (/ +/, #item);
#firstvar = #first[2];
#if there is a way to save the first date in the #firstvar array and keep it until the date
changes
if #firstvar == #first[2] { # This part isnt designed correctly I know. }
elsif #firstvar ne #first[2]; {
last;
}
}
One common technique is to use a hash, which is a data structure mapping key-value pairs. If you key by date, you can check if a given date has been encountered before.
If a date hasn't been encountered, it has no key in the hash.
If a date has been encountered, we insert 1 under that key to mark it.
my %dates;
foreach my $line (#array) {
my ($idx, $id, $date, $time, $filename) = split(/\s+/, $line);
if ($dates{$date}) {
#handle duplicate
} else {
$dates{$date} = 1;
#...
#code here will be executed only if the entry's date is unique
}
#...
#code here will be executed for each entry
}
Note that this will check each date against each other date. If for some reason you only want to check if two adjacent dates match, you could just cache the last $date and check against that.
In comments, OP mentioned they might rather perform that second check I mentioned. It's similar. Might look like this:
#we declare the variable OUTSIDE of the loop
#if needs to be, so that it stays in scope between runs
my $last_date;
foreach my $line (#array) {
my ($idx, $id, $date, $time, $filename) = split(/\s+/, $line);
if ($date eq $last_date) { #we use 'eq' for string comparison
#handle duplicate
} else {
$last_date = $date;
#...
#code here will be executed only if the entry's date is unique
}
#...
#code here will be executed for each entry
}

Perl - Data comparison taking huge time

open(INFILE1,"INPUT.txt");
my $modfile = 'Data.txt';
open MODIFIED,'>',$modfile or die "Could not open $modfile : $!";
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
my ($tablename1, $colname1,$sql1) = split(/\t/, $line1);
my ($tablename2, $colname2,$sql2) = split(/\t/, $line2);
if ($tablename1 eq $tablename2)
{
my $sth1 = $dbh->prepare($sql1);
$sth1->execute;
my $hash_ref1 = $sth1->fetchall_hashref('KEY');
my $sth2 = $dbh->prepare($sql2);
$sth2->execute;
my $hash_ref2 = $sth2->fetchall_hashref('KEY');
my #fieldname = split(/,/, $colname1);
my $colcnt=0;
my $rowcnt=0;
foreach $key1 ( keys(%{$hash_ref1}) )
{
foreach (#fieldname)
{
$colname =$_;
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
my $strvalue2='';
#val2 = $hash_ref2->{$key1}->{$colname};
if (defined #val2)
{
my #filtered = grep /#val2/, #metadata2;
my $strvalue2 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
if ($strvalue1 ne $strvalue2 )
{
$colcnt = $colcnt + 1;
print MODIFIED "$tablename1\t$colname\t$strvalue1\t$strvalue2\n";
}
}
}
if ($colcnt>0)
{
print "modified count is $colcnt\n";
}
%$hash_ref1 = ();
%$hash_ref2 = ();
}
The program is Read input file in which every line contrain three strings seperated by tab. First is TableName, Second is ALL Column Name with commas in between and third contain the sql to be run. As this utlity is doing comparison of data, so there are two rows for every tablename. One for each DB. So data needs to be picked from each respective db's and then compared column by column.
SQL returns as ID in the result set and if the value is coming from db then it needs be translated to a string by reading from a array (that array contains 100K records with Key and value seperated by ||)
Now I ran this for one set of tables which contains 18K records in each db. There are 8 columns picked from db in each sql. So for every record out of 18K, and then for every field in that record i.e. 8, this script is taking a lot of time.
My question is if someone can look and see if it can be imporoved so that it takes less time.
File contents sample
INPUT.TXT
TABLENAME COL1,COL2 select COL1,COL2 from TABLENAME where ......
TABLENAMEB COL1,COL2 select COL1,COL2 from TABLENAMEB where ......
Metadata array contains something like this(there are two i.e. for each db)
111||Code 1
222||Code 2
Please suggest
Your code does look a bit unusual, and could gain clarity from using subroutines vs. just using loops and conditionals. Here are a few other suggestions.
The excerpt
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
...;
}
is overly complicated: Not everyone knows the C-ish for(;;) idiom. You have lots of code duplication. And aren't you actually saying loop while I can read two lines?
while (defined(my $line1 = <INFILE1>) and defined(my $line2 = <INFILE1>)) {
...;
}
Yes, that line is longer, but I think it's a bit more self-documenting.
Instead of doing
if ($tablename1 eq $tablename2) { the rest of the loop }
you could say
next if $tablename1 eq $tablename2;
the rest of the loop;
and save a level of intendation. And better intendation equals better readability makes it easier to write good code. And better code might perform better.
What are you doing at foreach $key1 (keys ...) — something tells me you didn't use strict! (Just a hint: lexical variables with my can perform slightly better than global variables)
Also, doing $colname = $_ inside a for-loop is a dumb thing, for the same reason.
for my $key1 (keys ...) {
...;
for my $colname (#fieldname) { ... }
}
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
I don't think this does what you think it does.
From the $hash_ref1 you retrive a single element, then assign that element to an array (a collection of multiple values).
Then you called defined on this array. An array cannot be undefined, and what you are doing is quite deprecated. Calling defined function on a collection returns info about the memory management, but does not indicate ① whether the array is empty or ② whether the first element in that array is defined.
Interpolating an array into a regex isn't likely to be useful: The elements of the array are joined with the value of $", usually a whitespace, and the resulting string treated as a regex. This will wreak havoc if there are metacharacters present.
When you only need the first value of a list, you can force list context, but assign to a single scalar like
my ($filtered) = produce_a_list;
This frees you from weird subscripts you don't need and that only slow you down.
Then you assign to a $strvalue1 variable you just declared. This shadows the outer $strvalue1. They are not the same variable. So after the if branch, you still have the empty string in $strvalue1.
I would write this code like
my $val1 = $hash_ref1->{$key1}{$colname};
my $strvalue1 = defined $val1
? do {
my ($filtered) = grep /\Q$val1/, #metadata;
substr $filtered, 2 + index $filtered, '||'
} : '';
But this would be even cheaper if you pre-split #metadata into pairs and test for equality with the correct field. This would remove some of the bugs that are still lurking in that code.
$x = $x + 1 is commonly written $x++.
Emptying the hashrefs at the end of the iteration is unneccessary: The hashrefs are assigned to a new value at the next iteration of the loop. Also, it is unneccessary to assist Perls garbage collection for such simple tasks.
About the metadata: 100K records is a lot, so either put it in a database itself, or at the very least a hash. Especially for so many records, using a hash is a lot faster than looping through all entries and using slow regexes … aargh!
Create the hash from the file, once at the beginning of the program
my %metadata;
while (<METADATA>) {
chomp;
my ($key, $value) = split /\|\|/;
$metadata{$key} = $value; # assumes each key only has one value
}
Simply look up the key inside the loop
my $strvalue1 = defined $val1 ? $metadata{$val1} // '' : ''
That should be so much faster.
(Oh, and please consider using better names for variables. $strvalue1 doesn't tell me anything, except that it is a stringy value (d'oh). $val1 is even worse.)
This is not really an answer but it won't really fit well in a comment either so, until you provide some more information, here are some observations.
Inside you inner for loop, there is:
#val1 = $hash_ref1->{$key1}->{$colname};
Did you mean #val1 = #{ $hash_ref1->{$key1}->{$colname} };?
Later, you check if (defined #val1)? What did you really want to check? As perldoc -f defined points out:
Use of "defined" on aggregates (hashes and arrays) is
deprecated. It used to report whether memory for that aggregate
had ever been allocated. This behavior may disappear in future
versions of Perl. You should instead use a simple test for size:
In your case, if (defined #val1) will always be true.
Then, you have my #filtered = grep /#val1/, #metadata; Where did #metadata come from? What did you actually intend to check?
Then you have my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
There is some interesting stuff going on in there.
You will need to verbalize what you are actually trying to do.
I strongly suspect there is a single SQL query you can run that will give you what you want but we first need to know what you want.

Perl need the right grep operator to match value of variable

I want to see if I have repeated items in my array, there are over 16.000 so will automate it
There may be other ways but I started with this and, well, would like to finish it unless there is a straightforward command. What I am doing is shifting and pushing from one array into another and this way, check the destination array to see if it is "in array" (like there is such a command in PHP).
So, I got this sub routine and it works with literals, but it doesn't with variables. It is because of the 'eq' or whatever I should need. The 'sourcefile' will contain one or more of the words of the destination array.
// Here I just fetch my file
$listamails = <STDIN>;
# Remove the newlines filename
chomp $listamails;
# open the file, or exit
unless ( open(MAILS, $listamails) ) {
print "Cannot open file \"$listamails\"\n\n";
exit;
}
# Read the list of mails from the file, and store it
# into the array variable #sourcefile
#sourcefile = <MAILS>;
# Close the handle - we've read all the data into #sourcefile now.
close MAILS;
my #destination = ('hi', 'bye');
sub in_array
{
my ($destination,$search_for) = #_;
return grep {$search_for eq $_} #$destination;
}
for($i = 0; $i <=100; $i ++)
{
$elemento = shift #sourcefile;
if(in_array(\#destination, $elemento))
{
print "it is";
}
else
{
print "it aint there";
}
}
Well, if instead of including the $elemento in there I put a 'hi' it does work and also I have printed the value of $elemento which is also 'hi', but when I put the variable, it does not work, and that is because of the 'eq', but I don't know what else to put. If I put == it complains that 'hi' is not a numeric value.
When you want distinct values think hash.
my %seen;
#seen{ #array } = ();
if (keys %seen == #array) {
print "\#array has no duplicate values\n";
}
It's not clear what you want. If your first sentence is the only one that matters ("I want to see if I have repeated items in my array"), then you could use:
my %seen;
if (grep ++$seen{$_} >= 2, #array) {
say "Has duplicates";
}
You said you have a large array, so it might be faster to stop as soon as you find a duplicate.
my %seen;
for (#array) {
if (++$seen{$_} == 2) {
say "Has duplicates";
last;
}
}
By the way, when looking for duplicates in a large number of items, it's much faster to use a strategy based on sorting. After sorting the items, all duplicates will be right next to each other, so to tell if something is a duplicate, all you have to do is compare it with the previous one:
#sorted = sort #sourcefile;
for (my $i = 1; $i < #sorted; ++$i) { # Start at 1 because we'll check the previous one
print "$sorted[$i] is a duplicate!\n" if $sorted[$i] eq $sorted[$i - 1];
}
This will print multiple dupe messages if there are multiple dupes, but you can clean it up.
As eugene y said, hashes are definitely the way to go here. Here's a direct translation of the code you posted to a hash-based method (with a little more Perlishness added along the way):
my #destination = ('hi', 'bye');
my %in_array = map { $_ => 1 } #destination;
for my $i (0 .. 100) {
$elemento = shift #sourcefile;
if(exists $in_array{$elemento})
{
print "it is";
}
else
{
print "it aint there";
}
}
Also, if you mean to check all elements of #sourcefile (as opposed to testing the first 101 elements) against #destination, you should replace the for line with
while (#sourcefile) {
Also also, don't forget to chomp any values read from a file! Lines read from a file have a linebreak at the end of them (the \r\n or \n mentioned in comments on the initial question), which will cause both eq and hash lookups to report that otherwise-matching values are different. This is, most likely, the reason why your code is failing to work correctly in the first place and changing to use sort or hashes won't fix that. First chomp your input to make it work, then use sort or hashes to make it efficient.

How is this Perl code selecting two different elements from an array?

I have inherited some code from a guy whose favorite past time was to shorten every line to its absolute minimum (and sometimes only to make it look cool). His code is hard to understand but I managed to understand (and rewrite) most of it.
Now I have stumbled on a piece of code which, no matter how hard I try, I cannot understand.
my #heads = grep {s/\.txt$//} OSA::Fast::IO::Ls->ls($SysKey,'fo','osr/tiparlo',qr{^\d+\.txt$}) || ();
my #selected_heads = ();
for my $i (0..1) {
$selected_heads[$i] = int rand scalar #heads;
for my $j (0..#heads-1) {
last if (!grep $j eq $_, #selected_heads[0..$i-1]);
$selected_heads[$i] = ($selected_heads[$i] + 1) % #heads; #WTF?
}
my $head_nr = sprintf "%04d", $i;
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$heads[$selected_heads[$i]].txt","$recdir/heads/$head_nr.txt");
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$heads[$selected_heads[$i]].cache","$recdir/heads/$head_nr.cache");
}
From what I can understand, this is supposed to be some kind of randomizer, but I never saw a more complex way to achieve randomness. Or are my assumptions wrong? At least, that's what this code is supposed to do. Select 2 random files and copy them.
=== NOTES ===
The OSA Framework is a Framework of our own. They are named after their UNIX counterparts and do some basic testing so that the application does not need to bother with that.
This looks like some C code with Perl syntax. Sometimes knowing the language the person is thinking in helps you figure out what's going on. In this case, the person's brain is infected with the inner workings of memory management, pointer arithmetic, and other low level concerns, so he wants to minutely control everything:
my #selected_heads = ();
# a tricky way to make a two element array
for my $i (0..1) {
# choose a random file
$selected_heads[$i] = int rand #heads;
# for all the files (could use $#heads instead)
for my $j (0..#heads-1) {
# stop if the chosen file is not already in #selected_heads
# it's that damned ! in front of the grep that's mind-warping
last if (!grep $j eq $_, #selected_heads[0..$i-1]);
# if we are this far, the two files we selected are the same
# choose a different file if we're this far
$selected_heads[$i] = ($selected_heads[$i] + 1) % #heads; #WTF?
}
...
}
This is a lot of work because the original programmer either doesn't understand hashes or doesn't like them.
my %selected_heads;
until( keys %selected_heads == 2 )
{
my $try = int rand #heads;
redo if exists $selected_heads{$try};
$selected_heads{$try}++;
}
my #selected_heads = keys %selected_heads;
If you still hate hashes and have Perl 5.10 or later, you can use smart-matching to check if a value is in an array:
my #selected_heads;
until( #selected_heads == 2 )
{
my $try = int rand #heads;
redo if $try ~~ #selected_heads;
push #selected_heads, $try;
}
However, you have a special constraint on this problem. Since you know there are only two elements, you just have to check if the element you want to add is the prior element. In the first case it won't be undef, so the first addition always works. In the second case, it just can't be the last element in the array:
my #selected_heads;
until( #selected_heads == 2 )
{
my $try = int rand #heads;
redo if $try eq $selected_heads[-1];
push #selected_heads, $try;
}
Huh. I can't remember the last time I used until when it actually fit the problem. :)
Note that all of these solutions have the problem that they can cause an infinite loop if the number of original files is less than 2. I'd add a guard condition higher up so the no and single file cases through an error and perhaps the two file case doesn't bother to order them.
Another way you might do this is to shuffle (say, with List::Util) the entire list of original files and just take off the first two files:
use List::Util qw(shuffle);
my #input = 'a' .. 'z';
my #two = ( shuffle( #input ) )[0,1];
print "selected: #two\n";
It selects a random element from #heads.
Then it adds on another random but different element from #heads (if it is the element previously selected, it scrolls through #heads till it find an element not previously selected).
In summary, it selects N (in your case N=2) different random indexes in #heads array and then copies files corresponding to those indexes.
Personally I would write it a bit differently:
# ...
%selected_previously = ();
foreach my $i (0..$N) { # Generalize for N random files instead of 2
my $random_head_index = int rand scalar #heads;
while ($selected_previously[$random_head_index]++) {
$random_head_index = $random_head_index + 1) % #heads; # Cache me!!!
}
# NOTE: "++" in the while() might be considered a bit of a hack
# More readable version: $selected_previously[$random_head_index]=1; here.
The part you labeled "WTF" isn't so troubling, it's just simply making sure that $selected_heads[$i] remains as a valid subscript of #head. The really troubling part is that it is a pretty inefficient way of making sure he's not selecting the same file.
Then again, if the size of #heads is small, stepping from 0..$#heads is probably more efficient than just generating int rand( 2 ) and testing if they are the same.
But basically it copies two files at random (why?) as a '.txt' file and a '.cache' file.
How about just
for my $i (0..1) {
my $selected = splice( #heads, rand #heads, 1 );
my $head_nr = sprintf "%04d", $i;
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$selected.txt","$recdir/heads/$head_nr.txt");
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$selected.cache","$recdir/heads/$head_nr.cache");
}
unless #heads or #selected_heads are used later.
Here's yet another way to select 2 unique random indices:
my #selected_heads = ();
my #indices = 0..$#heads;
for my $i (0..1) {
my $j = int rand (#heads - $i);
push #selected_heads, $indices[$j];
$indices[$j] = $indices[#heads - $i - 1];
}