How to parse text that include something like table using PERL - perl

Tables like this:
<p>
Porpertity History
<p>
class Rate data
<p>
A1 5% 10
<p>
B1 3.5% 8
How to parse them into a hash or array?
Thanks a lot.

There are three ways we could parse this table. First, we need to open it and get to the data. I'm assuming that if the second column ends in a %, it's a valid column:
#! /usr/bin/env perl
use strict;
use warnings;
open (MY_FILE, "data.txt")
or die qq(Can't open "data.txt" for reading\n);
my %myHash;
my %dataHash;
my %rateHash;
while (my $line = <MY_FILE>) {
my ($class, $rate, $data) = split (/\s+/, $line);
next (unless $rate =~ /%$/);
That part of the code will split the three items, and then the question is how to structure the hash. We could create two hashes (one for rate and one for data, and use the same key:
$rateHash{$class} = $rate;
$dataHash{$data} = $data;
Or, we could have our hash as a hash of hashes, and put both pieces in the same hash:
$myHash{$class}->{RATE} = $rate;
$myHash{$class}->{DATA} = $data;
You can now pull up either the rate or data in the same hash. You can also do it in one go:
%{$myHash{$class}} = ("RATE" => $rate, "DATA" => "$data");
I personally prefer the first one.
Another possibility is to combine the two into a single scalar:
$myHash{$class} = "$rate:$data"; #Assuming ":" doesn't appear in either.
My preference is to make it a hash of hashes (like in the second example). Then, create a class to handle the dirty work (Simple to do using Moose).
However, depending upon your programming skill, you might feel more comfortable with the double hash idea, or with combining the two values into a single hash.

Related

To increase the performance of a script in perl

I have 2 files here which is newFile and LookupFile (which are huge files).
The contents in newFile will be searched in LookupFile and further processing happens. This script is working fine, however, it is taking more time to execute. Could you please let me know what can be done here to increase the performance? Could you please let me know if we can convert files into hash to increase performance?
My file looks like below
NewFile and LookupFile:
acl sourceipaddress subnet destinationipaddress subnet portnumber
.
.
Script:
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp::Tiny 'read_file';
use File::Copy;
use Data::Dumper;
use File::Copy qw(copy);
my %options = (
LookupFile => {
type => "=s",
help => "File name",
variable => 'gitFile',
required => 1,
}, newFile => {
type => "=s",
help => "file containing the acl lines to checked for",
variable => ‘newFile’,
required => 1,
} );
$opts->addOptions(%options);
$opts->parse();
$opts->validate();
my $newFile = $opts->getOption('newFile');
my $LookupFile = $opts->getOption('LookupFile');
my #LookupFile = read_file ("$LookupFile");
my #newFile = read_file ("$newFile");
#LookupFile = split (/\n/,$LookupFile[0]);
#newLines = split (/\n/,$newFile[0]);
open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>)
{
chomp($line);
my #columns = split(' ',$line);
$var = #columns;
my $fld1;
my $cnt;
my $fld2;
my $fld3;
my $fld4;
my $fld5;
my $dIP;
my $sIP;
my $sHOST;
my $dHOST;
if(....)
if (....) further checks and processing
)
First thing to do before any optimization is to profile your code. Rather than guessing, this will tell you what lines are taking up the most time, and how often they're called. Devel::NYTProf is a good tool for the job.
This is a problem.
my #LookupFile = read_file ("$LookupFile");
my #newFile = read_file ("$newFile");
#LookupFile = split (/\n/,$LookupFile[0]);
#newLines = split (/\n/,$newFile[0]);
read_file reads the whole file in as one big string (it should be my $contents = read_file(...), using an array is awkward). Then it splits the whole thing into newlines, copying everything in the file. This is very slow and hard on memory and unnecessary.
Instead, use read_lines. This will split the file into lines as it reads avoiding a costly copy.
my #lookups = read_lines($LookupFile);
my #new = read_lines($newFile);
Next problem is $newFile is opened again and iterated through line by line.
open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>) {
This is a waste as you've already read that file into memory. Use one or the other. However, in general, it's better to work with files line-by-line than to slurp them all into memory.
The above will speed things up, but they don't get at the crux of the problem. This is likely the real problem...
The contents in newFile will be searched in LookupFile and further processing happens.
You didn't show what you're doing, but I'm going to imagine it looks something like this...
for my $line (#lines) {
for my $thing (#lookups) {
...
}
}
That is, for each line in one file, you're looking at every line in the other. This is what is known as an O(n^2) algorithm meaning that as you double the size of the files you quadruple the time.
If each file has 10 lines, it will take 100 (10^2) turns through the inner loop. If they have 100 lines, it will take 10,000 (100^2). With 1,000 lines it will take 1,000,000 times.
With O(n^2) as the sizes get bigger things get very slow very quickly.
Could you please let me know if we can convert files into hash to increase performance?
You've got the right idea. You could convert the lookup file to a hash to speed things up. Let's say they're both lists of words.
# input
foo
bar
biff
up
down
# lookup
foo
bar
baz
And you want to check if any lines in input match any lines in lookup.
First you'd read lookup in and turn it into a hash. Then you'd read input and check if each line is in the hash.
use strict;
use warnings;
use autodie;
use v5.10;
...
# Populate `%lookup`
my %lookup;
{
open my $fh, $lookupFile;
while(my $line = <$fh>) {
chomp $line;
$lookup{$line} = 1;
}
}
# Check if any lines are in %lookup
open my $fh, $inputFile;
while(my $line = <$fh>) {
chomp $line;
print $line if $lookup{$line};
}
This way you only iterate through each file once. This is an O(n) algorithm meaning is scales linearly, because hash lookups are basically instantaneous. If each file has 10 lines, it will only take 10 iterations of each loop. If they have 100 lines it will only take 100 iterations of each loop. 1000 lines, 1000 iterations.
Finally, what you really want to do is skip all this and create a database for your data and search that. SQLite is a SQL database that requires no server, just a file. Put your data in there and perform SQL queries on it using DBD::SQLite.
While this means you have to learn SQL, and there is a cost to building and maintaining the database, this is fast and most importantly very flexible. SQLite can do all sorts of searches quickly without you having to write a bunch of extra code. SQL databases are a very common, so it's a very good investment to learn SQL.
Since you're splitting the file up with my #columns = split(' ',$line); it's probably a file with many fields in it. That will likely map to a SQL table very well.
SQLite can even import files like that for you. See this answer for details on how to do that.

Adding multiple values to key in perl hash

I need to create multi-dimensional hash.
for example I have done:
$hash{gene} = $mrna;
if (exists ($exon)){
$hash{gene}{$mrna} = $exon;
}
if (exists ($cds)){
$hash{gene}{$mrna} = $cds;
}
where $gene, $mrna, $exon, $cds are unique ids.
But, my issue is that I want some properties of $gene and $mrna to be included in the hash.
for example:
$hash{$gene}{'start_loc'} = $start;
$hash{gene}{mrna}{'start_loc'} = $start;
etc. But, is that a feasible way of declaring a hash? If I call $hash{$gene} both $mrna and start_loc will be printed. What could be the solution?
How would I add multiple values for the same key $gene and $mrna being the keys in this case.
Any suggestions will be appreciated.
What you need to do is to read the Perl Reference Tutorial.
Simple answer to your question:
Perl hashes can only take a single value to a key. However, that single value can be a reference to a memory location of another hash.
my %hash1 = ( foo => "bar", fu => "bur" }; #First hash
my %hash2;
my $hash{some_key} = \%hash1; #Reference to %hash1
And, there's nothing stopping that first hash from containing a reference to another hash. It's turtles all the way down!.
So yes, you can have a complex and convoluted structure as you like with as many sub-hashes as you want. Or mix in some arrays too.
For various reasons, I prefer the -> syntax when using these complex structures. I find that for more complex structures, it makes it easier to read. However, the main this is it makes you remember these are references and not actual multidimensional structures.
For example:
$hash{gene}->{mrna}->{start_loc} = $start; #Quote not needed in string if key name qualifies as a valid variable name.
The best thing to do is to think of your hash as a structure. For example:
my $person_ref = {}; #Person is a hash reference.
my $person->{NAME}->{FIRST} = "Bob";
my $person->{NAME}->{LAST} = "Rogers";
my $person->{PHONE}->{WORK}->[0] = "555-1234"; An Array Ref. Might have > 1
my $person->{PHONE}->{WORK}->[1] = "555-4444";
my $person->{PHONE}->{CELL}->[0] = "555-4321";
...
my #people;
push #people, $person_ref;
Now, I can load up my #people array with all my people, or maybe use a hash:
my %person;
$person{$bobs_ssn} = $person; #Now, all of Bob's info is index by his SSN.
So, the first thing you need to do is to think of what your structure should look like. What are the fields in your structure? What are the sub-fields? Figure out what your structure should look like, and then setup your hash of hashes to look like that. Figure out exactly how it will be stored and keyed.
Remember, this hash contains references to your genes (or whatever), so you want to choose your keys wisely.
Read the tutorial. Then, try your hand at it. It's not all that complicated to understand. However, it can be a bear to maintain.
When you say use strict;, you give yourself some protection:
my $foo = "bar";
say $Foo; #This won't work!
This won't work because you didn't declare $Foo, you declared $foo. The use stict; can catch variable names that are mistyped, but:
my %var;
$var{foo} = "bar";
say $var{Foo}; #Whoops!
This will not be caught (except maybe that $var{Foo} has not been initialized. The use strict; pragma can't detect mistakes in typing in your keys.
The next step, after you've grown comfortable with references is to move onto object oriented Perl. There's a Tutorial for that too.
All Object Oriented Perl does is to take your hash references, and turns them into objects. Then, it creates subroutines that will help you keep track of manipulating objects. For example:
sub last_name {
my $person = shift; #Don't worry about this for now..
my $last_name = shift;
if ( exists $last_name ) {
my $person->{NAME}->{LAST} = $last_name;
}
return $person->{NAME}->{LAST};
}
When I set my last name using this subroutine ...I mean method, I guarantee that the key will be $person->{NAME}->{LAST} and not $person->{LAST}->{NAME} or $person->{LAST}->{NMAE}. or $person->{last}->{name}.
The main problem isn't learning the mechanisms, but learning to apply them. So, think about exactly how you want to represent your items. This about what fields you want, and how you're going to pull up that information.
You could try pushing each value onto a hash of arrays:
my (#gene, #mrna, #exon, #cds);
my %hash;
push #{ $hash{$gene[$_]} }, [$mrna[$_], $exon[$_], $cds[$_] ] for 0 .. $#gene;
This way gene is the key, with multiple values ($mrna, $exon, $cds) associated with it.
Iterate over keys/values as follows:
for my $key (sort keys %hash) {
print "Gene: $key\t";
for my $value (#{ $hash{$key} } ) {
my ($mrna, $exon, $cds) = #$value; # De-references the array
print "Values: [$mrna], [$exon], [$cds]\n";
}
}
The answer to a question I've asked previously might be of help (Can a hash key have multiple 'subvalues' in perl?).

Finding equal lines in file with Perl

I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.
If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup
A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent
Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables
To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.
amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3

How do I create a hash of hashes in Perl?

Based on my current understanding of hashes in Perl, I would expect this code to print "hello world." It instead prints nothing.
%a=();
%b=();
$b{str} = "hello";
$a{1}=%b;
$b=();
$b{str} = "world";
$a{2}=%b;
print "$a{1}{str} $a{2}{str}";
I assume that a hash is just like an array, so why can't I make a hash contain another?
You should always use "use strict;" in your program.
Use references and anonymous hashes.
use strict;use warnings;
my %a;
my %b;
$b{str} = "hello";
$a{1}={%b};
%b=();
$b{str} = "world";
$a{2}={%b};
print "$a{1}{str} $a{2}{str}";
{%b} creates reference to copy of hash %b. You need copy here because you empty it later.
Hashes of hashes are tricky to get right the first time. In this case
$a{1} = { %b };
...
$a{2} = { %b };
will get you where you want to go.
See perldoc perllol for the gory details about two-dimensional data structures in Perl.
Short answer: hash keys can only be associated with a scalar, not a hash. To do what you want, you need to use references.
Rather than re-hash (heh) how to create multi-level data structures, I suggest you read perlreftut. perlref is more complete, but it's a bit overwhelming at first.
Mike, Alexandr's is the right answer.
Also a tip. If you are just learning hashes perl has a module called Data::Dumper that can pretty-print your data structures for you, which is really handy when you'd like to check what values your data structures have.
use Data::Dumper;
print Dumper(\%a);
when you print this it shows
$VAR1 = {
'1' => {
'str' => 'hello'
},
'2' => {
'str' => 'world'
}
};
Perl likes to flatten your data structures. That's often a good thing...for example, (#options, "another option", "yet another") ends up as one list.
If you really mean to have one structure inside another, the inner structure needs to be a reference. Like so:
%a{1} = { %b };
The braces denote a hash, which you're filling with values from %b, and getting back as a reference rather than a straight hash.
You could also say
$a{1} = \%b;
but that makes changes to %b change $a{1} as well.
I needed to create 1000 employees records for testing a T&A system. The employee records were stored in a hash where the key was the employee's identity number, and the value was a hash of their name, date of birth, and date of hire etc. Here's how...
# declare an empty hash
%employees = ();
# add each employee to the hash
$employees{$identity} = {gender=>$gender, forename=>$forename, surname=>$surname, dob=>$dob, doh=>$doh};
# dump the hash as CSV
foreach $identity ( keys %employees ){
print "$identity,$employees{$identity}{forename},$employees{$identity}{surname}\n";
}

Perl Array Question

Never done much programming -- been charged at work with manipulating the data from comment cards. Using perl so far I've got the database to correctly put its daily comments into an array. Comments are each one LINE of text within the database, so I just split the array on the line-break.
my #comments = split("\n", $c_data);
And yes, this being my first time programming, that took me wayyy too long to figure out.
At this point I now need to organize these array elements (is that what I should call them?) into their own separate scalars based on capitalized words (this is a behavior of the database, which was at one point corrupt).
Example of what two elements of the array look like:
print "$comments[0]\n";
This dining experience was GOOD blah blah blah.
or
print "$comments[1]\n";
Overall this was a BAD time and me and my blah blah.
These "good" or "bad" or "best" are already capitalized by the database the data came from.
What's the easiest way in Perl to get these lines into scalars from an array based on these capitalized words?
If I understand you correctly, you want to merge array elements that match a certain word. You can do it like this:
my #bad_comments = grep { /\bBAD\b/ } #comments;
my #good_comments = grep { /\bGOOD\b/ } #comments;
That way all 'good' and 'bad' comments go to each own array.
Now if you need to merge them into a scalar you'd want to join them (opposite of split):
my $bad_comments = join "\n", grep { /\bBAD\b/ } #comments;
my $good_comments = join "\n", grep { /\bGOOD\b/ } #comments;
Think hash table when you want to group data by arbitrary string keys. In this case, you have an array of GOOD comments and an array of BAD comments. What if you had an array of SO-SO comments? A strategy based on having array variables #good, #bad, #soso breaks down fast.
You have some ways to go before you can fully understand the code below:
#!/usr/bin/perl
use strict; use warnings;
use Regex::PreSuf;
my %comments;
my #types = qw( GOOD BAD ); # DRY
my $types_re = presuf #types;
while ( my $comment = <DATA> ) {
chomp $comment;
last unless $comment =~ /\S/;
# capturing match in list context returns captured strings
my ($type) = ( $comment =~ /($types_re)/ );
push #{ $comments{$type} }, $comment;
}
for my $type ( #types ) {
print "$type comments:\n";
for my $comment ( #{ $comments{$type} } ) {
print $comment, "\n";
}
}
__DATA__
This dining experience was GOOD blah blah blah.
Overall this was a BAD time and me and my blah blah.
You could use regular expressons, eg:
if ($comments[$i] =~ /GOOD/) {
# good comment
}
or more generally
if ($comments[$i] =~ /\b([A-Z]{2,})\b/) {
print "Comment: $1\n";
}
Here, \b means word boundary, () are used to extract captured text, [A-Z] represent a group of capital characters - capital letters, {2,} means that there have to be 2 or more characters defined by previous class.
I would store all your comments into a hash-of-arrays data structure, with the key being your capitalized word.
Here is a general solution to grab any capitalized word (assuming only one per comment), not just GOOD and BAD:
use strict;
use warnings;
my #comments = <DATA>;
chomp #comments;
my %data;
for (#comments) {
my $cap;
for (split) {
$cap = $_ if /^[A-Z]+$/;
}
if ($cap) { push #{ $data{$cap} }, $_ }
}
use Data::Dumper; print Dumper(\%data);
__DATA__
This is GOOD stuff
Here's some BAD stuff.
More of the GOOD junk.
Nothing here.
Here is the output:
$VAR1 = {
'BAD' => [
'Here\'s some BAD stuff.'
],
'GOOD' => [
'This is GOOD stuff',
'More of the GOOD junk.'
]
};
In my opinion, your best bet would be to create a disk-based database of some sort (SQLite?) that stores the comments and type as separate data.
Then use one of the other posted solutions to import your existing data into it.
The only problem here is that you need to learn Perl's DBI layer and a bit of SQL to use SQLite with Perl.
Not sure what you mean by "organize" and "based on".
If you mean produce a list of any capitalized words, each with a list of the lines containing that word (similar to toolic's solution, you could do this:
my %CAPS = ();
map {
my ($word) = /(\b[A-Z]+\b)/;
push( #{ $CAPS{$word} }, $_)
} #comments;
This will build a mapping of WORDS to things, and the things in this case are going to be lists of lines.
And you can refer to these lists as $CAPS{'GOOD'} or $CAPS{'BAD'}, or $CAPS{whatever}.