speed up perl hashes in foreach loop and better algorithm

speed up perl hashes in foreach loop and better algorithm - perl

I have 2 hashes -> %a and %b.
Hash %a is from temp.txt
my %a = map{
my $short = substr($_,12);
$count++ => {$short => $_};
} #a;
my %b = map {
$_ => $_;
} #b;
%a = (
'1' => {'We go lunch' => 'We go lunch 9 pm'},
'2' => {'We go break' => 'We go break 8 pm'},
'3' => {'We go lunchy' => 'We go lunchy 8 pm'}
);
%b = (
'We go lunch' => 'We go lunch',
'We go break' => 'We go break',
'We go lunchy' => 'We go lunchy'
);
foreach my $key (keys %a){
foreach my $key2 (keys %{$a{$key}}){
if(exists $b{$key2}){
delete $a{$key}{$key2};
delete $a{$key};
}
}
}
my #another;
foreach my $key ( sort {$a<=>$b} keys %a) {
foreach my $key2 (keys %{$a{$key}}){
$another[$count] = $a{$key}{$key2};
$count++;
}
}
how can I speed up this? is my hashes weird? It took 30 seconds to output #another through 25144 lines of words in temp.txt.
Is it necessary to make hash of hash for %a?
The reason is I want any %b{$key} value in %a to be deleted.
I'm still learning Perl if you guys have any better way of doing this, very much appreciated, possibly using map and grep? and better algorithm?
previous workaround
if you see every #b is shorter string than every #a but still within #a. I had try to use
foreach (#b) {
my $source = $_;
#another = grep !(/$source/i), #a;}
but still it doesn't work. I was confused and thus came this hash of hash in %a and make hash %b from #b just to get rid of every instances value of #b in #a. which comes out as weird hash. lol

There are a few unknowns here - how is %b built for example.
Otherwise, a few observations:
You should use another array instead of %a:
my #c = map{
{ "".substr($_,12) => $_}
} #a;
If you already have %b defined, you could further optimize it by:
my #another = grep !exists $b{ substr($_,12) }, #a;
Hope this helps
Also, don't forget to always use strict; and use warnings; in the beginning of your program.
Explanations:
Your code puts everything in %a, traverse it and eliminates what shouldn't be there.
I think you could simply grep and keep in an array only the desired results.
The optimized code should become:
use strict;
use warning;
my %b = (
'We go lunch' => 'We go lunch',
'We go break' => 'We go break',
'We go lunch' => 'We go lunch'
);
#add code that initially fills #a
my #another = grep { !exists $b{ substr($_,12) } } #a;

It seems you are very confused. First of all, substr $_, 12 returns all the characters after the twelfth in the string, and so doesn't create the data structure you say it does. Secondly, you are using hash of hashes %a as an array of arrays, as the keys are integers without gaps in the sequence, and the value you are storing is a simple string pair.
The biggest problem for us is that you don't explain your goal in all this.
What it looks like is that you want to end up with the array #another containing all the lines from temp.txt that don't begin with any of the strings in #b. Is that about right?
I would do it by building a regular expression from array #b, and checking each line from the file as I read it.
This program demonstrates. I have renamed array #b to #exclude as the former is a terrible name for a variable. The regular expression is built by preceding each element of the array with ^ to anchor the regex at the beginning of the string, and appending \b to force a word boundary (so that, for instance, lunch doesn't match lunchy). Then all the elements are joined together using the | alternation operator, resulting in a regex that matches a string that starts with any of the lines in #exclude.
After that it is a simple matter to read through the file, check each line against the regex, and push onto #another those lines that don't match.
Note that, as it stands, the program reads from the DATA file handle so that I could include some test data in the source. You should change it by uncommenting the open line, and deleting the line my $fh = *DATA.
use strict;
use warnings;
#open my $fh, '<', 'temp.txt' or die $!;
my $fh = *DATA;
my #exclude = (
'We go lunch',
'We go lunchy',
'We go break',
);
my $exclude_re = join '|', map "^$_\\b", #exclude;
my #another;
while (my $line = <$fh>) {
chomp $line;
push #another, $line unless $line =~ $exclude_re;
}
print "$_\n" for #another;
__DATA__
We go breakfast 6 am
We go lunch 9 pm
We go break 8 pm
We go lunchy 8 pm
We go supper 7 pm
output
We go breakfast 6 am
We go supper 7 pm

Related

Perl: reading file into a hash and splitting, retrieving information

I have a file which has data like this:
1 unknown state 3204563 3207049 . - . name "gosford"; school_name "gosford"; pupil_id "P15240"; transcript_id "NM_001011874.1"; tss_id "TSS13146";
I want to read it line by line into a hash, and then split it with regular expressions. so that i can count the number of schools.]
so far i have:
my$schools;
open (SCHOOLS, <"$schools) or die (Cannot open $schools");
while <SCHOOLS> {
chomp;
my ($val, $key) = split /(^\d)\s+\w+\s+\W+\s+\d+\s+\d+\s+\d+\.\s+\+\s+\.\s+.. and so on);
}
How do I get the values I've split into the hash, and then manipulate them so produce some basic statistics?

It's a bit unclear what you're after, but I will offer - you are doing things the hard way using a long regex to match the line. Also, for 'other things' it's quite hard to tell exactly what you have in mind. But grep is your friend, as it lets you specify search terms.
Something like this will do the trick. I've used a simplistic example for counting entries matching a particular criterion. Of course, given you've only given us one row, this is a bit of a guess:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #entries;
my #keys = qw ( id thing state firstnum secondnum );
while ( <DATA> ) {
my %attributes = m/(\w+) "(\w+)"/g;
#attributes{#keys} = split;
push #entries, \%attributes;
}
print Dumper \#entries;
print "count of things: ", scalar #entries, "\n";
print "There are ", (scalar grep { $_ -> {state} eq "state" } #entries), " things with a state of 'state'\n";
__DATA__
1 unknown state 3204563 3207049 . - . name "gosford"; school_name "gosford"; pupil_id "P15240"; transcript_id "NM_001011874.1"; tss_id "TSS13146";
I'll also point out - it's much better form to use lexical filehandles with 3 arg open. E.g.
open ( my $schools, '<', 'schools.txt' ) or die $!;
while ( <$schools> ) {
#etc.
}
I'm using the special filehandle __DATA__ for illustrative purposes.

put next lines in an array after finding the matched pattern in Perl

I open a text report inside my Perl script and need to find the specific lines and store them in arrays.
this is my report which I need to process through:
matched pattern 1
line1:10
line2:20
line3:30
next matched pattern 2
line1:5
line2:10
line3:15
next matched pattern 3
lineA:A
lineB:B
lineC:C
.
.
------------------------------------
this part is my script:
#numbers;
#numbers2;
#letters;
while (<FILE>)
{
if ($_ =~/matched pattern 1/ && $_ ne "\n")
{
chomp();
push (#numbers,$_)
}
if ($_ =~/next matched pattern 2/ && $_ ne "\n")
{
chomp();
push (#numbers2,$_)
}
if ($_ =~/next matched pattern 3/ && $_ ne "\n")
{
chomp();
push (#letters,$_)
}
}
then I can use numbers and letters inside the arrays.
this is a part of my report file
Maximum points per Lab
Lab1:10
Lab2:30
Lab3:20
Maximum points per Exam
Exam1:50
Exam2:50
Maximum points on Final
Final:150

What is your program supposed to be doing? Your current program is looking for the lines that have matched pattern and storing THOSE VERY LINEs into three different arrays. All other lines are ignored.
You show some sort of example output, but there's no real relationship between your output and input.
First, learn about references, so you don't need five different arrays. In my example, I use an array of arrays to store all of your separate files. If each file represents something else, you could use an array of hashes or a hash of arrays or a hash of hashes of arrays to represent this data in a unified structure. (Don't get me started on how you really should learn object oriented Perl. Get the hang of references first).
Also get a book on modern Perl and learn the new Perl syntax. It looks like your Perl reference is for Perl 4.0. Perl 5.0 has been out since 1994. There's a big difference between Perl 4 and Perl 5 in the way syntax is done.
use strict;
use warnings;
# Prints out your data strtucture
use Data::Dumper;
my $array_num;
my #array_of_arrays;
use constant {
PATTERN => qr/matched pattern/,
};
while (my $line = <DATA>) {
chomp $line;
next if $line =~ /^\s*$/; #Skip blank lines
if ($line =~ PATTERN) {
if (not defined $array_num) {
$array_num = 0;
}
else {
$array_num++;
}
next;
}
push #{ $array_of_arrays[$array_num] }, $line;
}
print Dumper (\#array_of_arrays) . "\n";
__DATA__
matched pattern 1
line1:10
line2:20
line3:30
next matched pattern 2
line1:5
line2:10
line3:15
next matched pattern 3
lineA:A
lineB:B
lineC:C
OUTPUT. Each set of lines are in a different array:
$VAR1 = [
[
'line1:10',
'line2:20',
'line3:30'
],
[
'line1:5',
'line2:10',
'line3:15'
],
[
'lineA:A',
'lineB:B',
'lineC:C'
]
];

#numbers;
#letters;
open FILE, "report2.txt" or die $!;
while (<FILE>)
{
if ($_ =~/:(\d+)/ && $_ ne "\n")
{
chomp();
push (#numbers,$1)
}elsif ($_ =~/:(\w+)/ && $_ ne "\n")
{
chomp();
push (#letters,$1)
}
}
print "numbers: ", #numbers, "\n";
print "letters: ", #letters, "\n";

Revised for some best practices and my own style preferences (programmed for extensibility, as I always end up extending code, so I try to program in a generally extensible way):
# Things we search for
my %patterns = (
foo => qr/^matched pattern 1/,
bar => qr/^matched pattern 2/,
baz => qr/^matched pattern 3/,
);
# Where we store matches, initialized to empty array refs
my %matches = map { $_ => [] } keys %patterns;
open(my $fh, '<', $file) or die $!;
my %current_match;
LINE: while (my $line = <$fh>) {
# We never want empty lines, so exit early
next if $_ eq "\n";
# Check current line for matches, to note which bucket we are saving into
for my $matchable (keys %patterns) {
# Skip to next unless it matches
next unless $lines =~ $matches{$matchable};
# Set the current match and jump to next line:
$current_match = $matchable;
next LINE;
}
# If there's a current match found, save the line
push( #{$matches{$current_match}, $line ) if $current_match;
}

How to count duplicate key and add all values of duplicate key together to make new hash with non duplicate key?

Hi I am new to perl and in a beginners stage Please Help
I am having a hash
%hash = { a => 2 , b=>6, a=>4, f=>2, b=>1, a=>1}
I want output as
a comes 3 times
b comes 2 times
f comes 1 time
a new hash should be
%newhash = { a => 7, b=>7,f =>2}
How can I do this?
To count the frequency of keys in hash i am doing
foreach $element(sort keys %hash) {
my $count = grep /$element/, sort keys %hash;
print "$element comes in $count times \n";
}
But by doing this I am getting the output as:
a comes 1 times
b comes 1 times
a comes 1 times
f comes 1 times
b comes 1 times
a comes 1 times
Which is not what I want.
How can I get the correct number of frequency of the duplicate keys? How can I add the values of those duplicate key and store in a new hash?

By definition, a hash can not have the same hash key in it multiple times. You probably want to store your initial data in a different data structure, such as a two-dimensional array:
use strict;
use warnings;
use Data::Dumper;
my #data = ( [ a => 2 ],
[ b => 6 ],
[ a => 4 ],
[ f => 2 ],
[ b => 1 ],
[ a => 1 ],
);
my %results;
for my $value (#data) {
$results{$value->[0]} += $value->[1];
}
print Dumper %results;
# $VAR1 = 'a';
# $VAR2 = 7;
# $VAR3 = 'b';
# $VAR4 = 7;
# $VAR5 = 'f';
# $VAR6 = 2;
That said, other wrong things:
%hash = { a => 2 , b=>6, a=>4, f=>2, b=>1, a=>1}
You can't do this, it's assigning a hashref ({}) to a hash. Either use %hash = ( ... ) or $hashref = { ... }.

Sonam:
I've reedited your post in order to help format it for reading. Study the Markdown Editing Help Guide and that'll make your posts clearer and easier to understand. Here are a couple of hints:
Indent your code by four spaces. That tells Markdown to leave it alone and don't reformat it.
When you make a list, put astricks with a space in front. Markdown understands it's a bulleted list and formats it as such.
Press "Edit" on your original post, and you can see what changes I made.
Now on to your post. I'm not sure I understand your data. If your data was in a hash, the keys would be unique. You can't have duplicate keys in a hash, so where is your data coming from?
For example, if you're reading it in from a file with two numbers on each line, you could do this:
use autodie;
use strict;
use warnings;
open (my $data_fh, "<", "$fileName");
my %hash;
while (my $line = <$data_fh>) {
chomp $line;
my ($key, $value) = split /\s+/, $line;
$hash{$key}++;
}
foreach my $key (sort keys %hash) {
print "$key appears $hash{$key} times\n";
}
The first three lines are Perl pragmas. They change the way Perl operates:
use autodie: This tells the program to die in certain circumstances such as when you try to open a file that doesn't exist. This way, I didn't have to check to see if the open statement worked or not.
use strict: This makes sure you have to declare your variables before using them which helps eliminate 90% of Perl bugs. You declare a variable most of the time using my. Variables declared with my last in the block where they were declared. That's why my %hash had to be declared before the while block. Otherwise, the variable would become undefined once that loops completes.
use warnings: This has Perl generate warnings in certain conditions. For example, you attempt to print out a variable that has no user set value.
The first loop simply goes line by line through my data and counts the number of occurrences of your key. The second loop prints out the results.

Why is first value of captured expression getting stored in fourth element in Perl?

I am storing information captured by regex into an array. But for some reason the first value is getting stored at 4 element of array. Any suggestion on whats going wrong and how to store the first value in the first element of array.
The following is the script:
#!/usr/bin/perl
use strict;
my #value;
my $find= qr/^\s+([0-9]+)\s+([A-Z])/;
open (FILE, "</usr/test")|| die "cant open file";
my #body=<FILE>;
foreach my $line (#body){
chomp $line;
push #value, join('', $line =~ /$find/);
}
print "$value[0]\n"; #does not print anything
print "$value[4]\n"; #prints first value i.e 1389E
exit;
DATA
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes

Your second line has more than one space between the number group and the letter, so you probably want \s+ both times rather than \s the second time.
You won't necessarily know how many items you have in the #value array at the end, so you might want to put the printing into a for loop rather than assume you have a fifth item. (Maybe you know you want the first and fifth, however?) Follow-up: based on your edit, you have more than two entries after all. The version that I give below, using split and \s+ captures the number and letter for all the lines. I'll adjust the print part of the script to show you what I mean.
A few other things:
You should always enable warnings.
There's no reason to read the whole file into an array and then process through it line by line. Skip the #body array and just do what you need to in the while loop.
Use the more modern form of open with lexical filehandles and three arguments.
split seems more straightforward here to me, rather than a regular expression with captures. Since you want to capture two specific parts of the line, you can use split with an array slice to grab those two items and feed them to join.
#value is not an especially helpful variable name, but I think you should at least make it plural. It's a good habit to get into, I think, insofar as the array stores your plural records. (That's not a hard and fast rule, but it bugged me here. This point is pretty minor.)
Here's how all this might look:
#!/usr/bin/env perl
use warnings;
use strict;
my #values;
# open my $filehandle, '<', '/usr/test'
# or die "Can't open /usr/test: $!";
while (my $line = <DATA>) {
chomp $line;
push #values, join('', (split /\s+/, $line)[1..2]);
}
for my $record (#values) {
print $record, "\n";
}
__DATA__
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes

I think you should be writing
print "$value[0]\n";
print "$value[4]\n";
to access elements of an array.

You should use lexical file handles and the three argument form of open as well as avoiding slurping files unnecessarily.
In any case, the most likely reason for your problem is a single character missing from your pattern. Compare the one below with the one you have above.
#!/usr/bin/perl
use strict;
use warnings;
my #value;
my $find= qr/^\s+([0-9]+)\s+([A-Z])/;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
push #value, join '', $line =~ $find;
}
use Data::Dumper;
print Dumper \#value;
__DATA__
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes

Do you have leading whitespace lines, or other leading lines in your data that don't match your regexp? Since you're unconditionally push()-ing onto your output array, regardless of whether your regexp matches, you'll get blank array elements for every non-matching line in your input.
Observe:
#!/usr/bin/perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
push #lines , ( join( '' , /^\s*(\d+)/ ));
}
foreach ( 0 .. $#lines ) {
print "$_ -> $lines[$_]\n";
}
__DATA__
FOO
Bar
Baz
1234
456
bargle
Output:
0 ->
1 ->
2 ->
3 -> 1234
4 -> 456
5 ->

How can I convert these strings to a hash in Perl?

I wish to convert a single string with multiple delimiters into a key=>value hash structure. Is there a simple way to accomplish this? My current implementation is:
sub readConfigFile() {
my %CONFIG;
my $index = 0;
open(CON_FILE, "config");
my #lines = <CON_FILE>;
close(CON_FILE);
my #array = split(/>/, $lines[0]);
my $total = #array;
while($index < $total) {
my #arr = split(/=/, $array[$index]);
chomp($arr[1]);
$CONFIG{$arr[0]} = $arr[1];
$index = $index + 1;
}
while ( ($k,$v) = each %CONFIG ) {
print "$k => $v\n";
}
return;
}
where 'config' contains:
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
The last digits need to be also removed, and kept in a separate key=>value pair whose name can be 'ip'. (I have not been able to accomplish this without making the code too lengthy and complicated).

What is your configuration data structure supposed to look like? So far the solutions only record the last line because they are stomping on the same hash keys every time they add a record.
Here's something that might get you closer, but you still need to figure out what the data structure should be.
I pass in the file handle as an argument so my subroutine isn't tied to a particular way of getting the data. It can be from a file, a string, a socket, or even the stuff below DATA in this case.
Instead of fixing things up after I parse the string, I fix the string to have the "ip" element before I parse it. Once I do that, the "ip" element isn't a special case and it's just a matter of a double split. This is a very important technique to save a lot of work and code.
I create a hash reference inside the subroutine and return that hash reference when I'm done. I don't need a global variable. :)
use warnings;
use strict;
use Data::Dumper;
readConfigFile( \*DATA );
sub readConfigFile
{
my( $fh ) = shift;
my $hash = {};
while( <$fh> )
{
chomp;
s/\s+(\d*\.\d+)$/>ip=$1/;
$hash->{ $. } = { map { split /=/ } split />/ };
}
return $hash;
}
my $hash = readConfigFile( \*DATA );
print Dumper( $hash );
__DATA__
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
This gives you a data structure where each line is a separate record. I choose the line number of the record ($.) as the top-level key, but you can use anything that you like.
$VAR1 = {
'1' => {
'ip' => '6.00',
'rec' => '0',
'adv' => '1234 123 4.5',
'pub' => '3',
'size' => '3'
},
'2' => {
'ip' => '.76',
'rec' => '1',
'adv' => '111 22 3456',
'pub' => '1',
'size' => '2'
}
};
If that's not the structure you want, show us what you'd like to end up with and we can adjust our answers.

I am assuming that you want to read and parse more than 1 line. So, I chose to store the values in an AoH.
#!/usr/bin/perl
use strict;
use warnings;
my #config;
while (<DATA>) {
chomp;
push #config, { split /[=>]/ };
}
for my $href (#config) {
while (my ($k, $v) = each %$href) {
print "$k => $v\n";
}
}
__DATA__
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
This results in the printout below. (The while loop above reads from DATA.)
rec => 0
adv => 1234 123 4.5 6.00
pub => 3
size => 3
rec => 1
adv => 111 22 3456 .76
pub => 1
size => 2
Chris

The below assumes the delimiter is guaranteed to be a >, and there is no chance of that appearing in the data.
I simply split each line based on '>'. The last value will contain a key=value pair, then a space, then the IP, so split this on / / exactly once (limit 2) and you get the k=v and the IP. Save the IP to the hash and keep the k=v pair in the array, then go through the array and split k=v on '='.
Fill in the hashref and push it to your higher-scoped array. This will then contain your hashrefs when finished.
(Having loaded the config into an array)
my #hashes;
for my $line (#config) {
my $hash; # config line will end up here
my #pairs = split />/, $line;
# Do the ip first. Split the last element of #pairs and put the second half into the
# hash, overwriting the element with the first half at the same time.
# This means we don't have to do anything special with the for loop below.
($pairs[-1], $hash->{ip}) = (split / /, $pairs[-1], 2);
for (#pairs) {
my ($k, $v) = split /=/;
$hash->{$k} = $v;
}
push #hashes, $hash;
}

The config file format is sub-optimal, shall we say. That is, there are easier formats to parse and understand. [Added: but the format is already defined by another program. Perl is flexible enough to deal with that.]
Your code slurps the file when there is no real need.
Your code only pays attention to the last line of data in the file (as Chris Charley noted while I was typing this up).
You also have not allowed for comment lines or blank lines - both are a good idea in any config file and they are easy to support. [Added: again, with the pre-defined format, this is barely relevant, but when you design your own files, do remember it.]
Here's an adaptation of your function into somewhat more idiomatic Perl.
#!/bin/perl -w
use strict;
use constant debug => 0;
sub readConfigFile()
{
my %CONFIG;
open(CON_FILE, "config") or die "failed to open file ($!)\n";
while (my $line = <CON_FILE>)
{
chomp $line;
$line =~ s/#.*//; # Remove comments
next if $line =~ /^\s*$/; # Ignore blank lines
foreach my $field (split(/>/, $line))
{
my #arr = split(/=/, $field);
$CONFIG{$arr[0]} = $arr[1];
print ":: $arr[0] => $arr[1]\n" if debug;
}
}
close(CON_FILE);
while (my($k,$v) = each %CONFIG)
{
print "$k => $v\n";
}
return %CONFIG;
}
readConfigFile; # Ignores returned hash
Now, you need to explain more clearly what the structure of the last field is, and why you have an 'ip' field without the key=value notation. Consistency makes life easier for everybody. You also need to think about how multiple lines are supposed to be handled. And I'd explore using a more orthodox notation, such as:
pub=3;rec=0;size=3;adv=(1234,123,4.5);ip=6.00
Colon or semi-colon as delimiters are fairly conventional; parentheses around comma separated items in a list are not an outrageous convention. Consistency is paramount. Emerson said "A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines", but consistency in Computer Science is a great benefit to everyone.

Here's one way.
foreach ( #lines ) {
chomp;
my %CONFIG;
# Extract the last digit first and replace it with an end of
# pair delimiter.
s/\s*([\d\.]+)\s*$/>/;
$CONFIG{ip} = $1;
while ( /([^=]*)=([^>]*)>/g ) {
$CONFIG{$1} = $2;
}
print Dumper ( \%CONFIG );
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

speed up perl hashes in foreach loop and better algorithm - perl

Related

Perl: reading file into a hash and splitting, retrieving information

put next lines in an array after finding the matched pattern in Perl

How to count duplicate key and add all values of duplicate key together to make new hash with non duplicate key?

Why is first value of captured expression getting stored in fourth element in Perl?

How can I convert these strings to a hash in Perl?

Categories

Resources