How to retrieve the values of required keys in Perl [closed] - perl

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have a hash with keys and values. How can I retrieve the values of the desired keys?
%a = qw(genea brain geneb heart genec kidney gened eye);
Now I want to retrieve the value for the keys genec and gened. How can I do this?

To get a list of the values for many keys at once, use a hash slice:
#lots_of_values = #hash{ #lots_of_keys };
Because a list is the result, you use the # sigil even though it is a hash; the values will be the order of the keys specified, including undef values where the specified keys don't exist in the hash.

It sounds like all you're asking is how to access elements of a hash. As Quentin indicates, this is trivially google-able.
The perldata doc covers basic questions, and perlfaq4 covers many other hash questions.
That said, to answer your question:
print $a{'genec'};
print $a{'gened'};
I also would not declare your hash in that way, as it's unclear what is a key and what is a value. Instead, consider:
my %a = ('genea' => 'brain', 'geneb' => 'heart'); # etc.

$GENEC = $a{genec};
$GENED = $a{gened};
Please get yourself a copy of Learning Perl. You'll be glad you did.

Related

Does Perl's Glob have a limitation?

I am running the following expecting return strings of 5 characters:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
print "$_\n";
}
but it returns only 4 characters:
anbc
anbd
anbe
anbf
anbg
...
However, when I reduce the number of characters in the list:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
print "$_\n";
}
it returns correctly:
aamid
aamie
aamif
aamig
aamih
...
Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?
If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28
The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.
In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.†
I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.
As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.
However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.
There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...
Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)
† I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.
Update A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.
It did produce the right number of actual 5-char long strings, seemingly correct, on this server.
Everything has some limitation.
Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:
use v5.10;
use Set::CrossProduct;
my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );
while( my $item = $set->get ) {
say join '', #$item
}

Print 10 elements out of array [duplicate]

This question already has answers here:
What's the best way to get the last N elements of a Perl array?
(6 answers)
Closed 7 years ago.
If i have an array that I've used to create 50 random numbers, I then sort them numerically. Now lets say I wanted to print out the 10 biggest number (elements 40 to 50) I could say:
print($array[40]) print($array[41]) print($array[42]) etc etc.
But is there a neater way to it? Hope I'm making myself clear.
Cheers
You could loop over the indexes.
say $array[$_] for 40..49;
Offsets from the end make more sense here.
say $array[$_] for -10..-1;
You could also use an array slice.
say for #array[-10..-1];
To print them on one line, you can use join.
say join ', ', #array[-10..-1];
Try this :
print join ',',#array[-10..-1] ,"\n";

Extract text, Matlab [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am trying to find a way to extract text in a specific and efficient way
as in this example:
'Hello Mr. Jack Andrew , your number is 894Gfsf , and your Bank ID # 734234"
I want a way to get the Name, the Number and the Bank ID Number.
I want to write software that deals with different text files and get those required values. I may not know the exact order but it must be a template like a bank statement or something.
Thanks!
It's a bit hard to understand what exactly is the problem.. If all you need to do is to split strings, here's a possible way to do it:
str = 'Hello Mr. Jack Andrew , your number is 894Gfsf , and your Bank ID # 734234';
tokenized = strsplit(str,' ');
Name = strjoin([tokenized(3:4)],' ');
Number = tokenized{9};
Account = tokenized{end};
Alternatively, for splitting you could use regexp(...,'split') or regexp(...,'tokens');
I think you want regular expressions for this. Here's an example:
str = 'Hello Mr. Jack Andrew , your number is 894Gfsf , and your Bank ID # 734234';
matches=regexp(str, 'your number is (\w+).*Bank ID # (\d+)', 'tokens');
matches{1}
ans =
'894Gfsf' '734234'
My suggestion would be to make a whole array of strings with sample patterns that you want to match, then build a set of regular expressions that collectively match all of your samples. Try each regexp in sequence until you find one that matches.
To do this, you will need to learn about regular expressions.

Find combinations of numbers that sum to some desired number

I need an algorithm that identifies all possible combinations of a set of numbers that sum to some other number.
For example, given the set {2,3,4,7}, I need to know all possible subsets that sum to x. If x == 12, the answer is {2,3,7}; if x ==7 the answer is {{3,4},{7}} (ie, two possible answers); and if x==8 there is no answer. Note that, as these example imply, numbers in the set cannot be reused.
This question was asked on this site a couple years ago but the answer is in C# and I need to do it in Perl and don't know enough to translate the answer.
I know that this problem is hard (see other post for discussion), but I just need a brute-force solution because I am dealing with fairly small sets.
sub Solve
{
my ($goal, $elements) = #_;
# For extra speed, you can remove this next line
# if #$elements is guaranteed to be already sorted:
$elements = [ sort { $a <=> $b } #$elements ];
my (#results, $RecursiveSolve, $nextValue);
$RecursiveSolve = sub {
my ($currentGoal, $included, $index) = #_;
for ( ; $index < #$elements; ++$index) {
$nextValue = $elements->[$index];
# Since elements are sorted, there's no point in trying a
# non-final element unless it's less than goal/2:
if ($currentGoal > 2 * $nextValue) {
$RecursiveSolve->($currentGoal - $nextValue,
[ #$included, $nextValue ],
$index + 1);
} else {
push #results, [ #$included, $nextValue ]
if $currentGoal == $nextValue;
return if $nextValue >= $currentGoal;
}
} # end for
}; # end $RecursiveSolve
$RecursiveSolve->($goal, [], 0);
undef $RecursiveSolve; # Avoid memory leak from circular reference
return #results;
} # end Solve
my #results = Solve(7, [2,3,4,7]);
print "#$_\n" for #results;
This started as a fairly direct translation of the C# version from the question you linked, but I simplified it a bit (and now a bit more, and also removed some unnecessary variable allocations, added some optimizations based on the list of elements being sorted, and rearranged the conditions to be slightly more efficient).
I've also now added another significant optimization. When considering whether to try using an element that doesn't complete the sum, there's no point if the element is greater than or equal to half the current goal. (The next number we add will be even bigger.) Depending on the set you're trying, this can short-circuit quite a bit more. (You could also try adding the next element instead of multiplying by 2, but then you have to worry about running off the end of the list.)
The rough algorithm is as follows:
have a "solve" function that takes in a list of numbers already included and a list of those not yet included.
This function will loop through all the numbers not yet included.
If adding that number in hits the goal then record that set of numbers and move on,
if it is less than the target recursively call the function with the included/exluded lists modified with the number you are looking at.
else just go to the next step in the loop (since if you are over there is no point trying to add more numbers unless you allow negative ones)
You call this function initially with your included list empty and your yet to be included list with your full list of numbers.
There are optimisations you can do with this such as passing the sum around rather than recalculating each time. Also if you sort your list initially you can do optimisations based on the fact that if adding number k in the list makes you go over target then adding k+1 will also send you over target.
Hopefully that will give you a good enough start. My perl is unfortuantely quite rusty.
Pretty much though this is a brute force algorithm with a few shortcuts in it so its never going to be that efficient.
You can make use of the Data::PowerSet module which generates all subsets of a list of elements:
Use Algorithm::Combinatorics. That way, you can decide ahead of time what size subsets you want to consider and keep memory use to a minimum. Apply some heuristics to return early.
#!/usr/bin/perl
use strict; use warnings;
use List::Util qw( sum );
use Algorithm::Combinatorics qw( combinations );
my #x = (1 .. 10);
my $target_sum = 12;
{
use integer;
for my $n ( 1 .. #x ) {
my $iter = combinations(\#x, $n);
while ( my $set = $iter->next ) {
print "#$set\n" if $target_sum == sum #$set;
}
}
}
The numbers do blow up fairly rapidly: It would take thousands of days to go through all subsets of a 40 element set. So, you should decide on the interesting sizes of subsets.
Is this a 'do my homework for me' question?
To do this deterministically would need an algorithm of order N! (i.e. (N-0) * (N-1) * (N-2)...) which is going to be very slow with large sets of inputs. But the algorithm is very simple: work out each possible sequence of the inputs in the set and try adding up the inputs in the sequence. If at any point the sum matches, you've got one of the answers, save the result and move on to the next sequence. If at any point the sum is greater than the target, abandon the current sequence and move on to the next.
You could optimize this a little by deleting any of the inputs greater than the target. Another approach for optimization would be to to take the first input I in the sequence and create a new sequence S1, deduct I from the target T to get a new target T1, then check if T exists in S1, if it does then you've got a match, otherwise repeat the process with S1 and T1. The order is still N! though.
If you needed to do this with a very large set of numbers then I'd suggest reading up on genetic algorithms.
C.
Someone posted a similar question a while ago and another person showed a neat shell trick to answer it. Here is a shell technique, but I don't think it is as neat a solution as the one I saw before (so I'm not taking credit for this approach). It's cute because it takes advantage of shell expansion:
for i in 0{,+2}{,+3}{,+4}{,+7}; do
y=$(( $i )); # evaluate expression
if [ $y -eq 7 ]; then
echo $i = $y;
fi;
done
Outputs:
0+7 = 7
0+3+4 = 7

one million records in log file for a online shopping website. FInd distinct IP addresses [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
It is a technical test question. Given 1 million records in a log file. These are records of website hits of an online shopping website. The records are of type:
TimeStamp:Date,time ; IP address; ProductName
Find the distinct IP addresses and most popular product. What is the most efficient way to do this? One solution is hashing. If solution is hashing, please provide a explanation for efficiently hashing this since there are one million records.
I recently did something similar for homework, not sure on the total number of lines, but it was quite a lot. The point is that your computer can probably do this very quickly, even if there is a million records.
I agree with you on the hashtable, and I would do the two questions slightly differently.
The first one I would check every ip against the hashtable, and if it exists, do nothing. If it does not exist, add it to the hashtable, and increment a counter. At the end of the program, the counter will tell you how many unique IP's there were.
The second I would hash the product name and put that in the hashtable. I would increment the value associated with the hashkey every time I found a match in the table. At the end, loop through all the keys and values of the hashtable and find the highest value. That is the most popular product.
A million log records is really a very small number; just read them in and keep a set of the IP addresses and a dict from product names to number of mentions -- you don't mention any specific language constraint so I assume a language that will do (implicitly) excellent hashing of such strings on your behalf is acceptable (Perl, Python, Ruby, Java, C#, etc, all have fine facilities for the purpose).
E.g., in Python:
import collections
import heapq
ipset = set()
prodcount = collections.defaultdict(int)
numlines = 0
for line in open('logfile.txt', 'r'):
timestamp, ip, product = line.strip().split(';')
ipset.add(ip)
prodcount[product] += 1
numlines += 1
print "%d distinct IP in %d lines" % (len(ipset), numlines)
print "Top 10 products:"
top10 = heapq.nlargest(10, prodcount, key=prodcount.get)
for product in top10:
print "%6d %s" % (prodcount[product], product)
Distinct IP addresses:
$ cut -f 2 -d \; | sort | uniq
Most popular product:
$ cut -f 3 -d \; | sort | uniq -c | sort -n
If you can do so, shell script it like that.
First of all, one million lines is not at all a huge file.
A simple Perl script can chew a 2.7 million lines script in 6 seconds, without having to think much about the algorithm.
In any case hashing is the way to go and, as shown, there's no need to bother with hashing over an integer representation.
If we were talking about a really huge file, then I/O would become the bottleneck and thus the hashing method gets less and less relevant as the file grows bigger.
Theoretically in a language like C it would probably be faster to hash over an integer than over a string, but I doubt that in a language suited to this task that would really make a difference. Things like how to read the file efficiently would matter much much more.
Code
vinko#parrot:~$ more hash.pl
use strict;
use warnings;
my %ip_hash;
my %product_hash;
open my $fh, "<", "log2.txt" or die $!;
while (<$fh>) {
my ($timestamp, $ip, $product) = split (/;/,$_); #To fix highlighting
$ip_hash{$ip} = 1 if (!defined $ip_hash{$ip});
if (!defined $product_hash{$product}) {
$product_hash{$product} = 1
} else {
$product_hash{$product} = $product_hash{$product} + 1;
}
}
for my $ip (keys %ip_hash) {
print "$ip\n";
}
my #pkeys = sort {$product_hash{$b} <=> $product_hash{$a}} keys %product_hash;
print "Top product: $pkeys[0]\n";
Sample
vinko#parrot:~$ wc -l log2.txt
2774720 log2.txt
vinko#parrot:~$ head -1 log2.txt
1;10.0.1.1;DuctTape
vinko#parrot:~$ time perl hash.pl
11.1.3.3
11.1.3.2
10.0.2.2
10.1.2.2
11.1.2.2
10.0.2.1
11.2.3.3
10.0.1.1
Top product: DuctTape
real 0m6.295s
user 0m6.230s
sys 0m0.030s
I also would read the file into a database, and would link it to another table of log filenames and date/time imported.
This is because in the real world you're going to need to do this regularly. The company is going to want to be able to check trends over time, so you're quickly going to be asked questions like "is that more or less unique IP addresses than last month's log file?" and "how are the most popular products changing from week to week".
In my experience the best way to answer these questions in an interview scenario as you describe is to show awareness of real-world situations. A tool to parse the log files (produced daily? weekly? monthly?) and read them into a database where some queries, graphs etc can pull all the data out, especially across multiple log files, will take a bit longer to write but be infinitely more useful and useable.
In the real world, if this is a once-off or occasional thing, I would simply insert the data into a database and run a few basic queries.
Since this is a homework assignment though, that's probably not what the prof is looking for. An IP address is really just a 32-bit number. I could convert each IP to it's 32-bit equivalent instead of creating a hash.
Since this is homework, the rest "is left as an exercise to the reader."
Like other people write, there are just 2 hashtables. One for IP's and one for Products. You can count the occurrences for both, but you only care about them in the latter "Product Popularity"
The key to hashing is having an efficient hash key, and hash efficiency means that the keys are evenly distributed. Poor key choice means that there will be many collisions, and performance of the hash table will suffer.
Being lazy, I'd be tempted to create a Dictionary<IPAddress,int> and hope that the implementor of the IPAddress class created the hash key appropriately.
Dictionary<IPAddress, int> ipAddresses = new Dictionary<IPAddress, int>();
Dictionary<string, int> products = new Dictionary<string,int>();
For the Products list, after you've setup the hash table, just use linq to select them out.
var sorted = from o in products
orderby o.Value descending
where o.Value > 1
select new { ProductName = o.Key, Count = o.Value };
Just sort the file by each of the two fields of interest. That avoids any need to worry about hash functions and will work just fine on a million record set.
Sorting the IP addresses this way also makes it easy to extract other interesting information such as accesses from the same subnet.
Unique IPs:
$ awk -F\; '{print $2}' log.file | sort -u
count.awk
{ a[$0]++ }
END {
for(key in a) {
print a[key] " " key;
}
}
Top 10 favorite items:
$ awk -F\; '{print $3}' log.file | awk -f count.awk | sort -r -n | top