What is a "hash function generator" - hash

I heard/read this term many times, but cannot understand it. The name implies it should "generate a hash function" and I naively imagine it generating a source code in C for example. I looked on web search, here on Stackoverflow, looked into Wikipedia. But cannot find no good definition and no examples.

From Wikipedia:
A perfect hash function for a set S is a hash function that maps
distinct elements in S to a set of integers, with no collisions. A
perfect hash function has many of the same applications as other hash
functions, but with the advantage that no collision resolution has to
be implemented.
If you know your keys in advance, you can construct such a perfect hash function. Programs that do so are called perfect hash function generators.
One example is GNU gperf, which works like you suggested, taking in a list of keys and printing out C source code.

A hash function generator is a tool for finding a hash function meeting certain criteria. Its output can be in any form that unambiguously describes the hash function, usually in the form of source code in some programming language.
Examples
Perfect hash function
Given a set of distinct strings (for example {"banana", "peach", "pineapple", "apple", "microsoft", "pinemicrosoft"}), find a hash function that will map them to distinct integer values. For example:
"banana" => 6
"peach" => 2
"pineapple" => 123
"apple" => 3
"microsoft" => 77
"pinemicrosoft" => 451
There is no restriction on what the hash function may return for an input string that doesn't belong to our predefined set.
Minimal perfect hash function
Similar to above, but the hash values must form a contiguous range.
"banana" => 1
"peach" => 2
"pineapple" => 3
"apple" => 4
"microsoft" => 5
"pinemicrosoft" => 6
The simplest implementation meeting the functional requirements for a minimal perfect hash function is
to store internally a sorted array of the target strings,
look-up the input value in that array and
return its index.
The drawbacks of such an implementation are that it consumes storage and slows down as the size of the target input set grows. So an additional requirement on the hash function is to minimize its size and running time.
Classificator
Given a set of distinct strings grouped into non-overlapping subsets, find a hash function that will map each string to the index of the subset it belongs to.
For example:
any of {"banana", "peach", "apple"} => 1 // fruit
any of {"lion", "zebra", "dog", "eagle"} => 2 // animal
any of {"red", "green", "blue", "white"} => 3 // color

Related

Print the hash elements by grouping their values in Raku

I keep record of how many times a letter occur in a word e.g. 'embeddedss'
my %x := {e => 3, m => 1, b => 1, d => 3, s => 2};
I'd like to print the elements by grouping their values like this:
# e and d 3 times
# m and b 1 times
# s 2 times
How to do it practically i.e. without constructing loops (if there is any)?
Optional Before printing the hash, I'd like to convert and assing it to a temporary data structure such as ( <3 e d>, <1 m b>, <2 s> ) and then print it. What could be the most practical data structure and way to print it?
Using .categorize as suggested in the comments, you can group them together based on the value.
%x.categorize(*.value)
This produces a Hash, with the keys being the value used for categorization, and the values being Pairs from your original Hash ($x). This can be looped over using for or .map. The letters you originally counted are the key of the Pairs, which can be neatly extracted using .map.
for %x.categorize(*.value) {
say "{$_.value.map(*.key).join(' and ')} {$_.key} times";
}
Optionally, you can also sort the List by occurrences using .sort. This will sort the lowest number first, but adding a simple .reverse will make the highest value come first.
for %x.categorize(*.value).sort.reverse {
say "{$_.value.map(*.key).join(' and ')} {$_.key} times";
}

Give value, return field name in matlab structure

I have a Matlab structure like this:
Columns.T21=6;
Columns.ws21=9;
Columns.wd21=10;
Columns.u21=11;
Is there some elegant way I can give the value and return the field name? For instance, if I give 6 and it would return 'T21.' I know that fieldnames() will return all the field names, but I want the fieldname for a specific value. Many thanks!
Assuming that the structure contains fields with scalar numeric values, you can use this struct2array based approach -
search_num = 6; %// Edit this for a different search number
fns=fieldnames(Columns) %// Get field names
out = fns(struct2array(Columns)==search_num) %// Logically index into names to find
%// the one that matches our search
Goal:
Construct two vectors from your struct, one for the names of fields and the other for their respective values. This has analogy to the dict in Python or map in C++, where you have unique keys being mapped to possibly non-unique values.
Simple Solution:
You can do this very simply using the various functions defined for struct in Matlab, namely: struc2cell() and cell2mat()
For the particular element of interest, say 1 of your struct Columns, get the names of all fields in the form of a cell array, using fieldnames() function:
fields = fieldnames( Columns(1) )
Similarly, get the values of all the fields of that element of Columns, in the form of a matrix
vals = cell2mat( struct2cell( Columns(1) ) )
Next, find the field with the corresponding value, say 6 here, using the find function and convert the resulting 1x1 cell into a char using cell2mat() function :
cell2mat( fields( find( vals == 6 ) ) )
which will yield:
T21
Now, you can define a function that does this for you, e.g.:
function fieldname = getFieldForValue( myStruct, value)
Advanced Solution using Map Container Data Abstraction:
You can also choose to define an object of the containers.map class using the field-names of your struct as the keySet and values as valueSet.
myMap = containers.Map( fieldnames( Columns(1) ), struct2cell( Columns(1) ) );
This allows you to get keys and values using corresponding built-in functions:
myMapKeys = keys(myMap);
myMapValues = values(myMap);
Now, you can find all the keys corresponding to a particular value, say 6 in this case:
cell2mat( myMapKeys( find( myMapValues == 6) )' )
which again yields:
T21
Caution: This method, or for that matter all methods for doing so, will only work if all the fields have the values of the same type, because the matrix to which we are converting vals to, need to have a uniform type for all its elements. But I assume from your example that this would always be the case.
Customized function/ logic:
struct consists of elements that contain fields which have values, all in that order. An element is thus a key for which field is a value. The essence of "lookup" is to find values (which are non-unique) for specific keys (which are unique). Thus, Matlab has a built-in way of doing so. But what you want is the other way around, i.e. to find keys for specific values. Since its not a typical use case, you need to write up your own logic or function for it.
Suppose your structure is called S. First extract all the field names into an array:
fNames=fieldnames(S);
Now define a following anonymous function in your code:
myfun=#(yourArray,desiredValue) yourArray==desiredValue;
Then you can get the desired field name as:
desiredFieldIndex=myfun(structfun(#(x) x,S),3) %desired value is 3 (say)
desiredFieldName=fNames(desiredFieldIndex)
Alternative using containers.Map
Assuming each field in the structure contains one scalar value as in the question (not an array).
Aim is to create a Map object with the field values as keys and the field names as values
myMap = containers.Map(struct2cell(Columns),fieldnames(Columns))
Now to get the fieldname for a value index into myMap with the value
myMap(6)
ans =
T21
This has the advantage that if the structure doesn't change you can repeatedly use myMap to find other value-field name pairs

Perl array of hash structures

This is a design setup question. I know in Perl there are not array of arrays. I am looking at reading in code that pulls in data from large text files at phases of something in flight. Each of these phases track different variables (and different numbers of them) . I have to store them because in the second part of the script i am rewriting them into another file I am updating as I read in.
I thought first I should have an array of hash's, however the variables are not the same at each phase. Then I thought maybe and array with the name of several arrays (array of references I guess) .
Data example would be similar to
phase 100.00 mass 0.9900720175260495E+005
phase 240.00 gcrad 61442116.0 long 0.963710076E+003 gdalt 0.575477727E+002 vell 0.9862937759999998E+002
Data is made up but you should get the idea and there would be many phases and the variable would likely range from 1 to 25 variables in each phase
You can use Arrays of Arrays in Perl. You can find documentation on Perl data structures including Arrays of Arrays here: http://perldoc.perl.org/perldsc.html. That said, looking at the sample you've provided it looks like what you need is an Array of Hashes. Perhaps something like this:
my #data = (
{ phase => 100.00,
mass => 0.9900720175260495e005 },
{ phase => 240.00
gcrad => 61442116.0
long => 0.963710076e003
gdalt => 0.575477727e002
vell => 0.9862937759999998e002 }
);
to access the data you would use:
$data[0]->{phase} # => 100.00
You could also use a Hash of Hashes like this:
my %data = (
name1 => {
phase => 100.00,
mass => 0.9900720175260495e005
},
name2 => {
phase => 240.00
gcrad => 61442116.0
long => 0.963710076e003
gdalt => 0.575477727e002
vell => 0.9862937759999998e002
}
);
to access the data you would use:
$data{name1}->{phase} # => 100.00
A great resource for learning how to implement advanced data structures and algorithms in Perl is the book, Mastering Algorithms in Perl
I use the following mnemonic when defining arrays, array references and hash references:
Use parentheses for lists -- lists can be assigned to either arrays or hashes:
my %person = (
given_name => 'Zaphod',
surname => 'Beeblebrox'
);
or
my #rainbow = (
'red',
'orange',
'yellow',
'green',
'blue',
'indigo',
'violet'
);
Because the lists are assigned to list types -- array and hash, there is no semantic ambiguity. When you deal with array references or hash references, however, the delimiter must distinguish between the types of reference, because the $ sigil for scalar variables can't be used to distinguish between the two types of reference. Therefore, [] is used to denote array references, just as [] is used to dereference arrays, and {} is used to denote hash references, just as {} is used to dereference hashes.
So an array of arrayrefs looks like this:
my #AoA = (
[1,2,3],
[4,5,6],
[7,8,9]
);
An array of hashrefs:
my #AoH = (
{ given_name => 'Ford', surname => 'Prefect' },
{ given_name => 'Arthur', surname => 'Dent' }
);
A hashref assigned to a scalar variable:
my $bones = {
head => 'skull',
jaw => 'mandible',
'great toe' => 'distal phalanx'
};

What is the best way to store 1key - 3 value in Perl?

I have a situation where I have 3 different values for each key. I have to print the data like this:
K1 V1 V2 V3
K2 V1 V2 V3
…
Kn V1 V2 V3
Is there any alternate efficient & easier way to achieve this other that that listed below? I am thinking of 2 approaches:
Maintain 3 hashes for 3 different values for each key.
Iterate through one hash based on the key and get the values from other 2 hashes
and print it.
Hash 1 - K1-->V1 ...
Hash 2 - K1-->V2 ...
Hash 3 - K1-->V3 ...
Maintain a single hash with key to reference to array of values.
Here I need to iterate and read only 1 hash.
K1 --> Ref{V1,V2,V3}
EDIT1:
The main challenge is that, the values V1, V2, V3 are derived at different places and cannot be pushed together as the array. So if I make the hash value as a reference to array, I have to dereference it every time I want to add the next value.
E.g., I am in subroutine1 - I populated Hash1 - K1-->[V1]
I am in subroutine2 - I have to de-reference [V1], then push V2. So now the hash
becomes K1-->[V1 V2], V3 is added in another routine. K1-->[V1 V2 V3]
EDIT2:
Now I am facing another challenge. I have to sort the hash based on the V3.
Still is it feasible to store the hash with key and list reference?
K1-->[V1 V2 V3]
It really depends on what you want to do with your data, although I can't imagine your option 1 being convenient for anything.
Use a hash of arrays if you are happy referring to your V1, V2, V3 using indexes 0, 1, 2 or if you never really want to handle their values separately.
my %data;
$data{K1}[0] = V1;
$data{K1}[1] = V2;
$data{K1}[2] = V3;
or, of course
$data{K1} = [V1, V2, V3];
As an additional option, if your values mean something nameable you could use a hash of hashes, so
my %data;
$data{K1}{name} = V1;
$data{K1}{age} = V2;
$data{K1}{height} = V3;
or
$data{K1}{qw/ name age height /} = (V1, V2, V3);
Finally, if you never need access to the individual values, it would be fine to leave them as they are in the file, like this
my %data;
$data{K1} = "V1 V2 V3";
But as I said, the internal storage is mostly dependent on how you want to access your data, and you haven't told us about that.
Edit
Now that you say
The main challenge is that, the values V1, V2, V3 are derived at
different places and cannot be pushed together as the array
I think perhaps the hash of hashes is more appropriate, but I wouldn't worry at all about dereferencing as it is an insignificant operation as far as execution time is concerned. But I wouldn't use push as that restricts you to adding the data in the correct order.
Depending which you prefer, you have the alternatives of
$data{K1}[2] = V3;
or
$data{K1}{height} = V3;
and clearly the latter is more readable.
Edit 2
As requested, to sort a hash of hashes by the third value (height in my example) you would write
use strict;
use warnings;
my %data = (
K1 => { name => 'ABC', age => 99, height => 64 },
K2 => { name => 'DEF', age => 12, height => 32 },
K3 => { name => 'GHI', age => 56, height => 9 },
);
for (sort { $data{$a}{height} <=> $data{$b}{height} } keys %data) {
printf "%s => %s %s %s\n", $_, #{$data{$_}}{qw/ name age height / };
}
or, if the data was stored as a hash of arrays
use strict;
use warnings;
my %data = (
K1 => [ 'ABC', 99, 64 ],
K2 => [ 'DEF', 12, 32 ],
K3 => [ 'GHI', 56, 9 ],
);
for (sort { $data{$a}[2] <=> $data{$b}[2] } keys %data) {
printf "%s => %s %s %s\n", $_, #{$data{$_}};
}
The output for both scripts is identical
K3 => GHI 56 9
K2 => DEF 12 32
K1 => ABC 99 64
In terms of readability/maintainability the second seems superior to me. The danger with the first is that you could end up with keys present in one hash but not the others. Also, if I came across the first approach, I'd have to think about it for a while, whereas the first seems "natural" (or a more common idiom, or more practical, or something else which means I'd understand it more readily).
The second approach (one array reference for each key) is:
In my experience, far more common,
Easier to maintain, since you only have one data structure floating around instead of three, and
More in line with the DRY principle: "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system." Represent a key once, not three times.
Sure, it's better to mantain only one data structure:
%data = ( K1=>[V1, V2, V3], ... );
You can use Data::Dump for a fast view/debug of your data structure.
The choice really depends on the usage pattern. Specifically, it depends on whether you use procedural program or object-oriented programming.
This is a philosophical difference, and it's unrelated to whether language-level classes and objects are used or not. Procedural programming is organised around work flow; procedures accesses and transforms whatever data it needs. OOP is organised around records of data; methods access and transform one particular record only.
The second approach is closely aligned with object-oriented programming. Object-oriented programming is by far the most common programming style in Perl, so the second approach is almost universally the preferred structure these days (even though it takes more memory).
But your edit implied you might be using a more a procedural approach. As you discovered, the first approach is more convenient for procedural programming. It was very commonly used when procedural programming was in vogue.
Take whatever suits your code's organisation best.

returning multiple dissimilar data structures from R function in PL/R

I have been looking at various discussions here on SO and other places, and the general consensus seems that if one is returning multiple non-similar data structures from an R function, they are best returned as a list(a, b) and then accessed by the indexes 0 and 1 and so on. Except, when using an R function via PL/R inside a Perl program, the R list function flattens the list, and also stringifies even the numbers. For example
my $res = $sth->fetchrow_arrayref;
# now, $res is a single, flattened, stringified list
# even though the R function was supposed to return
# list([1, "foo", 3], [2, "bar"])
#
# instead, $res looks like c(\"1\", \""foo"\", \"3\", \"2\", \""bar"\")
# or some such nonsense
Using a data.frame doesn't work because the two arrays being returned are not symmetrical, and the function croaks.
So, how do I return a single data structure from an R function that is made up of an arbitrary set of nested data structures, and still be able to access each individual bundle from Perl as simply $res->[0], $res->[1] or $res->{'employees'}, $res->{'pets'}? update: I am looking for an R equiv of Perl's [[1, "foo", 3], [2, "bar"]] or even [[1, "foo", 3], {a => 2, b => "bar"}]
addendum: The main thrust of my question is how to return multiple dissimilar data structures from a PL/R function. However, the stringification, as noted above, and secondary, is also problematic because I convert the data to JSON, and all those extra quotes just add to useless data transferred between the server and the user.
I think you have a few problems here. The first is you can't just return an array in this case because it won't pass PostgreSQL's array checks (arrays must be symmetric, all of the same type, etc). Remember that if you are calling PL/R from PL/Perl across a query interface, PostgreSQL type constraints are going to be an issue.
You have a couple of options.
You could return setof text[], with one data type per row.
you could return some sort of structured data using structures PostgreSQL understands, like:
CREATE TYPE ab AS (
a text,
b text
);
CREATE TYPE r_retval AS (
labels text[],
my_ab ab
);
This would allow you to return something like:
{labels => [1, "foo", 3], ab => {a => 'foo', b => 'bar'} }
But at any rate you have to put it into a data structure that the PostgreSQL planner can understand and that is what I think is missing in your example.