Perfect Hash Function for Perl (like gperf)? - perl

I'm going to be using a key:value store and would like to create non-collidable hashes in Perl. Is there a Perl module, or function that I can use to generate a non-collidable hash function or table (maybe something like gperf)? I already know my range of input values.

I can't find a pure Perl solution, closest is Reini Urban's examinations of using perfect hashes with a type system. If you were to do it in XS, the CMPH (C Minimal Perfect Hashing Library) might be more apropos than gperf. CMPH seems to be optimized for non-trivial key sizes and run-time generation.
The cost of generating a perfect hash function at runtime in Perl might swamp the value of using it. In order to gain benefit, you'd want it compiled and cached. So again, writing an XS module which generates the function from a fixed key list at XS compile time might be the best way to go.
Out of curiosity, how big is your data and how many keys does the set contain?

You might be interested in Judy. It's not a hash table implementation, but it's supposedly a very efficient associative array implementation.
Mind you, Perl's hashes are very well tuned, and they automatically get rehashed when a bucket starts growing large.

Related

Implementing language translators in racket

I am implementing an interpreter that codegen to another language using Racket. As a novice I'm trying to avoid macros to the extent that I can ;) Hence I came up with the following "interpreter":
(define op (open-output-bytes))
(define (interpret arg)
(define r
(syntax-case arg (if)
[(if a b) #'(fprintf op "if (~a) {~a}" a b)]))
; other cases here
(eval r))
This looks a bit clumsy to me. Is there a "best practice" for doing this? Am I doing a totally crazy thing here?
Short answer: yes, this is a reasonable thing to do. The way in which you do it is going to depend a lot on the specifics of your situation, though.
You're absolutely right to observe that generating programs as strings is an error-prone and fragile way to do it. Avoiding this, though, requires being able to express the target language at a higher level, in order to circumvent that language's parser.
Again, it really has a lot to do with the language that you're targeting, and how good a job you want to do. I've hacked together things like this for generating Python myself, in a situation where I knew I didn't have time to do things right.
EDIT: oh, you're doing Python too? Bleah! :)
You have a number of different choices. Your cleanest choice is to generate a representation of Python AST nodes, so you can either inject them directly or use existing serialization. You're going to ask me whether there are libraries for this, and ... I fergits. I do believe that the current Python architecture includes ... okay, yes, I went and looked, and you're in good shape. Python's "Parser" module generates ASTs, and it looks like the AST module can be constructed directly.
https://docs.python.org/3/library/ast.html#module-ast
I'm guessing your cleanest path would be to generate JSON that represents these AST modules, then write a Python stub that translates these to Python ASTs.
All of this assumes that you want to take the high road; there's a broad spectrum of in-between approaches involving simple generalizations of python syntax (e.g.: oh, it looks like this kind of statement has a colon followed by an indented block of code, etc.).
If your source language shares syntax with Racket, then use read-syntax to produce a syntax-object representing the input program. Then use recursive descent using syntax-case or syntax-parse to discern between the various constructs.
Instead of printing directly to an output port, I recommend building a tree of elements (strings, numbers, symbols etc). The last step is then to print all the elements of the tree. Representing the output using a tree is very flexible and allows you to handle sub expressions out of order. It also allows you to efficiently concatenate output from different sources.
Macros are not needed.

How do I change the default ONE_AT_A_TIME_HARD hash function in Perl 5.18?

I'm not really familiar with Perl, but I've been searching in the documentation and other sources without success for the last 2 days. In the documentation, it is written:
Perl v5.18 includes support for multiple hash functions, and changed the default (to ONE_AT_A_TIME_HARD), you can choose a different algorithm by defining a symbol at compile time. For a current list, consult the INSTALL document. Note that as of Perl v5.18 we can only recommend use of the default or SIPHASH. All the others are known to have security issues and are for research purposes only.
The thing is that neither in INSTALL document nor in other sources/sites etc. I can find how to define this symbol.
What I want to do is to change the default ONE_AT_A_TIME_HARD hash function to ONE_AT_A_TIME_OLD so I can simulate the old Perl 5.16 behavior.
This sounds like an XY problem. What are you trying to accomplish by forcibly downgrading the hash algorithm in perl to one that has known problems?
From comments:
I need to run a lot of test cases written in perl 5.16 whose functionality depends on the old hash implementation and it's quite impossible to change the code as the cases are hundreds.
Whew, that's bad news. Find those developers, and hit them around the head with a copy perldata:
Hashes are unordered collections of scalar values indexed by their associated string key.
Specifically - if this is a problem for you, it means your codebase treats hashes as ordered, when they aren't and never were. (It's just they were fairly consistent before 5.18 and more random after).
From perldelta:
When encountering these changes, the key to cleaning up from them is to accept that hashes are unordered collections and to act accordingly.
See: http://blog.booking.com/hardening-perls-hash-function.html
To answer your question - if you really must:
./Configure -DPERL_HASH_FUNC_ONE_AT_A_TIME_OLD -des && make && make test
But it's a very very bad idea, because as the INSTALL file in your perl source package points out:
Note that as of Perl 5.18 we can only recommend the use of default or SIPHASH. All the others are known to have security issues and are for research purposes only.
By building your perl this way you introduce a known security flaw for every perl program using it.
Note - ONE_AT_A_TIME_HARD is the new default, so this won't change how perl 5.18 works. You may mean PERL_HASH_FUNC_ONE_AT_A_TIME_OLD

Is there any benefit to using an obarray rather than a hash-table in Emacs Lisp?

I have an Emacs Lisp program that needs to keep track of a set of strings, use them for completion and test other strings for membership in the set. In most languages without a built-in set type, I would use a dictionary or hash table with a dummy t or 1 value for this, but it occurred to me that Elisp's obarray type could also serve the purpose, with intern, intern-soft and unintern taking the place of puthash, gethash and remhash.
(I know about the cl-lib functions which operate on lists as sets, but those are not particularly relevant for this problem, which only needs a set membership test).
Is there any advantage (in speed, memory usage or otherwise) in using an obarray rather than a hash table in a modern Emacs, or are obarrays other than the main symbol table more of a leftover from before Emacs Lisp had a separate hash-table type?
Since both work, it's to a large extent a question of taste or performance.
In terms of memory usage (counted in words), an obarray uses 1 array of fixed size N plus one symbol per entry (of size 6), whereas a hash-table has a size that is more or less 5 per element plus a bit more. So memorywise, it's a wash.
In terms of speed, I don't know anyone who has bothered to measure it, so it's probably not a big issue either.
IOW, it's a question of taste. FWIW, I prefer hash tables which offer more options; obarrays are largely a historical accident in my view.

Should all implementations of SHA512 give the same Hash?

I am working on writing a SHA512 function. When i check the file I am encrypting on different sources, a Linux SHA512SUM tool, a couple websites, and run it through the old source code i have for SHA512, they all give different hash values. My thought going into this project is that all Hash algorithms will output the same hash value if implemented correctly, to be used as a check sum. Am I wrong in thinking this? If I am wrong how would I really check to see if my work is correct?
Thanks in advance.
Yes, that's one of the basic building block of PKI: the same data block passed to a hash should always return the same hash value.
beware of the interpretation, though: the result of a SHA-2(512) hash is a block of 512 bits, not a string value so it will first be encoded for human consumption and it is therefore possible that you see what looks like visually different results when it's simply a matter of using different encodings.

Looking for a fast hash-function

I'm looking for a special hash-function. Let's say I have a large list of strings, if I order them by their hash-values they should be ordered quasi randomly.
The most important point is: it must be super fast. I've tried md5 and sha1 and they're using to much cpu power.
Clashes are not a problem.
I'm using javascript, so it shouldn't be too complicated to implement.
Take a look at Murmur hash. It has a nice space/collision trade-off:
http://sites.google.com/site/murmurhash/
It looks as if you want the sort of hash function used in a hash table, not the sort used to detect duplicates or tampering.
Googling will yield you a wealth of information on alternative hash functions. To start with, stay away from cryptographic signature hashes (like MD-5 or SHA-1), they solve another problem.
You can read this, or this, or this, to start with.
If speed is paramount, you can implement a simple ad-hoc hash, e.g. take the first and last letter and order your string by the last and then first letter. The result would look, as you say, "quasi random" and it would be fast. For instance, part of my answer sorted that way would look like this:
ca ad-hoc
el like
es simple
gt taking
hh hash
nc can
ti implement
uy you
Hsieh, Murmur, Bob Jenkin's comes to my mind.
A nice page about hash functions that has some tests for quality and a simple S-box hash as well.