How can I address the SHA3 state vector in programming terms? - hash

I've been working on an implementation of SHA3, and I'm getting a bit muddled on this particular aspect of the algorithm. The addressing scheme of the state vector is given by the following diagram:
My issue with the above is: How does one go about addressing this in terms of actual code? I am using a 3 dimensional array to express the state vector, but this leads to obvious issues since the conventional mapping of an array (0 index is first) differs from the above convention used in SHA3.
For example, if I wanted to address the (0,0,0) bit in the SHA3 state array, the following expression would achieve this:
state_vector[2][2][0]
I find this highly cumbersome however because when implementing the actual round algorithms, the intended x and y values do not directly map to the array indices. Addressing state_vector[0][0][0] would return the very first index in the array instead of the (0,0,0) bit in the SHA3 state array.
Is there a way I can get around this in code?
Sorry, I know this is probably a stupid question.

The way this is customarily implemented is as a 5×5 array of 64-bit words, an array of 25 64-bit words or, if you believe your architecture (say, AArch64) will have a lot of registers, as 25 individual 64-bit words. (I prefer the second option because it's simpler to work with.) Typically they are indeed ordered in the typical order for arrays, and one simply rewrites things accordingly.
Usually this isn't a problem, because the operations are specified in terms of words in relation to each other, such as in the theta and chi steps. It's common to simply code rho and pi together such that it involves reading a word, rotating it, and storing it in the destination word, and in such a case you can simply just reorder the rotation constants as you need to.
If you want to get very fancy, you can write this as an SIMD implementation, but I think it's easier to see how it works in a practical implementation if you write it as a one- or two-dimensional array of words first.

Related

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

How to calculate equal hash for similar strings?

I create Antiplagiat. I use a shingle method. For example, I have the following shingles:
I go to the cinema
I go to the cinema1
I go to th cinema
Is there a method of calculating the equal hash for these lines?
I know of the existence of Levenshtein distance. However, I do not know what I should take source word. Maybe there is a better way than to consider Levenshtein distance.
The problem with hashing is that, logically, you'll run into 2 strings that differ by a single character that hash to different values.
Small proof:
Consider all possible strings.
Assume all of these hash to at least 2 different values.
Take any 2 strings A and B that hash to different values.
You can obviously go from A to B by just changing one character at a time.
Thus at some point the hash will change.
Thus at this point the hash will be different for a single character change.
Some options I can think of:
Hash multiple parts of the string and check each of these hashes. Probably won't work too well since a single character omission will cause significant difference in the hash values.
Check a range of hashes. A hash is one dimensional, but string similarity is not, thus this probably won't work either.
All in all, hashing is probably not the way to go.
This questions is a bit old but you may be interested in this paper by two researchers at AT&T. They employ a technique that is reminiscent of the Nilsimsa hash to detect when similar sms messages have been seen an "abnormal" number of times in a time window.
It sounds Locality Sensitive hashing would also be pertinent to your problem.

Hash operator in Matlab for linear indices of vectors

I am clustering a large set of points. Throughout the iterations, I want to avoid re-computing cluster properties if the assigned points are the same as the previous iteration. Each cluster keeps the IDs of its points. I don't want to compare them element wise, comparing the sum of the ID vector is risky (a small ID can be compensated with a large one), may be I should compare the sum of squares? Is there a hashing method in Matlab which I can use with confidence?
Example data:
a=[2,13,14,18,19,21,23,24,25,27]
b=[6,79,82,85,89,111,113,123,127,129]
c=[3,9,59,91,99,101,110,119,120,682]
d=[11,57,74,83,86,90,92,102,103,104]
So the problem is that if I just check the sum, it could be that cluster d for example, looses points 11,103 and gets 9,105. Then I would mistakenly think that there has been no change in the cluster.
This is one of those (very common) situations where the more we know about your data and application the better we are able to help. In the absence of better information than you provide, and in the spirit of exposing the weakness of answers such as this in that absence, here are a couple of suggestions you might reject.
One appropriate data structure for set operations is a bit-set, that is a set of length equal to the cardinality of the underlying universe of things in which each bit is set on or off according to the things membership of the (sub-set). You could implement this in Matlab in at least two ways:
a) (easy, but possibly consuming too much space): define a matrix with as many columns as there are points in your data, and one row for each cluster. Set the (cluster, point) value to true if point is a member of cluster. Set operations are then defined by vector operations. I don't have a clue about the relative (time) efficiency of setdiff versus rowA==rowB.
b) (more difficult): actually represent the clusters by bit sets. You'll have to use Matlab's bit-twiddling capabilities of course, but the pain might be worth the gain. Suppose that your universe comprises 1024 points, then you'll need an array of 16 uint64 values to represent the bit set for each cluster. The presence of, say, point 563 in a cluster requires that you set, for the bit set representing that cluster, bit 563 (which is probably bit 51 in the 9th element of the set) to 1.
And perhaps I should have started by writing that I don't think that this is a hashing sort of a problem, it's a set sort of a problem. Yeah, you could use a hash but then you'll have to program around the limitations of using a screwdriver on a nail (choose your preferred analogy).
If I understand correctly, to hash the ID's I would recommend using the matlab Java interface to use the Java hashing algorithms
http://docs.oracle.com/javase/1.4.2/docs/api/java/security/MessageDigest.html
You'll do something like:
hash = java.security.MessageDigest.getInstance('SHA');
Hope this helps.
I found the function
DataHash on FEX it is quiet fast for vectors and the strcmp on the keys is a lot faster than I expected.

best way to resolve collisions in hashing strings

I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.

How is a bitmapped vector trie faster than a plain vector?

It's supposedly faster than a vector, but I don't really understand how locality of reference is supposed to help this (since a vector is by definition the most locally packed data possible -- every element is packed next to the succeeding element, with no extra space between).
Is the benchmark assuming a specific usage pattern or something similar?
How this is possible?
bitmapped vector tries aren't strictly faster than normal vectors, at least not at everything. It depends on what operation you are considering.
Conventional vectors are faster, for example, at accessing a data element at a specific index. It's hard to beat a straight indexed array lookup. And from a cache locality perspective, big arrays are pretty good if all you are doing is looping over them sequentially.
However a bitmapped vector trie will be much faster for other operations (thanks to structural sharing) - for example creating a new copy with a single changed element without affecting the original data structure is O(log32 n) vs. O(n) for a traditional vector. That's a huge win.
Here's an excellent video well worth watching on the topic, which includes a lot of the motivation of why you might want these kind of structures in your language: Persistent Data Structures and Managed References (talk by Rich Hickey).
There is a lot of good stuff in the other answers but nobdy answers your question. The PersistenVectors are only fast for lots of random lookups by index (when the array is big). "How can that be?" you might ask. "A normal flat array only needs to move a pointer, the PersistentVector has to go through multiple steps."
The answer is "Cache Locality".
The cache always gets a range from memory. If you have a big array it does not fit the cache. So if you want to get item x and item y you have to reload the whole cache. That's because the array is always sequential in memory.
Now with the PVector that's diffrent. There are lots of small arrays floating around and the JVM is smart about that and puts them close to each other in memory. So for random accesses this is fast; if you run through it sequentially it's much slower.
I have to say that I'm not an expert on hardware or how the JVM handles cache locality and I have never benchmarked this myself; I am just retelling stuff I've heard from other people :)
Edit: mikera mentions that too.
Edit 2: See this talk about Functional Data-Structures, skip to the last part if you are only intrested in the vector. http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala
A bitmapped vector trie (aka a persistent vector) is a data structure invented by Rich Hickey for Clojure, that has been implementated in Scala since 2010 (v 2.8). It is its clever bitwise indexing strategy that allows for highly efficient access and modification of large data sets.
From Understanding Clojure's Persistent Vectors :
Mutable vectors and ArrayLists are generally just arrays which grows
and shrinks when needed. This works great when you want mutability,
but is a big problem when you want persistence. You get slow
modification operations because you'll have to copy the whole array
all the time, and it will use a lot of memory. It would be ideal to
somehow avoid redundancy as much as possible without losing
performance when looking up values, along with fast operations. That
is exactly what Clojure's persistent vector does, and it is done
through balanced, ordered trees.
The idea is to implement a structure which is similar to a binary
tree. The only difference is that the interior nodes in the tree have
a reference to at most two subnodes, and does not contain any elements
themselves. The leaf nodes contain at most two elements. The elements
are in order, which means that the first element is the first element
in the leftmost leaf, and the last element is the rightmost element in
the rightmost leaf. For now, we require that all leaf nodes are at the
same depth2. As an example, take a look at the tree below: It has
the integers 0 to 8 in it, where 0 is the first element and 8 the
last. The number 9 is the vector size:
If we wanted to add a new element to the end of this vector and we
were in the mutable world, we would insert 9 in the rightmost leaf
node, like this:
But here's the issue: We cannot do that if we want to be persistent.
And this would obviously not work if we wanted to update an element!
We would need to copy the whole structure, or at least parts of it.
To minimize copying while retaining full persistence, we perform path
copying: We copy all nodes on the path down to the value we're about
to update or insert, and replace the value with the new one when we're
at the bottom. A result of multiple insertions is shown below. Here,
the vector with 7 elements share structure with a vector with 10
elements:
The pink coloured nodes are shared between the vectors, whereas the
brown and blue are separate. Other vectors not visualized may also
share nodes with these vectors.
More info
Besides Understanding Clojure's Persistent Vectors, the ideas behind this data structure and its use cases are also explained pretty well in David Nolen's 2014 lecture Immutability, interactivity & JavaScript, from which the screenshot below was taken. Or if you really want to dive deeply into the technical details, see also Phil Bagwell's Ideal Hash Trees, which was the paper upon which Hickey's initial Clojure implementation was based.
What do you mean by "plain vector"? Just a flat array of items? That's great if you never update it, but if you ever change a 1M-element flat-vector you have to do a lot of copying; the tree exists to allow you to share most of the structure.
Short explanation: it uses the fact that the JVM optimizes so hard on read/write/copy array data structures. The key aspect IMO is that if your vector grows to a certain size index management becomes a  bottleneck . Here comes the very clever algorithm from persisted vector into play, on very large collections it outperforms the standard variant. So basically it is a functional data-structure which only performed so well because it is built up on small mutable highly optimizes JVM datastructures.
For further details see here (at the end)
http://topsy.com/vimeo.com/28760673
Judging by the title of the talk, it's talking about Scala vectors, which aren't even close to "the most locally packed data possible": see source at https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src/library/scala/collection/immutable/Vector.scala.
Your definition only applies to Lisps (as far as I know).