Associative noncommutative hash function - hash

Is there a hash function with following properties?
is associative
is not commutative
easily implementable on 32 bit integers: int32 hash(int32, int32)
If I am correct, such function allows achieving following goals
calculate hash of concatenated string from hashes of substrings
calculate hash concurrently
calculate hash of list implemented on binary tree - including order, but excluding how tree is balanced
The best I found so far is multiplication of 4x4 matrix of bits, but thats awkward to implement and reduces space to 16bits.
I am grateful for any help.

Polynomial rolling hash could help:
H(A1,...,An) = (H(A1,...,An-1) * Base + An) Mod P
It's easy to concat two results or substract prefix/suffix from result, as long as the length is known.

Matrix multiplication is associative and non-commutative.
You could try representing your hashes as matrices but this will result in a loss of information if they have 0 determinant (which is likely!).
So instead you should generate a triangle matrix with a diagonal of 1's to ensure that you have a determinant of 1 (this guarantees that composition does not loose information).
Furthermore the composition of triangle matrices produces a new triangle matrix, making reading the composition the same as generation.
Note: to use this method the length of your hash must be a triangle number!

Related

What is the difference between matrix and array?

What is the more generalized term?
Why is MATLAB named matrix laboratory, then?
A matrix is a practical way to represent a linear transformation from a space of dimension n to a space of dimension m in the form of a nxm array of scalar values.
It is also very practical to perform linear algebra operation in a very systematic way that can be implemented on a computer. For instance if matrix A represents the linear transformation f and matrix B the linear transformation g, then the composition f o g writes as A*B where * denotes matrix multiplication. Matlab has also a lot of routines related to matrix operations (i.e. linear algebra operations) like det, pinv, svd etc...
As you can still see nowadays in Matlab, operators like *, / are strongly tied to matrix operations and thus strongly tied to linear algebra operations, which I think was the original goal of matlab in its early elaboration, hence its name (surely quite speculative but guess not so far from reality).
To perform element-wise operations on n-dimensional data sets, you have to write .*, or ./. denoting you are now performing array operations.
I would not say array operations encompass matrix operations, they are different. The later ones relate to linear algebra, while the other ones just relate to a practical way to operate on large sets of data. These data are not limited to be numbers, they are just n-dimensional data sets of whatever (string, numbers, cells, etc...).
Matlab also has a very synthetic syntax to perform array operations on sub-blocks (i.e. linear/logical subscripts) that makes it very easy to reorganize data sets in just one line of code before applying subsequent matrix or array operations.
If you're asking about MATLAB, the word "matrix" typically refers to a 2d array, whereas an "array" can be n-dimensional.
Early versions of MATLAB supported only 2d matrices, not n-dimensional arrays. I believe support for n-dimensional arrays was introduced in version 5 of MATLAB.
I would say that MATLABs matrix is a more advanced kind of array if you compare to the c-style arrays, eg double array[], or the Java array, eg double arry2[]. I would also say that the matlab matrix is better for mathematical purposed than the c++ vector or Java ArrayList. However, if you mean the matlab array I would say that it is more complicated. I would then recommend the link about matlab data which describes the mxArray type, used to store most of the data in matlab. The question is hard to answer completely without better description of what you mean with array, but I would say that regarding the type there is no difference between an array like a = [1,2,3,4] and matrix like b = [1,2,3,4;5,6,7,8]. There can also be matrices of higher dimensions as c = ones(3,4,3). These are in general called matrices as well in MATLAB, or if you need to be more specific N dimensional matrices.

Lexicographic ordering of triplets of integers in Matlab

I have the following problem: I have an array of N integer triplets (i.e. an Nx3 matrix) and I would like to order it lexicographically in Matlab. In order to do so I thought of using the built-in sort algorithm of Matlab, but I wanted to ask if the way I thought of doing it is right or if there exists a simpler way (preferably using Matlab routines).
I thought of converting every triplet into a single number and then sort these numbers with sort(). If my integers were between 0 and 9, I could just convert them into decimal. However, they are bigger. If their maximum absolute value is M, I thought of converting them into the (M+1)-ary system like this: if (a,b,c) the triplet, the corresponding integer is a*(M+1)^2+b*(M+1)+c. Would sorting these transformed integers solve the problem, or am I making a logical mistake in my reasoning?
Thank you!
PS: I know that sort() in Matlab does have a lexicographic option for strings, but my integers do not have the same digit length. Maybe padding them with leading zeros and concatenating them would do the trick?
Have you considered using sortrows?
Should enable you to straight-forward sort your 3-columns of data lexicographically.

What is a Vector

I am from C/C++ programming world and finding it difficult to understand what exactly is a Vector / Matrix in MATLAB - why are the not termed as array everywhere.
What is Vector in MATLAB and why it is not called or referenced as an array?
The "MAT" in MATLAB is for Matrix, not Math. In MATLAB, basically everything you do is calculations with what you would call matrices / vectors in mathematical terms.
It is common to call a numeric array a matrix (or vector if it's 1xn), and other arrays for arrays. You'll see terms like cell array, which is an array of cells.
This way you can use mathematical terms when describing calculations with numerical arrays. For instance inv can be used to find the inverse of a matrix, instead of the inverse of a numeric array. (Btw, never use inv, it was just an example).
Matlab is designed to use as "Matrix-Lab": a tool for numerically process linear-algebra objects such as vector and matrices. So, in terms of "data structures" it indeed works with n-dimensional arrays, but has special names for the special cases: "vector" for 1-d array and "matrix" for 2-d array.

automatic transpose of vectors for binary operations

I know there are alternatives exist. But just curious to know. When I perform some binary operations such as *,-,/,+ between two vectors of same size, some times the dimension does not match. For eg., for a*b a is of size (m,1) and b is also of size (m,1). or for a-b, the size of a,b is (m,1) and (1,m) respectively. Is there a way that matlab automatically matches dimension of vectors and performs the operation.
A simple approach is to use
a(:)-b(:)
instead of a-b. The linear indexing (:) turns everything into a column vector.
If one of the operands is in turn the result of an operation, for example b+c, you can't directly write a(:)-(b+c)(:) in Matlab. In that case you can use reshape, like this:
reshape(a,[],1) - reshape(b+c,[],1)
This works because reshape(...,[],1), like (:), converts its argument into a column; but now that argument can be the result of an operation.

How to calculate hash of a set (unordered list) of values?

I want to calculate sha1 hash of a set (unordered list) of elements. I have already calculated sha1 hash of each element. I'm considering two solutions:
Sort elements by their hashes and calculate top hash of such list.
Treat element hashes as 160 bits integer values and XOR (bitwise operation) them together into one 160 bits hash.
Does second solution is weaker in terms of secure hash function properties? (pre-image resistance, second pre-image resistance, collision resistance).
Option 1 is what is done in ERS: that standard uses hash trees, where each node contains a hash value computed over the set of hash values from the child nodes; since order is not significant in the tree, the values are sorted lexicographically before hashing. This is good, and, as far as we know, safe.
Option 2 is very unsafe: if the hash function has 160-bit output, then I can easily generate 160 random inputs such that the corresponding hash values constitute a basis of the vector space GF(2)160, at which point I can produce a matching set for any aggregate hash value. Attack cost is negligible.
Option 3 suggested by #paj28 (sorting the values to hash, then hash them) is fine, too, as long as you "concatenate" the sorted values with an unambiguous separator. For instance, if you hash the set of strings containing "bar" and "foo", you don't want to obtain the same hash value as with the set of strings containing "ba" and "rfoo". It is easier to get something safe when all values to hash have the same length.
Therefore, use option 1: hash each value in the set, then sort the hash values in lexicographic order, and hash the sorted list of values again.
On the attack with option 2: this is linear algebra. Suppose that you have k vectors of n bits, such that none of them is equal to the XOR of some of the k-1 other vectors (they are said to be linearly independent). Then consider a new random vector v; the probability that this vector is equal to the XOR of some of the k vectors is equal to 2k-n, i.e. it is small as long as k < n. If the new vector v indeed linearly independent with the k vectors you already have (thus with probability 1-2k-n), then add it to the set: you now have k+1 linearly independent vectors.
Recurse: you will soon obtain n vectors of n bits which are linearly independent to each other. But you cannot go further, because probability of any new vector to be linearly independent from the n previous has dropped to 0. The n vectors are said to be a basis for the vector space.
In this case, the vectors are obtained by simply hashing values (random values, or values with structure, it does not matter much, because the hash function acts as a randomizer).
For a given set of k vectors, determining whether a new vector v is linearly independent with the k vectors is easy with Gaussian elimination. The same algorithm lets you know, once you have a basis, which of your n basis vectors shall be XORed together to yield any vector v'. In the setup of this question, this means that once I have produced n values mi such that the h(mi) constitute a basis, then for any target n-bit output t, I can use Gauss elimination to work out which of my h(mi) may be XORed together to yield exactly the value t. The corresponding mi values are then a preimage set for t.
The other option (3) is to sort the elements first, then combine them into a single string using a separator that cannot appear as part of an element.
Of these possibilities, 2 would concern me the most. I can't think now how you could attack it in a practical way, but it seems the riskiest.
So 1 and 3 are basically fine. But I would recommend 3 because you are using the hash in the way it is intended.