Hashcode for geometries that differ only on their orientation - hash

I have a collection that contains geometries (usually (Poly)Lines). Now I want to implement a HashCode for these geometries in order to put them into the collection. To do so I have three members within every geometry that will not change and thus are suitable for a HashCode: the geometry-type (which is PolyLine for all geometries, the from- and the to-point)
So I wrote this code
int hash = 17;
// this hascode-implementation uses the geometry-type and the from- and toPoints of the geometry
hash = hash * 31 + this.Geometry.GeometryType.GetHashCode();
hash = hash * 31 + this.Geometry.FromPoint.X.GetHashCode();
hash = hash * 31 + this.Geometry.FromPoint.Y.GetHashCode();
hash = hash * 31 + this.Geometry.ToPoint.X.GetHashCode();
hash = hash * 31 + this.Geometry.ToPoint.Y.GetHashCode();
Now we have another prerequisite within our application which makes it impossible to me to write a hash-function: two geometries are also considered equal when they are contrary. Since every actual equal object MUST have the same hashCode I have to change the implementation so it allows diagonal collisions.
This means the following:
when fromPoint of geometry 1 equals toPoint from geometry 2 (and vice versa) also their hashCodes must be equal.
Which of the factors do I have to change in my implementation to enable diagonal collisions or am I totally whrong with my implementation /is there a better way to do this)?

For the swapped points to yield the same result, you need a mathematical operation where A op B == B op A and you need to apply it to both coordinates before adding the result to the hash.
I would try this:
hash = hash * 31 + (
this.Geometry.FromPoint.X.GetHashCode()
+ this.Geometry.ToPoint.X.GetHashCode
);
This line returns the same result, no matter in which order you pass the X coordinate.
Note: If you add/remove lines of the polygon or move end points, then the hash code changes. So you must make sure that the geometry doesn't change as long as such an object is stored in a hash map/set.
If you need to change the geometry, you first have to remove the object from the hash map/set, change the geometry and add it again.
PS: The X in the last line of your code should be Y.

I haven't really understood the geometrical aspects of the problem you are describing, but here are a few thoughts:
how many objects do you have? If it's something that fits into not too many, it might be acceptable to not worry too much about the hashcode implementation, just make it constant
if the equals operation of any two geometries is non-trivial, what about wrapping them into an object where you have less problems talking about equality? e.g. new MyGeometry("an Id", aGeometry)? Implementing hashCode / equals should be trivial then.

Related

What does EuclideanHash in Location Sensitive Hash mean?

I find that Location Sensitive Hash support EuclideanHash CosineHash and some other hash according to the repository in github: lsh families. Anyway, CosineHash is easy to understand:
double result = vector.dot(randomProjection);
return result > 0 ? 1 : 0;
But then EuclideanHash is hard to understand:
double hashValue = (vector.dot(randomProjection)+offset)/Double.valueOf(w); // offset = rand.nextInt(w)
return (int) Math.round(hashValue);
Generally Euclidean hash in lsh mean that hash function that map a data (vector) in nearby position in Euclidean space to an integer.
One way to do this is by generating random line, and dividing the line into segments where a segment represent a hash number. Then, hash can be obtained by projecting the data vector to this line and observing which segment it falls to.
The function you asked seems to be using similar approach but using dot product instead of projection

Passing values to a sparse matrix in MATLAB

Might sound too simple to you but I need some help in regrad to do all folowings in one shot instead of defining redundant variables i.e. tmp_x, tmp_y:
X= sparse(numel(find(G==0)),2);
[tmp_x, temp_y] = ind2sub(size(G), find(G == 0));
X(:)=[tmp_x, tmp_y];
(More info: G is a sparse matrix)
I tried:
X(:)=ind2sub(size(G), find(G == 0));
but that threw an error.
How can I achieve this without defining tmp_x, tmp_y?
A couple of comments with your code:
numel(find(G == 0)) is probably one of the worst ways to determine how many entries that are zero in your matrix. I would personally do numel(G) - nnz(G). numel(G) determines how many elements are in G and nnz(G) determines how many non-zero values are in G. Subtracting these both would give you the total number of elements that are zero.
What you are doing is first declaring X to be sparse... then when you're doing the final assignment in the last line to X, it reconverts the matrix to double. As such, the first statement is totally redundant.
If I understand what you are doing, you want to find the row and column locations of what is zero in G and place these into a N x 2 matrix. Currently with what MATLAB has available, this cannot be done without intermediate variables. The functions that you'd typically use (find, ind2sub, etc.) require intermediate variables if you want to capture the row and column locations. Using one output variable will give you the column locations only.
You don't have a choice but to use intermediate variables. However, if you want to make this more efficient, you don't even need to use ind2sub. Just use find directly:
[I,J] = find(~G);
X = [I,J];

Does a string hash exist which can ignore the order of chars in this string

Does a string hash exist which can ignore the order of chars in this string? Eg."helloword" and "wordhello" can map into the same bucket.
There is a number of different approaches you can take.
You can add the values of the characters together. (a + b + c is
equal to a + c + b.) Unfortunately, this is the least desirable
approach, since strings like "ac" and "bb" will generate the same
hash value.
To reduce the possibility of hash code collisions, you can XOR the
values together. (a ^ b ^ c is equal to a ^ c ^ b.) Unfortunately,
this will not give a very broad distribution of random bits, so it
will still give a high chance of collisions for different strings.
To even further reduce the possibility of hash code collisions, you
can multiply the values of the characters together. (a * b * c is
equal to a * c * b.)
If that's not good enough either, then you can sort all the
characters in the string before applying the default string hashing
function offered to you by whatever language it is that you are
using. (So, both "helloword" ad "wordhello" would become "dehlloorw"
before hashing, thus generating the same hash code.) The only disadvantage of this approach is that it is computationally more expensive than the others.
Although the other suggestions of multiplying or adding characters would work, notice that such a hash function is not secure at all.
The reason is that it will introduce a ton of collisions and one of the main properties a hash function has is the low probability of collisions.
For example, a + b + c is the same as c + b + a. However, it is also the same as a + a + d (since the sum of the ascii characters are the same). The same thing applies for multiplying or xor-ing the numbers.
In sum, if you want to achieve a hash function which ignores order, you can but it will introduce a ton of collisions which will potentially make your program faulty and insecure.

optimal way of storing multidimensional array/tensor

I am trying to create a tensor (can be conceived as a multidimensional array) package in scala. So far I was storing the data in a 1D Vector and doing index arithmetic.
But slicing and subarrays are not so easy to get. One needs to do a lot of arithmetic to convert multidimensional indices to 1D indices.
Is there any optimal way of storing a multidimensional array? If not, i.e. 1D array is the best solution, how one can optimally slice arrays (some concrete code would really help me)?
The key to answering this question is: when is pointer indirection faster than arithmetic? The answer is pretty much never. In-order traversals can be about equally fast for 2D, and things get worse from there:
2D random access
Array of Arrays - 600 M / second
Multiplication - 1.1 G / second
3D in-order
Array of Array of Arrays - 2.4G / second
Multiplication - 2.8 G / second
(etc.)
So you're better off just doing the math.
Now the question is how to do slicing. Initially, if you have dimensions of n1, n2, n3, ... and indices of i1, i2, i3, ..., you compute the offset into the array by
i = i1 + n1*(i2 + n2*(i3 + ... ))
where typically i1 is chosen to be the last (innermost) dimension (but in general it should be the dimension most often in the innermost loop). That is, if it were an array of arrays of (...), you would index into it as a(...)(i3)(i2)(i1).
Now suppose you want to slice this. First, you might give an offset o1, o2, o3 to every index:
i = (i1 + o1) + n1*((i2 + o2) + n2*((i3 + o3) + ...))
and then you will have a shorter range on each (let's call these m1, m2, m3, ...).
Finally, if you eliminate a dimension entirely--let's say, for example, that m2 == 1, meaning that i2 == 0, you just simplify the formula:
i = (i1 + o1 + n1*o2) + (n1+n2)*((i3 + o3) + ... ))
I will leave it as an exercise to the reader to figure out how to do this in general, but note that we can store new constants o1 + n1*o21 and n1+n2 so we don't need to keep doing that math on the slice.
Finally, if you are allowing arbitrary dimensions, you just put that math into a while loop. This does, admittedly, slow it down a little bit, but you're still at least as well off as if you'd used a pointer dereference (in almost every case).
From my own general experience: If you have to write a multidimensional (rectangular) array class yourself, do not aim to store the data as Array[Array[Double]] but use a one-dimensional storage and add helper methods for converting the multidimensional access tuples to a simple index and vice versa.
When using lists of lists, you need to do much to much bookkeeping that all lists are of the same size and you need to be careful when assigning a sublist to another sublist (because this makes the assigned to sublist identical to the first and you wonder why changing the item at (0,5) also changes (3,5)).
Of course, if you expect a certain dimension to be sliced much more often than another and you want to have reference semantics for that dimension as well, a list of lists will be the better solution, as you may pass around those inner lists as a slice to the consumer without making any copy. But if you don’t expect that, it is a better solution to add a proxy class for the slices which maps to the multidimensional array (which in turn maps to the one-dimensional storage array).
Just an idea: how about a map with Int-tuples as keys?
Example:
val twoDimMatrix = Map((1,1) -> -1, (1,2) -> 5, (2,1) -> 7.7, (2,2) -> 9)
and then you could
scala> twoDimMatrix.filterKeys{_._2 == 1}.values
res1: Iterable[AnyVal] = MapLike(-1, 7.7)
or
twoDimMatrix.filterKeys{tuple => { val (dim1, dim2) = tuple; dim1 == dim2}} //diagonal
this way the index arithmetics would be done by the map. I don't know how practical and fast this is though.
As soon as the number of dimension is known before the design, you can use a collection of collection ...(n times) of collection. If you must be able to build a verctor for any number of dimension, then, there's nothing convenient in the scala API to do it (as far as I know).
You can simply store information in a mulitdimensional array (eg. `Array[Array[Double]]).
If the tensors are small and can fit in cache, you can have a performance improvement with 1D arrays because of data memory locality. It should also be faster to copy the whole tensor.
For slicing arithmetic. It depends what kind of slicing you require. I suppose you already have a function for extracting an element based on indices. So write a basic splicing loop based on indices iteration, insert manually the expression for extracting element, and then try to simplify the whole loop. It is often simpler than to write a correct expression from scratch.

Fast associative arrays or maps in Matlab

I need to build a fast one-to-one mapping between two large arrays of integers in Matlab. The mapping should take as input an element from a pre-defined array, e.g.:
in_range = [-200 2 56 45 ... ];
and map it, by its index in the previous array, to the corresponding element from another pre-defined array, e.g.:
out_range = [-10000 0 97 600 ... ];
For example, in the case above, my_map(-200) should output -10000, and my_map(45) should output 600.
I need a solution that
Can map very large arrays (~100K elements) relatively efficiently.
Scales well with the bounds of in_range and out_range (i.e. their min and max values)
So far, I have solved this problem using Matlab's external interface to Java with Java's HashMaps, but I was wondering if there was a Matlab-native alternative.
Thanks!
The latest versions of Matlab have hashes. I'm using 2007b and they aren't available, so I use structs whenever I need a hash. Just convert the integers to valid field names with genvarname.