What does EuclideanHash in Location Sensitive Hash mean? - hash

I find that Location Sensitive Hash support EuclideanHash CosineHash and some other hash according to the repository in github: lsh families. Anyway, CosineHash is easy to understand:
double result = vector.dot(randomProjection);
return result > 0 ? 1 : 0;
But then EuclideanHash is hard to understand:
double hashValue = (vector.dot(randomProjection)+offset)/Double.valueOf(w); // offset = rand.nextInt(w)
return (int) Math.round(hashValue);

Generally Euclidean hash in lsh mean that hash function that map a data (vector) in nearby position in Euclidean space to an integer.
One way to do this is by generating random line, and dividing the line into segments where a segment represent a hash number. Then, hash can be obtained by projecting the data vector to this line and observing which segment it falls to.
The function you asked seems to be using similar approach but using dot product instead of projection

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

creating a random multidimension array using rand() with nddata = fix(8*randn(10,5,3))

I am currently trying to learn MATLAB independently and had a question about a command that used randn().
nddata = fix(8*randn(10,5,3))
I understand what the fix() function does, and the multi dimension array that is created by randn. However, I am not sure what 8 is doing here, it is not multiplying the outcome of the random numbers and it is not part of the limit. So I just want to know the purpose of the 8 here.
Thanks
randn generated a standard normally distributed matrix of random numbers (standard in this context is defined as mean = 0 and standard deviation = 1). The 8 factor simply stretches this distribution along the x-axis; a scalar multiplication for each value in the 3D matrix. The fix function then rounds each value to the nearest integer towards 0, i.e. -3.9 becomes -3.0. This effectively reduces the standard deviation of the data.
To see this for yourself, split the expression up and create temporary variables for each operation, and step through it with the debugger.

Index when mean is constant

I am relatively new to matlab. I found the consecutive mean of a set of 1E6 random numbers that has mean and standard deviation. Initially the calculated mean fluctuate and then converges to a certain value.
I will like to know the index (i.e 100th position) at which the mean converges. I have no idea how to do that.
I tried using the logical operator but i have to go through 1e6 data points. Even with that i still can't find the index.
Y_c= sigma_c * randn(n_r, 1) + mu_c; %Random number creation
Y_f=sigma_f * randn(n_r, 1) + mu_f;%Random number creation
P_u=gamma*(B*B)/2.*N_gamma+q*B.*N_q + Y_c*B.*N_c; %Calculation of Ultimate load
prog_mu=cumsum(P_u)./cumsum(ones(size(P_u))); %Progressive Cumulative Mean of system response
logical(diff(prog_mu==0)); %Find index
I suspect the issue is that the mean will never truly be constant, but will rather fluctuate around the "true mean". As such, you'll most likely never encounter a situation where the two consecutive values of the cumulative mean are identical. What you should do is determine some threshold value, below which you consider fluctuations in the mean to be approximately equal to zero, and compare the difference of the cumulative mean to that value. For instance:
epsilon = 0.01;
const_ind = find(abs(diff(prog_mu))<epsilon,1,'first');
where epsilon will be the threshold value you choose. The find command will return the index at which the variation in the cumulative mean first drops below this threshold value.
EDIT: As was pointed out, this method may potentially fail if the first few random numbers are generated such that the difference between them is less than the epsilon value, but have not yet converged. I would like to suggest a different approach, then.
We calculate the cumulative means, as before, like so:
prog_mu=cumsum(P_u)./cumsum(ones(size(P_u)));
We also calculate the difference in these cumulative means, as before:
df_prog_mu = diff(prog_mu);
Now, to ensure that conversion has been achieved, we find the first index where the cumulative mean is below the threshold value epsilon and all subsequent means are also below the threshold value. To phrase this another way, we want to find the index after the last position in the array where the cumulative mean is above the threshold:
conv_index = find(~df_prog_mu,1,'last')+1;
In doing so, we guarantee that the value at the index, and all subsequent values, have converged below your predetermined threshold value.
I wouldn't imagine that the mean would suddenly become constant at a single index. Wouldn't it asymptotically approach a constant value? I would reccommend a for loop to calculate the mean (it sounds like maybe you've already done this part?) like this:
avg = [];
for k=1:length(x)
avg(k) = mean(x(1:k));
end
Then plot the consecutive mean:
plot(avg)
hold on % this will allow us to plot more data on the same figure later
If you're trying to find the point at which the consecutive mean comes within a certain range of the true mean, try this:
Tavg = 5; % or whatever your true mean is
err = 0.01; % the range you want the consecutive mean to reach before we say that it "became constant"
inRange = avg>(Tavg-err) & avg<(Tavg+err); % gives you a binary logical array telling you which values fell within the range
q = 1000; % set this as high as you can while still getting a value for consIndex
constIndex = [];
for k=1:length(inRange)
if(inRange(k) == sum(inRange(k:k+q))/(q-1);)
constIndex = k;
end
end
The below answer takes a similar approach but makes an unsafe assumption that the first value to fall within the range is the value where the function starts to converge. Any value could randomly fall within that range. We need to make sure that the following values also fall within that range. In the above code, you can edit "q" and "err" to optimize your result. I would recommend double checking it by plotting.
plot(avg(constIndex), '*')

How to calculate the "rest value" of a plot?

Didn't know how to paraphrase the question well.
Function for example:
Data:https://www.dropbox.com/s/wr61qyhhf6ujvny/data.mat?dl=0
In this case how do I calculate that the rest point of this function is ~1? I have access to the vector that makes the plot.
I guess the mean is an approximation but in some cases it can be pretty bad.
Under the assumption that the "rest" point is the steady-state value in your data and the fact that the steady-state value happens the majority of the times in your data, you can simply bin all of the points and use each unique value as a separate bin. The bin with the highest count should correspond to the steady-state value.
You can do this by a combination of histc and unique. Assuming your data is stored in y, do this:
%// Find all unique values in your data
bins = unique(y);
%// Find the total number of occurrences per unique value
counts = histc(y, bins);
%// Figure out which bin has the largest count
[~,max_bin] = max(counts);
%// Figure out the corresponding y value
ss_value = bins(max_bin);
ss_value contains the steady-state value of your data, corresponding to the most occurring output point with the assumptions I laid out above.
A minor caveat with the above approach is that this is not friendly to floating point data whose unique values are generated by floating point values whose decimal values beyond the first few significant digits are different.
Here's an example of your data from point 2300 to 2320:
>> format long g;
>> y(2300:2320)
ans =
0.99995724232555
0.999957488454868
0.999957733165346
0.999957976465197
0.999958218362579
0.999958458865564
0.999958697982251
0.999958935720613
0.999959172088623
0.999959407094224
0.999959640745246
0.999959873049548
0.999960104014889
0.999960333649014
0.999960561959611
0.999960788954326
0.99996101464076
0.999961239026462
0.999961462118947
0.999961683925704
0.999961904454139
Therefore, what I'd recommend is to perhaps round so that the first 5 or so significant digits are maintained.
You can do this to your dataset before you continue:
num_digits = 5;
y_round = round(y*(10^num_digits))/(10^num_digits);
This will first multiply by 10^n where n is the number of digits you desire so that the decimal point is shifted over by n positions. We round this result, then divide by 10^n to bring it back to the scale that it was before. If you do this, for those points that were 0.9999... where there are n decimal places, these will get rounded to 1, and it may help in the above calculations.
However, more recent versions of MATLAB have this functionality already built-in to round, and you can just do this:
num_digits = 5;
y_round = round(y,num_digits);
Minor Note
More recent versions of MATLAB discourage the use of histc and recommend you use histcounts instead. Same function definition and expected inputs and outputs... so just replace histc with histcounts if your MATLAB version can handle it.
Using the above logic, you could also use the median too. If the majority of data is fluctuating around 1, then the median would have a high probability that the steady-state value is chosen... so try this too:
ss_value = median(y_round);

Accumulating votes in MATLAB

First, a little background to my problem:
I am building an object recognition system using a geometric hashing technique. My hash table is indexed by the affine co-ordinates of points in a model determined by a basis triplet (allowing an affine invariant representation of any learned object). Each hash table entry is a structure :
entry = struct('ModelName', modelName, 'BasisTriplet', [a; b; c])];
Now, an arbitrary basis triplet is extracted from image points then the affine co-ordinates of all other points are calculated relative to this basis and used as indices to the hash table. For each entry that exists in this hash bin, a vote is cast for the modelName and basis triplet.
After checking all points, the models and their corresponding basis triplets with a sufficiently high number of votes are taken as candidates for an object and a further verification step is performed.
I am unsure however what is the most efficient method of casting these votes. Currently I am using a dynamic cell array, each time a new model and basis triplet pair is voted for, an additional row is added to the array. Otherwise the vote count of an existing candidate is incremented.
for keylist = 1:length(keylist)
% Where keylist is an array of indicies to the relevant keys to look up
% xkeys is the n by 2 array of all of the keys in the hash table
% Obtain this hash bin
bin = hashTable(xkeys(keylist(i), 1), xkeys(keylist(i), 2));
% Vote for every entry in the bin
for entry = 1:length(bin)
% Find the index of this model/basis in the voting accumulator
indAcc = find( strcmp(bin.ModelName, v_models(:, 1)) & myIsEqual(v_basisTriplets, bin.BasisTriplet) );
if isempty(indAcc)
% If entries do not exist yet, Add new entries
v_models = [v_models; {bin.ModelName, 1}];
v_basisTriplets = cat(3, v_basisTriplets, bin.BasisTriplet);
else
% Otherwise increment the count
v_models(indAcc, 2) = v_models(indAcc, 2)+1;
end
end
end
There is a separate 3D array (v_basisTriplets) in which the 2D basis array is concatenated and indexed along the 3rd dimension. I did have these basis triplets in the cell array also, however I had difficulty searching this cell array for a 2D array. The myIsEqual function just searches through the third dimension and checks if the 2D array at each index is equal, returning a 1D vector of which arrays are equal for use in the find.
function ind = myIsEqual(vec3D, A)
ind = zeros(size(vec3D, 3), 1);
for i = 1:size(vec3D, 3)
ind(i) = isequal(vec3D(:, :, i), A);
end
This is most certainly not the most efficient way. Immediately I can see that it would be more efficient to initialize the arrays to store the votes beforehand. However however is there a better way in general of going about this? I need to try and find the most efficient and elegant way of voting as there are usually hundreds of points to check and time is valuable.
Thanks
If you are only considering time efficiency, consider using a 4d matrix.
The dimensions would be:
Model
coordinateA
coordinateB
coordinateC
Depending on the ratio between this matrix size and the amount of points that you check, consider using a sparse matrix.
Note that especially if you can't use a sparse array, this method can be rather memory inefficient and may therefore be infeasible.