Universal hash with collisions - hash

I have a homework but I really donĀ“t know where start.
Write a program in Java that implements the two families of universal hash functions we saw in class today .
Both families of functions depend on various parameters , so that Java should use two classes whose constructures receive the appropriate parameters and C should make some kind of structure that contains a pointer to a function.
To test their implementations , use the following sets : U = f0 ; 1; 2; ::: 10008g and D = f0 ; 1; 2; ::: 2052g . Write an application at random choose a function of each type and insert U 500 random number in a table containing the addresses of D. The output of your program should be the number of collisions obtained during insertions using buckets policy to solve collisions.
Can you please tell me how can I start or what should I implement for this.
Thanks

Create the two hash functions you have been given
Generate and insert into your hash table, 500 numbers of type U which map to an address of type D. The insertion should also randomly choose a hash function to use for inserting
Iterate over your hash table and count the number of buckets containing more than one item of type U

Related

Split a Matlab table in several tables dynamically

I am working in MATLAB and I did not find yet a way to split a table T in different tables {T1,T2,T3,...} dynamically. What I mean with dynamic is that it must be done based on some conditions of the table T that are not known a priori. For now, I do it in a non-dynamic way with the following code (I hard-code the number of tables I want to have).
%% Separate data of table T in tables T1,T2,T3
starting_index = 1;
T1 = T(1:counter_simulations(1),:);
starting_index = counter_simulations(1)+1;
T2 = T(starting_index:starting_index+counter_simulations(2)-1,:);
starting_index = starting_index + counter_simulations(2);
T3 = T(starting_index:starting_index+counter_simulations(3)-1,:);
Any ideas on how to do it dynamically? I would like to do something like that:
for (i=1:number_of_tables_to_create)
T{i} = ...
end
EDIT: the variable counter_simulations is an array containing the number of rows I want to extract for each table. Example: counter_simulations(1)=200 will mean that the first table will be T1= T(1:200, :). If counter_simulations(2)=300 the first table will be T1= T(counter_simulations(1)+1:300, :) and so on.
I hope I was clear.
Should I use cell arrays instead of tables maybe?
Thanks!
For the example you give, where counter_simulations contains a list of the number of rows to take from T in each of the output tables, MATLAB's mat2cell function actually implements this behaviour directly:
T = mat2cell(T,counter_simulations);
While you haven't specified the contents of counter_simulations, it's clear that if sum(counter_simulations) > height(T) the example would fail. If sum(counter_simulations) < height(T) (and so your desired output doesn't contain the last row(s) of T) then you would need to add a final element to counter_simulations and then discard the resulting output table:
counter_simulations(end+1) = height(T) - sum(counter_simulations);
T = mat2cell(T,counter_simulations);
T(end) = [];
Whether this solution applies to all examples of
some conditions of the table T that are not known a priori
you ask for in the question depends on the range of conditions you actually mean; for a broad enough interpretation there will not be a general solution but you might be able to narrow it down if mat2cell performs too specific a job for your actual problem.

How can I assign a unique value to represent a list of unique integers

I need to compare two lists of unique integers by assigning each list a unique value to represent it's integers. What method/algorithm can I apply for this that's not too computationally intensive and produces a relatively short id/hash a set
Both lists:
have a unique set of integers ranging from 1 to 1000
are ordered
For example:
l1 = [1,2,3,4...55,57...999]
l2 = [1,2,3,4...54,56...999]
l1 is missing 56 while l2 is missing 55.
All I need to know in this case is that the lists are not identical so I can update l2.
Updated after comment
See below for an explanation of why you can't use a hash code to assign "each list a unique value to represent its integers."
However, a hash code can be useful. Assuming you create a hash code for each list. You'll want to make sure that you sort items in the lists before computing the hash code, because order definitely matters in hash code computations. That won't necessarily generate a unique hash code for each list, but if the hash codes for two lists aren't identical, the lists are definitely different. If the hash codes are identical, the lists might be identical. The code, then looks like this:
bool AreListsIdentical(list1, list2)
{
if (list1.hashCode != list2.hashCode)
{
// hash codes are different, so lists are definitely not identical
return false;
}
// hash codes are equal. Lists might be identical.
if (list1.Count != list2.Count)
{
// lists have different numbers of items. Definitely not identical.
return false;
}
// have to compare individual items
for (int i = 0; i < list1.Count; ++i)
{
if (list1[x] != list2[x])
{
return false;
}
}
return true;
}
Previous answer
You have multiple lists, each of which contains unique numbers in the range 1 to 1,000. You don't say how large each list is, but for illustration purposes I'll say that each list contains 10 numbers.
You also don't say whether order matters in the list. Is the list [1,7,99,206] the same as [99,7,206,1]? I'll show you the calculations either way.
The number of permutations (order matters) of 1,000 items taken 10 at a time is 9.56E+29. The number of combinations (order doesn't matter) is 2.63E+23. Those are huge numbers.
You say you want a "relatively short id." We can express a 64-bit value easily in a 12-character string, so let's assume that you want to create a 64-bit hash code. There are 1.84E+18 possible 64-bit values.
There are one hundred trillion times more possible permutations than possible hash codes. There are 100,000 times more combinations than hash codes.
Applying the Pigeonhole principle, you have n things that you want to put in m boxes. Since n > m, at least one box will contain more than one item. You can't possibly have a unique 64-bit value for each list.
(In truth, assuming a good hash function, every hash code will represent approximately the same number of different lists.)

Cuckoo Hashing: What is the best way to detect collisions in hash functions?

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?
As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.

Compute "substring" distances between sequences

My dataset (first line = header) is the following:
ID;Activity 1;Activity 2; ... ;Activity 20;
Company_X;A1A3T1D1O1R1R8;A1A3T2O1R2;...;A1A3T6D2O1O2R2
Company_Y;A1A3T1O1R1;A1A3T2O1R2;...;A1A3T11O1O3R5
Company Z;A1A3T1D8O1R1R8;A1A3T2O1R2;...;A1A3T6D2O1R2
where for each activity, each pair (one letter + one number) represents on part of a sequence. A1=actor1, A3=actor3, O1=object1. What I try to do is to compute the difference between the activities of companies. For instance the activity1 of company_x should have a difference of - e.g., 2 with the activity1 of company_y since they have in common A1A3T1O1R1 but not D1 and R8.
Can any packages in TraMineR do that? Which means comparing, within each event, a predefined number of chars?
Thank you very much for your help
From what I understand, each string (activity) like A1A3T6D2O1O2R2 should be considered as a sequence of pairs and you want to compare such sequences.
The seqdef function of TraMineR can read sequences in string form. However, when each element is defined by more than a single character, you have to introduce a separator (e.g., A1-A3-T6) for that. Then, to pair your sequences with company names you may also need to organize your data in table form with each sequence (activity) in a separate row, something like
ID Activity
company_x A1-A3-T6-D2-O1-O2-R2
company_y A1-A3-T1-O1-R1
...
Then, you can compute dissimilarities using measures applicable to sequences of different lengths. Optimal matching (OM), for instance, is the minimal cost of transforming one sequence into the other given the indel and substitution costs. This should give you what you expect. Depending on the substitution costs, the distance between A1A3T6D2O1O2R2 and A1A3T6D2O1R2, could be different than between A1A3T6D2O1O2R2 and A3T4

Pseudorandom seed methodology for lookup tables

Could someone please suggest a good way of taking a global seed value e.g. "Hello World" and using that value to lookup values in arrays or tables.
I'm sort of thinking like that classic spacefaring game of "Elite" where there were different attributes for the planets but they were not random, simply derived from the seed value for the universe.
I was thinking MD5 on the input value and then using bytes from the hash, casting them to integers and mod them into acceptable indexes for lookup tables, but i suspect there must be a better way? I read something about Mersenne twisters but maybe that would be overkill.
I'm hoping for something which will give a good distrubution over the values in my lookup tables. e.g. Red, Orange, Yellow, Green, Blue, Purple
Also to emphasize I'm not looking for random values but consistent values each time.
Update: Perhaps I'm having difficulty in expressing my own problem domain. Here is an example of a site which uses generators and can generate X number of values: http://www.seventhsanctum.com
Additional criteria
I would prefer to work from first principles rather than making use of library functions such as System.Random
My approach would be to use your key as a seed for a random number generator
public StarSystem(long systemSeed){
java.util.Random r = new Random(systemSeed);
Color c = colorArray[r.nextInt(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.nextInt(politicsArray.length)];
...
}
For a given seed this will produce the same color and the same political system every time.
For getting the starting seed from a string you could just use MD5Sum and grab the first/last 64bits for your long, the other approach would be to just use a numeric for each plant. Elite also generated the names for each system using its pseudo-random-generator.
for(long seed=1; seed<NUMBER_OF_SYSTEMS; seed++){
starSystems.add(new StarSystem(seed));
}
By setting the seed to a known value each time the Random will return the same sequence every time it is called, this is why when trying for good random values a good seed is very important. However in your case a known seed will produce the results your looking for.
The c# equivalent is
public StarSystem(int systemSeed){
System.Random r = new Random(systemSeed);
Color c = colorArray[r.next(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.next(politicsArray.length)];
...
}
Notice a difference? no, nor did I.
Many common random number generators will generate the same sequence given the same seed value, so it seems that all you need to do is convert your name into a number. There are any number of hashing functions that will do that.
Supplementary question: Is it required that all unique strings generate unique hashes and so (probably) unique pseudo-random sequences.
?