Pseudorandom seed methodology for lookup tables - hash

Could someone please suggest a good way of taking a global seed value e.g. "Hello World" and using that value to lookup values in arrays or tables.
I'm sort of thinking like that classic spacefaring game of "Elite" where there were different attributes for the planets but they were not random, simply derived from the seed value for the universe.
I was thinking MD5 on the input value and then using bytes from the hash, casting them to integers and mod them into acceptable indexes for lookup tables, but i suspect there must be a better way? I read something about Mersenne twisters but maybe that would be overkill.
I'm hoping for something which will give a good distrubution over the values in my lookup tables. e.g. Red, Orange, Yellow, Green, Blue, Purple
Also to emphasize I'm not looking for random values but consistent values each time.
Update: Perhaps I'm having difficulty in expressing my own problem domain. Here is an example of a site which uses generators and can generate X number of values: http://www.seventhsanctum.com
Additional criteria
I would prefer to work from first principles rather than making use of library functions such as System.Random

My approach would be to use your key as a seed for a random number generator
public StarSystem(long systemSeed){
java.util.Random r = new Random(systemSeed);
Color c = colorArray[r.nextInt(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.nextInt(politicsArray.length)];
...
}
For a given seed this will produce the same color and the same political system every time.
For getting the starting seed from a string you could just use MD5Sum and grab the first/last 64bits for your long, the other approach would be to just use a numeric for each plant. Elite also generated the names for each system using its pseudo-random-generator.
for(long seed=1; seed<NUMBER_OF_SYSTEMS; seed++){
starSystems.add(new StarSystem(seed));
}
By setting the seed to a known value each time the Random will return the same sequence every time it is called, this is why when trying for good random values a good seed is very important. However in your case a known seed will produce the results your looking for.
The c# equivalent is
public StarSystem(int systemSeed){
System.Random r = new Random(systemSeed);
Color c = colorArray[r.next(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.next(politicsArray.length)];
...
}
Notice a difference? no, nor did I.

Many common random number generators will generate the same sequence given the same seed value, so it seems that all you need to do is convert your name into a number. There are any number of hashing functions that will do that.
Supplementary question: Is it required that all unique strings generate unique hashes and so (probably) unique pseudo-random sequences.
?

Related

How does Snowflake calculate its HASH() output?

Take a look at this query
select
hash( col1, col2 ) as a,
col1||col2 as b, -- just taking a guess as to how hash can take multiple values
hash( b ) as c
from table_name
The result for a and c are different.
So, my question is: how does Snowflake calculate the hash when there are many fields like in a? Is it concatinating the fields first, and then signing that result of that?
Thank you
More to NickW's point that HASH is proprietary
HASH is a proprietary function that accepts a variable number of input expressions of arbitrary types and returns a signed value. It is not a cryptographic hash function and should not be used as such.
I assume the core of the problem you are trying to achieve, is to "make a value in another system, and be able to compare these "safely", of which concatenating strings together, seems very dangerous, as the number and length of each string is a property of those strings.
The usage notes section has some good hints:
Any two values of type NUMBER that compare equally will hash to the same hash value, even if the respective types have different precision and/or scale.
this implies that things are converted to this form.. but it also notes on convertion:
Note that this guarantee does not apply to other combinations of types, even if implicit conversions exist between the types.
What really would help is for you to describe, what you want to happen for you, then if "knowing how HASH works" is the best path to that end, OR not as I would suggest, would be more answerable.
Aka, this answer is a long form question, suggesting this question needs to be reworked.

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

I am working on a database that (hopefully) will end up using a primary key with both numbers and letters in the values to track lots of agricultural product. Due to the way in which the weighing of product takes place at more than one facility, I have no other option but to maintain the same base number but use letters in addition to this base number to denote split portions of each lot of product. The problem is, after I create record number 99, the number 100 suddenly floats up and underneath 10. This makes it difficult to maintain consistency and forces me to replace this alphanumeric lot ID with a strictly numeric value in order to keep it sorted (which I use "autonumber" as the data type). Either way, I need the alphanumeric lot ID, and so having 2 ID's for the same lot can be confusing for anyone inputting values into the form. Is there a way around this that I am just not seeing?
If you're using query as a data source then you may try to sort it by string converted to number, something like
SELECT id, field1, field2, ..
ORDER BY CLng(YourAlphaNumericField)
Edit: you may also try Val function instead of CLng - it should not fail on non-numeric input
Why not properly format your key before saving ? e.g: "0000099". You will avoid a costly conversion later.
Alternatively, you could use 2 fields as the composite PK. One with the Number (as Long) and one with the Location (as String).

Compute "substring" distances between sequences

My dataset (first line = header) is the following:
ID;Activity 1;Activity 2; ... ;Activity 20;
Company_X;A1A3T1D1O1R1R8;A1A3T2O1R2;...;A1A3T6D2O1O2R2
Company_Y;A1A3T1O1R1;A1A3T2O1R2;...;A1A3T11O1O3R5
Company Z;A1A3T1D8O1R1R8;A1A3T2O1R2;...;A1A3T6D2O1R2
where for each activity, each pair (one letter + one number) represents on part of a sequence. A1=actor1, A3=actor3, O1=object1. What I try to do is to compute the difference between the activities of companies. For instance the activity1 of company_x should have a difference of - e.g., 2 with the activity1 of company_y since they have in common A1A3T1O1R1 but not D1 and R8.
Can any packages in TraMineR do that? Which means comparing, within each event, a predefined number of chars?
Thank you very much for your help
From what I understand, each string (activity) like A1A3T6D2O1O2R2 should be considered as a sequence of pairs and you want to compare such sequences.
The seqdef function of TraMineR can read sequences in string form. However, when each element is defined by more than a single character, you have to introduce a separator (e.g., A1-A3-T6) for that. Then, to pair your sequences with company names you may also need to organize your data in table form with each sequence (activity) in a separate row, something like
ID Activity
company_x A1-A3-T6-D2-O1-O2-R2
company_y A1-A3-T1-O1-R1
...
Then, you can compute dissimilarities using measures applicable to sequences of different lengths. Optimal matching (OM), for instance, is the minimal cost of transforming one sequence into the other given the indel and substitution costs. This should give you what you expect. Depending on the substitution costs, the distance between A1A3T6D2O1O2R2 and A1A3T6D2O1R2, could be different than between A1A3T6D2O1O2R2 and A3T4

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.

How does the hash part in hash maps work?

So there is this nice picture in the hash maps article on Wikipedia:
Everything clear so far, except for the hash function in the middle.
How can a function generate the right index from any string? Are the indexes integers in reality too? If yes, how can the function output 1 for John Smith, 2 for Lisa Smith, etc.?
That's one of the key problems of hashmaps/dictionaries and so on. You have to choose a good hash function. A very bad but fast hash function could be the length of the keys. You instantly see, that you will get a lot of collisions (different keys, but same hash). Another bad hash function could be the ASCII value of the first character of your key. Lot's of collisions, too.
So you need a function that is a lot better than those two. You could add (xor) all ASCII values of the key characters and mix the length in for instance. In practice you often depend on the values (fields) of the object that you want to hash (same values give same hash => value type). For reference types you can mix in a memory location for instance.
In your example that's just simplified a lot. No real hash function would map these keys to sequential numbers.
Maybe you want to read one of my previous answers to hashmaps
A simple hash function may be as follows:
$hash = $string[0] % HASH_TABLE_SIZE;
This function will return a number between 0 and HASH_TABLE_SIZE - 1, depending on the first letter of the string. This number can be used to go to the correct position in the hash table.
A real hash function will consider all letters in a string, and it will be designed so that there is an even spread among the buckets.
The hash function most often (but not necessarily always) outputs an integer within wanted range (often parameter to the hash function). This integer can be used as an index. Notice that hash function cannot be guaranteed to always produce unique result when given different data to hash. This is called hash collision and hash algorithm must always handle it in some way.
As for your specific question, how a string becomes a number. Any string is composed of characters (J, o, h, n ...) and characters can be interpreted as numbers (in computers). ASCII and UTF standards bind certain values to certain characters, so result is deterministic and always the same on all computers. So the hash function does operation on these characters that processes them as numbers and comes up with another number (output). You could for example simply sum all the values and use modulo operation to range-limit the resulting value.
This would be quite a horrible hashing function because for example "ab" and "ba" would get same result. Design of hash function is difficult and so one should use some ready-made algorithm unless situation dictates some other solution.
There's a really good article on how hash functions (and colision detection/resolution) on MSDN:
Part 2: The Queue, Stack, and Hashtable
You can skip down to the header Compressing Ordinal Indexing with a Hash Function
There are some bits and pieces that are .NET specific (when they talk about which Hash algorithm .NET uses by default) but for the most part it is language agnostic.
All that is required of a hash function is that it returns the same integer given the same key. Technically, a hash function that always returns '1' is not incorrect.