Several questions about NDB keys - hash

Question a: Is there any absolute guarantee that an entity successfully retrieved from NDB will not contain a key with value None, nor key.id() with value None? Is this guarantee written anywhere in the docs?
Question b: Do keys store the id as a str/int? Or do keys store id in a hash to be decrypted with key.id() is called? If has, proceed to question c.
Question c: Is the number of characters in the key hash limited or a set amount? If yes, proceed to question d.
Question d: When trying to create a Key, does the constructor limit the str len of its id when trying to create a Key? If no, proceed to question e.
Question e: If I construct a key with a str id of len equal to len of key hash + 1, how can the key decrypyt and retrieve the id with id() if the range of values the hash can have is less than the range of values of the str id I provide?
Thanks for your time!

Answer a: If you ever retrieve an entity with a key (or id) of Null, that would indicate a much bigger problem with NDB. As far as I know, this is not possible because every entity needs a key. No, there's no "guarantee" of this apart from the source code.
Answer b: The key stores the id as a str/int, as seen in the source code:
def id(self):
"""Return the string or integer id in the last (kind, id) pair, if any.
Returns:
A string or integer id, or None if the key is incomplete.
"""
return self.__pairs[-1][1]
Since it does not use a hash, I did not proceed with answers c, d and e.

Related

What's a best practice for saving a unique, random, short string to db?

I have a table with a varchar column named key, which is supposed to hold a unique, 8-char random string, which is going to be used as an unique identifier by users. This field should be generated and saved on creation of objects, I have a question about how to create it:
Most of recommendations point to UUID field, but it's not applicable for me because it's too long, and if just get a subset of it then there's no guarantee of uniqueness.
Currently I've just implemented a loop in my backend (not DB), which generates a random string and tries to insert it to DB, and retries if the string turns out to be not unique. But I feel that this is just a really bad practice.
What's the best way to do this?
I'm using Postgresql 9.6
UPDATE:
My main concern is to remove the loop that retries to find a random, short string (or number, doesn't matter) that is unique in that table. AFAIK the solution should be a way to generate the string in DB itself. The only thing that I can find for Postgresql is uuid and uuid-ossp that does something like this, but uuid is way too long for my application, and I don't know of any way to have a shorter representation of uuid without compromising it's uniqueness (and I don't think it's possible theoretically).
So, how can I remove the loop and it's back-and-forth to DB?
Encryption is guaranteed unique, it has to be otherwise decryption would not work. Provided you encrypt unique inputs, such as 0, 1, 2, 3, ... then you are guaranteed unique outputs.
You want 8 characters. You have 62 characters to play with: A-Z, a-z, 0-9 so convert your binary output from the encryption to a base 62 number.
You may need to use the cycle walking technique from Format-preserving encryption to handle a few cases.

How are same hash vs same key handled?

This question is not specific to any programming language, I am more interested in a generic logic.
Generally, associative maps take a key and map it to a value. As far as I know, implementations require the keys to be unique otherwise values get overwritten. Alright.
So let us assume that the above is done by some hash implementation.
What if two DIFFERENT keys get the same hash value? I am thinking of this in the form of an underlying array whose indices are in a result of hash on said keys. It could be possible that more than one unique key gets mapped to the same value yes? If so, how does such an implementation handle this?
How is handling same hash different from handling same key? Since same key results in overwriting and same hash HAS to retain the value.
I understand hashing with collision, so I know chaining and probing. Do implementations iterate over the current values which are hashed to a particular index and determine if the key is the same?
While I was searching for the answer I came across these links:
1. What happens when a duplicate key is put into a HashMap?
2. HashMap with multiple values under the same key
They don't answer my question however. How do we distinguish between same hash vs same key?
By comparing the keys. If you look at object-oriented implementations of hash maps, you'll find that they usually require two methods to be implemented on the key type:
bool equal(Key key1, Key key2);
int hash(Key key);
If only the hash function can be given and no equality function, that restricts the hash map to be based on the language's default equality. This is not always desirable as sometimes keys need to be compared with a different equality function. For example, if the keys are strings, an application may need to do a case-insensitive comparison, and then it would pass a hash function that converts to lowercase before hashing, and an equal function that ignores case.
The hash map stores the key alongside each corresponding value. (Usually, that's a pointer to the key object that was originally stored.) Any lookup into the hash map has to make a key comparison after finding a matching hash, to verify that the key actually matches.
For example, for a very simple hash map that stores a list in each bucket, the list would be a list of (key, value) pairs, and any lookup compares the keys of each list entry until it finds a match. In pseudocode:
Array<List<Pair<Key, Value>>> buckets;
Value lookup(Key k_sought) {
int h = hash(k_sought);
List<Pair<Key, Value>> bucket = buckets[h];
for (kv in bucket) {
Key k_found = kv.0;
Value v_found = kv.1;
if (equal(k_sought, k_found)) {
return v_found;
}
}
throw Not_found;
}
You can not tell what a key is from the index, so no you can not iterate over the values to find any information about the keys. You will either have to guarantee 0 collisions or store the information that was hashed to give the index.
If you only have values stored in your structure, there is no way to tell if they have the same key or just the same hash. You will need to store the key along with the value to know.

Generate C* bucket hash from multipart primary key

I will have C* tables that will be very wide. To prevent them to become too wide I have encountered a strategy that could suit me well. It was presented in this video.
Bucket Your Partitions Wisely
The good thing with this strategy is that there is no need for a "look-up-table" (it is fast), the bad part is that one needs to know the max amount of buckets and eventually end up with no more buckets to use (not scalable). I know my max bucket size so I will try this.
By calculating a hash from the tables primary keys this can be used as a bucket part together with the rest of the primary keys.
I have come up with the following method to be sure (I think?) that the hash always will be the same for a specific primary key.
Using Guava Hashing:
public static String bucket(List<String> primKeyParts, int maxBuckets) {
StringBuilder combinedHashString = new StringBuilder();
primKeyParts.forEach(part ->{
combinedHashString.append(
String.valueOf(
Hashing.consistentHash(Hashing.sha512()
.hashBytes(part.getBytes()), maxBuckets)
)
);
});
return combinedHashString.toString();
}
The reason I use sha512 is to be able to have strings with max characters of 256 (512 bit) otherwise the result will never be the same (as it seems according to my tests).
I am far from being a hashing guru, hence I'm asking the following questions.
Requirement: Between different JVM executions on different nodes/machines the result should always be the same for a given Cassandra primary key?
Can I rely on the mentioned method to do the job?
Is there a better solution of hashing large strings so they always will produce the same result for a given string?
Do I always need to hash from string or could there be a better way of doing this for a C* primary key and always produce same result?
Please, I don't want to discuss data modeling for a specific table, I just want to have a bucket strategy.
EDIT:
Elaborated further and came up with this so the length of string can be arbitrary. What do you say about this one?
public static int murmur3_128_bucket(int maxBuckets, String... primKeyParts) {
List<HashCode> hashCodes = new ArrayList();
for(String part : primKeyParts) {
hashCodes.add(Hashing.murmur3_128().hashString(part, StandardCharsets.UTF_8));
};
return Hashing.consistentHash(Hashing.combineOrdered(hashCodes), maxBuckets);
}
I currently use a similar solution in production. So for your method I would change to:
public static int bucket(List<String> primKeyParts, int maxBuckets) {
String keyParts = String.join("", primKeyParts);
return Hashing.consistentHash(
Hashing.murmur3_32().hashString(keyParts, Charsets.UTF_8),
maxBuckets);
}
So the differences
Send all the PK parts into the hash function at once.
We actually set the max buckets as a code constant since the consistent hash is only if the max buckets stay the same.
We use MurMur3 hash since we want it to be fast not cryptographically strong.
For your direct questions 1) Yes the method should do the job. 2) I think with the tweaks above you should be set. 3) The assumption is you need the whole PK?
I'm not sure you need to use the whole primary key since the expectation is that your partition part of your primary key is gonna be the same for many things which is why you are bucketing. You could just hash the bits that will provide you with good buckets to use in your partition key. In our case we just hash some of the clustering key parts of the PK to generate the bucket id we use as part of the partition key.

Cuckoo Hashing: What is the best way to detect collisions in hash functions?

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?
As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.

Swift: Unique Int id from String

I am using Parse which has an preload User table in the database. I want from each user a unique userId (Int). Parse's objectId is unique but not an Int and username is a String. Username is unique for each user , so can I somehow convert each username into a number ?
I tried .toInt() , Int() but I got nothing.
WHY :
I have an existing table with user's ratings (movies) and I want to extent this table with more ratings. The userId field is a Number value so I must keep it this way.
Swift String has a hash property. It also conforms to the Hashable protocol. Maybe you can use that.
However, hashValue has the following comment:
Axiom:x == y implies x.hashValue == y.hashValue.
Note: The hash value
is not guaranteed to be stable across different invocations of the
same program. Do not persist the hash value across program runs.
so, use carefully...
Note: as stated in the comments, the hashValue is not guaranteed to be unique, but collisions should be rare, so it may be a solution anyway.
Having unique arbitrary String to Int map is not possible. You have to put some constraints on the allowed characters and string length. However, even if you use case-insensitive alpha-numeric user names, with some smart variable-length bit-encoding, then you look at some 5 bits per character on rough average. 64-bit integer can accomodate up to some 12 characters this way. Longer than that, you will inevitably have collisions.
I think you approach the problem from the wrong end. Instead of having a function for String -> Int mapping, what stops you from having a separate table with Int <-> String mapping? Just have some functionality that will check whether a userID exists in that table, and if it does not, then insert a new record for such userID and assign a new unique number to it. This way it will take quite some time and service popularity to deplete 64-bit integer capacity.