Generate C* bucket hash from multipart primary key - hash

I will have C* tables that will be very wide. To prevent them to become too wide I have encountered a strategy that could suit me well. It was presented in this video.
Bucket Your Partitions Wisely
The good thing with this strategy is that there is no need for a "look-up-table" (it is fast), the bad part is that one needs to know the max amount of buckets and eventually end up with no more buckets to use (not scalable). I know my max bucket size so I will try this.
By calculating a hash from the tables primary keys this can be used as a bucket part together with the rest of the primary keys.
I have come up with the following method to be sure (I think?) that the hash always will be the same for a specific primary key.
Using Guava Hashing:
public static String bucket(List<String> primKeyParts, int maxBuckets) {
StringBuilder combinedHashString = new StringBuilder();
primKeyParts.forEach(part ->{
combinedHashString.append(
String.valueOf(
Hashing.consistentHash(Hashing.sha512()
.hashBytes(part.getBytes()), maxBuckets)
)
);
});
return combinedHashString.toString();
}
The reason I use sha512 is to be able to have strings with max characters of 256 (512 bit) otherwise the result will never be the same (as it seems according to my tests).
I am far from being a hashing guru, hence I'm asking the following questions.
Requirement: Between different JVM executions on different nodes/machines the result should always be the same for a given Cassandra primary key?
Can I rely on the mentioned method to do the job?
Is there a better solution of hashing large strings so they always will produce the same result for a given string?
Do I always need to hash from string or could there be a better way of doing this for a C* primary key and always produce same result?
Please, I don't want to discuss data modeling for a specific table, I just want to have a bucket strategy.
EDIT:
Elaborated further and came up with this so the length of string can be arbitrary. What do you say about this one?
public static int murmur3_128_bucket(int maxBuckets, String... primKeyParts) {
List<HashCode> hashCodes = new ArrayList();
for(String part : primKeyParts) {
hashCodes.add(Hashing.murmur3_128().hashString(part, StandardCharsets.UTF_8));
};
return Hashing.consistentHash(Hashing.combineOrdered(hashCodes), maxBuckets);
}

I currently use a similar solution in production. So for your method I would change to:
public static int bucket(List<String> primKeyParts, int maxBuckets) {
String keyParts = String.join("", primKeyParts);
return Hashing.consistentHash(
Hashing.murmur3_32().hashString(keyParts, Charsets.UTF_8),
maxBuckets);
}
So the differences
Send all the PK parts into the hash function at once.
We actually set the max buckets as a code constant since the consistent hash is only if the max buckets stay the same.
We use MurMur3 hash since we want it to be fast not cryptographically strong.
For your direct questions 1) Yes the method should do the job. 2) I think with the tweaks above you should be set. 3) The assumption is you need the whole PK?
I'm not sure you need to use the whole primary key since the expectation is that your partition part of your primary key is gonna be the same for many things which is why you are bucketing. You could just hash the bits that will provide you with good buckets to use in your partition key. In our case we just hash some of the clustering key parts of the PK to generate the bucket id we use as part of the partition key.

Related

What's a best practice for saving a unique, random, short string to db?

I have a table with a varchar column named key, which is supposed to hold a unique, 8-char random string, which is going to be used as an unique identifier by users. This field should be generated and saved on creation of objects, I have a question about how to create it:
Most of recommendations point to UUID field, but it's not applicable for me because it's too long, and if just get a subset of it then there's no guarantee of uniqueness.
Currently I've just implemented a loop in my backend (not DB), which generates a random string and tries to insert it to DB, and retries if the string turns out to be not unique. But I feel that this is just a really bad practice.
What's the best way to do this?
I'm using Postgresql 9.6
UPDATE:
My main concern is to remove the loop that retries to find a random, short string (or number, doesn't matter) that is unique in that table. AFAIK the solution should be a way to generate the string in DB itself. The only thing that I can find for Postgresql is uuid and uuid-ossp that does something like this, but uuid is way too long for my application, and I don't know of any way to have a shorter representation of uuid without compromising it's uniqueness (and I don't think it's possible theoretically).
So, how can I remove the loop and it's back-and-forth to DB?
Encryption is guaranteed unique, it has to be otherwise decryption would not work. Provided you encrypt unique inputs, such as 0, 1, 2, 3, ... then you are guaranteed unique outputs.
You want 8 characters. You have 62 characters to play with: A-Z, a-z, 0-9 so convert your binary output from the encryption to a base 62 number.
You may need to use the cycle walking technique from Format-preserving encryption to handle a few cases.

How to implement a high performing non incremental ID in postgresql? [duplicate]

I would like to replace some of the sequences I use for id's in my postgresql db with my own custom made id generator. The generator would produce a random number with a checkdigit at the end. So this:
SELECT nextval('customers')
would be replaced by something like this:
SELECT get_new_rand_id('customer')
The function would then return a numerical value such as: [1-9][0-9]{9} where the last digit is a checksum.
The concerns I have is:
How do I make the thing atomic
How do I avoid returning the same id twice (this would be caught by trying to insert it into a column with unique constraint but then its to late to I think)
Is this a good idea at all?
Note1: I do not want to use uuid since it is to be communicated with customers and 10 digits is far simpler to communicate than the 36 character uuid.
Note2: The function would rarely be called with SELECT get_new_rand_id() but would be assigned as default value on the id-column instead of nextval().
EDIT: Ok, good discussusion below! Here are some explanation for why:
So why would I over-comlicate things this way? The purpouse is to hide the primary key from the customers.
I give each new customer a unique
customerId (generated serial number in
the db). Since I communicate that
number with the customer it is a
fairly simple task for my competitors
to monitor my business (there are
other numbers such as invoice nr and
order nr that have the same
properties). It is this monitoring I
would like to make a little bit
harder (note: not impossible but
harder).
Why the check digit?
Before there was any talk of hiding the serial nr I added a checkdigit to ordernr since there were klumbsy fingers at some points in the production, and my thought was that this would be a good practice to keep in the future.
After reading the discussion I can certainly see that my approach is not the best way to solve my problem, but I have no other good idea of how to solve it, so please help me out here.
Should I add an extra column where I put the id I expose to the customer and keep the serial as primary key?
How can I generate the id to expose in a sane and efficient way?
Is the checkdigit necessary?
For generating unique and random-looking identifiers from a serial, using ciphers might be a good idea. Since their output is bijective (there is a one-to-one mapping between input and output values) -- you will not have any collisions, unlike hashes. Which means your identifiers don't have to be as long as hashes.
Most cryptographic ciphers work on 64-bit or larger blocks, but the PostgreSQL wiki has an example PL/pgSQL procedure for a "non-cryptographic" cipher function that works on (32-bit) int type. Disclaimer: I have not tried using this function myself.
To use it for your primary keys, run the CREATE FUNCTION call from the wiki page, and then on your empty tables do:
ALTER TABLE foo ALTER COLUMN foo_id SET DEFAULT pseudo_encrypt(nextval('foo_foo_id_seq')::int);
And voila!
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> insert into foo (foo_id) values(default);
pg=> select * from foo;
foo_id
------------
1241588087
1500453386
1755259484
(4 rows)
I added my comment to your question and then realized that I should have explained myself better... My apologies.
You could have a second key - not the primary key - that is visible to the user. That key could use the primary as the seed for the hash function you describe and be the one that you use to do lookups. That key would be generated by a trigger after insert (which is much simpler than trying to ensure atomicity of the operation) and
That is the key that you share with your clients, never the PK. I know there is debate (albeit, I can't understand why) if PKs are to be invisible to the user applications or not. The modern database design practices, and my personal experience, all seem to suggest that PKs should NOT be visible to users. They tend to attach meaning to them and, over time, that is a very bad thing - regardless if they have a check digit in the key or not.
Your joins will still be done using the PK. This other generated key is just supposed to be used for client lookups. They are the face, the PK is the guts.
Hope that helps.
Edit: FWIW, there is little to be said about "right" or "wrong" in database design. Sometimes it boils down to a choice. I think the choice you face will be better served by leaving the PK alone and creating a secondary key - just that.
I think you are way over-complicating this. Why not let the database do what it does best and let it take care of atomicity and ensuring that the same id is not used twice? Why not use a postgresql SERIAL type and get an autogenerated surrogate primary key, just like an integer IDENTITY column in SQL Server or DB2? Use that on the column instead. Plus it will be faster than your user-defined function.
I concur regarding hiding this surrogate primary key and using an exposed secondary key (with a unique constraint on it) to lookup clients in your interface.
Are you using a sequence because you need a unique identifier across several tables? This is usually an indication that you need to rethink your table design, and those several tables should perhaps be combined into one, with an autogenerated surrogate primary key.
Also see here
How you generate the random and unique ids is a useful question - but you seem to be making a counter productive assumption about when to generate them!
My point is that you do not need to generate these id's at the time of creating your rows, because they are essentially independent of the data being inserted.
What I do is pre-generate random id's for future use, that way I can take my own sweet time and absolutely guarantee they are unique, and there's no processing to be done at the time of the insert.
For example I have an orders table with order_id in it. This id is generated on the fly when the user enters the order, incrementally 1,2,3 etc forever. The user does not need to see this internal id.
Then I have another table - random_ids with (order_id, random_id). I have a routine that runs every night which pre-loads this table with enough rows to more than cover the orders that might be inserted in the next 24 hours. (If I ever get 10000 orders in one day I'll have a problem - but that would be a good problem to have!)
This approach guarantees uniqueness and takes any processing load away from the insert transaction and into the batch routine, where it does not affect the user.
Your best bet would probably be some form of hash function, and then a checksum added to the end.
If you're not using this too often (you do not have a new customer every second, do you?) then it is feasible to just get a random number and then try to insert the record. Just be prepared to retry inserting with another number when it fails with unique constraint violation.
I'd use numbers 1000000 to 999999 (900000 possible numbers of the same length) and check digit using UPC or ISBN 10 algorithm. 2 check digits would be better though as they'll eliminate 99% of human errors instead of 9%.

Swift: Unique Int id from String

I am using Parse which has an preload User table in the database. I want from each user a unique userId (Int). Parse's objectId is unique but not an Int and username is a String. Username is unique for each user , so can I somehow convert each username into a number ?
I tried .toInt() , Int() but I got nothing.
WHY :
I have an existing table with user's ratings (movies) and I want to extent this table with more ratings. The userId field is a Number value so I must keep it this way.
Swift String has a hash property. It also conforms to the Hashable protocol. Maybe you can use that.
However, hashValue has the following comment:
Axiom:x == y implies x.hashValue == y.hashValue.
Note: The hash value
is not guaranteed to be stable across different invocations of the
same program. Do not persist the hash value across program runs.
so, use carefully...
Note: as stated in the comments, the hashValue is not guaranteed to be unique, but collisions should be rare, so it may be a solution anyway.
Having unique arbitrary String to Int map is not possible. You have to put some constraints on the allowed characters and string length. However, even if you use case-insensitive alpha-numeric user names, with some smart variable-length bit-encoding, then you look at some 5 bits per character on rough average. 64-bit integer can accomodate up to some 12 characters this way. Longer than that, you will inevitably have collisions.
I think you approach the problem from the wrong end. Instead of having a function for String -> Int mapping, what stops you from having a separate table with Int <-> String mapping? Just have some functionality that will check whether a userID exists in that table, and if it does not, then insert a new record for such userID and assign a new unique number to it. This way it will take quite some time and service popularity to deplete 64-bit integer capacity.

Amazon DynamoDB table design and querying

We are considering DynamoDB for an expectedly large dataset. I come from a strong SQL background so the No-SQL way of thinking is new to me.
I have a problem and design, but ran into what appears to be a dead end.
The documentation says to make sure your Hash keys are widely distributed to aid in performance, okay that makes sense.
I am going to be recording various datapoints/actions for users. It makes sense to me that the hash key should be the user-id, and my range key can be the action(s) performed.
Now, if I want all the actions user #1 performs, I can easily query that.
But, if I want all the USERS who performed action X, I cannot do that without a table scan. From the Query documentation:
A Query operation directly accesses items from a table using the table primary key, or from an index using the index key. You must provide a specific hash key value.
So it would seem I am limited to getting data from a specific user, unless I am willing to do a table scan, which is slower and consumes many capacity units.
My question is, I think, ultimately a design question. Maybe I am missing something when it comes to No-SQL? Should my hash key be something else? Or is it simply that my requirements do not fit in with No-SQL (and more specifically, DynamoDB)?
It is almost as if the hash key is a kind of grouping with DynamoDB. I considered changing the hash key to the actions we are intending to put into place, but then I am not widely distributing my keys...
The DynamoDb way to meet your requirement to allow both types of queries is to store the data in two tables, one with hash key user-id and range key action-id, and one with hash key action-id and range key user-id.
And you should think about if you need all the data in both tables, or if one can be a summary table. For example, say you have a limited number of possible actions. Instead of putting the full record of every action in the user-keyed table, you might want a table with only one row for each user: a hash key of user - id, and a second column that is multiply valued and is a list of any action-id that the user has performed at least once.
You must create a Global Secondary Index (GSI). What this does is it creates a second pair of hash and range keys which differ from the original keys. You can then query the same table by also including an index name in your parameters.
Example in JS:
var table = tablename;
var index = actionId-username-gsi;
var action = actionId;
var params = {
TableName : table,
IndexName : index,
KeyConditionExpression : 'actionId = :v_actionId',
ExpressionAttributeValues : {
':v_actionId': { N : action }
},
ProjectionExpression : 'actionId, username'
};
ddb.query(params, err) {
if(err) {
// Oh well
} else {
// Do something
}
};
This will query the actionId-username-gsi index and look for any actionId hashes with the value provided. Using ProjectionExpression will return only the specified attributes' values for each item, lowering throughput if that ever becomes a concern. I hope this helps answer your question.
node.js aws amazon-dynamodb nosql
I guess the global secondary indexes option is better, as you get a single table.
Creating two tables will create redundancy and additional work to maintain consistency when doing any CUD (Create, Update, Delete) operation on any one table.

How to encrypt a string with standard PostgreSQL?

I'm working with PostgreSQL.
I need to transform "http://www.xyz.com/some_uri/index1.html" in something like "scdfdsffd"(some unique key, based on the URL that is a unique key in the table).
By other words... the URL is a unique key in the table, but I need to generate a small unique key based on the URL.
What can I do with standard PostgreSQL 8.4?
Best Regards,
Several methods:
a) Why not use an auto-incrementing column or sequence generator to generate unique integers per insert? If you have less than 100 million URLs, your identifiers are short and easy to remember. However, if that's not an option (e.g. because you don't want people guessing IDs and attacking the database that way):
b) The built-in MD5() function may help:
INSERT INTO table (pkey, url) VALUES (MD5('http://...'), 'http://...');
MD5() is a hash function and will most likely give you a unique identifier per URL. I say "most likely" because you get a 128-bit hash from MD5, and the likelihood of a hash collision is on the order of 2^-128 (about 10^-55).
If you need smaller identifiers you can chop the result from MD5 down to a smaller number of characters, but you could potentially significantly increase the chance of a hash collision depending on which characters you take.
[Note: timestamp answer redacted since it in no way solves the original problem. -BobG]