What hash algorithms are most suitable for generating unique IDs in Postgres? - postgresql

I have a large geospatial data set (~30m records) which I am currently importing into a PostgreSQL database. I need a unique ID to assign to each record, but an incrementing integer might be a bad idea because it could not be reliably recreated if I ever needed to reimport the data set.
It seems that a unique hash of the geometry data in a determined projection might be the best option for a reliable identifier. Being able to calculate the hash within Postgres would be beneficial, and speed would also be of benefit.
What is/are my options given this situation? Is there a particular method that is highly suitable for this situation?

If you need a unique identifier that depends on (and can be recreated from) the data, the most straightforward option seems to be a MD5 hash, which is included in Posgresql (no need of additional libraries) and is quite efficient and -for this scenario- secure.
The pgcrypto module provides additional hashing algorithms, eg SHA1.
Of course, you need to assert that the data to be hashed is unique.

Related

Which access method shall be used for a Berkeley DB that it is going to store 15.000.000 of integer keys?

I am planning to evaluate BerkeleyDB for a project where I have to store 15.000.000 of key/value pairs.
Keys are integers of 10 digits.
Values are variable lenght binary data.
In the BerkeleyDB documentation (https://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/am_conf/intro.html) it is said that there are four access methods that can be configured:
Btree
Hash
Queue
Recno
While the documentation describes each access method, I can not fully understand which access method would fit better for this specific data set I need to store.
Which access method shall be used for this kind of data?
When unsure, choose btree. It's the most flexible access method. Sure, if you're positive that your application fits in one of the other ones, go for it.
A note of caution: writing an application using BDB that really works, that's transactional, recoverable, and offers consistency guarantees is going to be time consuming and prone to error at every step. And, if you're using this for commercial purposes, the licensing could be a total dealbreaker. For some things, it's really the best option. Just make sure you weigh all the other key value store options before embarking on your BDB quest: https://en.wikipedia.org/wiki/Key-value_database

HashIds - To Store Hash in DB or Not To

I am trying to figure out the best practices with using Hashids and if I should be storing my hash ids in a column in my db or should I be using it as demonstrated within the documentation. i.e. Encoding the Id in one area, decoding it in another.
With my current setup, I have been encoding all of my primary key id's and decoding them when the values are publicly accessible, which is the intended purpose of the module, but I'm worried that the hashes that were uniquely generated for my id's will change at some point in the future, which could cause issues like link share-ability for my application.
Based on this scenario, should I really be storing the generated hash into a column in my db and query that?
Regarding whether you should store the Id or not in the database is really up for you to decide. I would say there is no need to do that.
Whether the hashes will change in the future or not really depends on you updating the package or not, from the page;
Since produced ids might differ after updates, be sure to include exact version of Hashids, so you don't unknowingly change your existing ids.
Don't know which version you are using but I'm the author of the .NET version and I've been trying to follow Semantic Versioning meaning with bug fixes I increase patch, added features (non breaking) increases minor. Would there be a change in how the hashes are generated I would increase major.

Reason behind MD5 Hashing?

I have sometimes seen and have been recommended to store Strings and associative array keys as MD5 hash values. Now I have learnt about hashing from MIT - OCW 6.046j and it seems more like a scheme to store data in an efficient format for fast searching and to prevent people from getting back the original.
But don't languages supporting associative arrays / dictionaries do this internally? What additional advantage is the MD5 hash giving?
Most languages may support this internally, for example see Java's hashcode(), which is used when storing keys in a HashMap:
Returns a hash code value for the object. This method is supported for the benefit of hash tables such as those provided by HashMap.
But there are scenarios where you want to do it yourself.
Scenario 1 - using as key in a database:
Let's suppose you have a big no-sql-ish database of letters and metadata of these letters. You want to be able to find a letter's metadata quickly without searching. What would your index be?
One option is using a running index that's unrelated to the letter's content, but then you have to search the database before being able to find a document's metadata. Another option is to create a signature for the document composed of it's prefix (it's just an example out of many), but some documents may share this property ("Dear John,").
So how about taking into account the entire document? That's where you can use md5 as the row-key for your documents.
In this scenario you're relying on having no collisions, and the argument in favour of this assumption usually mention your chances of running into a demented gorilla being (usually) greater. The Secure Hash Algorithm family produce even less collisions.
I mention this since databases normally do not do this out of the box (frameworks may...).
Scenario 2 - One-way hash for password storage:
note: This may no longer apply for md5, but it does for the SHA-family variants.
In this scenario, you want to store passwords on your database, but storing plain-text passwords may have drawbacks if the database is compromised (user often share passwords across sites - may lead to accounts on other sites compromised as well). The usage of hashing here is storing the hashed password and when a user attempts to log-in you only compare the hash and not the password itself. This way you don't need the password stored locally and it is a lot harder to crack it.

Checksum field in PostgreSQL to content comparison

I have a field in a table with large content (text or binary data). If I want to know if another text is equals this one, I can use a checksum to compare the two texts. I can define this field as UNIQUE to avoid repeated content too.
My doubt is if I create a checksum field, this comparison will speed up, so PostgreSQL already does this (without need programmer intervention) or I need do this manually?
EDIT: What is better, create a checksum for a TEXT field, use a checksum for it or the two ways are the same thing?
There is no default "checksum" for large columns in PostgreSQL, you will have to implement one yourself.
Reply to comment
Hash indexes provide fast performance for equality checks. And they are updated automatically. But they are not fully integrated in PostgreSQL (yet), so their use is discouraged - read the manual.
And you cannot query the values and use them in your application for instance. You could do that with a checksum column, but you need to add an index for performance if your table is big and maintain the column. I would use a trigger BEFORE INSERT OR UPDATE for that.
So, a hash index may or may not be for you. #A.H.'s idea certainly fits the problem ...
You might read the Indexes Types manual, because basically you want to do the same as a hash-index but with your bare hands. So you might read up on the pros and cons of a hash index in PostgreSQL.

Using a hash of data as a salt

I was wondering - is there any disadvantages in using the hash of something as a salt of itself?
E.g. hashAlgorithm(data + hashAlgorithm(data))
This prevents the usage of lookup tables, and does not require the storage of a salt in the database. If the attacker does not have access to the source code, he would not be able to obtain the algorithm, which would make brute-forcing significantly harder.
Thoughts? (I have a gut feeling that this is bad - but I wanted to check if it really is, and if so, why.)
If the attacker does not have access to the source code
This is called "security through obscurity", which is always considered bad. An inherently safe method is always better, even if the only difference lies in the fact that you don't feel save "because they don't know how". Someone can and will always find the algorithm -- through careful analysis, trial-and-error, or because they found the source by SSH-ing to your shared hosting service, or any of a hundred other methods.
Using a hash of the data as salt for the data is not secure.
The purpose of salt is to produce unpredictable results from inputs that are otherwise the same. For example, even if many users select the same input (as a password, for example), after applying a good salt, you (or an attacker) won't be able to tell.
When the salt is a function of the data, an attacker can pre-compute a lookup table, because the salt for every password is predictable.
The best salts are chosen from a cryptographic pseudo-random number generator initialized with a random seed. If you really cannot store an extra salt, consider using something that varies per user (like a user name), together with something application specific (like a domain name). This isn't as good as a random salt, but it isn't fatally flawed.
Remember, a salt doesn't need to be secret, but it cannot be a function of the data being salted.
This offers no improvement over just hashing. Use a randomly generated salt.
The point of salting is to make it so two chronologically distinct values' hashes differ, and by so doing breaks pre-calculated lookup tables.
Consider:
data = "test"
hash = hash("test"+hash("test"))
Hash will be constant whenever data = "test". Thus, if the attacker has the algorithm (and the attacker always has the algorithm) they can pre-calculate hash values for a dictionary of data entries.
This is not salt - you have just modified the hash function. Instead of using lookup table for the original hashAlgorithm, attacker can just get the table for your modified one; this does not prevent the usage of lookup tables.
It is always better to use true random data as salt. Imagine an implementation where the username ist taken as salt value. This would lead to reduced security for common names like "root" or "admin".
I you don't want to create and manage a salt value for each hash, you could use a strong application wide salt. In most cases this would be absolutely sufficient and many other things would be more vulnerable than the hashes.