HashIds - To Store Hash in DB or Not To - hash

I am trying to figure out the best practices with using Hashids and if I should be storing my hash ids in a column in my db or should I be using it as demonstrated within the documentation. i.e. Encoding the Id in one area, decoding it in another.
With my current setup, I have been encoding all of my primary key id's and decoding them when the values are publicly accessible, which is the intended purpose of the module, but I'm worried that the hashes that were uniquely generated for my id's will change at some point in the future, which could cause issues like link share-ability for my application.
Based on this scenario, should I really be storing the generated hash into a column in my db and query that?

Regarding whether you should store the Id or not in the database is really up for you to decide. I would say there is no need to do that.
Whether the hashes will change in the future or not really depends on you updating the package or not, from the page;
Since produced ids might differ after updates, be sure to include exact version of Hashids, so you don't unknowingly change your existing ids.
Don't know which version you are using but I'm the author of the .NET version and I've been trying to follow Semantic Versioning meaning with bug fixes I increase patch, added features (non breaking) increases minor. Would there be a change in how the hashes are generated I would increase major.

Related

What kind of key should be used to group multiple rows within the same database table?

Use case
I need to store texts assigned to some entity. It's important to note that I always only care about the most current texts that have been assigned to that entity. In case new texts are inserted, older ones might even be deleted. And that "might" is the problem, because I can't rely that really only the most current texts are available.
The only thing I'm unsure about how to design is the case that some INSERT can provide either 1 or N texts for some entity. In the latter case, I need to know which N texts belong to the most current INSERT done for one and the same entity. Additionally, inserting N instead of 1 text will be pretty rare.
I know that things could be implemented using two different tables: One calculating some main-ID and the other mapping individual texts with their own IDs to that main-ID. Because multiple texts should happen rarely and a one table design already provides columns which could easily be reused for grouping multiple texts together, I prefer using one only.
Additionally, I thought of which concept would make a good grouping key in general as well. I somewhat doubt that others really always implement the two table-approach only and therefore created this question to get a better understanding. Of course I simply might be wrong and everybody avoids such "hacks" at all costs. :-)
Possible keys
Transaction-local timestamp
Postgres supports the concept of a transaction-local timestamp using current_timestamp. I need to have one of those to store when the texts have been stored anyway, so they might be used for grouping as well?
While there's in theory the probability of collisions, timestamps have a resolution of 1 microsecond, which is in practice enough for my needs. Texts are uploaded by human users and it is very unlikely that multiple humans upload texts for the same entity at the same time at all.
That timestamp won't be used as a primary key of course, only to group multiple texts if necessary.
Transaction-ID
Postgres supports txid_current to get the ID of the current transaction, which should be ever increasing over the lifetime of the current installation. The good thing is that this value is always available and the app doesn't need to do anything on it's own. But things can easily break in case of restores, can't they? Can TXIDs e.g. occur again with the restored cluster?
People knowing things better than me write the following:
Do not use the transaction ID for anything at the application level. It is an internal system level field. Whatever you are trying to do, it's likely that the transaction ID is not the right way to do it.
https://stackoverflow.com/a/32644144/2055163
You shouldn't be using the transaction ID for anything except an identifier for a transaction. You can't even assume that a lower transaction ID is an older transaction, due to transaction ID wrap-around.
https://stackoverflow.com/a/20602796/2055163
Isn't my grouping a valid use case for wanting to know the ID of the current transaction?
Custom sequence
Grouping simply needs a unique key per transaction, which can be achieved using a custom sequence for that purpose only. I don't see any downsides, its values consume less storage than e.g. UUIDs and can easily be queried.
Reusing first unique ID
The table to store the texts contains a serial-column, so each inserted text gets an individual ID already. Therefore, the ID of the first inserted text could simply always be additionally reused as the group-key for all later added texts.
In case of only inserting one text, one should easily be able to use currval and doesn't even need to explicitly query the ID of the inserted row. In case of multiple texts this doesn't work anymore, though, because currval would provide updated IDs instead of the first one per transaction only. So some special handling would be necessary.
APP-generated random UUID
Each request to store multiple texts could simply generate some UUID and group by that. The mainly used database Postgres even supports a corresponding data type.
I mainly see too downsides with this: It feels really hacky already and consumes more space than necessary. OTOH, compared to the texts to store, the latter might simply be negligible.

Which access method shall be used for a Berkeley DB that it is going to store 15.000.000 of integer keys?

I am planning to evaluate BerkeleyDB for a project where I have to store 15.000.000 of key/value pairs.
Keys are integers of 10 digits.
Values are variable lenght binary data.
In the BerkeleyDB documentation (https://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/am_conf/intro.html) it is said that there are four access methods that can be configured:
Btree
Hash
Queue
Recno
While the documentation describes each access method, I can not fully understand which access method would fit better for this specific data set I need to store.
Which access method shall be used for this kind of data?
When unsure, choose btree. It's the most flexible access method. Sure, if you're positive that your application fits in one of the other ones, go for it.
A note of caution: writing an application using BDB that really works, that's transactional, recoverable, and offers consistency guarantees is going to be time consuming and prone to error at every step. And, if you're using this for commercial purposes, the licensing could be a total dealbreaker. For some things, it's really the best option. Just make sure you weigh all the other key value store options before embarking on your BDB quest: https://en.wikipedia.org/wiki/Key-value_database

What hash algorithms are most suitable for generating unique IDs in Postgres?

I have a large geospatial data set (~30m records) which I am currently importing into a PostgreSQL database. I need a unique ID to assign to each record, but an incrementing integer might be a bad idea because it could not be reliably recreated if I ever needed to reimport the data set.
It seems that a unique hash of the geometry data in a determined projection might be the best option for a reliable identifier. Being able to calculate the hash within Postgres would be beneficial, and speed would also be of benefit.
What is/are my options given this situation? Is there a particular method that is highly suitable for this situation?
If you need a unique identifier that depends on (and can be recreated from) the data, the most straightforward option seems to be a MD5 hash, which is included in Posgresql (no need of additional libraries) and is quite efficient and -for this scenario- secure.
The pgcrypto module provides additional hashing algorithms, eg SHA1.
Of course, you need to assert that the data to be hashed is unique.

Is there a way to get around space usage issues when using long field names in MongoDB?

It looks like having descriptive field names (the ones I like the most) can take much space in the memory for big collections. I don't like the idea of giving them short and cryptic names to save memory, neither do I like the idea to translate field names to shortened fields somewhere in the application.
Is there a way to tell mongo not to store every field name as text?
For now the only thing you can do is to vote and wait for SERVER-863 to be solved. After almost a year of discussion the status of this issue has been changes to planned but not scheduled...
The workaround is to use document mapping libraries likes Spring Data Document or morphia (in Java world) and work with nicely named objects. But the underlying database names are still cryptic.
If you are using an "object-document mapper" library to access MongoDB, many of them provide facilities for using descriptive names within your application code, but storing short names in the database. If your application has a data access layer, it may be possible for you to implement this logic in your application code, as well.
Since you haven't said what language you're using, or whether you're using an ODM at all, I provide any more guidance on which ODMs might fit your needs.

How to search the value when value is stored as encrypted

in my database i store the student information in encrypted form.
now i want to perform the search to list all student which name is start with "something" or contains "something"
anybody have idea that how can perform this type of query?
Please suggest
Any decent encryption algorithm has as one of its core features the fact that it's impossible to deduce anything about the plaintext just by looking at the encrypted text. If you were able to tell, just by looking at the encrypted text, that the plaintext contained the string william, any attackers would be able to get that information just as easily, and you may as well not be encrypting at all.
The only way to perform this kind of operation on the data is to have access to the decrypted data. Using the model you've described - where the database only ever sees the encrypted data - it's not possible for the database to do this work, as the database has no access to the data it needs.
You need to have the data you're wanting to search on decrypted. The only complete way to do this is to have the application pull all the data out of the database, decrypt it, then do the filtering/sorting/whatever in your application. Obviously this is not going to scale well - but that's surely something you took into consideration when you decided to encrypt the data before putting it in the database.
Another option would be to store fragments of the data unencrypted. For example, if you have a first_name field and you want to be able to retrieve all records where first_name begins with a, have a first_name_first_letter field. Obviously this isn't going to scale well either - if you want to search for all records where first_name contains ill, you're going to have to store the complete first_name unencrypted.
There's a more serious problem with this solution though: by storing unencrypted data, you're leaking information about the encrypted data. The more unencrypted data you store, the more you leak. The more you leak, the more clues you're leaving for an attacker to defeat your encryption - plus, if you've stored the bit they were interested in unencrypted, they've already won.
Another question points to SQLCipher - it's an implemention of sqlite that does the encryption in the database. It seems to be targeted towards your use case - it's even already used on a couple of iPhone apps.
However, it has the database doing the encryption, not the application. This lets the database also handle the decryption, and hence the database is able to inspect the contents of the fields and do the searching you're looking for.
If you're still insisting on not doing the encryption in the database, this won't work for you.
If all you want is the equivalent of "starts with" and "contains", you might be able to do something with a bit field and the bitwise logical operators.
Not sure on the syntax you'd use, exactly (I'm a bit rusty on SQL) but the idea would be to create an additional field for each record which has a bit set for each letter that occurs in the name, then do something like:
SELECT * from someTable where (searchValue & bitField)>0
You then need to iterate over those records, decrypt them, and determine whether they actually meet the criteria you really wanted to search on (since you'd get a superset of the desired records back from the search).
You'd obviously be leaking some information about the contents of the field by doing this, but you can probably reduce that by encrypting the bitfields as well, or by turning on a few extra bits in each bitfield, so you can't tell "bob" from "bobby", for example.
I'm curious about what sort of security goal you're trying to meet with this encryption, though. If you describe the model a bit more, you might get better answers.