Which access method shall be used for a Berkeley DB that it is going to store 15.000.000 of integer keys? - key-value

I am planning to evaluate BerkeleyDB for a project where I have to store 15.000.000 of key/value pairs.
Keys are integers of 10 digits.
Values are variable lenght binary data.
In the BerkeleyDB documentation (https://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/am_conf/intro.html) it is said that there are four access methods that can be configured:
Btree
Hash
Queue
Recno
While the documentation describes each access method, I can not fully understand which access method would fit better for this specific data set I need to store.
Which access method shall be used for this kind of data?

When unsure, choose btree. It's the most flexible access method. Sure, if you're positive that your application fits in one of the other ones, go for it.
A note of caution: writing an application using BDB that really works, that's transactional, recoverable, and offers consistency guarantees is going to be time consuming and prone to error at every step. And, if you're using this for commercial purposes, the licensing could be a total dealbreaker. For some things, it's really the best option. Just make sure you weigh all the other key value store options before embarking on your BDB quest: https://en.wikipedia.org/wiki/Key-value_database

Related

Store arbitrary precision integer in PostgreSQL

I have an application that needs to store cryptocurrency values to PostgreSQL database. The application uses arbitrary precision integers, and those I have to store to the database. What's the most efficient way to do that?
Why arbitrary precision? For two reasons:
For security. There shall never be an overflow.
For necessity. For example, Ethereum uses uint256 by default internally, and 1 Ether = 10^18 wei. So transactions will have a gigantic number of digits that has to be stored if accuracy is to be sought (which it's).
The best solution I came up with is to convert the number to a blob and store the number as bits in raw format. But I'm hoping there's a better way that's more suitable for a database.
EDIT:
The reason why I need this method for storing to be better is performance. I don't want to get into benchmarks and all this detail. That's why I'm keeping the question simple, or otherwise it'll get complicated. So the question is whether there's a proper way to do this.
Have a look at the documentation.
If you need efficiency but also depend on accurate values (which I would agree with), then you really should pre-define columns or different tables with specific presets using decimal(precision, scale).
If your tests reveal that the standard data types are not performing good enough you might want to have a look at bignum and maybe others.

HashIds - To Store Hash in DB or Not To

I am trying to figure out the best practices with using Hashids and if I should be storing my hash ids in a column in my db or should I be using it as demonstrated within the documentation. i.e. Encoding the Id in one area, decoding it in another.
With my current setup, I have been encoding all of my primary key id's and decoding them when the values are publicly accessible, which is the intended purpose of the module, but I'm worried that the hashes that were uniquely generated for my id's will change at some point in the future, which could cause issues like link share-ability for my application.
Based on this scenario, should I really be storing the generated hash into a column in my db and query that?
Regarding whether you should store the Id or not in the database is really up for you to decide. I would say there is no need to do that.
Whether the hashes will change in the future or not really depends on you updating the package or not, from the page;
Since produced ids might differ after updates, be sure to include exact version of Hashids, so you don't unknowingly change your existing ids.
Don't know which version you are using but I'm the author of the .NET version and I've been trying to follow Semantic Versioning meaning with bug fixes I increase patch, added features (non breaking) increases minor. Would there be a change in how the hashes are generated I would increase major.

What hash algorithms are most suitable for generating unique IDs in Postgres?

I have a large geospatial data set (~30m records) which I am currently importing into a PostgreSQL database. I need a unique ID to assign to each record, but an incrementing integer might be a bad idea because it could not be reliably recreated if I ever needed to reimport the data set.
It seems that a unique hash of the geometry data in a determined projection might be the best option for a reliable identifier. Being able to calculate the hash within Postgres would be beneficial, and speed would also be of benefit.
What is/are my options given this situation? Is there a particular method that is highly suitable for this situation?
If you need a unique identifier that depends on (and can be recreated from) the data, the most straightforward option seems to be a MD5 hash, which is included in Posgresql (no need of additional libraries) and is quite efficient and -for this scenario- secure.
The pgcrypto module provides additional hashing algorithms, eg SHA1.
Of course, you need to assert that the data to be hashed is unique.

Is there a data storage where I can access data directly via array index instead of hash key? Redis? MongoDB?

I need an external C/C++ memory efficient (!) data storage for a Java app which does not have the downside of a normal database lookup (b tree) but which uses my IDs as array index. Is there an open source solution for this? I implemented this in C++ in-memory only, but I would like to have a "storage to disc" option in case of a crash or for backup. Also Java binding would be cool.
E.g. redis looks good but when reading the docs I see that in general things are accessed by hash keys which have O(1) only in theory - or can I somehow force that the hashing scheme matches the storage index? And also lists are not appropriated as they are implemented as linked lists. Or what about mongodb?
And yes, I really need that fast read access (write can be "okayish slow" :)) - it is no premature optimization but if there is no alternative I'll try redis before rolling my own. Also Java is not possible (as I said: memory efficient ;))
With a remote key-value store, the overhead is very often dominated by the network and protocol management rather than data access itself. That's why with efficient key-value stores (like Redis for instance), almost all the operations actually have the same cost.
The Redis benchmark page contains a good illustration of this point.
In other words, in the context of an in-memory remote store, and considering only the latency, a random access array will have the same exact performance than a hash table, and even less efficient O(log n) containers like red-black trees, B-trees, etc ... will be quite close.
If you really want maximum performance, I would suggest to use an embedded (i.e. in-process) store. For instance, both BerkeleyDB and Tokyo Cabinet provide disk based random access containers for fixed-length records.
KDB is the go-to solution for this problem in the financial systems (algo trading) world. Be prepared to have your brain melted by the syntax though. Oh, and it is not open source.

How does memcache store data?

I am a newbie to caching and have no idea how data is stored in caching. I have tried to read a few examples online, but everybody is providing code snippets of storing and getting data, rather than explaining how data is cached using memcache. I have read that it stores data in key, value pairs , but I am unable to understand where are those key-value pairs stored?
Also could someone explain why is data going into cache is hashed or encrypted? I am a little confused between serialising data and hashing data.
A couple of quotes from the Memcache page on Wikipedia:
Memcached's APIs provide a giant hash
table distributed across multiple
machines. When the table is full,
subsequent inserts cause older data to
be purged in least recently used (LRU)
order.
And
The servers keep the values in RAM; if
a server runs out of RAM, it discards
the oldest values. Therefore, clients
must treat Memcached as a transitory
cache; they cannot assume that data
stored in Memcached is still there
when they need it.
The rest of the page on Wikipedia is pretty informative, and it might help you get started.
They are stored in memory on the server, that way if you use the same key/value often and you know they won't change for a while you can store them in memory for faster access.
I'm not deeply familiar with memcached, so take what I have to say with a grain of salt :-)
Memcached is a separate process or set of processes that store a key-value store in-memory so they can be easily accessed later. In a sense, they provide another global scope that can be shared by different aspects of your program, enabling a value to be calculated once, and used in many distinct and separate areas of your program. In another sense, they provide a fast, forgetful database that can be used to store transient data. The data is not stored permanently, but in general it will be stored beyond the life of a particular request (it is possible for Memcached to never store your data, so every read will be a miss, but that's generally an indication that you do not have it set up correctly for your use case).
The data going into cache does not have to be hashed or encrypted (but both things can happen to the data, depending on the caching mechanism.)
Serializing data actually has nothing to do with either concept -- instead, it is the process of changing data from one format (generally one suited for in-memory storage) to another one (generally suitable for storage in a persistent medium.) Another term for this process is marshalling and unmarshalling.