Key value pairs Vs hashes for large amounts of data

Key value pairs Vs hashes for large amounts of data - hash

Lets assume in a system where we have about 10 million users. we need to cache those users objects after retrieval from the database into redis.
Now the question is would we store those JSON objects into a key value pairs something like "user_1" or a more appropriate solution would be having them all into the same hash "users" and the hash key would be the user ID "1 in this case"
I assume having key-value pairs would take much more memory than a hash, but what about the performance?

Since both global key space and hashes are hash tables, access time has complexity O(1). Performance shouldn't be an issue in both cases.
BTW, I would take a look at this official Redis docs' article about memory optimization. Its first paragraph states:
Since Redis 2.2 many data types are optimized to use less space up to
a certain size. Hashes, Lists, Sets composed of just integers, and
Sorted Sets, when smaller than a given number of elements, and up to a
maximum element size, are encoded in a very memory efficient way that
uses up to 10 times less memory (with 5 time less memory used being
the average saving).
Also, you said:
we have about 10 million users.
Then, either if you use globak key space or hashes, you should take a look at sharding with Redis Cluster. Probably this way you should be able to even optimize more your scenario.

3 years late, but given #Matias' comment on sharding with Redis Cluster it is worth noting that the unit of sharding is the key name. That means that all values in a hash will end up on the same server. So for millions of users, the global key space would allow for sharding, but not a hash.

Related

Redis: single big hash vs many keys

I have some data which has million key-value pairs. So which way is better to store them in redis, single key with super big hash which has million key-value pairs or million keys each having a value?
In my opinion, both ways seems not very good given million pairs.
Any suggestion?

Redis stores everything in a big hash table internally anyhow. So for a single node, probably not going to make a big difference.
However, as soon as you need to scale and add clustering, that one key that contains your hash and that has everything will always be on the same shard. Which means clustering isn't really doing anything.
I'd go for the proliferation of keys.

Ensuring evenly distributed documents per shard in solr

I've found myself needing to support result grouping with an accurate ngroups count. This required colocation of documents by a secondaryId field.
I'm currently indexing documents using the compositeId router in solr. The uniqueKey is documentId and I'm adding a shard key at the front like this:
doc.addField("documentId", secondaryId + "!" + actualDocId);
The problem I'm seeing is that the document count accross my 3 shards is now uneven:
shard1: ~30k
shard1: ~60k
shard1: ~30k
(This is expected to grow a lot.)
Apparently the hashes of secondaryId are not very evenly distributed, but I don't know enough about possible values.
Any thoughts on getting a better distribution of these documents?

Your data is not evenly spread across you secondaryIds. Some secondary ids have a lot more data than others. There is no perfect and/or simple solution.
Assuming you cannot change your routing id, one approach is to create a larger number of shards, say 16 on same number of hosts. Your shards will now be smaller and still potentially uneven. But given their larger numbers, you can then move your shards around across the nodes you have, to more or less balance out the nodes in size.
The caveat is that you have routed queries so that each query hits only one shard. If you have unrouted queries, having a large number of shards can result in significant performance degradation as each query will need to be run against each shard.

What I've done is read the Solr routing code to see how it hashes. Then replicate some of the logic manually to figure out the hash ranges to split.
I found these online tools to convert the Ids to hash then back and forth to Hex which is what the shard split command wants.
Murmur hash app: http://murmurhash.shorelabs.com/
Use “MurmurHash3” form.
Hex converter app: https://www.rapidtables.com/convert/number/decimal-to-hex.html
I think want “Hex signed 2's complement” when different, but not when has 00000000 prefix...
You'll also have to pay attention to masking. It's somethinglike:
Imagine you have a document hashed to a HEX values of 12345678. This is a composite of:
primaryRouteId: 12xxxxxx
secondaryRouteId:xx34xxx
documentId: xxxx5678
(Note if you only have a primaryRouteId!docId then primaryRouteId takes the first 4 spots.)

You can use Solr rebalancing with the feature called UTILIZENODE.
Check these links :
https://solr.apache.org/guide/8_4/cluster-node-management.html#utilizenode
https://solr.pl/en/2018/01/02/solr-7-2-rebalancing-replicas-using-utilizenode/
It will automatically handle the uneven shards and will balance them across all the servers.
Note : It is a new feature and will work only with Solr version greater than equal to 8.2

MongoDB shard key

I've been thinking about selecting the best shard key (through a compound index) for my data and thought the combination of the document creation date combined with a customer no. (or invoice no.) would be a good combination. IF MongoDB would consider the customer no as a string backwards ie.:
90043 => 34009
90044 => 44009
90045 => 54009
etc.
Index on the The creation date would ensure that relatively new data are kept in memory and the backward customer no would help MongoDB to distribute the data/load across the cluster.
Is this a correct assumption? and if so... would I need to save my customer no reversed for it to be distributed the way I expect?

Regarding your specific question of "would I need to save my customer no reversed for it to be distributed the way I expect?", no - you would not.
Even with the relatively narrow spread of customer number values you listed, if you use customerNumber in your compound key, MongoDB will break apart the data into chunks and distribute these accordingly. As long as the data associated with customerNumber are relatively evenly distributed (e.g., one user doesn't dominate the system), you will get the shard balancing you desire.
I would consider either your original choice (minus the string reversal) or Dan's choice (using the built-in ObjectId instead of timestamp) as good candidates for your compound key.

from what I have read in the documentation the MongoId is already time based.
Therfore you can add the _id to your compound key like this: (_id, customerid). If you don't need the date in your application, you can just drop the field which would save you some storage.
MongoDB stores the datasets recently used in memory.
The index of a collection will always tried to be stored into RAM.
When an index is too large to fit into RAM, MongoDB must read the
index from disk, which is a much slower operation than reading from
RAM. Keep in mind an index fits into RAM when your server has RAM
available for the index combined with the rest of the working set.
Hope this helps.
Cheers dan

I think the issue with your thinking it that, somehow, you feel Node 1 would be faster than Node 2. Unless the hardware is drastically different then Node 1 and Node 2 would be accessed equally fast and thus reversing the strings would not help you out.
The main issue I see has to do with the number of customers in your system. This can lead to monotonic sharding wherein the last shard is the one always being hit and that can cause excessive splitting and migration. If you have a large number of customers then there is no issue, otherwise you might want to add another key on top of the customer id and date fields to more evenly divide up your content. I have heard of people using random identifiers, hashing the _id or using a GUID to overcome this issue.

Why are key names stored in the document in MongodDB

I'm curious about this quote from Kyle Banker's MongoDB In Action:
It’s important to consider the length of the key names you choose, since key names are stored in the documents themselves. This contrasts with an RDBMS, where column names are always kept separate from the rows they refer to. So when using BSON, if you can live with dob in place of date_of_birth as a key name, you’ll save 10 bytes per document. That may not sound like much, but once you have a billion such documents, you’ll have saved nearly 10 GB of storage space just by using a shorter key name. This doesn’t mean you should go to unreasonable lengths to ensure small key names; be sensible. But if you expect massive amounts of data, economizing on key names will save space.
I am interested in the reason why this is not optimized on the database server side. Would a in-memory lookup table with all key names in the collection be too much of a performance penalty that is not worth the potential space savings?

What you are referring to is often called "key compression"*. There are several reasons why it hasn't been implemented:
If you want it done, you can currently do it at the Application/ORM/ODM level quite easily.
It's not necessarily a performance** advantage in all cases — think collections with lots of key names, and/or key names that vary wildly between documents.
It might not provide a measurable performance** advantage at all until you have millions of documents.
If the server does it, the full key names still have to be transmitted over the network.
If compressed key names are transmitted over the network, then readability really suffers using the javascript console.
Compressing the entire JSON document might offer offers an even better performance advantage.
Like all features, there's a cost benefit analysis for implementing it, and (at least so far) other features have offered more "bang for the buck".
Full document compression is [being considered][1] for a future MongoDB version. available as of version 3.0 (see below)
* An in-memory lookup table for key names is basically a special case of LZW style compression — that's more or less what most compression algorithms do.
** Compression provides both a space advantage and a performance advantage. Smaller documents means that more documents can be read per IO, which means that in a system with fixed IO, more documents per second can be read.
Update
MongoDB versions 3.0 and up now have full document compression capability with the WiredTiger storage engine.
Two compression algorithms are available: snappy, and zlib. The intent is for snappy to be the best choice for all-around performance, and for zlib to be the best choice for maximum storage capacity.
In my personal (non-scientific, but related to a commercial project) experimentation, snappy compression (we didn't evaluate zlib) offered significantly improved storage density at no noticeable net performance cost. In fact, there was slightly better performance in some cases, roughly in line with my previous comments/predictions.

I believe one of the original reasons behind storing the key names with the documents is to allow a more easily scalable schema-less database. Each document is self-contained to a greater extent, in that if you move the document to another server (for example, via replication or sharding) you can index the contents of the document without having to reference separate or centralized metadata such as a mapping of key names to more compact key IDs.
Since there is no enforced schema for a MongoDB collection, the field names can potentially be different for every document in the same collection. In a sharded environment, inserts to each shard are (intentionally) independent so at a document level the raw data could end up differing unless the key mapping was able to be consistent per shard.
Depending on your use case, the key names may or may not consume a significant amount of space relative to the accompanying data. You could always workaround the storage concern from the application / ODM implementation by mapping YourFriendlyKeyNames to shorter DB key equivalents.
There is an open MongoDB Jira issue and some further discussion to have the server tokenize field names, which you can vote on to help prioritize including this feature in a future release.
MongoDB's current design goals include performance with dynamic schemas, replication & high availability, auto-sharding, and in-place updates .. with one potential tradeoff being some extra disk usage.

Having to look this up within the database for each and every query would be a serious penalty.
Most drivers allow you to specify ElementName, so that MyLongButReadablePropertyName in your domain model becomes mlbrpn in mongodb.
Therefore, when you query in your application, it's the application that transforms the query that would of been:
db.myCollection.find({"MyLongButReadablePropertyName" : "some value"})
into
db.myCollection.find({"mlbrpn" : "some value"})
Efficient drivers, like the C# driver cache this mapping, so it doesn't need to look this up for each and every query.
Coming back to the title of your question:
Why are key names stored in the document in MongodDB
This is the only way documents can be searched?
Without the key names stored , there'd be no key to search on.
Hope this helps

Why does ObjectId make sharding easier in MongoDB?

I keep reading that using an ObjectId as the unique key makes sharding easier, but I haven't seen a relatively detailed explanation as to why that is. Could someone shed some light on this?
The reason I ask is that I want to use an english string (which will be unique obviously) as the unique key, but want to make sure that it won't tie my hands later on.

I've just recently been getting familiar with mongoDB myself so take this with a grain of salt but I suspect that sharding is probably more efficient when using ObjectId rather that your own key values because of the fact that part of the ObjectId will point out which machine or shard that the document was created on. The bottom of this page in the mongo docs explains what each portion of the ObjectId means.

I asked this question on Mongo user list and basically the reply was that it's OK to generate your own value of _id and it will not make sharding more difficult. For me sometimes it's necessary to have numeric values on _id like when I'm going to use them in url, so I'm generating my own _id in some collections.

ObjectId is designed to be globally unique. So, when used as a primary key and a new record is appended to the dataset without primary key value, then each shard can generate a new objectid and not worry about collisions with other shards. This somewhat simplifies life for everyone :)

Shard key does not have to be unique. We can't conclude that sharding a collection based on object id is always efficient .

Actually, ObjectID is probably a poor choice for a shard key.
From the docs (http://docs.mongodb.org/manual/core/sharded-cluster-internals/ the section on "Write Scaling"):
"[T]he most significant bits of [an ObjectID] represent a time stamp, which means that they increment in a regular and predictable pattern. [Therefore] all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster."
In other words, because every OID sorts "bigger" than the one created immediately before it, an inserts that are keyed by OID will land on the same machine, and the write I/O capacity of that one machine will be the total I/O of your entire cluster. (This is true not just of OIDs, but any predictable key -- timestamps, autoincrementing numbers, etc.)
Contrariwise, if you chose a random string as your shard key, writes would tend to distribute evenly over the cluster, and your throughput would be the total I/O of the whole cluster.
(EDIT to be complete: with an OID shard key, as new records landed on the "rightmost" shard, the balancer would handle moving them elsewhere, so they would eventually end up on other machines. But that doesn't solve the I/O problem; it actually makes it worse.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse