Redis: single big hash vs many keys - hash

I have some data which has million key-value pairs. So which way is better to store them in redis, single key with super big hash which has million key-value pairs or million keys each having a value?
In my opinion, both ways seems not very good given million pairs.
Any suggestion?

Redis stores everything in a big hash table internally anyhow. So for a single node, probably not going to make a big difference.
However, as soon as you need to scale and add clustering, that one key that contains your hash and that has everything will always be on the same shard. Which means clustering isn't really doing anything.
I'd go for the proliferation of keys.

Related

Does Redis hash guarantee that no two different keys have same hash?

Does Redis hash mentioned at https://redislabs.com/ebook/part-2-core-concepts/chapter-3-commands-in-redis/3-4-hashes/, guarantee that two different Strings will have two different hash?If not, then how collisions are handled.I cannot find any such information on web.
Use Case :- I want to store my suggestion index into redis. For every word, i want to give some suggestions
{
"hel":"hell,hello"
"cap" : "captain,capital"
}
and so on.
So if i use this info in redis hash as a key value.Is it guaranteed that no two keys will have same hash? So that when i am looking up hash table, i should not show something irrelevant.

How to create a Table-Like Reliable Collection

What is the best way to create a table-like reliable collection? Can we role out our own?
I am looking for something to store simple lists or bags for indexes and to track keys and other simple details since the cost of enumerating multi-partition dictionaries is so high. Preferably sequential rather than random access.
The obvious options:
IDictionary <Guid, List> has concurrency issues and poor performance
Try to enumerate a queue but I doubt it will be better than dictionary
Use an external data store
None of these seem particularly good.
The partitioning is actually to gain performance. The trick is to shard your data in such a way that cross partition queries aren't needed. You can also create multiple dictionaries with different aggregates of the same data. (use transactions)
Read more in the chapter 'plan for partitioning' here.
Of course you can roll out your own reliable collection. After all, a reliable collection is just a in-memory data structure backed by a Azure Storage object. If you want a reliable list of string, you can implement IList<string>, and in the various methods (add, remove, getEnumerator, ecc.) insert code to track, persist the data structure.
Based on your content, it can be a table (if you can generate a good partition/row key), or just a blob (and you serialize/deserialize the content each time, or at checkpoints, or... your policy!)
I did not get why IReliableDictionary<K, V> is not good for you. Do you need to store key, value pairs, and you do not want keys to be distributed in partitions, for performance reasons? (Because a "getAll" will spawn machines?)
Or do you need just a list of keys (like colors, or like you have in an HashSet?).
In any case, depending on the size of data, you can partition them differently, using something like IReliableDictionary, where the int can be just one (like "42", and you'll have one partition), or a few (hash and then mod (%) the number you want), and you will get a whole bunch of keys (from every key to N sections of keys) at once.

Key value pairs Vs hashes for large amounts of data

Lets assume in a system where we have about 10 million users. we need to cache those users objects after retrieval from the database into redis.
Now the question is would we store those JSON objects into a key value pairs something like "user_1" or a more appropriate solution would be having them all into the same hash "users" and the hash key would be the user ID "1 in this case"
I assume having key-value pairs would take much more memory than a hash, but what about the performance?
Since both global key space and hashes are hash tables, access time has complexity O(1). Performance shouldn't be an issue in both cases.
BTW, I would take a look at this official Redis docs' article about memory optimization. Its first paragraph states:
Since Redis 2.2 many data types are optimized to use less space up to
a certain size. Hashes, Lists, Sets composed of just integers, and
Sorted Sets, when smaller than a given number of elements, and up to a
maximum element size, are encoded in a very memory efficient way that
uses up to 10 times less memory (with 5 time less memory used being
the average saving).
Also, you said:
we have about 10 million users.
Then, either if you use globak key space or hashes, you should take a look at sharding with Redis Cluster. Probably this way you should be able to even optimize more your scenario.
3 years late, but given #Matias' comment on sharding with Redis Cluster it is worth noting that the unit of sharding is the key name. That means that all values in a hash will end up on the same server. So for millions of users, the global key space would allow for sharding, but not a hash.

DynamoDB: Get All Items

I'm trying to retrieve all of the keys from a DynamoDB table in an optimized way. There are millions of keys.
In Cassandra I would probably create a single row with a column for every key which would eliminate to do a full table scan. DynamoDBs 64k limit per Item would seemingly preclude this option though.
Is there a quick way for me to get back all of the keys?
Thanks.
I believe the DynamoDB analogue would be to use composite keys: have a primary key of "allmykeys" and a range attribute of the originals being tracked: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/DataModel.html#DataModelPrimaryKey
I suspect this will scale poorly to billions of entries, but should work adequately for a few million.
Finally, again as with Cassandra, the most straightforward solution is to use map/reduce to get the keys: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html

Why does ObjectId make sharding easier in MongoDB?

I keep reading that using an ObjectId as the unique key makes sharding easier, but I haven't seen a relatively detailed explanation as to why that is. Could someone shed some light on this?
The reason I ask is that I want to use an english string (which will be unique obviously) as the unique key, but want to make sure that it won't tie my hands later on.
I've just recently been getting familiar with mongoDB myself so take this with a grain of salt but I suspect that sharding is probably more efficient when using ObjectId rather that your own key values because of the fact that part of the ObjectId will point out which machine or shard that the document was created on. The bottom of this page in the mongo docs explains what each portion of the ObjectId means.
I asked this question on Mongo user list and basically the reply was that it's OK to generate your own value of _id and it will not make sharding more difficult. For me sometimes it's necessary to have numeric values on _id like when I'm going to use them in url, so I'm generating my own _id in some collections.
ObjectId is designed to be globally unique. So, when used as a primary key and a new record is appended to the dataset without primary key value, then each shard can generate a new objectid and not worry about collisions with other shards. This somewhat simplifies life for everyone :)
Shard key does not have to be unique. We can't conclude that sharding a collection based on object id is always efficient .
Actually, ObjectID is probably a poor choice for a shard key.
From the docs (http://docs.mongodb.org/manual/core/sharded-cluster-internals/ the section on "Write Scaling"):
"[T]he most significant bits of [an ObjectID] represent a time stamp, which means that they increment in a regular and predictable pattern. [Therefore] all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster."
In other words, because every OID sorts "bigger" than the one created immediately before it, an inserts that are keyed by OID will land on the same machine, and the write I/O capacity of that one machine will be the total I/O of your entire cluster. (This is true not just of OIDs, but any predictable key -- timestamps, autoincrementing numbers, etc.)
Contrariwise, if you chose a random string as your shard key, writes would tend to distribute evenly over the cluster, and your throughput would be the total I/O of the whole cluster.
(EDIT to be complete: with an OID shard key, as new records landed on the "rightmost" shard, the balancer would handle moving them elsewhere, so they would eventually end up on other machines. But that doesn't solve the I/O problem; it actually makes it worse.)