http://www.infoq.com/presentations/newport-evolving-key-value-programming-model is a video about KV stores, and the whole premise is that redis promotes a column-based style for storing the attributes of an object under separate keys rather than serialising an object and storing it under a single key.
(This question is not redis-specific, but more a general style and best practice for KV stores in general.)
Instead of a blob for, say, a 'person', redis encourages a column based style where the attributes in an object are stored as separate key, e.g.
R.set("U:123:firstname","Billy")
R.set("U:123:surname","Newport")
...
I am curious if this is best practice, and if people take different approaches.
E.g. you could 'pickle' an object under a single key. This has the advantage of being fetched or set in a single request
Or a person could be a list with the first item being a field name index or such?
This got me thinking - I'd like a hierarchical key store, e.g.
R.set(["U:123","firstname"],"Billy")
R.set(["U:123","surname"],"Newport")
R.get(["U:123"]) returns [("firstname","Billy"),("surname","Newport")]
And then to add in transactions:
with(R.get(["U:132"]) as user):
user.set("firstname","Paul")
user.set("lastname","Simon")
From a scaling perspective, the batching of gets and sets is going to be important?
Are there key stores that do have support for this or have other applicable approaches?
You can get similar behavior in Redis by using an extra Set to keep track of the individual members of your object.
SET U:123:firstname Billy
SADD U:123:members firstname
SET U:123:surname Cobin
SADD U:123:members surname
GET U:123:firstname => Billy
GET U:123:firstname => Cobin
SORT U:123:members GET U:123:* -> [Billy, Cobin]
or
SMEMBERS U:123:members -> [firstname, surname]
MGET U:123:firstname U:123:firstname
Not a perfect match but good enough in many situations. There's an interesting article about how hurl uses this pattern with Redis
Related
I'd like to model a parent-sole child relationships in a RESTful manner.
I have a User table with partition key userId formatted as usr_############. This userId is readily available in my front-end at all times.
I need to associate to each User the following child resource tables: Profile and Settings. The relationships are 1-to-1. I could make a single table, but I prefer to keep them separate.
My question is how to best index and link the tables in a RESTful manner?
One option is to give each sole child the same partition key as its parent. I can then get the users' information as follows:
/users/usr_abcdefg123456
/profiles/usr_abcdefg123456
/settings/usr_abcdefg123456
The issue with this is my keys are formatted to have the object type as a prefix (e.g., usr_) and it is confusing to call /profiles/{profileId} with profileId="usr_###...". Also it violates the principle that each resource should have a distinct identifier. Imagine in the future I need an array of indexes for a mixed group of objects.
A second option is to make a separate partition key (e.g., profileId, settingsId) and have an ownership attribute/global index ownerId for each child. Since I would not know these new partition keys beforehand (I only have access to userId), my endpoints would have to be either:
/profiles/me
/settings/me
(not ideal because "me" is not a resource.)
/profiles?ownerId=usr_abcdefg123456
/settings?ownerId=usr_abcdefg123456
(not ideal because /profiles and /settings return collections (lists) and not a single object)
/users/usr_abcdefg123456/profile
/users/usr_abcdefg123456/settings
(not ideal because it is nested and I would have to create 2 additional REST endpoints)
Is there a better way to do this?
Your help is greatly appreciated.
I have a persistent actor which holds some values. I need to get some filtered ones of them. So, I have two ways:
1) Create new message say
GetValuesWithNameAndAgeGraterThan(name: String, age: Int)
pro: immutable, orthodox :)
contra: the problem here is that logic leaks into persistent actor which should be responsible only for keeping and providing data and ya, this case is exactly fits to providing data definition. But why should it know about "name" and "age" of value it keeps?!
And since tomorrow I would be needed to add more and more messages which would become a mess at the end.
2) Create generic message with filtering predicate
Filter(p: Value => Boolean)
pro: single, scalable, immutable when used properly
contra: I see the only problem when someone does
val ages: mutable.Seq[Int]
persistor ? Filter(v => ages.contains(v.age))
ages += 18
ages += 33
but we are usually using immutable values in Scala!
and also it would be unnatural to try to persist lambda, but we use it for read only!
So, what do you think?!
Never do follow the second approach!
The first is valid, but id change this a little bit.
You could initially agree on some contract for stored data, e.g might be an Enumeration where each value carries a value type.
So you modify your message with
GetValueForCondition(conditions: Seq[DataType, filter: Value => Boolean]), where DataType is an Enumeration value specifying the name and type of data and the filter defines the condition on the value.
This way you can specify generic queries for entities, which is reusable for requests to other data storing actors. You could also include a Boolean indicating for each data field whether it has to be set for an entity to be returned in the result. In case you store all those information belonging to one entity (e.g name, age,...) in an Entity object (e.g your persistent actor has some storage for multiple such Entities), you could implement the filtering generically within that Entity class, and ur data providing actor is free of this logic.
Let's say we have two models like this:
User:
_ _id
- name
- email
Company:
- _id
_ name
_ slug
Now let's say I need to connect a user to the company. A user can have one company assigned. To do this, I can add a new field called companyID in the user model. But I'm not sending the _id field to the front end. All the requests that come to the API will have the slug only. There are two ways I can do this:
1) Add slug to relate the company: If I do this, I can take the slug sent from a request and directly query for the company.
2) Add the _id of the company: If I do this, I need to first use the slug to query for the company and then use the _id returned to query for the required data.
May I please know which way is the best? Is there any extra benefit when using the _id of a record for the relationship?
Agree with the 2nd approach. There are several issues to consider when deciding on which field to use as a join key (this is true of all DBs, not just Mongo):
The field must be unique. I'm not sure exactly what the 'slug' field in your schema represents, but if there is any chance this could be duplicated, then don't use it.
The field must not change. Strictly speaking, you can change a key field but the only way to safely do so is to simultaneously change it in all the child tables atomically. This is a difficult thing to do reliably because a) you have to know which tables are using the field (maybe some other developer added another table that you're not aware of) b) If you do it one at a time, you'll introduce race conditions c) If any of the updates fail, you'll have inconsistent data and corrupted parent-child links. Some SQL DBs have a cascading-update feature to solve this problem, but Mongo does not. It's a hard enough problem that you really, really don't want to change a key field if you don't have to.
The field must be indexed. Strictly speaking this isn't true, but if you're going to join on it, then you will be running a lot of queries on it, so you'll need to index it.
For these reasons, it's almost always recommended to use a key field that serves solely as a key field, with no actual information stored in it. Plenty of people have been burned using things like Social Security Numbers, drivers licenses, etc. as key fields, either because there can be duplicates (e.g. SSNs can be duplicated if people are using fake numbers, or if they don't have one), or the numbers can change (e.g. drivers licenses).
Plus, by doing so, you can format the key field to optimize for speed of unique generation and indexing. For example, if you use SSNs, you need to check the SSN against the rest of the DB to ensure it's unique. That takes time if you have millions of records. Similarly for slugs, which are text fields that need to be hashed and checked against an index. OTOH, mongoDB essentially uses UUIDs as keys, which means it doesn't have to check for uniqueness (the algorithm guarantees a high statistical likelihood of uniqueness).
The bottomline is that there are very good reasons not to use a "real" field as your key if you can help it. Fortunately for you, mongoDB already gives you a great key field which satisfies all the above criteria, the _id field. Therefore, you should use it. Even if slug is not a "real" field and you generate it the exact same way as an _id field, why bother? Why does a record have to have 2 unique identifiers?
The second issue in your situation is that you don't expose the company's _id field to the user. Intuitively, it seems like that should be a valuable piece of information that shouldn't be given out willy-nilly. But the truth is, it has no informational value by itself, because, as stated above, a key should have no actual information. The place to implement security is in the query, ensuring that the user doing the query has permission to access the record / specific fields that she's asking for. Hiding the key is a classic security-by-obscurity that doesn't actually improve security.
The only time to hide your primary key is if you're using a poorly thought-out key that does contain useful information. For example, an invoice Id that increments by 1 for each invoice can be used by someone to figure out how many orders you get in a day. Auto-increment Ids can also be easily guessed (if my invoice is #5, can I snoop on invoice #6?). Fortunately, Mongo uses UUIDs so there's really no information leaking out (except maybe for timing attacks on its cryptographic algorithm? And if you're worried about that, you need far more in-depth security considerations than this post :-).
Look at it another way: if a slug reliably points to a specific company and user, then how is it more secure than just using the _id?
That said, there are some instances where exposing a secondary key (like slugs) is helpful, none of which have to do with security. For example, if in the future you need to migrate DB platforms and need to re-generate keys because the new platform can't use your old ones; or if users will be manually typing in identifiers, then it's helpful to give them something easier to remember like slugs. But even in those situations, you can use the slug as a handy identifier for users to use, but in your DB, you should still use the company ID to do the actual join (like in your option #2). Check out this discussion about the pros/cons of exposing _ids to users:
https://softwareengineering.stackexchange.com/questions/218306/why-not-expose-a-primary-key
So my recommendation would be to go ahead and give the user the company Id (along with the slug if you want a human-readable format e.g. for URLs, although mongo _ids can be used in a URL). They can send it back to you to get the user, and you can (after appropriate permission checks) do the join and send back the user data. If you don't want to expose the company Id, then I'd recommend your option #2, which is essentially the same thing except you're adding an additional query to first get the company Id. IMHO, that's a waste of cycles for no real improvement in security, but if there are other considerations, then it's still acceptable. And both of those options are better than using the slug as a primary key.
Second way of approach is the best,That is Add the _id of the company.
Using _id is the best way of practise to query any kind of information,even complex queries can be solved using _id as it is a unique ObjectId created by Mongodb. Population is the process of automatically replacing the specified paths in the document with document(s) from other collection(s). We may populate a single document, multiple documents, plain object, multiple plain objects, or all objects returned from a query.
I have a memcache backend and i want to add redis for adding the meta data of the keys of the memcache.
Meta data is as follows:
Miss_count: The number of times the data was not present in the memcache.
Hash_value: The hash value of the data corresponding to the key in the memcache.
Data in memcache : key1 ::: Data
Meta data (miss count) : key1_miss ::: 10
Meta data (hash value) : key1_hash ::: hash(Data)
Please provide help as in which data store is preferable as when i store the meta data in the memcache itself, the meta data is removed well before its expiry time as the size of the meta data is small and the slab allocation is allocating a small memory chuck to it.
As the meta data will increase with time, the hash concept of the redis will fail. Therefore apply a client logic to see that the max_zipped is satisfied.
If I understand your use case correctly I suspect Redis might be a good choice. Assuming you'll be periodically updating the meta data miss counts associated with the various hashes over time, you'd probably want to use Redis sorted sets. For example, if you wanted the miss counts stored in a sorted set called "misscounts", the Redis command to add/update those counts would be one and the same:
zadd misscounts misscount key1
... because zadd adds the entry if one doesn't already exist or overwrites an existing entry if it does. If you have a hook into the process that fires each time a miss occurs, you could instead use:
zincrby misscounts 1 key1
Similar to the the zadd command behavior, zincrement will create a new entry (using the increment value as the count) if one doesn't exist, or increment the existing count by the increment value you pass if an entry does exist.
Complete documentation of Redis commands can be found here. Descriptions of the different types of storage options in Redis is detailed here.
Oh, and a final note. In my experience, Redis is THE SHIT. Sorry to curse (in caps), but there's simply no other way to do Redis justice. We call our Redis server "honey badger", because when load starts increasing and our other servers start auto-scaling, honey badger just don't give a shit.
I am trying to understand, how can Voldermort be used? Say, I have this scenario:
Since, Voldemort is a key-value pair.
I need to fetch a value (say some text) on the basis of 3 parameters.
So, what will be the key in this case? I cannot use 3 keys for 1 value right, but that value should be search able on the basis of those 3 parameters.
Am I making sense?
Thanks
EDIT1
eg: A blog system. A user posts a blog: User's data stored: Name, Age and Sex
The blog content (text) is stored.
Now, I need to use Voldemort here, if a user searches from the front end for all the blog posts by Sex: Male
Then, my code should query voldemort and return all the "blog content (text)" which have Sex as Male.
So, as per my understanding:
Key = Name, Age and Sex
Value = Text
I am using Java.
Edited answer to fit with example added to question:
The thing to understand about Voldemort is that it's a very simple key-value store. As far as I know about it, the only thing you can do is store a value under a key, and then fetch those values by key. So for your example case, if you really want to use Voldemort, you have a few options.
So, for example, you've said that you're storing data for users. So, you might have something like this:
Key = user-Chad
Value = Name:Chad Birch, Age:26, Sex:Male
Now, if I want to post a new blog post, you also need to store that under a key. So you could do something like this:
Key = blog-Chad1
Value = Here is my very first blog post.
Now, your problem is that you need some way to look up all the blog posts made by users with Sex:Male, but there's no way to get that data directly. At this point, you have to either:
Pull out every single user, check if they're male, and if they are, pull out their blog posts.
Start storing more stuff in other key-value pairs so that you can look this up.
To implement #2, you could add another pair like this:
Key = search-Sex:Male
Value = Chad1 Chad2 Steve1 ...
Then, when someone does a search for Sex:Male, you pull out the value for this, split it up, and then go fetch all those blog posts.
Does that make sense? Using a k-v store is quite a bit different from a database, because you lose all these relational abilities.
I don't think you can do that directly with a key-value store, but one way to work around is to store the user in multiple places.
For example, you have key-value mapping of a user to a list of blog posts. You also have a mapping of an age to a list of users. Also a gender to a list of users. Now if you want to search by age or gender you pull the corresponding list of users, and then pull all of their blog posts.
Part of the reason a key-value store like Voldemort can work is that storage and queries are cheap enough that you can do extra ones.
The problem, though, with the above scheme is that if you're using Voldemort in a distributed way you're better off with lots of keys that map to short lists of data (so you can distribute based on key) which something like mapping gender to user would violate (only a few keys with potentially very large lists of data for each).