Correct usage of Voldemort as key-value pair? - voldemort

I am trying to understand, how can Voldermort be used? Say, I have this scenario:
Since, Voldemort is a key-value pair.
I need to fetch a value (say some text) on the basis of 3 parameters.
So, what will be the key in this case? I cannot use 3 keys for 1 value right, but that value should be search able on the basis of those 3 parameters.
Am I making sense?
Thanks
EDIT1
eg: A blog system. A user posts a blog: User's data stored: Name, Age and Sex
The blog content (text) is stored.
Now, I need to use Voldemort here, if a user searches from the front end for all the blog posts by Sex: Male
Then, my code should query voldemort and return all the "blog content (text)" which have Sex as Male.
So, as per my understanding:
Key = Name, Age and Sex
Value = Text
I am using Java.

Edited answer to fit with example added to question:
The thing to understand about Voldemort is that it's a very simple key-value store. As far as I know about it, the only thing you can do is store a value under a key, and then fetch those values by key. So for your example case, if you really want to use Voldemort, you have a few options.
So, for example, you've said that you're storing data for users. So, you might have something like this:
Key = user-Chad
Value = Name:Chad Birch, Age:26, Sex:Male
Now, if I want to post a new blog post, you also need to store that under a key. So you could do something like this:
Key = blog-Chad1
Value = Here is my very first blog post.
Now, your problem is that you need some way to look up all the blog posts made by users with Sex:Male, but there's no way to get that data directly. At this point, you have to either:
Pull out every single user, check if they're male, and if they are, pull out their blog posts.
Start storing more stuff in other key-value pairs so that you can look this up.
To implement #2, you could add another pair like this:
Key = search-Sex:Male
Value = Chad1 Chad2 Steve1 ...
Then, when someone does a search for Sex:Male, you pull out the value for this, split it up, and then go fetch all those blog posts.
Does that make sense? Using a k-v store is quite a bit different from a database, because you lose all these relational abilities.

I don't think you can do that directly with a key-value store, but one way to work around is to store the user in multiple places.
For example, you have key-value mapping of a user to a list of blog posts. You also have a mapping of an age to a list of users. Also a gender to a list of users. Now if you want to search by age or gender you pull the corresponding list of users, and then pull all of their blog posts.
Part of the reason a key-value store like Voldemort can work is that storage and queries are cheap enough that you can do extra ones.
The problem, though, with the above scheme is that if you're using Voldemort in a distributed way you're better off with lots of keys that map to short lists of data (so you can distribute based on key) which something like mapping gender to user would violate (only a few keys with potentially very large lists of data for each).

Related

Firestore Structure for public/private fields

I'm after some advice on a Firestore DB structure. I have an app that has a Firestore db and allows a single user (under the one UID) to create a profile for each member of their family (each profile is a document within the collection). In each of the documents, there are the personal details of the family member (as fields. For example, field1 = firstname, field2 = last name, field3 = phone number and so on). This works well but there is one other detail I need to attribute to each and every field within each profile. I need to be able to set a private or public flag against each individual field (for example: firstname has public flag, last name has private flag, Phone number has private flag and so on..). It would be nice if each field could have nested fields underneath (such as a "private" bool field) but that's not how Firestore works. It seems to be Collection/Document/Collection/Document/and so on...
If I didn't need to private/public flag, I would not have an issue. My data would fit perfectly to the Firestore structure.
Does anyone have any suggestions on how I might best achieve this outcome?
Cheers and thanks in advance...
Family Profiles current structure without flags
You can use structure above. With this structure you can fetch private data and public data separately whenever you need. But I have to tell you if you want to show only first name to other users in your app you can use queries on what to show to users. And also always use unique ids to store data rather than hardcoded Names such as JaneDoe or JoeDoe. Otherwise you can face some problems in the future regarding fetching data from firestore.
If you have questions feel free to ask
Take a look at the official documentation of Firebase. The information provided there will help you to understand what could be the most suitable solution for work with the data structure on this service. On the other hand for your question, it depends of your use case, will be useful if you could provide us with more context about why would your implementation needs to be as you wanted.
Also, since your concerns are related about how to manage the privacy of your data check this document too.
I hope this information will help you

Redis hash usage as table

I want to use redis like Nosql database and I have some idea like below.
Assume that I have 3 table
1 - user
2 - post
3 - comment
I create hash for each table like below
hset user _usr_100 {"id":"_usr_100","name":"john","username"="jhn","age":25}
hset user _usr_101 {"id":"_usr_101","name":"adam","username"="adm","age":26}
hset user _usr_102 {"id":"_usr_102","name":"eric","username"="erc","age":27}
hset post _post_100 {"id":"_post_100","title":"title","content":"testpost","userid"="_usr_100"}
hset post _post_101 {"id":"_post_101","title":"title","content":"testpost","userid"="_usr_101"}
hset post _post_102 {"id":"_post_102","title":"title","content":"testpost","userid"="_usr_102"}
hset comment _comment_100 {"id":"_comment_100","content":"testpost","userid"="_usr_100","postid":"_post_100"}
hset comment _comment_101 {"id":"_comment_101","content":"testpost","userid"="_usr_101","postid":"_post_101"}
hset comment _comment_102 {"id":"_comment_102","content":"testpost","userid"="_usr_102","postid":"_post_102"}
When I want get user(_user_100) from redis
hget user _usr_100
{"id":"_usr_100","name":"john","username"="jhn","age":25}
When I want get users
hgetall user
{"id":"_usr_100","name":"john","username"="jhn","age":25}
{"id":"_usr_101","name":"adam","username"="adm","age":26}
{"id":"_usr_102","name":"eric","username"="erc","age":27}
Afer deserialize json string one by pne and fill them in list , I have List so I can do some operation (search,groupby,order,pagination ...) and I can do same thing for another hashes(post,comment)
I can delete,update user with;
hdel user _usr_101 // deleted _usr_101
hset user _usr_100 {"id":"_usr_100","name":"john","username"="jhn","age":26} //updated age
hset user _usr_103 {"id":"_usr_103","name":"max","username"="max","age":15} //new user
hgetall user
{"id":"_usr_100","name":"john","username"="jhn","age":26}
{"id":"_usr_102","name":"eric","username"="erc","age":27}
{"id":"_usr_103","name":"max","username"="max","age":15}
What can be disadvantage of this usage?Can you suggest another idea about hash to use redis like nosql tables.
Depending on your business rules/model, this option "may" work but it may not be the best/near the best solution for your domain. Using key/value store in the need of mostly relational domain cause you to make tradeoffs which may be disadvantage for you.
When your user class has new fields and this fields needed to be queried then you need to create more "space" to reduce the "time". You keep denormalizing your data to just achieve a single query. You will try to implement your relational database in the key/value store world. When you just need to update your user 101 with a simple statement;
UPDATE users SET username = 'mynewusername' where id = 101;
In your case you will need to find all related keys/fields through all hash/set/lists and update them for the data integrity. Keeping age as a field may be a bad idea, you will need to use birthday or and if your business needs to fetch list of users's whose birthday is today then you need to create new keys, duplicate most of your data, migrate all your existing users to there to just get the today's birthdays. It's better to keep that in mind, you need to query by day and month to get birthdays - which means that you have to keep users in separate sets such as users:birthday:01:01, users:birthday:02:05, users:birthday:11:08 to fetch them. If the users wants to update their birthday(depending on the business) then you need to manually move users between those sets while updating the other sets too.
Adding active/passive to users will be another pain. I am not sure whether you need to get all users, you may need to paginate them and while using hash - it will be hard, You will need another another sorted set/list to gain that.
Same goes for comments of posts of the users, last 25 comments of the user, most recent comments of the users who have the most posts or searching through posts of users etc etc. Your product manager will come with the idea, let's add tag to each post and you will need to relate this into your data model with new data structures.
These are relational data, it is better to keep them relational. When you start modeling your data in non-relational database all the elasticity rdbms provide you will be gone and it will be replaced with complexity on both data and application layer.
A single postgresql may boost you far better than redis in this problem. Redis has excellent features to solve problems but user/post/comment is not one of them.
This post may provide some insights too

Good URL syntax for a GET request with a composite key

Let's take the following resource in my REST API:
GET `http://api/v1/user/users/{id}`
In normal circumstances I would use this like so:
GET `http://api/v1/user/users/aabc`
Where aabc is the user id.
There are times, however, when I have had to design my REST API in a way that some extra information is passed with the ID. For example:
GET `http://api/v1/user/users/customer:1`
Where customer:1 denotes I am using an id from the customer domain to lookup the user and that id is 1.
I now have a scenario where the identifier is more than one key (a composite key). For example:
GET `http://api/v1/user/users/customer:1;type:agent`
My question: in the above URL, what should I use as the separator between customer:1 and type:agent?
According to https://www.ietf.org/rfc/rfc3986.txt I believe that the semi-colon is not allowed.
You should either:
Use parameters:
GET http://api/v1/user/users?customer=1
Or use a new URL:
GET http://api/v1/user/users/customer/1
But use Standards like this
("Paths tend to be cached, parameters tend to not be, as a general rule.")
Instead of trying to create a general structure for accessing records via multiple keys at once, I would suggest trying to think of this on more of a case-by-case basis.
To take your example, one way to interpret it is that you have multiple customers, and those customers each may have multiple user accounts. A natural hierarchy for this would be:
/customer/x/user/y
Often an elegant decision like this can be made, that not only solves the problem but also documents your data-model in a way that someone can easily see that users belong to customers via a 1-to-many relationship.

To relate one record to another in MongoDB, is it ok to use a slug?

Let's say we have two models like this:
User:
_ _id
- name
- email
Company:
- _id
_ name
_ slug
Now let's say I need to connect a user to the company. A user can have one company assigned. To do this, I can add a new field called companyID in the user model. But I'm not sending the _id field to the front end. All the requests that come to the API will have the slug only. There are two ways I can do this:
1) Add slug to relate the company: If I do this, I can take the slug sent from a request and directly query for the company.
2) Add the _id of the company: If I do this, I need to first use the slug to query for the company and then use the _id returned to query for the required data.
May I please know which way is the best? Is there any extra benefit when using the _id of a record for the relationship?
Agree with the 2nd approach. There are several issues to consider when deciding on which field to use as a join key (this is true of all DBs, not just Mongo):
The field must be unique. I'm not sure exactly what the 'slug' field in your schema represents, but if there is any chance this could be duplicated, then don't use it.
The field must not change. Strictly speaking, you can change a key field but the only way to safely do so is to simultaneously change it in all the child tables atomically. This is a difficult thing to do reliably because a) you have to know which tables are using the field (maybe some other developer added another table that you're not aware of) b) If you do it one at a time, you'll introduce race conditions c) If any of the updates fail, you'll have inconsistent data and corrupted parent-child links. Some SQL DBs have a cascading-update feature to solve this problem, but Mongo does not. It's a hard enough problem that you really, really don't want to change a key field if you don't have to.
The field must be indexed. Strictly speaking this isn't true, but if you're going to join on it, then you will be running a lot of queries on it, so you'll need to index it.
For these reasons, it's almost always recommended to use a key field that serves solely as a key field, with no actual information stored in it. Plenty of people have been burned using things like Social Security Numbers, drivers licenses, etc. as key fields, either because there can be duplicates (e.g. SSNs can be duplicated if people are using fake numbers, or if they don't have one), or the numbers can change (e.g. drivers licenses).
Plus, by doing so, you can format the key field to optimize for speed of unique generation and indexing. For example, if you use SSNs, you need to check the SSN against the rest of the DB to ensure it's unique. That takes time if you have millions of records. Similarly for slugs, which are text fields that need to be hashed and checked against an index. OTOH, mongoDB essentially uses UUIDs as keys, which means it doesn't have to check for uniqueness (the algorithm guarantees a high statistical likelihood of uniqueness).
The bottomline is that there are very good reasons not to use a "real" field as your key if you can help it. Fortunately for you, mongoDB already gives you a great key field which satisfies all the above criteria, the _id field. Therefore, you should use it. Even if slug is not a "real" field and you generate it the exact same way as an _id field, why bother? Why does a record have to have 2 unique identifiers?
The second issue in your situation is that you don't expose the company's _id field to the user. Intuitively, it seems like that should be a valuable piece of information that shouldn't be given out willy-nilly. But the truth is, it has no informational value by itself, because, as stated above, a key should have no actual information. The place to implement security is in the query, ensuring that the user doing the query has permission to access the record / specific fields that she's asking for. Hiding the key is a classic security-by-obscurity that doesn't actually improve security.
The only time to hide your primary key is if you're using a poorly thought-out key that does contain useful information. For example, an invoice Id that increments by 1 for each invoice can be used by someone to figure out how many orders you get in a day. Auto-increment Ids can also be easily guessed (if my invoice is #5, can I snoop on invoice #6?). Fortunately, Mongo uses UUIDs so there's really no information leaking out (except maybe for timing attacks on its cryptographic algorithm? And if you're worried about that, you need far more in-depth security considerations than this post :-).
Look at it another way: if a slug reliably points to a specific company and user, then how is it more secure than just using the _id?
That said, there are some instances where exposing a secondary key (like slugs) is helpful, none of which have to do with security. For example, if in the future you need to migrate DB platforms and need to re-generate keys because the new platform can't use your old ones; or if users will be manually typing in identifiers, then it's helpful to give them something easier to remember like slugs. But even in those situations, you can use the slug as a handy identifier for users to use, but in your DB, you should still use the company ID to do the actual join (like in your option #2). Check out this discussion about the pros/cons of exposing _ids to users:
https://softwareengineering.stackexchange.com/questions/218306/why-not-expose-a-primary-key
So my recommendation would be to go ahead and give the user the company Id (along with the slug if you want a human-readable format e.g. for URLs, although mongo _ids can be used in a URL). They can send it back to you to get the user, and you can (after appropriate permission checks) do the join and send back the user data. If you don't want to expose the company Id, then I'd recommend your option #2, which is essentially the same thing except you're adding an additional query to first get the company Id. IMHO, that's a waste of cycles for no real improvement in security, but if there are other considerations, then it's still acceptable. And both of those options are better than using the slug as a primary key.
Second way of approach is the best,That is Add the _id of the company.
Using _id is the best way of practise to query any kind of information,even complex queries can be solved using _id as it is a unique ObjectId created by Mongodb. Population is the process of automatically replacing the specified paths in the document with document(s) from other collection(s). We may populate a single document, multiple documents, plain object, multiple plain objects, or all objects returned from a query.

Structuring cassandra database

I don't understand one thing about Cassandra. Say, I have similar website to Facebook, where people can share, like, comment, upload images and so on.
Now, let's say, I want to get all of the things my friends did:
Username1 liked you comment
username 2 updated his profile picture
And so on.
So after a lot of reading, I guess I would need to do is create new Column Family for each single thing, for example: user_likes user_comments, user_shares. Basically, anything you can think off, and even after I do that, I would still need to create secondary indexes for most of the columns just so I could search for data? And even so how would I know which users are my friends? Would I need to first get all of my friends id's and then search through all of those Column Families for each user id?
EDIT
Ok so i did some more reading and now i understand things a little bit better, but i still can't really figure out how to structure my tables, so i will set a bounty and i want to get a clear example of how my tables should look like if i want to store and retrieve data in this kind of order:
All
Likes
Comments
Favourites
Downloads
Shares
Messages
So let's say i want to retrieve ten last uploaded files of all my friends or the people i follow, this is how it would look like:
John uploaded song AC/DC - Back in Black 10 mins ago
And every thing like comments and shares would be similar to that...
Now probably the biggest challenge would be to retrieve 10 last things of all categories together, so the list would be a mix of all the things...
Now i don't need an answer with a fully detailed tables, i just need some really clear example of how would i structure and retrieve data like i would do in mysql with joins
With sql, you structure your tables to normalize your data, and use indexes and joins to query. With cassandra, you can't do that, so you structure your tables to serve your queries, which requires denormalization.
You want to query items which your friends uploaded, one way to do this is t have a single table per user, and write to this table whenever a friend of that user uploads something.
friendUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
as an example,
friendUploads {
userA {
12313-upload5 : null
12512-upload6 : null
13512-upload8 : null
}
}
friendUploads {
userB {
11313-upload3 : null
12512-upload6 : null
}
}
Note that upload 6 is duplicated to two different columns, as whoever did upload6 is a friend of both User A and user B.
Now to query the friends upload display of a friend, do a getSlice with a limit of 10 on the userid column. This will return you the first 10 items, sorted by key.
To put newest items first, use a reverse comparator that sorts larger timestamps before smaller timestamps.
The drawback to this code is that when User A uploads a song, you have to do N writes to update the friendUploads columns, where N is the number of people who are friends of user A.
For the value associated with each timestamp-upload-id key, you can store enough information to display the results (probably in a json blob), or you can store nothing, and fetch the upload information using the uploadid.
To avoid duplicating writes, you can use a structure like,
userUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
This stores the uploads for a particular user. Now when want to display the uploads of User B's friends, you have to do N queries, one for each friend of User B, and merge the result in your application. This is slower to query, but faster to write.
Most likely, if users can have thousands of friends, you would use the first scheme, and do more writes rather than more queries, as you can do the writes in the background after the user uploads, but the queries have to happen while the user is waiting.
As an example of denormalization, look at how many writes twitter rainbird does when a single click occurs. Each write is used to support a single query.
In some regards, you "can" treat noSQL as a relational store. In others, you can denormalize to make things faster. For instance, PlayOrm's #OneToMany stored the many like so
user1 -> friend.user23, friend.user25, friend.user56, friend.user87
This is the wide row approach so when you find your user, you have all the foreign keys to his friends. Each row can be different lengths. You may also have a reverse reference stored as well so the user might have references to the people that marked him as a friend but he did not mark them back(let's call it buddy) so you might have
user1 -> friend.user23, friend.user25, buddy.user29, buddy.user37
Notice that if designed right, you may NOT need to "search" for the data. That said, with PlayOrm, you can still do Scalable SQL and do joins(you just have to figure out how to partition your tables so it can scale to trillions of rows).
A row can have millions of columns in it or it could have just 10. We are actually in the process of updating alot of the documentation in PlayOrm and the noSQL patterns this month so if you keep an eye on that, you can also learn more about general noSQL there as well.
Dean
Think of each DB query as of request to the service running on another machine. Your goal is to minimize number of these requests (because each request requires network roundtrip).
Here comes the main difference from RDBMS paradigm: In SQL you would typically use joins and secondary indexes. In cassandra joins aren't possible, since related data would reside on different servers. Things like materialized views are used in cassandra for the same purpose (to fetch all related data with single query).
I'd recommend to read this article:
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
And to look into twissandra sample project https://github.com/twissandra/twissandra
This is nice collection of optimization technics for the kind of projects you described.