Document design - nosql

I am trying out some different options to design and store a document structure in an efficient way in RavenDB.
The structure I am handling user is Session and activity tracking information.
A Session is started when a User logs into the system and the activities start getting created. There could be hundreds activities per session.
The session ends when the user closes / logs out.
A factor that complicates the scenario slightly is that the sessions are displayed in a web portal in real time. In other words: I need to keep track of the session and activities and correlate them to be able to find out if they are ongoing (and how long they have been running) or if they are done.
You can also dig around in the history of course.
I did some research and found two relevant questions here on stack overflow but none of them really helped me:
Document structure for RavenDB
Activity stream design with RavenDb
The two options I have spiked successfully are: (simplified structures)
1:
{
"User": "User1",
"Machine": "machinename",
"StartTime": "2012-02-13T13:11:52.0000000",
"EndTime": "2012-02-13T13:13:54.0000000",
"Activities": [
{
"Text": "Loaded Function X",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
},
{
"Text": "Executed action Z",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
}
2:
{
"Session" : "SomeSessionId-1",
"User": "User1",
"Machine": "machinename",
"Text": "Loaded Function X",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
{
"Session" : "SomeSessionId-1",
"User": "User1",
"Machine": "machinename",
"Text": "Executed action Z",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
Alternative 1 feels more natural, comming from a relational background and it was really simple to load up a Session, add events and store away. The overhead of loading a Session object and the appending events every time feels really bad for insert performance.
Alternative 2 feels much more efficient, I can simply append events (almost like event-sourcing). But the selections when digging around in events and showing them per session gets a bit more complex.
Is there perhaps a third better alternative?
Could the solution be to separate the events and create another read model?
Am I overcomplicating the issue?

I definitely think you should go with some variant of option 2. Won't the documents grow very large in option 1? That would probably make the inserts very slow.
I can't really see why showing events per session would be any more complicated in option 2 than in option 1, you can just select events by session with
session.Query<Event>().Where(x => x.Session == sessionId)
and RavenDB will automatically create an index for it. And if you want to make more complicated queries you could always create more specialized indexes for that.

Looks like you just need a User document and a session document. Create two models for "User" and "Session".. session doc would have userid as one property. Session will have nested "activity" properties also. It will be easy to show real time users - sessions - activities in this case. Without knowing more details, I'm over simplifying ofcourse.
EDIT:
//Sample User Document
{
UserId:"ABC01",
HomeMachine:"xxxx",
DateCreated:"12/12/2011"
}
//Sample Session Document
{
UserId:"ABC01",
Activities
{
Activity 1 properties
}
{
Activity 2 properties
}
...
...
etc..
}

Related

Modelling a repetitive task list in Mongo

I am trying to write a small app for work in order to learn Meteor which is going well except I am struggling to model the data in Mongo. I seem to be trapped in RDBM mindset. I have a set of tasks that need to be completed every day with a comment stored for each task as it is done.
I keep thinking of creating a table of tasks and then having another table with a task_id, date, status and comment fields but this seems totally against the nosql way?
I suppose I could have a document consisting of the tasks and each task having a sub document that just consists of dates and comments?
Does anyone have any ideas what would be the most effective way to model this. It's so simple but but I'm trapped in the old ways!!
Many thanks in advance!
I think this question has more than one right answer. So take my approach as an option rather than the only solution.
If you structure your tasks like so:
tasks:
{
"_id": "ObjectId()",
"task": "string",
"comment": "string",
"dueDate": "date",
"status": "string/bool/number"
}
Then you can easily access/filter/manipulate your data based with a single query. If you would like to separate the comment from the task, and to add it only on completion you can use two collections:
tasks:
{
"_id": "ObjectId()",
"task": "string",
"dueDate": "date",
"status": "string/bool/number"
}
comments:
{
"_id": "ObjectId()",
"taskId": "taskId",
"comment": "string",
}
Now when ever you update a task in your db you can get back the taskId from it, then yo use it also to insert/update the comment.
This is just the tip of the iceberg, you can have many options if you wanna have more than one comment per task, more than one user per task, in it goes deeper.
Try to read this article from the mongodb website, might be of great help for your understanding of this.

MongoDB collections - which way will be more efficient?

I am more used to MySQL but I decided to go MongoDB for this project.
Basically it's a social network.
I have a posts collection where documents currently look like this:
{
"text": "Some post...",
"user": "3j219dj21h18skd2" // User's "_id"
}
I am looking to implement a replies system. Will it be better to simply add an array of liking users, like so:
{
"text": "Some post...",
"user": "3j219dj21h18skd2", // User's "_id"
"replies": [
{
"user": "3j219dj200928smd81",
"text": "Nice one!"
},
{
"user": "3j219dj2321md81zb3",
"text": "Wow, this is amazing!"
}
]
}
Or will it be better to have a whole separate "replies" collection with a unique ID for each reply, and then "link" to it by ID in the posts collection?
I am not sure, but feels like the 1st way is more "NoSQL-like", and the 2nd way is the way I would go for MySQL.
Any inputs are welcome.
This is a typical data modeling question in MongoDB. Since you are planning to store just the _id of the user the answer is definitely to embed it because those replies are part of the post object.
If those replies can number in the hundreds or thousands and you are not going to show them by default (for example, you are going to have the users click to load those comments) then it would make more sense to store the replies in a separate collection.
Finally, if you need to store more than the user _id (such as the name) you have to think about maintaining the name in two places (here and in the user maintenance page) as you are duplicating data. This can be manageable or too much work. You have to decide.

Normalized vs denormalized data in mongo

I have the following schema for posts. Each post has an embedded author and attachments (array of links / videos / photos etc).
{
"content": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure http:\/\/t.co\/tbsSrVYneK by #psawers",
"author": {
"username": "TheNextWeb",
"id": "10876852",
"name": "The Next Web",
"photo": "https:\/\/pbs.twimg.com\/profile_images\/378800000147133877\/895fa7d3daeed8d32b7c089d9b3e976e_bigger.png",
"url": "https:\/\/twitter.com\/account\/redirect_by_id?id=10876852",
"description": "",
"serviceName": "twitter"
},
"attachments": [
{
"title": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure",
"description": "Pixable, the SingTel-owned company that organizes your social photos in smart ways, has announced a quick-import tool for Everpix users following the company's decision to close ...",
"url": "http:\/\/t.co\/tbsSrVYneK",
"type": "link",
"photo": "http:\/\/cdn1.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2013\/09\/camera1-.jpg"
}
]
}
Posts are read often (we have a view with 4 tabs, each tab requires 24 posts to be shown). Currently we are indexing these lists in Redis, so querying 4x24posts is as simple as fetching the lists from Redis (returns a list of mongo ids) and querying posts with the ids.
Updates on the embedded author happen rarely (for example when the author changes his picture). The updates do not have to be instantaneous or even fast.
We're wondering if we should split up the author and the post into two different collections. So a post would have a reference to its author, instead of an embedded / duplicated author. Is a normalized data state preferred here (author is duplicated for every post, resulting in a lot of duplicated data / extra bytes)? Or should we continue with the de-normalized state?
As it seems that you have a few magnitudes more reads than writes, it probably makes little sense to split this data out into two collections. Especially with few updates, and you needing almost all author information while showing posts one query is going to be faster than two. You also get data locality so potentially you would need less data in memory as well, which should provide another benefit.
However, you can only really find out by benchmarking this with the amount of data that you'd be using in production.

Structuring Nested Collections in RavenDB

I've got a question about how I structure my data in RavenDB. As like most I'm coming from a relational database background and it feels slightly like I'm having to re-program my brain :).
Anyway. I have a utility which looks as below
{
"Name": "Gas",
"Calendars": [
{
"Name": "EFA"
},
{
"Name": "Calendar"
}
]
}
And I have a contract. Whilst creating the contract I need to first pick a utility type. Then based upon that I need to pick a Calendar type.
For example, I would pick Gas and then I would pick EFA. My question is how should I store this information against the contract object. It almost feels like each of my calendars should have an id, but I'm guessing this is wrong? Or should I just be storing the text values?
Any advice on the correct way to do this would be appreciated.
You can have internal objects have ids in RavenDB, but those are application managed, not managed by RavenDB.

Schema design in MongoDB — to replicate data or not

I have a Share collection which stores a document for every time a user has shared a Link in my application. The schema looks like this:
{
"userId": String
"linkId": String,
"dateCreated": Date
}
In my application I am making requests for these documents, but my application requires that the information referenced by the userId and linkId properties is fully resolved/populated/joined (not sure on the terminology) in order to display the information as needed. Thus, every request for a Share document results in a lookup for the subsequent User and Link documents. Furthermore, each Link has a parent Feed document which must also be looked up. This means I have some spagehetti-like code to perform each find operation in a series (3 in total). Yet, the application only needs some of the data found in these calls (one or two properties). That said, the application does need the entire Link document.
This is very slow, and I am wondering whether I should just be replicating the data in the Share document itself. In my head, this is fine because most of the data will not change, but some of it might (i.e. a User's username). This is suggesting of a Share schema design like so:
{
"userId": String,
"user": {
"username": String,
"name": String,
},
"linkId": String,
"link": {}, // all of the `Link` data
"feed": {
"title": String
}
"dateCreated": Date
}
What is the consensus on optimising data for the application with regards to this? Do you recommend that I replicate the data and write some glue code to ensure the replicated username gets updated if it changes (for example), or can you recommend a better solution (with details on why)? My other worry about replicating data in this manner is, what if I needed more data in the Share document further down the line?