Application user can perform different tasks. Each kind of task has unique identifier. Each user activity is recorded to database.
So we have following Event entity to keep in database:
{
"user_id": 1,
"task_id": 2,
"event_dt": [
2013, 11, 15, 10, 0, 0, 0
]
}
I need to know how many tasks of each type were performed by particular user during particular timeframe. Timeframe might be quite long (i.e. rolling chart for last year is requested).
For better understanding, map function might be something like:
emit([doc.user_id, doc.task_id, doc.event_dt], 1)
and it might be queried using group_level=2 (or group_level=1 in case just number of user events is needed).
Is it possible to answer above question by making single view query using map/reduce mechanism? Do I have to use list functionality (though it may cause performance issues)?
Just use flat key [doc.user_id, doc.task_id].concat(doc.event_dt) since it will simplify request and grouping logic:
with group_level=1: you'll get amount of tasks per user for all time
with group_level=2: amount of specific task ids per user for all time
with group_level=3: same as above but in context of specific year
with group_level=4: same as above but also grouped by months
etc. by days, hours, minutes and seconds
For instance, the result for group_level=3 may be:
{"rows":[
{"key": ["user1", "task1", 2012], "value": 3},
{"key": ["user1", "task2", 2013], "value": 14},
{"key": ["user1", "task3", 2013], "value": 15},
{"key": ["user2", "task1", 2012], "value": 9},
{"key": ["user2", "task4", 2012], "value": 26},
{"key": ["user2", "task4", 2013], "value": 53},
{"key": ["user3", "task1", 2013], "value": 5}
]}
Related
While processing my input, I want to add a new field in output JSON, which value should be auto incremented.
Ex -
Input list
{"name": "Amar", "age": 10}
{"name": "Akbar", "age": 20}
{"name": "Anthony", "age": 30}
Output Expected After adding Serial No
{"No": 1, "name": "Amar", "age": 10}
{"No": 2, "name": "Akbar", "age": 20}
{"No": 3, "name": "Anthony", "age": 30}
Beam process elements in parallel and does not guarantee ordering of elements.
However, if you still want to assign counter then you can use states in apache beam to maintain a counter. Reference https://beam.apache.org/blog/2017/02/13/stateful-processing.html
Scope of a state is a key + window. So it should work fine when assigning independent counters for different sets of keys.
However, if you have small number of keys and windows then this can impact the parallelism of your pipeline.
Also, in distributed data processing, there is not much usage of such counter, it will be great if you can describe your usecase a bit more.
Let's assume we have the following collections:
Users
{
"id": MongoId,
"username": "jsloth",
"first_name": "John",
"last_name": "Sloth",
"display_name": "John Sloth"
}
Places
{
"id": MongoId,
"name": "Conference Room",
"description": "Some longer description of this place"
}
Meetings
{
"id": MongoId,
"name": "Very important meeting",
"place": <?>,
"timestamp": "1506493396",
"created_by": <?>
}
Later on, we want to return (e.g. from REST webservice) list of upcoming events like this:
[
{
"id": MongoId(Meetings),
"name": "Very important meeting",
"created_by": {
"id": MongoId(Users),
"display_name": "John Sloth",
},
"place": {
"id": MongoId(Places),
"name": "Conference Room",
}
},
...
]
It's important to return basic information that need to be displayed on the main page in web ui (so no additional calls are needed to render the table). That's why, each entry contains display_name of the user who created it and name of the place. I think that's a pretty common scenario.
Now my question is: how should I store this information in db (question mark values in Metting document)? I see 2 options:
1) Store references to other collections:
place: MongoId(Places)
(+) data is always consistent
(-) additional calls to db have to be made in order to construct the response
2) Denormalize data:
"place": {
"id": MongoId(Places),
"name": "Conference room",
}
(+) no need for additional calls (response can be constructed based on one document)
(-) data must be updated each time related documents are modified
What is the proper way of dealing with such scenario?
If I use option 1), how should I query other documents? Asking about each related document separately seems like an overkill. How about getting last 20 meetings, aggregate the list of related documents and then perform a query like db.users.find({_id: { $in: <id list> }})?
If I go for option 2), how should I keep the data in sync?
Thanks in advance for any advice!
You can keep the DB model you already have and still only do a single query as MongoDB introduced the $lookup aggregation in version 3.2. It is similar to join in RDBMS.
$lookup
Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing. The $lookup stage does an equality match between a field from the input documents with a field from the documents of the “joined” collection.
So instead of storing a reference to other collections, just store the document ID.
For example, if I have some sensors and they have unique SN.
Then, I have some history data (not sorted) in this format: (SN, timestamp, value)
I want to maintain sensors' latest status in MongoDB. {"sn":xxx, "timestamp": xxx, "value": xxx, "installed_time": xxx}. Installed times might be manually added later.
So currently, my code is like
if db.sensors.find_one({"sn":SN, "timestamp": {"$lt": timestamp}}):
db.sensors.update_one({"sn":SN}, {"$set":{"timestamp": timestamp, "value": value}}, {"upsert": True})
I'd like to know whether I can combine these to one operation.
I tried to do conditional upsert
db.sensors.update_one({"sn":SN, "timestamp": {"$lt": timestamp}}, {"$set":{"timestamp": timestamp, "value": value}}, {"upsert": True}). The problem is that I'll have multiple documents with same SN.
For example, let's start with empty collection. First, (1, 3, 1) is processed, {"sn": 1, "timestamp":3, "value": 1} is inserted. Then, processing (1, 1, 2), it will create another document {"sn": 1, "timestamp":1, "value": 2}. The intended behaviour is just ignore this data point.
I also tried to do document replacement db.sensors.update_one({"sn":SN, "timestamp": {"$lt": timestamp}}, {"sn": SN, "timestamp": timestamp, "value": value}, {"upsert": True}). This will overwrite some other fields like installed_time.
I'm trying to store a multitude of documents which are double linked i.e. they can have a predecessor and a successor. Since the collection exists of different documents I'm not sure if I can create a feasible index on it:
{"_id": "1234", "title": "Document1", "content":"...", "next": "1236"}
{"_id": "1235", "title": "Document2", "content":"...", "next": "1238"}
{"_id": "1236", "title": "Document1a", "content":"...", "prev": "1234"}
{"_id": "1237", "title": "Document2a", "content":"...", "prev": "1235", "next": "1238"}
{"_id": "1238", "title": "Document2b", "content":"...", "prev": "1237", "next": "1239"}
...
Since I'll need the whole 'history' of a document including prev and next documents I guess I'll have to perform a multitude of querys depending on the size of the list?
Any suggestions on how to create a performant index? A different structure for storing double linked lists would also be interesting.
If you want to optimize reading you can use arrays to store previous and next documents.
{
"_id": "1237",
"title": "Document1",
"content":"...",
"next": "1238",
"prev": "1235",
"parents" : [1000, 1235]
"children" : [1238, 1239]
}
You can then get all the documents where your _id is either in child or parents array. This solution is good if you only need parents or children of the document. To get a whole list you can't efficiently use indexes with $or and two $in operators.
Alternative and probably a better solution is to store the entire list for each document i.e. child and parents in one array:
{
"_id": "1237",
"title": "Document1",
"content":"...",
"next": "1238",
"prev": "1235",
"list_ids" : [1000, 1235, 1238, 1239, 1237]
}
That way you can have an index on list_ids and get all the documents with a simple $in query that will be fast.
The problem with both of the solutions is that you will need to update all related documents when you add a new document. So this is probably not a good solution if you're
going to have a write heavy app.
I am developing a small app which will store information on users, accounts and transactions. The users will have many accounts (probably less than 10) and the accounts will have many transactions (perhaps 1000's). Reading the Docs it seems to suggest that embedding as follows is the way to go...
{
"username": "joe",
"accounts": [
{
"name": "account1",
"transactions": [
{
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
{
"date": "2013-08-07",
"desc": "transaction2",
"amount": "123.45"
},
{
"date": "2013-08-08",
"desc": "transaction3",
"amount": "123.45"
}
]
},
{
"name": "account2",
"transactions": [
{
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
{
"date": "2013-08-07",
"desc": "transaction2",
"amount": "123.45"
},
{
"date": "2013-08-08",
"desc": "transaction3",
"amount": "123.45"
}
]
}
]
}
My question is... Since the list of transactions will grow to perhaps 1000's within the document will the data become fragmented and slow the performance. Would I be better to have a document to store the users and the accounts which will not grow as big and then a separate collection to store transactions which are referenced to the accounts. Or is there a better way?
This is not the way to go. You have a lot of transactions, and you don't know how many you will get. Instead of this, you should store them like:
{
"username": "joe",
"name": "account1",
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
{
"username": "joe",
"name": "account1",
"date": "2013-08-07",
"desc": "transaction2",
"amount": "123.45"
},
{
"username": "joe",
"name": "account1",
"date": "2013-08-08",
"desc": "transaction3",
"amount": "123.45"
},
{
"username": "joe",
"name": "account2",
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
{
"username": "joe",
"name": "account2",
"date": "2013-08-07",
"desc": "transaction2",
"amount": "123.45"
},
{
"username": "joe",
"name": "account2",
"date": "2013-08-08",
"desc": "transaction3",
"amount": "123.45"
}
In a NoSQL database like MongoDB you shouldn't be afraid to denormalise. As you noticed, I haven't even bothered with a separate collection for users. If your users have more information that you will have to show with each transaction, you might want to consider including that information as well.
If you need to search on, or select by, any of those fields, then don't forget to create indexes, for example:
// look up all transactions for an account
db.transactions.ensureIndex( { username: 1, name: 1 } );
and:
// look up all transactions for "2013-08-06"
db.transactions.ensureIndex( { date: 1 } );
etc.
There are a lot of advantages to duplicate data. With a schema like above, you can have as many transactions as possible and you will never get any fragmentation as documents never change - you only add to them. This also increases write performance and also makes it a lot easier to do other queries.
Alternative
An alternative might be to store username/name in a collection and only use it's ID with the transactions:
Accounts:
{
"username": "joe",
"name": "account1",
"account_id": 42,
}
Transactions:
{
"account_id": 42,
"date": "2013-08-06",
"desc": "transaction1",
"amount": "123.45"
},
This creates smaller transaction documents, but it does mean you have to do two queries to also get user information.
Since the list of transactions will grow to perhaps 1000's within the document will the data become fragmented and slow the performance.
Almost certainly, infact I would be surprised if over a period of years transactions only reached into the thousands instead of 10's of thousand for a single account.
Added the level of fragmentation you will witness from the consistently growing document over time you could end up with serious problems, if not running out of root document space (with it being 16meg). In fact looking at the fact that you store all accounts for a person under one document I would say you run a high risk of filling up a document in the space of about 2 years.
I would reference this relationship.
I would separate the transactions to a different collections. Seems like the data and update patterns between users and transactions are quite different. If transactions are constantly added to the user and causes it to grow all the time it will be moved a lot in the mongo file. So yes, it brings performance impact (fragmentation, more IO, more work for mongo).
Also, array operation performance sometimes desegregates on big arrays in documents, so holding 1000s of object in an array might not be a good idea (depends on what you do with it).
You should consider creating indexes, using the ensureIndex() function, it should reduce the risk of performance issues.
The earlier you add these, the better you'll understand how the collection should be structured.
I haven't been using mongo too long but I haven't come across any issues(not yet anyway) of data being fragmented
Edit If you intend to use this for multi-object commits, mongo doesn't support rollbacks. You need to use the 64bit version to allow journaling and make transactions durable.