Logging file access with MongoDB - mongodb

I am designing my first MongoDB (and first NoSQL) database and would like to store information about files in a collection. As part of each file document, I would like to store a log of file accesses (both reads and writes).
I was considering creating an array of log messages as part of the document:
{
"filename": "some_file_name",
"logs" : [
{ "timestamp": "2012-08-27 11:40:45", "user": "joe", "access": "read" },
{ "timestamp": "2012-08-27 11:41:01", "user": "mary", "access": "write" },
{ "timestamp": "2012-08-27 11:43:23", "user": "joe", "access": "read" }
]
}
Each log message will contain a timestamp, the type of access, and the username of the person accessing the file. I figured that this would allow very quick access to the logs for a particular file, probably the most common operation that will be performed with the logs.
I know that MongoDB has a 16Mbyte document size limit. I imagine that files that are accessed very frequently could push against this limit.
Is there a better way to design the NoSQL schema for this type of logging?

Lets first try to calculate avg size of the one log record:
timestamp word = 18, timestamp value = 8, user word = 8, user value=20 (10 chars it is max(or avg for sure) I guess), access word = 12, access value 10. So total is 76 bytes. So you can have ~220000 of log records.
And half of physical space will be used by field names. In case if you will name timestamp = t, user = u, access=a -- you will be able to store ~440000 of log items.
So, i think it is enough for the most systems. In my projects I always trying to embed rather than create separate collection, because it a way to achieve good performance with mongodb.
In the future you can move your logs records into separate collection. Also for performance you can have like a 30 last log records (simple denormalize them) in file document, for fast retrieving in addition to logs collection.
Also if you will go with one collection, make sure that you not loading logs when you no need them (you can include/exclude fields in mongodb). Also use $slice to do paging.
And one last thing: Enjoy mongo!

If you think document limit will become an issue there are few alternatives.
The obvious one is to simple create a new document for each log.
So you will have a collecton "logs". With this schema.
{
"filename": "some_file_name",
"timestamp": "2012-08-27 11:40:45",
"user": "joe",
"access": "read"
}
A query to find which files "joe" read will be something like the
db.logs.find({user: "joe", access: "read"})

Related

Inserting multiple key value pair data under single _id in cloudant db at various timings?

My requirement is to get json pair from mqtt subscriber at different timings under single_id in cloudant, but I'm facing error while trying to insert new json pair in existing _id, it simply replace old one. I need at least 10 json pair under one _id. Injecting at different timings.
First, you should make sure about your architectural decision to update a particular document multiple times. In general, this is discouraged, though it depends on your application. Instead, you could consider a way to insert each new piece of information as a separate document and then use a map-reduce view to reflect the state of your application.
For example (I'm going to assume that you have multiple "devices", each with some kind of unique identifier, that need to add data to a cloudant DB)
PUT
{
"info_a":"data a",
"device_id":123
}
{
"info_b":"data b",
"device_id":123
}
{
"info_a":"message a"
"device_id":1234
}
Then you'll need a map function like
_design/device/_view/state
{
function (doc) {
emit(doc.device_id, 1);
}
Then you can GET the results of that view to see all of the "info_X" data that is associated with the particular device.
GET account.cloudant.com/databasename/_design/device/_view/state
{"total_rows":3,"offset":0,"rows":[
{"id":"28324b34907981ba972937f53113ac3f","key":123,"value":1},
{"id":"d50553d206d722b960fb176f11841974","key":123,"value":1},
{"id":"eaa710a5fa1ff4ba6156c997ddf6099b","key":1234,"value":1}
]}
Then you can use the query parameters to control the output, for example
GET account.cloudant.com/databasename/_design/device/_view/state?key=123&include_docs=true
{"total_rows":3,"offset":0,"rows":[
{"id":"28324b34907981ba972937f53113ac3f","key":123,"value":1,"doc":
{"_id":"28324b34907981ba972937f53113ac3f",
"_rev":"1-bac5dd92a502cb984ea4db65eb41feec",
"info_b":"data b",
"device_id":123}
},
{"id":"d50553d206d722b960fb176f11841974","key":123,"value":1,"doc":
{"_id":"d50553d206d722b960fb176f11841974",
"_rev":"1-a2a6fea8704dfc0a0d26c3a7500ccc10",
"info_a":"data a",
"device_id":123}}
]}
And now you have the complete state for device_id:123.
Timing
Another issue is the rate at which you're updating your documents.
Bottom line recommendation is that if you are only updating the document once per ~minute or less frequently, then it could be reasonable for your application to update a single document. That is, you'd add new key-value pairs to the same document with the same _id value. In order to do that, however, you'll need to GET the full doc, add the new key-value pair, and then PUT that document back to the database. You must make sure that your are providing the most recent _rev of that document and you should also check for conflicts that could occur if the document is being updated by multiple devices.
If you are acquiring new data for a particular device at a high rate, you'll likely run into conflicts very frequently -- because cloudant is a distributed document store. In this case, you should follow something like the example I gave above.
Example flow for the second approach outlined by #gadamcox for use cases where document updates are not required very frequently:
[...] you'd add new key-value pairs to the same document with the same _id value. In order to do that, however, you'll need to GET the full doc, add the new key-value pair, and then PUT that document back to the database.
Your application first fetches the existing document by id: (https://docs.cloudant.com/document.html#read)
GET /$DATABASE/100
{
"_id": "100",
"_rev": "1-2902191555...",
"No": ["1"]
}
Then your application updates the document in memory
{
"_id": "100",
"_rev": "1-2902191555...",
"No": ["1","2"]
}
and saves it in the database by specifying the _id and _rev (https://docs.cloudant.com/document.html#update)
PUT /$DATABASE/100
{
"_id": "100",
"_rev": "1-2902191555...",
"No":["1","2"]
}

Is it possible to process objects in a Google Cloud Storage bucket in FIFO order?

In my web app, I need to pull objects from gcs one by one and process them.
So the question is,
"How do I send a request to gcs to get the next unprocessed object?"
What I’d like to do is to simply rely on the sort order provided by gcs and then just process the objects in this sorted list one by one.
That way, I only need to keep track of the last processed item in my app.
I’d like to rely on the sort order provided by the timeCreated timestamp on each individual object in the bucket.
When I query my bucket via the JSON API, I notice that the objects are returned sorted by timeCreated from oldest to newest.
For example, this query ...
returns this list ...
{
"items": [
{
"name": "cars_train/00001.jpg",
"timeCreated": "2016-03-23T19:19:47.506Z"
},
{
"name": "cars_train/00002.jpg",
"timeCreated": "2016-03-23T19:19:49.320Z"
},
{
"name": "cars_train/00003.jpg",
"timeCreated": "2016-03-23T19:19:50.228Z"
},
{
"name": "cars_train/00004.jpg",
"timeCreated": "2016-03-23T19:19:51.377Z"
},
{
"name": "cars_train/00005.jpg",
"timeCreated": "2016-03-23T19:19:51.778Z"
},
{
"name": "cars_train/00006.jpg",
"timeCreated": "2016-03-23T19:19:52.817Z"
},
{
"name": "cars_train/00007.jpg",
"timeCreated": "2016-03-23T19:19:53.868Z"
},
{
"name": "cars_train/00008.jpg",
"timeCreated": "2016-03-23T19:19:54.925Z"
},
{
"name": "cars_train/00009.jpg",
"timeCreated": "2016-03-23T19:19:58.426Z"
},
{
"name": "cars_train/00010.jpg",
"timeCreated": "2016-03-23T19:19:59.323Z"
}
]
}
This sort order by timeCreated is exactly what I need, though I’m not certain if I can rely on this always being true?
So, I could code my app to process this list by simply searching for the first timeCreated value greater than the last object that processed.
The problem is this list can be very large and searching through a huge list every single time the user presses the NEXT button is too computationally expensive.
I would like to be able to specify in my query to gcs to filter the list so that I return only the single item that I need.
The API does allow me to set the maxResults returned to a value of 1.
However, I do not see an option that would allow me to return only objects whose timeCreated value is greater than the value I specified.
I think what I am trying to do is probably fairly common, so I’m guessing that a solution may exist for this problem.
One work around for this problem is to physically move an object that has been processed to another bucket.
That way the first item in the list would always be the newest one and I could simply send the request with maxCount=1.
But this adds complexity because it forces me have have 2 separate buckets for every project instead of 1.
Is there a way to filter this list of objects to only include ones whose timeCreated date is above a specified value?
In MySQL, it might be something like ...
SELECT name
FROM bucket
WHERE timeCreated > X
ORDER BY timeCreated
LIMIT 1
You can configure object change notifications on the bucket, and get a notification each time a new object arrives. That would allow you to process new objects without scanning a long listing each time. It also avoids the problem that listing a bucket is only eventually consistent (so, recently uploaded objects may not show up immediately when you list objects; I don't know if that's a problem for your app).
Details about object change notification are documented at https://cloud.google.com/storage/docs/object-change-notification.
Object listing in GCS is not sorted by timeCreated. Object listing results are always in alphabetical order. In your example, those two things merely happen to catch up.
If you want to get a list of objects in the order they were uploaded, you must ensure that each object has a name alphabetically later than the name of any object uploaded before it. Even then, however, you must take care, as object listing is eventually consistent, which means that objects you upload may not immediately show up in a listing.
If some ordering of objects is critically important, it would be a good idea to maintain a separate index of the objects and their timestamps in a separate data structure, perhaps populated via object change notifications as Mike suggested.

How to store a user record on another collection in Meteor

I'm trying to find how to associate a user with another collection document in Meteor and am unsure about how to do this.
My objective is to find the most efficient and future proof method for storing this information, although I am now aware this can be seen as somewhat subjective.
In my example, I am using a "Message" collection and am storing user ids on this document as both "sender" and "recipients", recipients being an array of user ids.
When I want to display information about the sender/recipients of this message, should I use helpers to output certain data? Or add things like senderName and senderAvatar onto the document itself when it gets created? Or am I missing another way of associating a user with another object that is perhaps more efficient?
Here's a JSON example:
Option 1 - Simply storing user ids on the other object (Message)
{
"_id": "boDNs36xzLw7eLLhx",
"sender": "8jpS96b4T65g5ARug",
"recipient": "4Pa5i5vQ2gDtYQBDP",
"message": "A new message.",
"createdAt": "2015-10-22T21:18:18.291Z"
}
Option 2 - Storing more information on the document itself
{
"_id": "boDNs36xzLw7eLLhx",
"sender": "8jpS96b4T65g5ARug",
"senderName": "Joe Bloggs",
"senderAvatar": "http://myimg.com",
"recipient": "4Pa5i5vQ2gDtYQBDP",
"recipientName": "Bill Bloggs",
"recipientAvatar": "http://myotherimg.com",
"message": "A new message.",
"createdAt": "2015-10-22T21:18:18.291Z"
}
In my opinion you should stick with the 1st option and then get user data from users collection by user id.
Two main arguments:
Second option duplicates data. If users exchange 100 messages, there are 101 "Joe Bloggs", "Bill Bloggs" etc identical strings in database.
If one of the users changes name, it either doesn't change in messages, or you have to update each message sent and received by this this user, which means a massive and unnecessary database load.

Best way to represent multilingual database on mongodb

I have a MySQL database to support a multilingual website where the data is represented as the following:
table1
id
is_active
created
table1_lang
table1_id
name
surname
address
What's the best way to achieve the same on mongo database?
You can either design a schema where you can reference or embed documents. Let's look at the first option of embedded documents. With you above application, you might store the information in a document as follows:
// db.table1 schema
{
"_id": 3, // table1_id
"is_active": true,
"created": ISODate("2015-04-07T16:00:30.798Z"),
"lang": [
{
"name": "foo",
"surname": "bar",
"address": "xxx"
},
{
"name": "abc",
"surname": "def",
"address": "xyz"
}
]
}
In the example schema above, you would have essentially embedded the table1_lang information within the main table1document. This design has its merits, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk. If your application frequently accesses table1 information along with the table1_lang data then you'll almost certainly want to go the embedded route. The other advantage with embedded documents is the atomicity and isolation in writing data. To illustrate this, say you want to remove a document which has a lang key "name" with value "foo", this can be done with one single (atomic) operation:
db.table.remove({"lang.name": "foo"});
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, specifically Model One-to-Many Relationships with Embedded Documents
The other design option is referencing documents where you follow a normalized schema. For example:
// db.table1 schema
{
"_id": 3
"is_active": true
"created": ISODate("2015-04-07T16:00:30.798Z")
}
// db.table1_lang schema
/*
1
*/
{
"_id": 1,
"table1_id": 3,
"name": "foo",
"surname": "bar",
"address": "xxx"
}
/*
2
*/
{
"_id": 2,
"table1_id": 3,
"name": "abc",
"surname": "def",
"address": "xyz"
}
The above approach gives increased flexibility in performing queries. For instance, to retrieve all child table1_lang documents for the main parent entity table1 with id 3 will be straightforward, simply create a query against the collection table1_lang:
db.table1_lang.find({"table1_id": 3});
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of table_lang documents per give table entity, embedding has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland

MongoDB design for scalability

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}
If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.
Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.