MongoDB for BI use - mongodb

Actually we want to use MongoDB for some BI processing and we don't know which schema is more suited in our case to get the job done. Imagine we got 100 000 data describing sales of a certain network, do we have to put all this data in one array? (like this)
{
"_id" : ObjectId()
"dataset" : "set1",
"values" : [
{"property":"value_1"},
.
.
.
.
{"property":"value_100000"}
]
}
Or for each entry a document? (like this)
{"_id: ObjectId(), "property":"value_1"}
.
.
.
{"_id: ObjectId(), "property":"value_100000"}
Or simply what is the ideal way to scheme this use case?

Embedding is better for :
Small subdocuments
Data that does not change regularly
WHen eventual consistency is acceptable
Document that grow by a small amount
Data that you'll often need to perform asecond query to fetch
Fast reading speed
References are better for
Large subdocuments
Volatile data
When immediate consitency is necessary
Document grow with a large amount
Data that you often exclude from document
Fast write speed
-From 《Mongodb Definitive Guide》
Reference is something like
{'_id':ObjectId("123"),'cousin':ObjectId("456")}
It refers to his cousin through its ObjectId something like foreign key in SQL.

Related

MongoDB : Arrays or not?

I am storing some data into a mongo database and I'm not sure about the structure I have to use... It's about IoT sensors that sends a value (temperature, pression, etc...) every specific time. I want to store into a collection (the collection name will be the sensor name) all the value from the sensor for a specific time (I thought about an array), the sensor type (like temperature).
Here is an example :
{
history : [ { date : "ISODate(2016-02-01T11:23:21.321Z)", value : 10.232216 }, { date : "ISODate(2016-02-01T11:26:41.314Z)", value : 10.164892 } ],
type : "temperature"
}
But my problem is that I want to query the database to get the history as a "list" of document. Each one with the date and the value.
On the other hand, I want to add a new value to the history each time there is a new one.
Thanks
Store every reading in a readings collection like:
{
date : "ISODate(2016-02-01T11:23:21.321Z)",
value : 10.232216,
type : "temperature",
sensor-name: "sensor-1"
}
This way you can access readings by type, date, value AND sensor. There is no reason why you would need to create a collection for each sensor.
Ting Suns answer is absolutely appropriate: Just store each measurement reading as a separate document in a collection. In doing so it's up to you if you want to arrange a separate collection for each sensor. Although putting them all into the same collection seems to be more obvious.
Especially you should not store items - in your case measurement readings - whose number is basically infinitely growing or could become "very large" into an embedded array of another MongoDB document. This is because:
The size of an individual document is limited to 16MB (MongoDB Version 3.2)
Often recurring modifications of the parent document could be inefficient for the memory management of the database engine.
Furthermore queries for individual embedded items/measurements are inefficient and more difficult to implement because you would actually have to query for the entire parent document.
How you divide readings into collections is completely up to you, whether one collection or multiple. And there are likely good arguments to be had on both sides.
However, regarding arrays: Just remember that sensor readings are unbounded. That is, they are possibly infinite in nature - just a flow of readings. MongoDB documents are limited in size (currently 16MB). With unbounded arrays, you will eventually hit this limit, which will result in failed updates, and requiring you to alter your storage architecture to accommodate your sensor readings.
So... you either need to devise a sharding solution to split array data across multiple documents (to avoid document-size-limit issues), or avoid arrays and store readings in separate documents.

Mongodb text search for large collection

Given below collection which has potential for ~18 million documents. I need a search functionality on the payload part of the document.
Because of the large volume of data, will it create performance issues if I create a text index on the payload field in the document? Are there any known performance issues when the collection contains millions of documents?
{
"_id" : ObjectId("5575e388e4b001976b5e570d"),
"createdDate" : ISODate("2015-06-07T05:00:34.040Z"),
"env" : "prod",
"messageId" : "my-message-id-1",
"payload" : "message payload typically 500-1000 bytes of string data"
}
I use MongoDB 3.0.3
I believe that is exactly what NoSQL DB were designed to do; give you quick access to a piece of data via an [inverted] index. Mongo is designed for that. NoSQL DB's like Mongo are designed to handle massive sets of data distributed across multiple nodes in a cluster. 18 million in the scope of Mongo is pretty small. You should not have any performance problems if you index property. You might want to read up on sharing also, it is key to getting the best performance out of your MongoDB.
You can use the Mongo DB Atlas feature where you can search your text based on different Analyzers that MongoDB provides. And you can then do a fuzzy search where text closer to your text will also be returned:
PS: For full-text match and to ignore fuzzy, just exclude the fuzzy object from below.
$search:{
{
index: 'analyzer_name_created_from_atlas_search',
text: {
query: 'message payload typically 500-1000 bytes of string data',
path: 'payload',
fuzzy:{
maxEdits: 2
}
}
}
}

How efficient are MongoDB projections?

Is there a lot of overhead in excluding nearly all of the data in a document when querying a mongo database?
For example, in the case where I only want field1 and field2, for a collection with a document structure of:
{
"field1" : 1
"field2" : true
"field3" : ["big","array",...]
"field4" : ["another","big","array",...]
}
would I benefit more from:
Creating a separate collection alongside this collection containing
only field1 and field2, or
Using .find() on the original documents with inclusion/exclusion parameters
Note: The inefficiency of saving the same data twice isn't a concern for me as much as the efficiency of querying the data
Many thanks!
Projection is somewhat similar to using column names explicitly in SQL, so it seems a little counter-intuitive to ask if returning smaller amount of data would incur overhead over returning larger amount of data (full document).
So you have to find the document (depending on how you .find() it may be fast or slow) but returning only first two fields of the document rather than all the fields (complete document) would make it faster not slower.
Having a second collection may only benefit if you are concerned about your collection fitting into RAM. If the documents in the duplicate collection are much smaller then they can presumably fit into a smaller amount of total RAM decreasing a chance that a page will need to be swapped in from disk. However, if you are writing to this collection as well as original collection then you have to have a lot more data in RAM than if you just have the original collection.
So while the intricate details may depend on your individual set-up, the general answer would probably be 2. you will benefit more from using projection and only returning the two fields you need.

Evaluating mongo DB performance

I have a collection in a mongoDB on which I have set user preferences. I have a very large number of objects in one particular collection, and user can follow a key in the collection. For example:
colletionx { key1: value1, key2: value2 : key3: value3 .. keyn:valuen}
Now the user can follow any number of keys, i.e., when key1 equals some value update me. (Very much similar similar to the Twitter "follow" feature).
Now how can I efficiently do this?
Also if I query mongo with a query like this:
db.collection.find({ keyId : 290})
or db.collection.find({ keyId : { $in [ 290] } }) will there be any drastic performance improvement when there are millions of users and all follow 1 show.
I think one of the biggest concerns with having large amounts of data in any database is that when you are querying, you want to avoid hitting the disk. Mongodb does a fairly good job of keeping data in memory but if your data set outgrows your memory, you will start swapping and that will hurt your performance.
There shouldnt be much of a difference between doing an $eq query and an $in query as long as there is an index on the key you are querying. If there is no index, you'll do a full collection scan.
For large amount of data it is very recommended to work with Sharding
It will allow you to have the data splitted between shards, thus your index could fit the ram memory. I think findOne by index should be quite efficient. The only thing that can harm your performance in this case is only massive writes in addition to your reads operations. Since the mongo have a global lock.

Mongodb data storage performance - one doc with items in array vs multiple docs per item

I have statistical data in a Mongodb collection saved for each record per day.
For example my collection looks roughly like
{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }
Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.
As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:
{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}
On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.
I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.
The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.
Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.
I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.