...and am I doing it wrong if I need to ask?
I have a data set composed of several thousand items (tracked objects in a video), each of which is composed of anywhere between 1 and about 100,000 other sub-items (data from each frame). I'm trying to figure out if it is wise to refer to every single frame within the object document, roughly like so:
{
"_id" : ObjectId("541e59c033e2931c587ad85a"),
"frames" : [
ObjectId("541e599b33e2931c587ad7f6"),
ObjectId("541e599b33e2931c587ad7f7"),
ObjectId("541e599b33e2931c587ad7f8")
],
"track_id" : 124
}
My frames would be in another collection and look something like:
{
"_id" : ObjectId("541e599b33e2931c587ad7f6"),
"track_id" : 124,
"frame" : 1
"centroid" : [1234, 2345],
}
Because the length of frames in the "tracked" collection could extend into about the 100k range, I'm a bit worried I could scrape the 16 MB document size limit.
My XY problem is that if all my frame data is plainly ordered with an integer, and unique between a track_id and frame_no combo, should I even bother with the document references?
I think the field frames is redundant because all documents from collection frames can be collected by a certain track_id. It's safe to remove this field then you don't have the worry about BSON size limit any more.
By the way, this is so similar to GridFS which is supported by MongoDB.
To answer your title question: an ObjectId is 12 bytes.
But it sounds like you don't need the frames references. Add a unique index to the frames collection on {track_id: 1, frame: 1} which would let you quickly (and more easily) find any frame of any track.
Related
I want to store my data in a collection that holds a maximum of ~10000 past records before they're deleted automatically from the database, like a FIFO queue of 10k records.
I have been looking at capped collections, but apart from the maximum number of records, they also require the maximum size of the collection. Now, I don't really know how much size 10k records are going to occupy. If I set a size that can't hold 10k records, I'll have lesser than I need. On the other hand, if I set an upper limit on the size, space is going to be wasted because mongo allocates the space beforehand.
What I can do is get dummy records, but I don't know how to check the size of each document.
Does anyone know of a method that can set an upper limit solely on the number of documents in my collection? Using the latest version on mongo out right now (v 3.6.3)
I'd personally push in 10000 actual real documents into a mongodb collection then call stats on the collection
db.test.stats()
Then use the size property:
> db.test.stats()
{
"ns" : "test.test",
"size" : 28609,
"count" : 1018,
"avgObjSize" : 28,
You may also want to add some padding, say 10% to that number:
cap col size = size * 1.1
The collection of MongoDB I am working on takes sensor data from cellphone and it is pinged to the server like every 2-6 seconds.
The data is huge and the limit of 16mb is crossed after 4-5 hours, there don't seem to be any work around for this?
I have tried searching for it on Stack Overflow and went through various questions but no one actually shared their hack.
Is there any way... on the DB side maybe which will distribute the chunk like it is done for big files via gridFS?
To fix this problem you will need to make some small amendments to your data structure. By the sounds of it, for your documents to exceed the 16mb limit, you must be embedding your sensor data into an array in a single document.
I would not suggest using GridFS here, I do not believe it to be the best solution, and here is why.
There is a technique known as bucketing that you could employ which will essentially split your sensor readings out into separate documents, solving this problem for you.
The way it works is this:
Lets say I have a document with some embedded readings for a particular sensor that looks like this:
{
_id : ObjectId("xxx"),
sensor : "SensorName1",
readings : [
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" }
]
}
With the structure above, there is already a major flaw, the readings array could grow exponentially, and exceed the 16mb document limit.
So what we can do is change the structure slightly to look like this, to include a count property:
{
_id : ObjectId("xxx"),
sensor : "SensorName1",
readings : [
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" },
{ date : ISODate("..."), reading : "xxx" }
],
count : 3
}
The idea behind this is, when you $push your reading into your embedded array, you increment ($inc) the count variable for every push that is performed. And when you perform this update (push) operation, you would include a filter on this "count" property, which might look something like this:
{ count : { $lt : 500} }
Then, set your Update Options so that you can set "upsert" to "true":
db.sensorReadings.update(
{ name: "SensorName1", count { $lt : 500} },
{
//Your update. $push your reading and $inc your count
$push: { readings: [ReadingDocumentToPush] },
$inc: { count: 1 }
},
{ upsert: true }
)
see here for more info on MongoDb Update and the Upsert option:
MongoDB update documentation
What will happen is, when the filter condition is not met (i.e when there is either no existing document for this sensor, or the count is greater or equal to 500 - because you are incrementing it every time an item is pushed), a new document will be created, and the readings will now be embedded in this new document. So you will never hit the 16mb limit if you do this properly.
Now, when querying the database for readings of a particular sensor, you may get back multiple documents for that sensor (instead of just one with all the readings in it), for example, if you have 10,000 readings, you will get 20 documents back, each with 500 readings each.
You can then use aggregation pipeline and $unwind to filter your readings as if they were their own individual documents.
For more information on unwind see here, it's very useful
MongoDB Unwind
I hope this helps.
You can handle this type of situations using GridFS in MongoDB.
Instead of storing a file in a single document, GridFS divides the file into parts, or chunks 1, and stores each chunk as a separate document. By default, GridFS uses a chunk size of 255 kB; that is, GridFS divides a file into chunks of 255 kB with the exception of the last chunk. The last chunk is only as large as necessary. Similarly, files that are no larger than the chunk size only have a final chunk, using only as much space as needed plus some additional metadata.
The documentation of GriFS contains almost everything you need to implement GridFS. You can follow it.
As your data is stream, you can try as following...
gs.write(data, callback)
where data is a Buffer or a string, callback gets two parameters - an error object (if error occured) and result value which indicates if the write was successful or not. While the GridStore is not closed, every write is appended to the opened GridStore.
You can follow this github page for streaming related information.
I'm practicing on MongoDB through a small personal project,
in which, may encounter a need to store some intermediate data abstracted as a unbounded growing table. Both rows and columns would grow boundlessly.
The usage of this abstract table is that I want to be able to
know the corresponding column for each entry in a row
know the corresponding row for each entry in a column
Or, in other word, know the index of each table entry
Hence there comes up two choices to model the table:
Make two collections:
one holds each row as a document which embeds a growing structure as row entries to have reference to the corresponding columns;
and similarly, another collection holds each column as a document embedding a growing structure to reference to the corresponding rows.
Make a single separate collection that holds each table entry as a document. Hence each document size is fixed.
The first model has problem with document growth (In fact, in my application, the table grows a bit askew, and only one collection would encounter document growth issue). The second model seems fine to me. Is there some pitfall or some other issue that should be aware of? And what is the common practice to deal with such problem?
UPDATE: explain things in more detail
I am trying to do automatic summarization of an ongoing conversation. The input is a corpus of sentences, and terms are extracted from each sentences. For example, English terms are stemmed, and sentences in CJK languages are segmented. Hence obtained a term-sentence matrix. Then one of the method needs to compute (sparse) SVD of such term-sentence matrix.
The sentences and extracted terms would be stored into the database. But the term-sentence matrix would grow unbounded.
(Or one can think of the problem of storing a mapping between tweets and hashtags)
There were two choices of draft schema that comes up to my mind:
choice one (hold two-way linkages between sentences and terms)
{ // sentence collection doc
"_id" : // generated by timestamp
, "text" : //
, "contained_terms" : [
// an array of "_id"s in term collection
]
}
{ // term collection doc
"_id" : // use term name
, "in_sentences" : [
// an array of "_id"s in sentence collection
]
}
choice two (make linkages into a separate collection)
{ // linkage collection doc (as matrix entries)
"_id" : // generated by timestamp
, "term" : // an "_id" in term collection
, "in_sentence" : // an "_id" in sentence collection
}
{ // sentence collection doc
"_id" : // generated by time stamp
, "text" : //
}
{ // term collection doc
"_id" : // use term name
}
The choice one encounters document growth problem because "in sentences" array of a term collection doc is very likely to grow beyond limit when sentences come in nonstop.
The choice two extract the linkage between terms and sentences into a separate collection, hence avoids the document growth. Although querying "which sentences contain the term" costs more, but in the end, it seems I don't actually need such operation much.
Currently, I'm thinking that the choice two better suit my needs. The linkage collection seems conform to the input of sparse SVD. To speed up computation, very high frequency terms can be filtered out if the term frequency field is added to each term collection docs (or in a separate collection when there are more than one conversations). This filtering seems fine in the case of automatic summarization.
But still wonder
Is there some issues or pitfalls that should be aware of?
What is the common practice for similar situation?
My understanding of mongodb is that you need to design your schema around your queries. So how you save your data is highly dependent on what data will you be querying. So even for the same set of data, your schema can vary depending on the actual use case. Additionally, data redundancy is quite common in NoSql database design. In case you are going to need some data again and again, there is no point in saving it in a separate collection. You can duplicate it in 2 collections, and that's a fair enough cost for faster querying. Memory is cheap, processing isn't! Additionally, pre-aggregation helps in case of mongo for huge data sets. Your queries will work fine for decent number of documents, but once you go into the realm of millions of records, you may face problems with a certain class of queries like counts, aggregation, etc. Pre-aggregation helps in keeping things real time, though it may have a higher write/insertion overhead. Always avoid a full table scan, whenever you can.
Above are some broad level concepts that I find relevant to your question. I'll try and explain it in your context with some examples (as I am not sure what data you are eventually going to need, or the queries you will do).
Let's say you are going to need terms per sentence frequently, to highlight them. In that case the recommended schema will be:
{ "_id" : // sentence id - you will query on this
, "text" : // sentence text
, "terms" : ["term1", "term2", "term3"]
}
So for each new sentence, you extract all the terms and save it (not the id) along with the sentence. The advantage here being that you will not need to query for the term separately. You can get all the terms for a given sentence in a single query. Additionally, the document size doesn't grow, and hence no document relocation.
Let's say you also want to have a unique list of terms and some per term meta data. You can have a separate terms collection which has a list of all the unique terms:
{ "_id": ,
, "term": //term
, "meaning":
, "metadata""
, "count": 1
}
You can have a unique index on term. Each time you extract terms from a sentence, you look up for it in this collection, and in case you don't find it, you insert it. Now let's say you also want to maintain a count of term appearance. So each time you find a term in a sentence and do a lookup in terms collection, you can increment (atomic) the count as well - pre-aggregation. If you add an index on count, you can get the top 100 terms, etc. easily on the fly.
Now let's say you want to query/count all the sentences with a given term. You can add an index on terms array and directly look up for all the sentences with a given term:
Sentence.where(:term => "term1").count //mongoid query
Again, you are achieving this with a single query, as opposed to getting a term id first in your case, and then the sentences.
Other than this it's always advisable to ensure that your working set and indexes fit into RAM for best performance.
So again, there are no right and wrong answers for schema design and it definitely depends on the queries you will be doing. I would also advise you to unlearn some of your relational DB concepts when trying to design for NoSQL databases. I learned it the hard way =) Hope some of this helps you in coming up with an efficient schema for your use case.
If you are trying to model a matrix with the whole collection representing the matrix, I think the go-to model should be to have each entry (row i, column j) as a document. If you put in a field like "index" : { "row" : i, "column" : j} and appropriate indices then it's easy and fast to do fun things like
get the entry at (i, j)
get row i
get column j
The matrix is represented sparsely so if row i only has 10 columns with values, row i is just 10 documents. If the rows/columns really do grow unboundedly to very large sizes then modeling a document as a row or column or something of "1 dimension" could hit the hard 16MB BSON document size limit.
I'm thinking the biggest drawback could be large index sizes given that every entry is its own document.
Let's say we have a structure like this per entry that goes to solr. The document is first amended and than saved. The way it is amended at the moment is that we lose the connection between the number and the score. However, we could change that into something else, if necessary.
"keywords" : [
{
"score" : 1,
"content" : "great finisher"
},
{
"score" : 1,
"content" : "project"
},
{
"score" : 1,
"content" : "staying"
},
{
"score" : 1,
"content" : "staying motivated"
}
]
What we want is to give a boost to a solr query result to a document using the "score" value in case the query contains the word/collocation to which the score is associated.
So each document has a different "map" of keyword with a score. And the relevancy would be computed normally how it Solr does now, but with a boost according to this map and the words present in the query.
From what I saw we can give boosts to results according to some criteria, but this criteria is very dynamic - context dependent. Not sure how to implement or where to start.
At the moment there is no built-in support in Solr to do anything like this. The most ideal way would be to have each term in a multiValued field boosted separately, but this is currently not possible (the progress (although there is none) is tracked in SOLR-2499).
There are however ways of working around this; two are suggested in the issue tracker above. I can't say much about using payloads and a custom BoostingTermQuery, but using dynamic fields are a possibility. The drawbacks are managing your cache sizes if you have many different field names and query/sort by most of them. If you have a small index with fewer terms, it will work, but a larger (in the higher five and six digits) with many dynamic fields will eat up your memory quick (as you for each sort/query will have one lookup cache with an int/long-array in the same size as your document count.
Another suggestion would be to look at using function queries together with a boost. If you reference the field here instead, you might avoid the cache issue. Try it!
I'm new to MongoDB. When creating a new table a question came to my mind related to how to design it and performance. My table structure looks this way:
{
"name" : string,
"data" : { "data1" : "xxx", "data2" : "yyy", "data3" : "zzz", .... }
}
The "data" field could grow until it reaches an amount of 100.000 elements ( "data100.000" : "aaaXXX"). However the number of rows in this table would be under control (between 500 and 1000).
This table will be accessed many times in my application and I'd like to maximize the performance of any queries. I would do queries like this one (I'll put an example in java):
new Query().addCriteria(Criteria.where("name").is(name).and("data.data3").is("zzz"));
I don't know if this would get slower when the amount of "dataX"... elements grows.
So the question is: Is this design correct? Should I change something?
I'll be pleased to read your advice, many thanks in advance
A document could be viewed like a table with columns, but you have to be carefull. It has other usage characteristics. The document size can be max. 16 MB. And you have to keep in mind that the documents are hold in memory by mongo.
With your query the whole document will be returned. Ask yourself do you need all entries or
will you have to use a single entry on his own?
Using MongoDB for eCommerce
MongoDB Schema Design
MongoDB and eCommerce
MongoDB Transactions
This should be a good start.
What is data? I wouldn't store a single nested document with up to 100,000 fields as it you wouldn't be able to index it easily so you would get performance issues.
You'd be better off storing as an array of strings, then you can index the array field which would index all the values.
{
"name" : string,
"data" : [ "xxx", "yyy", "zzz" ]
}
If like in your query you then wanted the value at a particular position in the array, instead of data.data3 you could do:
db.Collection.find( { "data.2" : "zzz" } )
Or, if you don't care about the position and just want all documents where the data array contains 'zzz' you can do:
db.Collection.find( { "data" : "zzz" } )
100,000 strings is not going to get anywhere near 16MB so you don't need to worry about that, but having 100,000 fields in a nested document or array indicates something is wrong with the design, but without knowing what data is I couldn't say for sure.