Multiple nested arrays in MongoDB - mongodb

I am having difficulties figuring out an effective way of working with a multiple nested document. It looks like the following:
{ "_id" :
{ "$oid" : "53ce46e3f0c25036e7b0ddd8"} ,
"someid" : 7757099 ,
"otherids" :
[ { "id" : 100 ,
"line" : "test" ,
"otherids" :
[ { "id" : 129}]}
]}
and there will be another level of array in addition.
I can not find a way to query this structure except for "otherids" array, but no deeper. Is this possible to do in an effective way at all?
These arrays might grow a bit, but not hugely.
My thought was to use it like this since it will be effective to fetch a lot of data in one go. But this data also needs to be updated quite often. Is this a hopeless solution with mongoDB?
Regards mongoDB newb
EDIT:
I would like to do it as simply and fast as possible :-)
Like: someid.4.otherids.2.line -> somevalue
I know that probably I would have to do a query to check if values exist, but it would be nice to do it as an upsert. Now I only work with objects in java, and it takes 14 secs to insert 10 000 records. Most of these inserts are "leaf nodes", meaning I have to query, then find out what is already there, modify the document, then update the whole root. This takes too long.

Related

MongoDB - Using Index to get nested IDs is slow

I have a MongoDB collection with 8k+ documents, around 40GB. Inside it, the data follows this format:
{
_id: ...,
_session: {
_id: ...
},
data: {...}
}
I need to get all the _session._id for my application. The following approach (python) takes too long to get them:
cursor = collection.find({}, projection={'_session._id': 1})
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
Is there a way to speed this query such that I get all the _session._id very fast?
In mongo shell you can hint() the query optimizer to use the available index as follow:
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Following test is confirmed to work via python:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:12345")
mydb=db["test"]
docs= mydb.test2.find( {} ).hint([ ("x.y", pymongo.ASCENDING) ])
for i in docs:
print(i)
db.test2.createIndex({"x.y":1})
{
"v" : 2,
"key" : {
"x.y" : 1
},
"name" : "x.y_1"
}
python 3.7 ,
pymongo 3.11.2 ,
mongod 5.0.5
In your case seems to be text index , btw it seems abit strange why session is text index , for text index somethink like this must work:
db.test2.find({}).hint("x.y_text").explain()
And here is working example with text index:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:123456")
print('Get first 10 docs from test.test:')
mydb=db["test"]
docs= mydb.test2.find( {"x.y":"3"} ).hint( "x.y_text" )
print("===start:====")
for i in docs:
print(i)
db.test2.createIndex({"x.y":"text"}):
{
"v" : 2,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "x.y_text",
"weights" : {
"x.y" : 1
},
"default_language" : "english",
"language_override" : "language",
"textIndexVersion" : 3
}
There are a few points of confusion in this question and the ensuing discussion which generally come down to:
What indexes are present in the environment (and why the attempts to hint it failed)
When using indexing is most appropriate
Current Indexes
I think there are at least 5 indexes that were mentioned so far:
A standard index of {"_session._id":1} mentioned originally in #R2D2's answer.
A text index on the _session._id field (mentioned in this comment)
A text index on the _ts_meta.session field (mentioned in this comment)
A standard index of {"x.y":1} mentioned second in #R2D2's answer.
A text index of {"x.y":"text"} mentioned at the end of #R2D2's answer.
Only the first of these is likely to even really be relevant to the original question. Note that the difference a text index is a specialized index that is meant for performing more advanced text searching. Such indexes are not required for simple string matching or value retrieval. But standard indexes, { '_session._id': 1}, will also store string values and are relevant here.
What Indexing is For
Indexes are typically useful for retrieving a small subset of results from the database. The larger that set of results becomes relative to the overall size of the collection, the less helpful using an index will become. In your situation you are looking to retrieve data from all of the documents in the collection which is why the database doesn't consider using any index at all.
Now it is still possible that an index could help in this situation. That would be if we used it to perform a covered query which means that the data can be retrieved from the index alone without looking at the documents themselves. In this case the database would have to scan the full index, so it is not clear that it would be faster or not. But you could certainly try. To do so you would need to follow #R2D2's instructions, specifically by creating the index and then hinting it in the query (while also projecting out the _id field):
db.collection.createIndex({"_session._id":1})
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Additional Questions
There were two other things mentioned in the question that are important to address.
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
We talked about why this was the case above. But to find out if the database is using it or not you could navigate to the Explain tab in compass to take a look. If you explain plan visualization it should indicate if the index was used. Remember that you will need to hint the index based on your query.
Is there a way to speed this query such that I get all the _session._id very fast?
What is your definition of "very fast" here?
The general answer is that your operation requires scanning either all documents in the collection or a full index. There is no way to do this more efficiently based on the current schema. Therefore how fast it happens is largely going to come down to the hardware that the database is running on and it will slow down as the collection grows.
If this operation is something that you will be running frequently or have strict performance requirements around, then it may be important to think through your intended goals to see if there are other ways of achieving them. What will you or the application be doing with this list of session IDs?

Solr Increase relevance of search result based on a map of word:value

Let's say we have a structure like this per entry that goes to solr. The document is first amended and than saved. The way it is amended at the moment is that we lose the connection between the number and the score. However, we could change that into something else, if necessary.
"keywords" : [
{
"score" : 1,
"content" : "great finisher"
},
{
"score" : 1,
"content" : "project"
},
{
"score" : 1,
"content" : "staying"
},
{
"score" : 1,
"content" : "staying motivated"
}
]
What we want is to give a boost to a solr query result to a document using the "score" value in case the query contains the word/collocation to which the score is associated.
So each document has a different "map" of keyword with a score. And the relevancy would be computed normally how it Solr does now, but with a boost according to this map and the words present in the query.
From what I saw we can give boosts to results according to some criteria, but this criteria is very dynamic - context dependent. Not sure how to implement or where to start.
At the moment there is no built-in support in Solr to do anything like this. The most ideal way would be to have each term in a multiValued field boosted separately, but this is currently not possible (the progress (although there is none) is tracked in SOLR-2499).
There are however ways of working around this; two are suggested in the issue tracker above. I can't say much about using payloads and a custom BoostingTermQuery, but using dynamic fields are a possibility. The drawbacks are managing your cache sizes if you have many different field names and query/sort by most of them. If you have a small index with fewer terms, it will work, but a larger (in the higher five and six digits) with many dynamic fields will eat up your memory quick (as you for each sort/query will have one lookup cache with an int/long-array in the same size as your document count.
Another suggestion would be to look at using function queries together with a boost. If you reference the field here instead, you might avoid the cache issue. Try it!

How to use an index with MongoCollection.Update()

I am writing a method that updates a single document in a very large MongoCollection,
and I have an index that I want the MongoCollection.Update() call to use to drastically reduce lookup time, but I can't seem to find anything like MongoCursor.SetHint(string indexName).
Is using an index on an update operation possible? If so, how?
You can create index according to your query section of update command.
For example if you have this collection, named data:
> db.data.find()
{ "_id" : ObjectId("5334908bd7f87918dae92eaf"), "name" : "omid" }
{ "_id" : ObjectId("5334943fd7f87918dae92eb0"), "name" : "ali" }
{ "_id" : ObjectId("53349478d7f87918dae92eb1"), "name" : "reza" }
and if you do this update query:
> db.data.update(query={name:'ali'}, update={name: 'Ali'})
without any defined index, the number of scanned document is 2:
"nscanned" : 2,
But if you define an index, according to your query, here for name field:
db.data.ensureIndex({name:1})
Now if you update it again:
> db.data.update(query={name:'Ali'}, update={name: 'ALI'})
Mongodb use your index for doing update, and number of scanned document is 1:
"nscanned" : 1,
But if you want to hint for update, you can hint it for your query:
# Assume that the index and field of it exists.
> var cursor = db.data.find({name:'ALI'}).hint({family:1})
Then use it in your update query:
> db.data.update(query=cursor, update={name: 'ALI'})
If you already have indexed your collection, update will be using the CORRECT index right away. There is no point to provide hint (in fact you can't hint with update).
Hint is only for debugging and testing purposes. Mongo is in most cases smart enough to automatically decide which index (if you have many of them) should be used in a particular query and it reviews its strategy from time to time.
So short answer - do nothing. If you have an index and it is useful, it will be automatically used on find, update, delete, findOne.
If you want to see if it is used - take the part of the query which searches for something and run it through find with explain.
Example for hellboy. This is just an example and in real life it can be more complex.
So you have a collection with docs like this {a : int, b : timestamp}. You have 2 indexes: one is on a, another is on b. So right now you need to do a query like a > 5 and b is after 2014. For some reason it uses index a, which does not give you the faster time (may be because you have 1000 elements and most of them are bigger than 5 and only 10 are > 2004 ). SO you decided to hint it to use b index. Cool it works much faster now. But your collection changes and right now you are in 2020 year and most of your documents have b bigger than 2014. So right now your index b is not doing so much work. But mongo still uses it, because you told so.

How to add data into mongo collections

I had following mongo collections structures
{
"_id" : ObjectId("52204f5b24c8cbf03ca16f8e"),
"Date" : 1377849179,
"cpuUtilization" : 31641,
"memory" : 20623801,
"hostId" : "600.6.6.6"
}
In above collections I had 1000 hostId and every hostId produced cpuutilization and memory every 5 min. So any one suggest me I put my data into single collection or I create separate 1000 collections using hostId like collections name as 100.1.12.2,101.2.10.1....
and I also want indexing on collections for searching records.
From the structure you have shared it would be an intelligent choice to put data into separate records, since the memory and cpuUtilization would always be different. Also, if you store timestamp in Date field, that would always be different.
It would be far more easier to query your database if you store records separately and you could avoid using aggregation as well which will give you better query performance by using appropriate indexes.
So your records should look like below:
{ "_id" : ObjectId("someID1"),"Date" : 1377849179,"cpuUtilization" : 31641,"memory" : 20623801,"hostId" : "600.6.6.6"}
{ "_id" : ObjectId("someID2"),"Date" : 1377849210,"cpuUtilization" : 20141,"memory" : 28787801,"hostId" : "600.6.6.6"}
One collection will be good enough to store the information . One of the thought you have to take care is Write performance , as mongodb locks while writing at database level , Write may be slow. One suggestion I can give to have two or three database which will hold the collections for specific range of host. It help you to write faster . Beginning with version 2.2, MongoDB implements locks on a per-database basis for most read and write operations.

MongoDB Table Design and Query Performance

I'm new to MongoDB. When creating a new table a question came to my mind related to how to design it and performance. My table structure looks this way:
{
"name" : string,
"data" : { "data1" : "xxx", "data2" : "yyy", "data3" : "zzz", .... }
}
The "data" field could grow until it reaches an amount of 100.000 elements ( "data100.000" : "aaaXXX"). However the number of rows in this table would be under control (between 500 and 1000).
This table will be accessed many times in my application and I'd like to maximize the performance of any queries. I would do queries like this one (I'll put an example in java):
new Query().addCriteria(Criteria.where("name").is(name).and("data.data3").is("zzz"));
I don't know if this would get slower when the amount of "dataX"... elements grows.
So the question is: Is this design correct? Should I change something?
I'll be pleased to read your advice, many thanks in advance
A document could be viewed like a table with columns, but you have to be carefull. It has other usage characteristics. The document size can be max. 16 MB. And you have to keep in mind that the documents are hold in memory by mongo.
With your query the whole document will be returned. Ask yourself do you need all entries or
will you have to use a single entry on his own?
Using MongoDB for eCommerce
MongoDB Schema Design
MongoDB and eCommerce
MongoDB Transactions
This should be a good start.
What is data? I wouldn't store a single nested document with up to 100,000 fields as it you wouldn't be able to index it easily so you would get performance issues.
You'd be better off storing as an array of strings, then you can index the array field which would index all the values.
{
"name" : string,
"data" : [ "xxx", "yyy", "zzz" ]
}
If like in your query you then wanted the value at a particular position in the array, instead of data.data3 you could do:
db.Collection.find( { "data.2" : "zzz" } )
Or, if you don't care about the position and just want all documents where the data array contains 'zzz' you can do:
db.Collection.find( { "data" : "zzz" } )
100,000 strings is not going to get anywhere near 16MB so you don't need to worry about that, but having 100,000 fields in a nested document or array indicates something is wrong with the design, but without knowing what data is I couldn't say for sure.