What has better performance, double parallel query or query and then array.reduce? MongoDB and JavaScript Performance - mongodb

I want to know, in the long run, what option has better performance, assuming that DB is full of documents (may be thousands).
1.Does two queries in parallel:
The first query gets all rooms properties that user is in.
The second query gets a list of different rooms where user is in and each room has sharing property set to true.
const roomsInfo = await Promise.all([
db.collection('rooms').find({'users.userId': user.userId}).toArray()
db.collection('rooms').distinct('users.id', {'users.id': "myuserId", share: true}),
])
Example output:
roomsInfo[0] = [{
"id": "room1",
"name": "room1name",
"data": "XXXX",
"users" [{
"id": "User1",
"data": "XXXX",
},{
"id": "User2",
"data": "XXXX",
}],
share: true
}, {
"id": "room2",
...
"share": true
},{
"id": "room3",
...
"share": false
},{
"id": "room4",
...
"share": false
}]
roomsInfo[1] = ["room1", "room2"]
2.Does one query and then reduces:
const roomInfo = await db.collection('rooms').find({'users.id': "myuserId"}).toArray()
const roomFiltered = roomInfo.reduce((a, o) => (o.share && a.push(o.id), a), [])
Second option has same result as first one
(2nd option) roomInfo = roomInfo[0] (1st option)
(2nd option) roomFiltered = roomInfo[1] (1st option)
Thank you very much in advance.

I think to use the first solution (two queries in parallel)
for example, you have 1000000 records in your DB , speed of MongoDB for distinct query is faster than of your local for loop

Tested it with 100.000, 500.000 and 1M documents using mgodatagen and my conclusions are:
Queries in parallel are better than query and filter, map, reduce or any other array.method for big data.
When you've big data in your database you just query by index. If not you won't have any result in a reasonable time.
Divide your queries as many as you can: for example, if you query all rooms documents with a given array of rooms Id's, and rooms are divided in 2 types (single and group), is better to do 2 parallel queries first finding all single rooms that correspond to given array and second, all group rooms that correspond to given array.

Related

Mongodb query getting slow even after indexing

I have a collection let's say Fruits in db. which has following fields
{
"_id": ObjectId(...),
"item": "Banana",
"category": ["food", "produce", "grocery"],
"location": "4th Street Store",
"stock": 4,
"type": "cases"
}
There is an index by default on _id, and i added another index which is,
{
"item": "1",
"category": "1",
"stock": "1",
"type": "1"
}
this collection has data of thousands , and my query response is slow. My query is.
After the index which I mentioned above, Do I need to add all these
checks in my query or I can use any on the keys added in the index ?
Like, currently my queries are like
fruits.find({item: 'new'});
fruits.find({item: 'new', category: 'history'});
fruits.find({stock: '5', category: 'drama'});
fruits.find({type: 'new'});
Is my index which has all these keys is enough for this or I need to
created different indexes for all these combination of keys which I
mentioned above?
Sometimes I am using query and sometimes I am using aggregation on some other collections and lookup for this fruits collections and then doing search etc..
{
"item": "1",
"category": "1",
"stock": "1",
"type": "1"
}
This index will partially work for the following.
fruits.find({item: 'new'}); **Will work (Partially)**
fruits.find({item: 'new', category: 'history'}); **Will work (Partially)**
fruits.find({stock: '5', category: 'drama'}); **Won't work**
fruits.find({type: 'new'}); **Won't work**
Partially => The index is basically an addition in a B-Tree data structure in MongoDB which maps a document in the system. The index prefix on item allows the index to work for first and second query you mentioned but it would be a collection scan for third and last one.
Read about prefixes here.
You need to properly understand indexes in the long run, for queries specifically you can seek help but the knowledge gap will become a problem. This brief read will be really useful.
Edit
Aggregation => Depends on part of the query, mostly only for match you can use index thereafter everything else happens in memory(Check this for more details). For lookup you fetch the data using index on other collection if you have the index on it (again the match part) but after fetching that data whatever you do extra on it would be done in memory. Logically, mostly the fetching of data will be where indexes will be used anyway, for sorting part read the document linked above.

Firebase Combining Query and Pagination

Query can be used to filter a large items down to a smaller number suitable for synchronizing to the client.
Pagination is a also serve the same purpose, that is to limit the items to a smaller numbers suitable to be be fetched by the client.
Consider the following database schema:
"users": {
"-KRyXWjI0X6UvffIB_Gc": {
"active": true,
"name": "John Doe",
"occupation": "Looking for firebase answer"
},
"-KRyXBWwaK112OWGw5fa": {
"active": false,
"name": "Jane Doe",
"occupation": "Still stuck on combining query and pagination"
},
"-KRyWfOg7Nj59qtoCG30": {
"active": true,
"name": "Johnnie Doe",
"occupation": "There is no greater sorrow than to recall in misery the time when we were stuck"
}
}
If I were to get all the active users, it will be like this: (Code in Swift)
let usersRef = ref.child("users");
let query = usersRef.queryOrderedByChild("active")
.queryEqualToValue(true)
After that filtering, it left me with 10,000 users. Fetching all of those users at the same time is out of question. It must be paginated.
To do the pagination, I have to do the query on the unique sorted value, which is none other than the key itself. This is how it looks now:
let usersRef = ref.child("users");
let query = usersRef.queryOrderedByChild("active")
.queryEqualToValue(true)
let usersPerPage = 10;
query.queryOrderedByKey()
.queryStartingAtValue(lastKey)
.queryLimitedToFirst(usersPerPage)
This wouldn't work because:
You can only use one order-by method at a time. Calling an order-by
method multiple times in the same query throws an error.
After I spent 2 days on thinking how am I supposed to solve this situation, I can only came up with this "anti best practice" solution.
I modified the database schema. I convert the active boolean value to the string and append it after the key to give the order importance control to the key. This is how it looks now:
"users": {
"-KRyXWjI0X6UvffIB_Gc": {
"key_active": "-KRyXWjI0X6UvffIB_Gc true"
"active": true,
"name": "John Doe",
"occupation": "Looking for firebase answer"
}
}
Now I could do both the pagination and the query using the single orderBy:
let usersPerPage = 10;
query.queryOrderedByChild("key_active")
.queryStartingAtValue("\(lastKey) true")
.queryLimitedToFirst(usersPerPage)
Somehow my brain reject the idea of having the key inside the key because it's the worst dirty solution it can be. I want to know the right solution for this particular situation, any solution would be greatly appreciated.
Just add this to your JSON tree:-
active_Users :{
uid1 : {
index : 1,
name : "John Doe"
},
uid2 : {
index : 2,
name : "Johnnie Doe"
}
}
After this just follow this answer :- Retrieving 5 user's at a time, modify it according to your requirements.
Note:- Total no of users/posts in that answers is being retrieved by doing children count. Given your heavy database, you might wanna store totalNoOfUser in a separate node and increment it every time a new user is added.

Execution time of a query - MongoDB

I have two collections: coach and team.
Coach collection contains information about coaches like name, surname, age and an array coached_Team that contains the _id of the team that a coach coached.
The team collection contains data about teams like _id, common name, official name, country, championship....
If I want to find, for example, the official name of all teams coached by Allegri, I have to do two queries, the first on coach collection:
>var x = db.coach.find({surname:"Allegri"},{_id:0, "coached_Team.team_id":1})
>var AllegriTeams
>while(x.hasNext()) AllegriTeams=x.next()
{
"coached_Team" : [
{
"team_id" : "Juv.26
},
{
"team_id" : "Mil.74
},
{
"team_id" : "Cag.00
}
]
}
>AllegriTeams=AllegriTeams.coached_Team
[
{
"team_id" : "Juv.26"
},
{
"team_id" : "Mil.74"
},
{
"team_id" : "Cag.00"
}
]
And then I have to execute three queries on team collection:
> db.team.find({ _id:AllegriTeams[0].team_id}, {official_name:1,_id:0})
{official_name : "Juventus Football Club S.p.A."}
> db.team.find({ _id:AllegriTeams[1].team_id}, {official_name:1,_id:0})
{official_name : "Associazione Calcio Milan S.p.A"}
> db.team.find({ _id:AllegriTeams[2].team_id}, {official_name:1,_id:0})
{official_name:"Cagliari Calcio S.p.A"}
Now consider I have about 100k documents on collection team and collection coach. The first query, on coach collection, needs about 71 ms plus the time of while cycle. The three queries on team collection, using cursor.explain("executionStats") needs 0 ms. I don't understand why this query takes 0.
I need executionTimeMillis of these three queries to have the execution time of the query "find official names of all teams coached by Allegri". I want to add the execution time of the query on coach collection(71ms) with the execution time of these three. If the time of these three queries is 0 what can I say about the execution time of the query mainly?
I think the more important observation here is that 71ms is a long time for a simple fetch of one item. Looks like your "surname" field needs an index. The other "three" queries are simple lookups of a primary key, which is why they are relatively fast.
db.coach.createIndex({ "surname": 1 })
If that surname is actually "unique" then add that too:
db.coach.createIndex({ "surname": 1 },{ "unique": true })
You can also simplify your "three" queries as as one by simply mapping the array, and applying the $in operator:
var teamIds = [];
db.coach.find(
{ "surname": "Allegri" },
{ "_id":0, "coached_Team.team_id":1 }
).forEach(function(coach) {
teamIds = coach.coached_Team.map(function(team) {
return team.team_id }).concat(teamIds);
});
});
db.team.find(
{ "_id": { "$in": teamIds" }},
{ "official_name": 1, "_id": 0 }
).forEach(function(team) {
printjson(team);
});
And then certainly the overall execution time is way down, as well as removing the overhead of multiple operations down to just the two queries requried.
Also remembering here that despite what is in the execution plan stats, the more queries to make to and from the server then the longer the overal real time execution will be for making each request and retriving the data. So it is best to keep things as minimal as possible.
Therefore even more logical would be that where to "need" this information regularly, storing the "coach name" on the "team itself" ( and indexing that data ) leads to the fastest possible response and only a single query operation.
It's easy to get caught up in observing execution stats. But really, think of what is "best" and "fastest" as a pattern for the sort of queries you want to do.

How do I manage a sublist in Mongodb?

I have different types of data that would be difficult to model and scale with a relational database (e.g., a product type)
I'm interested in using Mongodb to solve this problem.
I am referencing the documentation at mongodb's website:
http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
For the data type that I am storing, I need to also maintain a relational list of id's where this particular product is available (e.g., store location id's).
In their example regarding "one-to-many relationships with embedded documents", they have the following:
{
name: "O'Reilly Media",
founded: 1980,
location: "CA",
books: [12346789, 234567890, ...]
}
I am currently importing the data with a spreadsheet, and want to use a batchInsert.
To avoid duplicates, I assume that:
1) I need to do an ensure index on the ID, and ignore errors on the insert?
2) Do I then need to loop through all the ID's to insert a new related ID to the books?
Your question could possibly be defined a little better, but let's consider the case that you have rows in a spreadsheet or other source that are all de-normalized in some way. So in a JSON representation the rows would be something like this:
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 12346789
},
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 234567890
}
So in order to get those sort of row results into the structure you wanted, one way to do this would be using the "upsert" functionality of the .update() method:
So assuming you have some way of looping the input values and they are identified with some structure then an analog to this would be something like:
books.forEach(function(book) {
db.publishers.update(
{
"name": book.publisher
},
{
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
{ "upsert": true }
);
})
This essentially simplified the code so that MongoDB is doing all of the data collection work for you. So where the "name" of the publisher is considered to be unique, what the statement does is first search for a document in the collection that matches the query condition given, as the "name".
In the case where that document is not found, then a new document is inserted. So either the database or driver will take care of creating the new _id value for this document and your "condition" is also automatically inserted to the new document since it was an implied value that should exist.
The usage of the $setOnInsert operator is to say that those fields will only be set when a new document is created. The final part uses $addToSet in order to "push" the book values that have not already been found into the "books" array (or set).
The reason for the separation is for when a document is actually found to exist with the specified "publisher" name. In this case, all of the fields under the $setOnInsert will be ignored as they should already be in the document. So only the $addToSet operation is processed and sent to the server in order to add the new entry to the "books" array (set) and where it does not already exist.
So that would be simplified logic compared to aggregating the new records in code before sending a new insert operation. However it is not very "batch" like as you are still performing some operation to the server for each row.
This is fixed in MongoDB version 2.6 and above as there is now the ability to do "batch" updates. So with a similar analog:
var batch = [];
books.forEach(function(book) {
batch.push({
"q": { "name": book.publisher },
"u": {
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
"upsert": true
});
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand( "update", "updates": batch );
batch = [];
}
});
db.runCommand( "update", "updates": batch );
So what is doing in setting up all of the constructed update statements into a single call to the server with a sensible size of operations sent in the batch, in this case once every 500 items processed. The actual limit is the BSON document maximum of 16MB so this can be altered appropriate to your data.
If your MongoDB version is lower than 2.6 then you either use the first form or do something similar to the second form using the existing batch insert functionality. But if you choose to insert then you need to do all the pre-aggregation work within your code.
All of the methods are of course supported with the PHP driver, so it is just a matter of adapting this to your actual code and which course you want to take.

Can I add or sort by a MongoDB match object?

So I have two collections, Trait and Question. For a given user, I iterate over the user's traits, and I want to query all of the questions that correspond to missing traits:
linq.From(missingTraits)
.ForEach(function(trait)
{
match.$or.push({ "Trait": trait });
});
database.collection("Questions", function(err, collection)
{
collection.find(match).limit(2).toArray(function(err, questions)
{
next(err, questions);
});
});
This works, but I'd like the objects to come back sorted by a field on the Trait document (which is NOT on the Question document):
Traits
[
{ "Name": "Height", "Value": "73", "Importance": 15 },
{ "Name": "Weight", "Value": "230" "Importance": 10 },
{ "Name": "Age", "Value": "29", "Importance": 20 }
]
Questions
[
{ "Trait": "Height", "Text": "How tall are you?" },
{ "Trait": "Weight", "Text": "How much do you weight?" },
{ "Trait": "Age", "Text": "How old are you?" }
]
So in the above example, if all three traits were missing, I'd want to bring back only Age and height (in that order). Is it possible to modify the query or the match object in some way to facilitate this?
Thank you.
If I am understanding your question right, you would like to find the two most important traits? If your missingTraits array is just the names of the traits ("Height", "Weight", "Age"), then you would want to query the Traits collection to find the two most important, before sending the query to the Questions collection.
Unfortunately, modifying just the query on your Questions collection will not work without a separate call to the Traits collection, since that is where the information on importance lives.
A query like
database.collection("Traits", function(err, collection)
{
collection.find(match).sort({ Importance : -1 }).limit(2).toArray(function(err, traits)
{
// cache the traits to query the questions collection
});
});
will get you the two most important Traits that match the $or query (all matches, sorted by Importance by descending order, limited to 2). You can then use these results to query the Questions collection.
Telling the Questions collection to return the docs in "Age", "Height" order is difficult in this case, so I would recommend caching the importance order of the traits you receive back from the Traits collection and then sorting the results of the query on Questions client-side.
Here are the docs for sort():
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-%7B%7Bsort%28%29%7D%7D
One last thing: if you are populating missingTraits via a database call to the Traits collection, then it should be possible to place the sort() logic into that call, rather than adding a second call to the collection.