MongoDB aggregate search with multiple fields - mongodb

I am trying to build an API for search jobs
Frontend input: single filed keyword with a string
API response: Return list of jobs that match any of the following fields
skills
location
company
Schemas
1.Job schema
title: {
type: String,
required: true,
},
location: {
type: mongoose.Schema.Types.ObjectId,
ref: 'location',
},
skills:[{
type: mongoose.Schema.Types.ObjectId,
ref: 'Skill'
}],
company: {
type: mongoose.Schema.Types.ObjectId,
ref: 'company',
},
As you see skills, location and company are mapped in another collection and frontend gives no separation on the keyword I am not sure which way I can write an effective search query
Right now approach is
Find skill_id based on skill name and fetch all jobs that have desired skill
Follow the same for location and company
But I am not sure this is the right approach, can somebody' advise a proper way of doing this

many strategies can be applied in your case (yours, AdamExchange's one, aggregation with $lookup stages...) depending of size of collections, indexes, etc...
But i think think you really have to look to index and index intersection strategies to really optimize your query
I would :
first create 3 single indexes on skill.name / location.name / company indexes ===>So you can find the ids in your different collections, using index.
Create single indexes on job collection : location, skill, company
Then you can simply run your queries like this (assuming MyKeyword is the value of your frontend field) [pseudo code, i don't know the language you use]:
skillId = db.skill.find({name:MyKeyword });
locationId = db.location.find({name:MyKeyword });
companyId = db.company.find({name:MyKeyword });
db.job.find({
$or: [
{
skill: {
$eq: skillId
}
},
{
location: {
$eq: locationId
}
},
{
company: {
$eq: companyId
}
}
]
})
So you can take benefit of indexes to query 'secondary collections' and of indexes intersection for each case of your $or condition for main collection.

Related

MongoDB: index enums and nullable fields for search?

I have a "log" type of collection, where depending on the source, there might be some id-fields. I want to search those fields with queries, but not sort by them. should I index them to improve search performance? To visualize the problem:
[{
_id: ObjectID("...") // unique
userId: ObjectID("...") // not unique
createdAt: ...
type: 'USER_CREATED'
},
{
_id: ObjectID("...") // unique
basketId: ObjectID("...") // not unique
createdAt: ...
type: 'BASKET_CREATED'
},
...]
I want to filter by (nullable) userId or basketId as well as the type-enum. I am not sure if those fields need an index. createdAT certainly does since it it is sortable. But sparse fields containing enums or null (and simply non-unique) values: how should those be treated as a rule of thumb?

Structuring mogodb schema

I'm playing around with MongoDB and was wondering what best practices are for how a SQL-ish schema may correspond to MongoDB. Here are the tables/data I have so far:
user
id
email
name
answer
user_id (FK user.id)
tag
upvotes
repo
id
owner
name
description
stars
repo_tag
repo_id (FK to repo.id)
tag
is_language
percentage
repo_contrib
repo_id (FK to repo.id)
user_id (FK to user.id)
lines_of_code
The structure goes something like this:
user
answer (left outer)
repo_contrib (left outer)
repo
repo_tag
Note: All users will have at least one answer or one repo, but does not necessarily have to have both.
How might I put this into a mongo schema? Would this be one 'collection' ? Or would this be two collections: one for user, and one for repo; or more?
My queries will be something like: "Grab all users with an Answer with tay [Python] with more than 2 upvotes or a repo with the [Python] tag with more than two stars.
Let me divide this to couple of steps:
STEP 1 - MONGODB and MONGOOSE
MongoDB is a document based database. Each record in a collection is a document, and every document should be self-contained (it should contain all information that you need inside it).
Since MongoDB is a no-relation database, you can not create relations between collections, but you can store a reference of one collection document as a property of another collection document. To help you manage all of this, there is a great package called Mongoose, which will allow you to create a Model for each Collection. After you define
Models, Mongoose will allow you to easily make queries to database.
STEP 2 - DEFINING MODELS
As we said, documents should be self-contained, so they should have all information that you need inside them. We can have 2 approaches based on your example:
APPROACH 1:
Create one collection for each table that you have in your relational database. This is the best practice when you have documents with a lot of data, because it is scalable.
APPROACH 2:
Create 3 Collections - USERS, ANSWERS and REPOS. Because repo_contrib does not have a lot of data, you can store all user's contributions in a USERS document. That way, when you fetch a User document, you will have everything that you need in one place. The same goes for repo_tag - we can store all repo's tags in a REPOS document.
APPROACH 3:
Create 2 Collections - USERS and REPOS. The same as APPROACH 2, but you can also add all user's answers to the USERS document.
RECOMMENDATION:
I would go with APPROACH 2 in this case, since repo_contrib and repo_tag does not store big data and can easily be stored in USERS and REPOS documents with no problem. Also, if we go with this approach, it will make querying database a lot easier. The reason why I didn't choose option 3 is because theoretically user can have thousands or tens of thousands of answers, and it would not scale well.
STEP 3 - IMPLEMENTATION
NOTE: MongoDB will automatically assign _id to each document, so you don't have to define id property when implementing Models.
Tables from your relational database example can be mapped to collections like this (This implementation is for APPROACH 2):
USERS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
email: { type: String, required: true, unique: true },
name: { type: String, required: true, unique: false },
contributions: [{
repo_id: { type: mongoose.Schema.Types.ObjectId, ref: 'REPOS' },
lines_of_code: { type: Numeric, ref: 'REPOS' }
}]
});
const Users = mongoose.model('USERS', schema);
module.exports = Users;
ANSWERS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
user_id: { type: mongoose.Schema.Types.ObjectId, ref: 'USERS', required: true },
tag: { type: String, required: true, unique: false },
upvotes:{ type: Number, default: 0, unique: false }
});
const Answers = mongoose.model('ANSWERS', schema);
module.exports = Answers;
REPOS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
owner: { type: mongoose.Schema.Types.ObjectId, ref: 'USERS', required: true },
name: { type: String, required: true, unique: false },
description: { type: String, required: false, unique: false },
stars:{ type: Number, default: 0, unique: false },
tags: [{
name: { type: String, required: true, unique: false },
is_language: {type: Boolean, required: true, unique: false},
percentage:{ type: Number, default: 0, unique: false }
}]
});
const Repos = mongoose.model('REPOS', schema);
module.exports = Repos ;
STEP 4 - POPULATION AND DATABASE QUERIES
One of the best features of the Mongoose is called population. If you store a reference of one collection document as a property of another collection document, when performing querying of the database, Mongoose will replace references with the actual documents.
Example 1:
Let us first take as an example the first query that you suggested: Find all users with an Answer with tag [Python] with more than 2 upvotes. Since we stored user_id in ANSWERS Collection as a reference to the document from the USERS collection, that means that we can just query the ANSWERS Collection, and when returning the final result Mongoose will go to USERS collection and replace the references with the actual User documents. The database query that will perform this looks like this:
const ANSWERS = require('../models/answers');
ANSWERS.find({
"tag": "Python",
"upvotes": {
"$gt": 2
}
}).populate('user_id');
Example 2:
Second query that you have suggested is: Find all repos with the [Python] tag with more than two stars. Since we are storing all repo's tags in one array, we just need to check if that array contains an item with the name field that is equal to Python, and that the repo's stars fields is greater than 2. The database query that will perform this looks like this:
const REPOS = require('../models/repos');
REPOS.find({
"tags.name": "Python",
"stars": {
"$gt": 2
}
})
Here is also the working example: https://mongoplayground.net/p/rgBtVVDgPzG
Designing a database model is very complex most of the time and I guess you are looking for best practices from a reputable source. I think this is the missing point in the other answers, even if #NenadMilosavljevic got close to it.
Brief introduction on NoSQL modeling
You are probably used to model SQL databases, for NoSQL modeling it is totally different. These are some of the differences:
SQL Modeling
NoSQL modeling
This type of modeling is "data-oriented" in the sense that it is designed to be used and shared by many applications. Data is normalized, generally using normal forms, to avoid duplication and to make future changes easier and with lowest downtime possible.
NoSQL modeling is "application-oriented" because it should be built from the requirements of a single application, in order to reach the maximum level of optimization.
You start from requirements analysis, then the conceptual design, in the end the physical design.
If you want to optimize your application, you need to start from the app itself and from the operations needed: this is the so-called workload. After that there are conceptual and physical design of course.
I want to focus on the workload a little more because it is very important. Since you come from a SQL-based application, you can describe the workload starting from various scenarios, production logs and statistics. For each query you need, these parameters are essential:
Size of data requested
Frequency of the query
The complexity of the operations involved
Returning to the original question: "My queries will be something like..." is not enough for me to help you on building a NoSQL model. There are many solutions for your problem but, unless you provide a lot more info on the queries you need to do, they are all correct.
#NenadMilosavljevic gave you multiple approaches, but I can't say if the second one is the correct one for the reasons I have told you above. For example, he suggests to keep user and user contributions together so that you have to perform a single query to retrieve them, instead of doing JOINs or something even more expensive.
This is certainly clever but suppose, and it is probably not your case, that you have to update the user contributions very often, then in this case keeping them in a separate collection might be better.
What I am saying is that too many assumptions are missing and the solution we give you could be good but not optimal.
Honestly, it is not clear to me if you need a trivial conversion from a SQL model to NoSQL model, or you are trying to apply the NoSQL principles. I have no idea about the size of your database, but if performances are not a problem, just go with the solution you find more appropriate. Doing a research on how to better model your data would be a waste of time.
Instead, if you really need to design a NoSQL database, not a SQL-like NoSQL database, then my advice is to follow this course. Actually, you could finish it in less than 5 hours and many lessons are unnecessary in your situation, but it is worth taking a look at it. No one here talked about patterns and how to deal with one-to-zillion relationships for instance. It is very important to know their existence unless you like to redesign your database when it is too late.
Here is my suggestion. My suggestion would be 3 collections. There are user, repo and answer. Below are the schemas for reference.
user collection
id: String
email: String
name: String
repo collection
id: String
owner: String
name: String
description: String
tag: [String] // Array of string
contributors: [Number] // Array of user id
I suggest having another collection named answer. This is because a user could provide a lot of answers. Thus, having it on another collection will be easier to query compare to having it inside subdocuments of user collection.
answer collection
answer_id
user_id
tag
upvotes
I hope it is helpful.
Mongo Schema design 101: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1.
If in SQL you think about data in object oriented way - your models represent some business entities
and you build functions around them, in Mongo you should think functional way - what data you have as
input and what you need as output. In other words your schema should be based on queries you need
to run, not on data you have.
It is a bit more tricky as there is no best way - any schema will be better for some queries than for others.
You will need to choose which queries should be prioritized. To make it even more interesting, you will need
to anticipate which queries you may have in the future.
And of course it's all about data size. If it fits into single server you have luxury of aggregation lookups to "join" collections.
Otherwise sharding will significantly restrict your choices.
On the other hand embedding should be used with care - document size cannot exceed 16MB and modification
of embedded documents is not that straightforward.
The last but not least thing to consider are indexes. Your schema should allow efficient indexes for your queries. Here you will need to consider not only data size but also its quality - selectivity/cardinality
In the light of the foregoing the best schema to "grab all users with an Answer with tay [Python] with more than 2 upvotes or a repo with the [Python] tag with more than two stars" will be a 2 collections:
user:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"id": {
"bsonType": "objectId"
},
"email": {
"bsonType": "string"
},
"name": {
"bsonType": "string"
},
"answers": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"tag": {
"bsonType": "string"
},
"upvotes": {
"bsonType": "int"
}
},
"required": [
"tag",
"upvotes"
]
}
]
},
},
"required": [
"id",
"email",
"name",
"answers"
]
}
repo:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"id": {
"bsonType": "objectId"
},
"owner": {
"bsonType": "string"
},
"name": {
"bsonType": "string"
},
"description": {
"bsonType": "string"
},
"stars": {
"bsonType": "int"
},
"tags": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"tag": {
"bsonType": "string"
},
"is_language": {
"bsonType": "bool"
},
"percentage": {
"bsonType": "double"
}
},
"required": [
"tag",
"is_language",
"percentage"
]
}
]
},
"contributors": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"user_id": {
"bsonType": "objectid"
},
"lines_of_code": {
"bsonType": "int"
}
},
"required": [
"user_id",
"lines_of_code"
]
}
]
}
},
"required": [
"id",
"owner",
"name"
"description",
"stars"
]
}
with the queries:
db.user.find({answers: {$elemMatch:{tag:"Python", upvotes:{$gt:2}}}})
db.repo.find({"tags.tag":"Python", stars:{$gt:2}})
In the comment you mentioned something like "to get all the repos for a given user". Assuming it's about contributors, otherwise you don't need this array at all.
The query will be:
db.repo.find({"contributors.user_id": ObjectId("12313212313232")})

How to build a MongoDB query that combines two field temporarily?

I have a schema which has one field named ownerId and a field which is an array named participantIds. In the frontend users can select participants. I'm using these ids to filter documents by querying the participantIds with the $all operator and the list of participantsIds from the frontend. This is perfect except that the participantsIds in the document don't include the ownerId. I thought about using aggregate to add a new field which consists of a list like this one: [participantIds, ownerId] and then querying against this new field with $all and after that delete the field again since it isn't need in the frontend.
How would such a query look like or is there any better way to achieve this behavior? I'm really lost right now since I'm trying to implement this with mongo_dart for the last 3 hours.
This is how the schema looks like:
{
_id: ObjectId(),
title: 'Title of the Event',
startDate: '2020-09-09T00:00:00.000',
endDate: '2020-09-09T00:00:00.000',
startHour: 1,
durationHours: 1,
ownerId: '5f57ff55202b0e00065fbd10',
participantsIds: ['5f57ff55202b0e00065fbd14', '5f57ff55202b0e00065fbd15', '5f57ff55202b0e00065fbd13'],
classesIds: [],
categoriesIds: [],
roomsIds: [],
creationTime: '2020-09-10T16:42:14.966',
description: 'Some Desc'
}
Tl;dr I want to query documents with the $all operator on the participantsIds field but the ownerId should be included in this query.
What I want is instead of querying against:
participantsIds: ['5f57ff55202b0e00065fbd14', '5f57ff55202b0e00065fbd15', '5f57ff55202b0e00065fbd13']
I want to query against:
participantsIds: ['5f57ff55202b0e00065fbd14', '5f57ff55202b0e00065fbd15', '5f57ff55202b0e00065fbd13', '5f57ff55202b0e00065fbd10']
Having fun here, by the way, it's better to use Joe answer if you are doing the query frequently, or even better a "All" field on insertion.
Additional Notes: Use projection at the start/end, to get what you need
https://mongoplayground.net/p/UP_-IUGenGp
db.collection.aggregate([
{
"$addFields": {
"all": {
$setUnion: [
"$participantsIds",
[
"$ownerId"
]
]
}
}
},
{
$match: {
all: {
$all: [
"5f57ff55202b0e00065fbd14",
"5f57ff55202b0e00065fbd15",
"5f57ff55202b0e00065fbd13",
"5f57ff55202b0e00065fbd10"
]
}
}
}
])
Didn't fully understand what you want to do but maybe this helps:
db.collection.find({
ownerId: "5f57ff55202b0e00065fbd10",
participantsIds: {
$all: ['5f57ff55202b0e00065fbd14',
'5f57ff55202b0e00065fbd15',
'5f57ff55202b0e00065fbd13']
})
You could use the pipeline form of update to either add the owner to the participant list or add a new consolidated field:
db.collection.update({},[{$set:{
allParticipantsIds: {$setUnion: [
"$participantsIds",
["$ownerId"]
]}
}}])

What's the strategy for creating index for this query?

I have a comment model for each thread,
const CommentSchema = new mongoose.Schema({
author: { type: ObjectID, required: true, ref: 'User' },
thread: { type: ObjectID, required: true, ref: 'Thread' },
parent: { type: ObjectID, required: true, ref: 'Comment' },
text: { type: String, required: true },
}, {
timestamps: true,
});
Besides a single query via _id, I want to query the database via this way:
Range query
const query = {
thread: req.query.threadID,
_id: { $gt: req.query.startFrom }
};
CommentModel.find(query).limit(req.query.limit);
My intention here is to find comments which related to a thread then get part of the result. It seems this query works as expected. My questions are:
Is this the right way to fulfill my requirement?
How to proper index the fields? Is this a compound index or I need to separate indexing each field? I checked the result of explain(), it seems as long as one of the query fields contains an index, the inputStage.stage will always have IXSCAN rather than COLLSCAN? Is this the key information to check the performace of the query?
Does it mean that every time I need to find based on one field, I need to make an index for these fields? Let's say that I want to search all the comments that are posted by an author to a specific thread.
Code like this:
const query = {
thread: req.query.threadID,
author: req.query.authorID,
};
Do I need to create a compound index for this requirement?
If you want to query by multiple fields then you have to create compound index.
for example
const query = {
thread: req.query.threadID,
author: req.query.authorID,
};
if you want to use this query then you have to create compound index like :
db.comments.createIndex( { "thread": 1, "author": 1 } );
Then that is, the index supports queries on the item field as well as both item and stock fields:
db.comments.find( { thread: "threadName" } )
db.comments.find( { thread: "threadName", author: "authorName" }
but not supports this one
db.comments.find( { author: "authorName", thread: "threadName" }
say if you create index for { "thread": 1, "author": 1, "parent": 1 } then for bellow ordered query support index
the thread field,
the thread field and the author field,
the thread field and the author field and the parent field.
But not support for bellow order
the author field,
the parent field, or
the author and parent fields.
For more read this

Adding indexes in mogodb

I currently have a mongodb database which is pretty unstructured. I am attempting to extract all the followers of a given set of profiles on twitter. My database looks like this:
{'123':1
'123':2
'123':3
'567':8
'567':9
}
Where each key is a user and the value is a single follower. When I attempt to create an index on these keys, I simply run out of the available index as I have a lot of users (8 million). After googling, I find that the maximum number of indexes I can have is about 64. How do I create a proper indexing on this database? OR would you suggest a different way for me to store my data?
You should structure your data differently.
I would recommend you to have a collection of "user" documents, where every user has an array "followers". This array should be filled with unique identifiers of the users who follow (like name, _id or your own ID number).
{ name: "userA",
followers: [
"userB",
"userC"
]
},
{ name: "userB",
followers: [
"userD",
"userF"
]
},
You can then create an index on the followers field to quickly find all users who follow an other user. When you want to find all users who follow the users "userX", "userY" and "userZ", you would then do it with this query:
db.users.find({followers: { $all: ["userX", "userY", "userZ" ] } });
Edit:
To add a follower to a user, use the $push operator:
db.users.update({name:"userA"}, { $push: { followers: "userB" } } );
The $pull operator can be used to remove array enries:
db.users.update({name:"userA"}, { $pull: { followers: "userB" } } );