MongoDB Text Search on Large Dataset

MongoDB Text Search on Large Dataset - mongodb

I have a book collection with currently 7.7 million records and I have setup a text index as follows that will allow me to search the collection by title and author as follows:
db.book.createIndex( { title: "text", author: "text" }, {sparse: true, background: true, weights: {title: 15, author: 5}, name: "text_index"} )
The problem is when I use a search query that will return a lot of results eg John and then sort by the textScore the time to perform the query is over 60 seconds.
Please see an example query below:
db.runCommand(
{
aggregate: "book",
pipeline : [
{ $match: { $text: { $search: "John" } } },
{ $sort: { score: { $meta: "textScore" } } },
{ $limit: 6 }
],
allowDiskUse : true
}
)
Can anyone suggest a solution to reduce this search time down to a reasonable level?
Many thanks.

Related

optimize indexes in MongoDB

I have a Order collection with records looking like this:
{
"_id": ObjectId,
"status": String Enum,
"products": [{
"sku": String UUID,
...
}, ...],
...
},
My goal is to find find what products user buy together. Given an sku, i would like to browse the past order and find, for orders that contains more than 1 product AND of course the product with the looked up sku, what other products were bought along.
So I created a aggregation pipeline that works :
[
// exclude cancelled orders
{
'$match': {
'status': {
'$nin': [
'CANCELLED', 'CHECK_OUT'
]
}
}
},
// add a fields with product size and just the products sku
{
'$addFields': {
'size': {
'$size': '$products'
},
'skus': '$products.sku'
}
},
// limit to orders with 2 products or more including the looked up SKU
{
'$match': {
'size': {
'$gte': 2
},
'skus': {
'$elemMatch': {
'$eq': '3516215049767'
}
}
}
},
// group by skus
{
'$unwind': {
'path': '$skus'
}
}, {
'$group': {
'_id': '$skus',
'count': {
'$sum': 1
}
}
},
// sort by count, exclude the looked up sku, limit to 4 results
{
$sort': {
'count': -1
}
}, {
'$match': {
'_id': {
'$ne': '3516215049767'
}
}
}, {
'$limit': 4
}
]
Althought this works, this collection contains more than 10K docs and I have an alert on my MongoDB instance telling me than the ratio Scanned Objects / Returned has gone above 1000.
So my question is, how can my query be improve? and what indexes can I add to improve this?
db.Orders.stats();
{
size: 14329835,
count: 10571,
avgObjSize: 1355,
storageSize: 4952064,
freeStorageSize: 307200,
capped: false
nindexes: 2,
indexBuilds: [],
totalIndexSize: 466944,
totalSize: 5419008,
indexSizes: { _id_: 299008, status_1__created_at_1: 167936 },
scaleFactor: 1,
ok: 1,
operationTime: Timestamp({ t: 1635415716, i: 1 })
}

Let's start with rewriting the query a little bit to make it more efficient.
Currently you're matching all the orders with a certain status and after that you're starting with data manipulations, this means every single stage is doing work on a larger than needed data set.
What we can do is move all the queries into the first stage, this is made possible using Mongo's dot notation, like so:
{
'$match': {
'status': {
'$nin': [
'CANCELLED', 'CHECK_OUT',
],
},
'products.sku': '3516215049767', // mongo allows you to do this using the dot notation.
'products.1': { $exists: true }, // this requires the array to have at least two elements.
},
},
Now this achieves two things:
We start the pipeline only with relevant results, no need to calculate the $size of the array anymore to many unrelevant documents. This already will boost your performance greatly.
Now we can create a compound index that will support this specific query, before we couldn't do that as index usage is limited to the first step and that only included the status field. ( just as an anecdote is that Mongo actually does optimize pipelines, but in this specific case no optimization was possible to to the usage of $addFields )
The index that I recommend building is:
{ status: 1, "products.sku": 1 }
This will allow the best match to start off your pipeline.

use geonear with fuzzy search text mongodb

I have the following query
db.hotels.aggregate([
{
$search: {
index:'txtIdx', // this is the index name
text: {
query:"sun",
path:['name','address.landmark','address.city','address.state','address.country'],
fuzzy: {
maxEdits:2,
prefixLength: 1,
},
},
},
},
{
$project: {
_id: 1,
name: 1,
address:1,
score: { $meta: "searchScore" }
}
},
{$limit:100},
])
there is also a field called 'location' in hotels' collection, which has coordinates as follows
"location": {
"type": "Point",
"coordinates": [
72.867804,
19.076033
]
}
how can I use geonear with this search query to only return near by hotels from user, with provided latitude, longitude and distance.
I also tried this query
{
$search: {
index:'searchIndex',
compound: {
must: {
text: {
query:'sun',
path:['name','address.landmark','address.city','address.state','address.country'],
fuzzy: {
maxEdits:2,
prefixLength: 3,
},
},
},
should: {
near:{
origin: {
type: 'Point',
coordinates: [-122.45665489904827,37.75118012951178],
},
pivot: 1000,
path: 'location'
},
}
}
}
},
but above query returns results which are not even around that location. It returns same result as 'search' would provide without 'near'.I have created 'geo' index for location, still it doesn't return nearby hotels.
Or is there another way apart from using geonear along with search? I am trying since past 2 days now, and I haven't found anything useful. Also I want to use fuzzy text search only. Please let me know if there is a way to solve this?

How to return a specific element of an array in a document?

I have a table containing documents set up as follows:
_id: 1,
name: { first: 'John', last: 'Doe' },
tools: [ 'Tool1', 'Tool2', 'Tool3' ],
skills: [
{ type: 'carpentry',
years: 3 },
{ type: 'plumbing',
year: 5 },
{ type: 'electrical',
year: 8 }
]
}
I need to write a script that can search each document in the table and return the value of a specific skill, for example: Find the number of years John Doe has in plumbing.
Since I don't need the full document, db.table.find({skills: {$elemMatch: {type:'plumbing'}}}) feels unnecessary and would still require me to search the document to find the value I'm looking for. Is there a way to just return the part of the document I'm looking for?
The desired output would be {type: 'plumbing', year: 5} so that I could then manipulate that data into another field in the document.

Try this-
db.collection.aggregate([
{
"$unwind": "$skills"
},
{
"$match": {
"skills.type": "plumbing"
}
},
{
"$project": {
skills: 1
}
}
])
Mongo Playground
OR try this if you only want year.
Mongo Playground 2

Mongoose Aggregate pagination and total number [duplicate]

I am interested in optimizing a "pagination" solution I'm working on with MongoDB. My problem is straight forward. I usually limit the number of documents returned using the limit() functionality. This forces me to issue a redundant query without the limit() function in order for me to also capture the total number of documents in the query so I can pass to that to the client letting them know they'll have to issue an additional request(s) to retrieve the rest of the documents.
Is there a way to condense this into 1 query? Get the total number of documents but at the same time only retrieve a subset using limit()? Is there a different way to think about this problem than I am approaching it?

Mongodb 3.4 has introduced $facet aggregation
which processes multiple aggregation pipelines within a single stage
on the same set of input documents.
Using $facet and $group you can find documents with $limit and can get total count.
You can use below aggregation in mongodb 3.4
db.collection.aggregate([
{ "$facet": {
"totalData": [
{ "$match": { }},
{ "$skip": 10 },
{ "$limit": 10 }
],
"totalCount": [
{ "$group": {
"_id": null,
"count": { "$sum": 1 }
}}
]
}}
])
Even you can use $count aggregation which has been introduced in mongodb 3.6.
You can use below aggregation in mongodb 3.6
db.collection.aggregate([
{ "$facet": {
"totalData": [
{ "$match": { }},
{ "$skip": 10 },
{ "$limit": 10 }
],
"totalCount": [
{ "$count": "count" }
]
}}
])

No, there is no other way. Two queries - one for count - one with limit. Or you have to use a different database. Apache Solr for instance works like you want. Every query there is limited and returns totalCount.

MongoDB allows you to use cursor.count() even when you pass limit() or skip().
Lets say you have a db.collection with 10 items.
You can do:
async function getQuery() {
let query = await db.collection.find({}).skip(5).limit(5); // returns last 5 items in db
let countTotal = await query.count() // returns 10-- will not take `skip` or `limit` into consideration
let countWithConstraints = await query.count(true) // returns 5 -- will take into consideration `skip` and `limit`
return { query, countTotal }
}

Here's how to do this with MongoDB 3.4+ (with Mongoose) using $facets. This examples returns a $count based on the documents after they have been matched.
const facetedPipeline = [{
"$match": { "dateCreated": { $gte: new Date('2021-01-01') } },
"$project": { 'exclude.some.field': 0 },
},
{
"$facet": {
"data": [
{ "$skip": 10 },
{ "$limit": 10 }
],
"pagination": [
{ "$count": "total" }
]
}
}
];
const results = await Model.aggregate(facetedPipeline);
This pattern is useful for getting pagination information to return from a REST API.
Reference: MongoDB $facet

Times have changed, and I believe you can achieve what the OP is asking by using aggregation with $sort, $group and $project. For my system, I needed to also grab some user info from my users collection. Hopefully this can answer any questions around that as well. Below is an aggregation pipe. The last three objects (sort, group and project) are what handle getting the total count, then providing pagination capabilities.
db.posts.aggregate([
{ $match: { public: true },
{ $lookup: {
from: 'users',
localField: 'userId',
foreignField: 'userId',
as: 'userInfo'
} },
{ $project: {
postId: 1,
title: 1,
description: 1
updated: 1,
userInfo: {
$let: {
vars: {
firstUser: {
$arrayElemAt: ['$userInfo', 0]
}
},
in: {
username: '$$firstUser.username'
}
}
}
} },
{ $sort: { updated: -1 } },
{ $group: {
_id: null,
postCount: { $sum: 1 },
posts: {
$push: '$$ROOT'
}
} },
{ $project: {
_id: 0,
postCount: 1,
posts: {
$slice: [
'$posts',
currentPage ? (currentPage - 1) * RESULTS_PER_PAGE : 0,
RESULTS_PER_PAGE
]
}
} }
])

there is a way in Mongodb 3.4: $facet
you can do
db.collection.aggregate([
{
$facet: {
data: [{ $match: {} }],
total: { $count: 'total' }
}
}
])
then you will be able to run two aggregate at the same time

By default, the count() method ignores the effects of the
cursor.skip() and cursor.limit() (MongoDB docs)
As the count method excludes the effects of limit and skip, you can use cursor.count() to get the total count
const cursor = await database.collection(collectionName).find(query).skip(offset).limit(limit)
return {
data: await cursor.toArray(),
count: await cursor.count() // this will give count of all the documents before .skip() and limit()
};

It all depends on the pagination experience you need as to whether or not you need to do two queries.
Do you need to list every single page or even a range of pages? Does anyone even go to page 1051 - conceptually what does that actually mean?
Theres been lots of UX on patterns of pagination - Avoid the pains of pagination covers various types of pagination and their scenarios and many don't need a count query to know if theres a next page. For example if you display 10 items on a page and you limit to 13 - you'll know if theres another page..

MongoDB has introduced a new method for getting only the count of the documents matching a given query and it goes as follows:
const result = await db.collection('foo').count({name: 'bar'});
console.log('result:', result) // prints the matching doc count
Recipe for usage in pagination:
const query = {name: 'bar'};
const skip = (pageNo - 1) * pageSize; // assuming pageNo starts from 1
const limit = pageSize;
const [listResult, countResult] = await Promise.all([
db.collection('foo')
.find(query)
.skip(skip)
.limit(limit),
db.collection('foo').count(query)
])
return {
totalCount: countResult,
list: listResult
}
For more details on db.collection.count visit this page

It is possible to get the total result size without the effect of limit() using count() as answered here:
Limiting results in MongoDB but still getting the full count?
According to the documentation you can even control whether limit/pagination is taken into account when calling count():
https://docs.mongodb.com/manual/reference/method/cursor.count/#cursor.count
Edit: in contrast to what is written elsewhere - the docs clearly state that "The operation does not perform the query but instead counts the results that would be returned by the query". Which - from my understanding - means that only one query is executed.
Example:
> db.createCollection("test")
{ "ok" : 1 }
> db.test.insert([{name: "first"}, {name: "second"}, {name: "third"},
{name: "forth"}, {name: "fifth"}])
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 5,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]
})
> db.test.find()
{ "_id" : ObjectId("58ff00918f5e60ff211521c5"), "name" : "first" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c6"), "name" : "second" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c7"), "name" : "third" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c8"), "name" : "forth" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c9"), "name" : "fifth" }
> db.test.count()
5
> var result = db.test.find().limit(3)
> result
{ "_id" : ObjectId("58ff00918f5e60ff211521c5"), "name" : "first" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c6"), "name" : "second" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c7"), "name" : "third" }
> result.count()
5 (total result size of the query without limit)
> result.count(1)
3 (result size with limit(3) taken into account)

Try as bellow:
cursor.count(false, function(err, total){ console.log("total", total) })
core.db.users.find(query, {}, {skip:0, limit:1}, function(err, cursor){
if(err)
return callback(err);
cursor.toArray(function(err, items){
if(err)
return callback(err);
cursor.count(false, function(err, total){
if(err)
return callback(err);
console.log("cursor", total)
callback(null, {items: items, total:total})
})
})
})

Thought of providing a caution while using the aggregate for the pagenation. Its better to use two queries for this if the API is used frequently to fetch data by the users. This is atleast 50 times faster than getting the data using aggregate on a production server when more users are accessing the system online. The aggregate and $facet are more suited for Dashboard , reports and cron jobs that are called less frequently.

We can do it using 2 query.
const limit = parseInt(req.query.limit || 50, 10);
let page = parseInt(req.query.page || 0, 10);
if (page > 0) { page = page - 1}
let doc = await req.db.collection('bookings').find().sort( { _id: -1 }).skip(page).limit(limit).toArray();
let count = await req.db.collection('bookings').find().count();
res.json({data: [...doc], count: count});

I took the two queries approach, and the following code has been taken straight out of a project I'm working on, using MongoDB Atlas and a full-text search index:
return new Promise( async (resolve, reject) => {
try {
const search = {
$search: {
index: 'assets',
compound: {
should: [{
text: {
query: args.phraseToSearch,
path: [
'title', 'note'
]
}
}]
}
}
}
const project = {
$project: {
_id: 0,
id: '$_id',
userId: 1,
title: 1,
note: 1,
score: {
$meta: 'searchScore'
}
}
}
const match = {
$match: {
userId: args.userId
}
}
const skip = {
$skip: args.skip
}
const limit = {
$limit: args.first
}
const group = {
$group: {
_id: null,
count: { $sum: 1 }
}
}
const searchAllAssets = await Models.Assets.schema.aggregate([
search, project, match, skip, limit
])
const [ totalNumberOfAssets ] = await Models.Assets.schema.aggregate([
search, project, match, group
])
return await resolve({
searchAllAssets: searchAllAssets,
totalNumberOfAssets: totalNumberOfAssets.count
})
} catch (exception) {
return reject(new Error(exception))
}
})

I had the same problem and came across this question. The correct solution to this problem is posted here.

You can do this in one query. First you run a count and within that run the limit() function.
In Node.js and Express.js, you will have to use it like this to be able to use the "count" function along with the toArray's "result".
var curFind = db.collection('tasks').find({query});
Then you can run two functions after it like this (one nested in the other)
curFind.count(function (e, count) {
// Use count here
curFind.skip(0).limit(10).toArray(function(err, result) {
// Use result here and count here
});
});

MongoDB: select matched elements of subcollection

I'm using mongoose.js to do queries to mongodb, but I think my problem is not specific to mongoose.js.
Say I have only one record in the collection:
var album = new Album({
tracks: [{
title: 'track0',
language: 'en',
},{
title: 'track1',
language: 'en',
},{
title: 'track2',
language: 'es',
}]
})
I want to select all tracks with language field equal to 'en', so I tried two variants:
Album.find({'tracks.language':'en'}, {'tracks.$': 1}, function(err, albums){
and tied to to the same thing with $elemMatch projection:
Album.find({}, {tracks: {$elemMatch: {'language': 'en'}}}, function(err, albums){
in either case I've got the same result:
{tracks:[{title: 'track0', language: 'en'}]}
selected album.tracks contain only ONE track element with title 'track0' (but there should be both 'track0', 'track1'):
{tracks:[{title: 'track0', language: 'en'}, {title: 'track1', language: 'en'}]}
What am I doing wrong?

Like #JohnnyHK already said, you'll have to use the aggregation framework to accomplish that because both $ and $elemMatch only return the first match.
Here's how:
db.Album.aggregate(
// This is optional. It might make your query faster if you have
// many albums that don't have any English tracks. Take a larger
// collection and measure the difference. YMMV.
{ $match: {tracks: {$elemMatch: {'language': 'en'}} } },
// This will create an 'intermediate' document for each track
{ $unwind : "$tracks" },
// Now filter out the documents that don't contain an English track
// Note: at this point, documents' 'tracks' element is not an array
{ $match: { "tracks.language" : "en" } },
// Re-group so the output documents have the same structure, ie.
// make tracks a subdocument / array again
{ $group : { _id : "$_id", tracks : { $addToSet : "$tracks" } }}
);
You might want to try that aggregate query with only the first expression and then add expressions line by line to see how the output is changed. It's particularly important to understand how $unwind creates intermediate documents that are later re-merged using $group and $addToSet.
Results:
> db.Album.aggregate(
{ $match: {tracks: {$elemMatch: {'language': 'en'}} } },
{ $unwind : "$tracks" },
{ $match: { "tracks.language" : "en" } },
{ $group : { _id : "$_id", tracks : { $addToSet : "$tracks" } }} );
{
"result" : [
{
"_id" : ObjectId("514217b1c99766f4d210c20b"),
"tracks" : [
{
"title" : "track1",
"language" : "en"
},
{
"title" : "track0",
"language" : "en"
}
]
}
],
"ok" : 1
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

MongoDB Text Search on Large Dataset - mongodb

Related

optimize indexes in MongoDB

use geonear with fuzzy search text mongodb

How to return a specific element of an array in a document?

Mongoose Aggregate pagination and total number [duplicate]

MongoDB: select matched elements of subcollection

Categories

Resources