Mongodb query optimization - running query in parallel

Mongodb query optimization - running query in parallel - mongodb

I am trying to run some wild card/regex based query on mongo cluster from java driver.
Mongo Replica Set config:
3 member replica
16 CPU(hyperthreaded), 24G RAM Linux x86_64
Collection size: 6M rows, 7G data
Client is localhost (mac osx 10.8) with latest mongo-java driver
Query using java driver with readpref = primaryPreffered
{ "$and" : [{ "$or" : [ { "country" : "united states"}]} , { "$or" : [ { "registering_organization" : { "$regex" : "^.*itt.*hartford.*$"}} , { "registering_organization" : { "$regex" : "^.*met.*life.*$"}} , { "registering_organization" : { "$regex" : "^.*cardinal.*health.*$"}}]}]}
I have regular index on both "country" and "registering_organization". But as per mongo docs a single query can utilize only one index and I can see that from explain() on above query as well.
So my question is what is the best alternative to achieve better performance in above query.
Should I break the 'and' operations and do in memory intersection. Going further I shall have 'Not' operations in query too.
I think my application may turn into reporting/analytics in future but that's not down the line or i am not looking to design accordingly.

There are so many things wrong with this query.
Your nested conditional with regexes will never get faster in MongoDB. MongoDB is not the best tool for "data discovery" (e.g. ad-hoc, multi-conditional queries for uncovering unknown information). MongoDB is blazing fast when you know the metrics you are generating. But, not for data discovery.
If this is a common query you are running, then I would create an attribute called "united_states_or_health_care", and set the value to the timestamp of the create date. With this method, you are moving your logic from your query to your document schema. This is one common way to think about scaling with MongoDB.
If you are doing data discovery, you have a few different options:
Have your application concatenate the results of the different queries
Run query on a secondary MongoDB, and accept slower performance
Pipe your data to Postgresql using mosql. Postgres will run these data-discovery queries much faster.
Another Tip:
Your regexes are not anchored in a way to be fast. It would be best to run your "registering_organization" attribute through a "findable_registering_organization" filter. The filter would break apart the organization into an array of queryable name subsets, and you would quite using the regexes. +2 points if you can filter incoming names by an industry lookup.

Related

Deleting large amounts of data from MongoDB

I have the following code that currently works. It goes through and finds every file that's newer than a specified date and that matches a regex, then deletes it, as well as the chunks that are pointing to it.
conn = new Mongo("<url>");
db = conn.getDB("<project>");
res = db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/});
while (res.hasNext()) {
var tmp = res.next();
db.getCollection('fs.chunks').remove({"files_id" : tmp._id});
db.fs.files.remove({ "_id" : tmp._id});
}
It's extremely slow, and most of the time, the client I'm running it from just times out.
Also, I know that I'm deleting files from the filesystem, and not from the normal collections. It's a long story, but the code above does exactly what I want it to do.
How can I get this to run faster? It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side? Before I was trying to use the Javascript driver, this is probably why. I assume using the Mongo shell executes everythin on the server.
Any help would be appreciated. So close but so far...

I know that I'm deleting files from the filesystem, and not from the normal collections
GridFS is a specification for storing binary data in MongoDB, so you are actually deleting documents from MongoDB collections rather than files from the filesystem.
It was pointed out to me earlier that I'm running this code on a client, but it's possible to run it on the server side?
The majority of your code (queries and commands) is being executed by your MongoDB server. The client (mongo shell, in this case) isn't doing any significant processing.
It's extremely slow, and most of the time, the client I'm running it from just times out.
You need to investigate where the time is being spent.
If there is problematic network latency between your mongo shell and your deployment, you could consider running the query from a mongo shell session closer to the deployment (if possible) or use query criteria matching a smaller range of documents.
Another obvious candidate to look into would be server resources. For example, is deleting a large number of documents putting pressure on your I/O or RAM? Reducing the number of documents you delete in each script run may also help in this case.
db.fs.files.find({"uploadDate" : { $gte : new ISODate("2017-04-04")}}, {filename : /.*(png)/})
This query likely isn't doing what you intended: the filename is being provided as the second option to find() (so is used for projection rather than search criteria) and the regex matches a filename containing png anywhere (for example: typng.doc).
I assume using the Mongo shell executes everythin on the server.
That's an incorrect general assumption. The mongo shell can evaluate local functions, so depending on your code there may be aspects that are executed/evaluated in a client context rather than a server context. Your example code is running queries/commands which are processed on the server, but fs.files documents returned from your find() query are being accessed in the mongo shell in order to construct the query to remove related documents in fs.chunks.
How can I get this to run faster?
In addition to comments noted above, there are a few code changes you can make to improve efficiency. In particular, you are currently removing chunk documents individually. There is a Bulk API in MongoDB 2.6+ which will reduce the round trips required per batch of deletes.
Some additional suggestions to try to improve the speed:
Add an index on {uploadDate:1, filename: 1} to support your find() query:
db.fs.files.createIndex({uploadDate:1, filename: 1})
Use the Bulk API to remove matching chunk documents rather than individual removes:
while (res.hasNext()) {
var tmp = res.next();
var bulk = db.fs.chunks.initializeUnorderedBulkOp();
bulk.find( {"files_id" : tmp._id} ).remove();
bulk.execute();
db.fs.files.remove({ "_id" : tmp._id});
}
Add a projection to the fs.files query to only include the fields you need:
var res = db.fs.files.find(
// query criteria
{
uploadDate: { $gte: new ISODate("2017-04-04") },
// Filenames that end in png
filename: /\.png$/
},
// Only include the _id field
{ _id: 1 }
)
Note: unless you've added a lot of metadata to your GridFS files (or have a lot of files to remove) this may not have a significant impact. The default fs.files documents are ~130 bytes, but the only field you require is _id (a 12 byte ObjectId).

Mongodb text search for large collection

Given below collection which has potential for ~18 million documents. I need a search functionality on the payload part of the document.
Because of the large volume of data, will it create performance issues if I create a text index on the payload field in the document? Are there any known performance issues when the collection contains millions of documents?
{
"_id" : ObjectId("5575e388e4b001976b5e570d"),
"createdDate" : ISODate("2015-06-07T05:00:34.040Z"),
"env" : "prod",
"messageId" : "my-message-id-1",
"payload" : "message payload typically 500-1000 bytes of string data"
}
I use MongoDB 3.0.3

I believe that is exactly what NoSQL DB were designed to do; give you quick access to a piece of data via an [inverted] index. Mongo is designed for that. NoSQL DB's like Mongo are designed to handle massive sets of data distributed across multiple nodes in a cluster. 18 million in the scope of Mongo is pretty small. You should not have any performance problems if you index property. You might want to read up on sharing also, it is key to getting the best performance out of your MongoDB.

You can use the Mongo DB Atlas feature where you can search your text based on different Analyzers that MongoDB provides. And you can then do a fuzzy search where text closer to your text will also be returned:
PS: For full-text match and to ignore fuzzy, just exclude the fuzzy object from below.
$search:{
{
index: 'analyzer_name_created_from_atlas_search',
text: {
query: 'message payload typically 500-1000 bytes of string data',
path: 'payload',
fuzzy:{
maxEdits: 2
}
}
}
}

MongoDB {aggregation $match} vs {find} speed

I have a mongoDB collection with millions of rows and I'm trying to optimize my queries. I'm currently using the aggregation framework to retrieve data and group them as I want. My typical aggregation query is something like : $match > $group > $ group > $project
However, I noticed that the last parts only take a few ms, the beginning is the slowest.
I tried to perform a query with only the $match filter, and then to perform the same query with collection.find. The aggregation query takes ~80ms while the find query takes 0 or 1ms.
I have indexes on pretty much each field so I guess this isn't the problem. Any idea on what could go wrong ? Or is it just a "normal" drawback of the aggregation framework ?
I could use find queries instead of aggregation queries, however I would have to perform a lot of processing after the request and this process can be done quickly with $group etc. so I would rather keep the aggregation framework.
Thanks,
EDIT :
Here is my criteria :
{
"action" : "click",
"timestamp" : {
"$gt" : ISODate("2015-01-01T00:00:00Z"),
"$lt" : ISODate("2015-02-011T00:00:00Z")
},
"itemId" : "5"
}

The main purpose of the aggregation framework is to ease the query of a big number of entries and generate a low number of results that hold value to you.
As you have said, you can also use multiple find queries, but remember that you can not create new fields with find queries. On the other hand, the $group stage allows you to define your new fields.
If you would like to achieve the functionality of the aggregation framework, you would most likely have to run an initial find (or chain several ones), pull that information and further manipulate it with a programming language.
The aggregation pipeline might seem to take longer, but at least you know you only have to take into account the performance of one system - MongoDB engine.
Whereas, when it comes to manipulating the data returned from a find query, you would most likely have to further manipulate the data with a programming language, thus increasing the complexity depending on the intricacies of the programming language of choice.

Have you tried using explain() to your find queries? It'll give you good idea about how much time find() query will exactly take. You can do the same for $match with $explain & see whether there is any difference in index accessing & other parameters.
Also the $group part of aggregation framework doesn't utilize the indexing so it has to process all the records returned by $match stage of aggregation framework. So to better understand the the working of your query see the result set it returns & whether it fits into memory to be processed by MongoDB.

if you are concern with performance, then no doubt aggregation is time taking task rather then find clause.
when you are fetching record on multiple conditions, having lookup, grouping, and some limited record ( paginated) then it is best approch to use aggregate , meanwhile in find query is fast when you have to fetch very big data set. you have some population, projection and no pagination i suggest to use find query that is fast

MongoDB querying performance for over 5 million records

We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.
They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.
I've added several compound indexes with the filtered fields and the timetamp
e.g:
db.events.ensureIndex({somefield: 1, timestamp:-1})
I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.
I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.
I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..
EDIT: example for a query:
> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['nickey#acme.com']}}).sort({timestamp: -1}).limit(25).explain()
{
"cursor" : "BtreeCursor user.userName_1_timestamp_-1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 30060,
"nscanned" : 30060,
"nscannedObjectsAllPlans" : 120241,
"nscannedAllPlans" : 120241,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 26495,
"indexBounds" : {
"user.userName" : [
[
"nickey#acme.com",
"nickey#acme.com"
]
],
"timestamp" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"server" : "yarin:27017"
}
please note that deviceType has only 2 values in my collection.

This is searching the needle in a haystack. We'd need some output of explain() for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:
Ensure it's not because of insufficient RAM and excessive paging
Enable the DB profiler (using db.setProfilingLevel(1, timeout) where timeout is the threshold for the number of milliseconds the query or command takes, anything slower will be logged)
Inspect the slow queries in db.system.profile and run the queries manually using explain()
Try to identify the slow operations in the explain() output, such as scanAndOrder or large nscanned, etc.
Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.
A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.
Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.
Let's say you have a query for all users with status "active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.
So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.
Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.
The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.
If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.
-- EDIT --
The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "nickey#acme.com". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.
I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).
Making this particular query faster could be done using a compound index that contains the device type, e.g. using
a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})
or
b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})
Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1}); can't use the same index anymore, so, as described, the number of indexes will grow very quickly.
I'm afraid there's no very good solution for this using mongodb at this time.

Mongo only uses 1 index per query.
So if you want to filter on 2 fields, mongo will use the index with one of the fields, but still needs to scan the entire subset.
This means that basically you'll need an index for every type of query in order to achieve the best performance.
Depending on your data, it might not be a bad idea to have one query per field, and process the results in your app.
This way you'll only need indexes on every fields, but it may be too much data to process.

If you are using $in, mongodb never uses INDEX. Change your query, by removing this $in. It should use index and it would give better performance than what you got earlier.
http://docs.mongodb.org/manual/core/query-optimization/

How to mapreduce on key from another collection

Say I have a collection of users like this:-
{
"_id" : "1234",
"Name" : "John",
"OS" : "5.1",
"Groups" : [{
"_id" : "A",
"Name" : "Group A"
}, {
"_id" : "C",
"Name" : "Group C"
}]
}
And I have a collection of events like this:-
{
"_id" : "15342",
"Event" : "VIEW",
"UserId" : "1234"
}
I'm able to use mapreduce to work out the count of events per user as I can just emit the "UserId" and count off of that, however what I want to do now is count events by group.
If I had a "Groups" array in my event document then this would be easy, however I don't and this is only an example, the actual application of this is much more complicated and I don't want to replicate all that data into the event document.
I've see an example at http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/ but I can't see how that applies in this situation as it is aggregating values from two places... all I really want to do is perform a lookup.
In SQL I would simply JOIN my flattened UserGroup table to the event table and just GROUP BY UserGroup.GroupName
I'd be happy with multiple passes of mapreduce... first pass to count by UserId into something like { "_id" : "1234", "count" : 9 } but I get stuck on next pass... how to include the group id
Some potential approaches I've considered:-
Include group info in the event document (not feasible)
Work out how to "join" the user collection or look-up the users groups from within the map function so I can emit the group id's as well (don't know how to do this)
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
What is possible and what are the benefits/issues with each approach?

Your third approach is the way to go:
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
To do this you'll need to create a new collection J with the "joined" data you need for map-reduce. There are several strategies you can use for this:
Update your application to insert/update J in the normal course of business. This is best in the case where you need to run MR very frequently and with up-to-date data. It can add substantially to code complexity. From an implementation standpoint, you can do this either directly (by writing to J) or indirectly (by writing changes to a log collection L and then applying the "new" changes to J). If you choose the log collection approach you'll need a strategy for determining what's changed. There are two common ones: high-watermark (based on _id or a timestamp) and using the log collection as a queue with the findAndModify command.
Create/update J in batch mode. This is the way to go in the case of high-performance systems where the multiple updates from the above strategy would affect performance. This is also the way to go if you do not need to run the MR very frequently and/or you do not have to guarantee up-to-the-second data accuracy.
If you go with (2) you will have to iterate over documents in the collections you need to join--as you've figured out, Mongo map-reduce won't help you here. There are many possible ways to do this:
If you don't have many documents and if they are small, you can iterate outside of the DB with a direct connection to the DB.
If you cannot do (1) you can iterate inside the DB using db.eval(). If the number of documents in not small, make sure to use nolock: true as db.eval is blocking by default. This is typically the strategy I choose as I tend to deal with very large document sets and I cannot afford to move them over the network.
If you cannot do (1) and do not want to do (2) you can clone the collections to another node with a temporary DB. Mongo has a convenient cloneCollection command for this. Note that this does not work if the DB requires authentication (don't ask why; it's a strange 10gen design choice). In that case you can use mongodump and mongorestore. Once you have the data local to a new DB you can party on it as you see fit. Once you complete the MR you can update the result collection in your production DB. I use this strategy for one-off map-reduce operations with heavy pre-processing so as to not load the production replica sets.
Good luck!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse