Performance issues related to $nin/$ne querying in large database - mongodb

I am working on a pipeline where multiple microservices(workers) modify and add attributes to documents. Some of them have to make sure the document was already processed by another microservice and/or make sure they don't process a document twice.
I've already tried two different data structures for this: array and object:
{
...other_attributes
worker_history_array: ["worker_1", "worker_2", ...]
woker_history_object: {"worker_1": true, "worker_2": true, ...}
}
I also created indexes for the 2 fields
{ "worker_history_array": 1 }
{ "worker_history_object.$**": 1 }
Both data structures use the index and work very well when querying for the existence of a worker in the history:
{
"worker_history_array": "worker_1"
}
{
"worker_history_object.worker_1": true
}
But I can't seem to find a query that is fast/ hits the index when looking if a worker did not already process this document. All of those queries perform awfully:
{
"worker_history_array": { $ne: "worker_1" }
}
{
"worker_history_array": { $nin: ["worker_1"] }
}
{
"worker_history_object.worker_1": { $exists: false }
}
{
"worker_history_object.worker_1":{ $not: { $exists: true } }
}
{
"worker_history_object.worker_1": { $ne: true }
}
Performance is already bad with 500k documents, but the database will grow to millions of documents.
Is there a way to improve the query performance?
Can I work around the low selectivity of $ne and $nin?
Different index?
Different data structure?
I don't think it matters but I'm using MongoDB Atlas (MongoDB 4.4.1, cluster with read replicas) on Google Cloud and examined the performance of the queries with MongoDB Compass.
Additional Infos/Restrictions:
Millions of records
Hundreds of workers
I don't know all workers beforehand
Not every worker processes every document (some may only work on documents with type: "x" while others work only on documents with type: "y")
No worker should have knowledge about the pipeline, only about the worker that directly precedes it.
Any help would be greatly appreciated.

Related

MongoDB query over several collections with one sort stage

I have some data with identical layout divided over several collections, say we have collections named Jobs.Current, Jobs.Finished, Jobs.ByJames.
I have implemented a complex query using some aggregation stages on one of these collections, where the last stage is the sorting. It's something like this (but in real it's implemented in C# and additionally doing a projection):
db.ArchivedJobs.aggregate([ { $match: { Name: { $gte: "A" } } }, { $addFields: { "UpdatedTime": { $max: "$Transitions.TimeStamp" } } }, { $sort: { "__tmp": 1 } } ])
My new requirement is to include all these collections into my query. I could do it by simply running the same query on all collections in sequence - but then I still need to sort the results together. As this sort isn't so trivial (using an additional field being created by a $max on a sub-array) and I'm using skip and limit options I hope it's possible to do it in a way like:
Doing the query I already implemented on all relevant collections by defining appropriate aggregation steps
Sorting the whole result afterwards inside the same aggregation request
I found something with a $lookup stage, but couldn't apply it to my request as it needs to do some field-oriented matching (?). I need to access the complete objects.
The data is something like
{
"_id":"4242",
"name":"Stream recording Test - BBC 60 secs switch",
"transitions":[
{
"_id":"123",
"timeStamp":"2020-02-13T14:59:40.449Z",
"currentProcState":"Waiting"
},
{
"_id":"124",
"timeStamp":"2020-02-13T14:59:40.55Z",
"currentProcState":"Running"
},
{
"_id":"125",
"timeStamp":"2020-02-13T15:00:23.216Z",
"currentProcState":"Error"
} ],
"currentState":"Error"
}

Mongo indexing in Meteor

Not sure I'm understanding indexing mongo queries in Meteor. Right now, none of my queries are indexed. On some of the pages in the app, there are 15 or 20 links that instigate to a unique mongo query. Would each query be indexed individually?
For example, if one of the queries is something like:
Template.myTemplate.helpers({
...
if (weekly === "1") {
var firstWeekers = _.where(progDocs, {Week1: "1"}),
firstWeekNames = firstWeekers.map(function (doc) {
return doc.FullName;
});
return Demographic.find({ FullName: { $in: firstWeekNames }}, { sort: { FullName: 1 }});
}
...
})
How would I implement each of the indexes?
Firstly minimongo (mongo on the client-side) runs in memory so indexing is much less of a factor than on disk. To minimize network consumption you also generally want to keep your collections on the client fairly small making indexing on the client-side even less important.
On the server however indexing can be critical to good performance. There are two common methods to setup indexes on the server:
via the meteor mongo shell, i.e. db.demographic.createIndex( { FullName: 1 } )
via setting the field to be indexed in your schema when using the Collection2 package. See aldeed:schema-index

Is it possible to perform multiple DB operations in a single transaction in MongoDB?

Suppose I have two collections A and B
I want to perform an operation
db.A.remove({_id:1});
db.B.insert({_id:"1","name":"dev"})
I know MongoDB maintains atomicity at the document level. Is it possible to perform the above set of operation in a single transaction?
Yes, now you can!
MongoDB has had atomic write operations on the level of a single document for a long time. But, MongoDB did not support such atomicity in the case of multi-document operations until v4.0.0. Multi-document operations are now atomic in nature thanks to the release of MongoDB Transactions.
But remember that transactions are only supported in replica sets using the WiredTiger storage engine, and not in standalone servers (but may support on standalone servers too in future!)
Here is a mongo shell example also provided in the official docs:
// Start a session.
session = db.getMongo().startSession( { readPreference: { mode: "primary" } } );
employeesCollection = session.getDatabase("hr").employees;
eventsCollection = session.getDatabase("reporting").events;
// Start a transaction
session.startTransaction( { readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } } );
//As many operations as you want inside this transaction
try {
employeesCollection.updateOne( { employee: 3 }, { $set: { status: "Inactive" } } );
eventsCollection.insertOne( { employee: 3, status: { new: "Inactive", old: "Active" } } );
} catch (error) {
// Abort transaction on error
session.abortTransaction();
throw error;
}
// Commit the transaction using write concern set at transaction start
session.commitTransaction();
session.endSession();
I recommend you reading this and this to better understand how to use!
MongoDB can not guarantee atomicity when more than one document is involved.
Also, MongoDB does not offer any single operations which affect more than one collection.
When you want to do whatever you actually want to do in an atomic manner, you need to merge collections A and B into one collection. Remember that MongoDB is a schemaless database. You can store documents of different types in one collection and you can perform single atomic update operations which perform multiple changes to a document. That means that a single update can transform a document of type A into a document of type B.
To tell different types in the same collection apart, you could have a type field and add this to all of your queries, or you could use duck-typing and identify types by checking if a certain field $exists.

Calculating collection stats for a subset of documents in MongoDB

I know the cardinal rule of SE is to not ask a question without giving examples of what you've already tried, but in this case I can't find where to begin. I've looked at the documentation for MongoDB and it looks like there are only two ways to calculate storage usage:
db.collection.stats() returns the statistics about the entire collection. In my case I need to know the amount of storage being used to by a subset of data within a collection (data for a particular user).
Object.bsonsize(<document>) returns the storage size of a single record, which would require a cursor function to calculate the size of each document, one at a time. My only concern with this approach is performance with large amounts of data. If a single user has tens of thousands of documents this process could take too long.
Does anyone know of a way to calculate the aggregate document size of set of records within a collection efficiently and accurately.
Thanks for the help.
This may not be the most efficient or accurate way to do it, but I ended up using a Mongoose plugin to get the size of the JSON representation of the document before it's saved:
module.exports = exports = function defaultPlugin(schema, options){
schema.add({
userId: { type: mongoose.Schema.Types.ObjectId, ref: "User", required: true },
recordSize: Number
});
schema.pre('save', function(next) {
this.recordSize = JSON.stringify(this).length;
next();
});
}
This will convert the schema object to a JSON representation, get it's length, then store the size in the document itself. I understand that this will actually add a tiny bit of extra storage to record the size, but it's the best I could come up with.
Then, to generate a storage report, I'm using a simple aggregate call to get the sum of all of the recordSize values in the collection, filtered by userId:
mongoose.model('YouCollectionName').aggregate([
{
$match: {
userId: userId
}
},
{
$group: {
_id: null,
recordSize: { $sum: '$recordSize'},
recordCount: { $sum: 1 }
}
}
], function (err, results) {
//Do something with your results
});

delayed_jobs with mongomapper is slow

I'm using delayed_jobs with mongomapper. However, it's slow when fetching delayed_jobs records (around 500k records).
I'm running to create indexes { locked_by: -1, priority: 1, run_at: 1 }, but it doesn't help.
I really don't know which indexes to improve the query. Each fetching takes around 2 seconds.
Here is the mongodb log:
Tue Dec 13 09:52:38 [conn497] query api_production.$cmd ntoreturn:1 command: {
findandmodify: "delayed_jobs", query: { run_at: { $lte: new Date(1323769957289) }, failed_at:
null, $or: [ { locked_by: "host:ip-10-128-145-246 pid:26157" }, { locked_at: null }, {
locked_at: { $lt: new Date(1323769057289) } } ] }, sort: { locked_by: -1, priority: -1,
run_at: 1 }, update: { $set: { locked_at: new Date(1323769957289), locked_by: "host:ip-10-
128-145-246 pid:26157" } } } reslen:699 1486ms
Your indexes don't match the query. Your query first eliminates candidates based on run_at, so that should be your first index, but it's not.
Then comes a rather inelegant $or clause. Now it will be hard to choose an appropriate index, because two criteria are locked_at while one is locked_by.
To make matters worse, there are three sort criteria, but they are exactly reverse of the direction of the query constraints. Also, you're sorting on a rather lengthy string.
Basically, I think the query is not very well designed, it tries to accomplish too much in a single query. I don't know if delayed_jobs is some kind of module, but it would be much easier if the rules were simpler. Why does a worker lock so many jobs, for instance? In fact, I think it's best if you only lock the job you're currently working on and have different workers fetch different job types for scaling. The workers might want to use uuids instead of using their ip address and pid (with a prefix that adds no entropy and no selectivity), etc.