mongodb capped collections consume from offset - mongodb

We are considering using MongoDB capped collections as our fifo queues.
Our requirements are the following:
Processing of the messages should be done in insertion order
No messages should be lost/skipped
We should be able to start consuming from a specific offset
However we are facing the following issue:
Capped collections are guaranteeing read by insertion order. However the _ids are not guaranteed to be monotonic.
This means that if there are multiple producers the following situation can occur:
[
...
{
_id: 5b72f12599757c9e26c0946b,
...
},
{
_id: 5b72f12599757c9e26c0946d,
...
},
{
_id: 5b72f12599757c9e26c0946c,
...
},
{
_id: 5b72f12599757c9e26c0946e,
...
},
...
]
This means that if we start consuming with the following code:
const cursor = collection
.find({ _id: { $gt: "5b72f12599757c9e26c0946d" } })
.tailable()
.cursor();
Then the message with 5b72f12599757c9e26c0946c will be skipped.
So my questions are the following:
Is it possible to guarantee monotonic ids on a capped collection?
Is it possible to start consuming from a specific offset without
skiping messages with out of order _id?
Are we missing something?
Thanks in advance.

Related

Mongodb to fetch top 100 results for each category

I have a collection of transactions that has below schema:
{
_id,
client_id,
billings,
date,
total
}
What I want to achieve is to get the 10 latest transaction models based on the date for a list of client IDs. I don't think the $slice well as the use case is mostly for embedded arrays.
Currently, I am iterating through the client_ids and using find with the limit but it is extremely slow.
UPDATE
Example
https://mongoplayground.net/p/urKH7HOxwqC
This shows two clients with 10 transaction each on different days, I want to write a query that would return latest 5 transaction for each.
Any suggestions of how to query data to make it faster?
The most efficient way would be to just execute multiple queries, 1 for each client, like so:
const clients = await db.collection.distinct('client_id');
const results = await Promise.all(
clients.map((clientId) => db.collection.find({client_id: clientId}).sort({date: -1}).limit(5))
)
To improve this performance make sure you have an index on client_id and date. If for whatever reason you can't built these indexes I'd recommend using this following pipeline (with syntax available starting version 5.3+):
db.collection.aggregate([
{
$group: {
_id: "$client_id",
latestTransactions: {
"$bottomN": {
"n": 5,
"sortBy": {
"date": 1
},
"output": "$$ROOT"
}
}
}
}
])
Mongo Playground

Performance issues related to $nin/$ne querying in large database

I am working on a pipeline where multiple microservices(workers) modify and add attributes to documents. Some of them have to make sure the document was already processed by another microservice and/or make sure they don't process a document twice.
I've already tried two different data structures for this: array and object:
{
...other_attributes
worker_history_array: ["worker_1", "worker_2", ...]
woker_history_object: {"worker_1": true, "worker_2": true, ...}
}
I also created indexes for the 2 fields
{ "worker_history_array": 1 }
{ "worker_history_object.$**": 1 }
Both data structures use the index and work very well when querying for the existence of a worker in the history:
{
"worker_history_array": "worker_1"
}
{
"worker_history_object.worker_1": true
}
But I can't seem to find a query that is fast/ hits the index when looking if a worker did not already process this document. All of those queries perform awfully:
{
"worker_history_array": { $ne: "worker_1" }
}
{
"worker_history_array": { $nin: ["worker_1"] }
}
{
"worker_history_object.worker_1": { $exists: false }
}
{
"worker_history_object.worker_1":{ $not: { $exists: true } }
}
{
"worker_history_object.worker_1": { $ne: true }
}
Performance is already bad with 500k documents, but the database will grow to millions of documents.
Is there a way to improve the query performance?
Can I work around the low selectivity of $ne and $nin?
Different index?
Different data structure?
I don't think it matters but I'm using MongoDB Atlas (MongoDB 4.4.1, cluster with read replicas) on Google Cloud and examined the performance of the queries with MongoDB Compass.
Additional Infos/Restrictions:
Millions of records
Hundreds of workers
I don't know all workers beforehand
Not every worker processes every document (some may only work on documents with type: "x" while others work only on documents with type: "y")
No worker should have knowledge about the pipeline, only about the worker that directly precedes it.
Any help would be greatly appreciated.

aggregating and sorting based on a Mongodb Relationship

I'm trying to figure out if what I want to do is even possible in Mongodb. I'm open to all sorts of suggestions regarding more appropriate ways to achieve what I need.
Currently, I have 2 collections:
vehicles (Contains vehicle data such as make and model. This data can be highly unstructured, which is why I turned to Mongodb for this)
views (Simply contains an IP, a date/time that the vehicle was viewed and the vehicle_id. There could be thousands of views)
I need to return a list of vehicles that have views between 2 dates. The list should include the number of views. I need to be able to sort by the number of views in addition to any of the usual vehicle fields. So, to be clear, if a vehicle has had 1000 views, but only 500 of those between the given dates, the count should return 500.
I'm pretty sure I could perform this query without any issues in MySQL - however, trying to store the vehicle data in MySQL has been a real headache in the past and it has been great moving to Mongo where I can add new data fields with ease and not worry about the structure of my database.
What do you all think?? TIA!
As it turns out, it's totally possible. It took me a long while to get my head around this, so I'm posting it up for future google searches...
db.statistics.aggregate({
$match: {
branch_id: { $in: [14] }
}
}, {
$lookup: {
from: 'vehicles', localField: 'vehicle_id', foreignField: '_id', as: 'vehicle'
}
}, {
$group: {
_id: "$vehicle_id",
count: { $sum: 1 },
vehicleObject: { $first: "$vehicle" }
}
}, { $unwind: "$vehicleObject" }, {
$project: {
daysInStock: { $subtract: [ new Date(), "$vehicleObject.date_assigned" ] },
vehicleObject: 1,
count: 1
}
}, { $sort: { count: -1 } }, { $limit: 10 });
To explain the above:
The Mongodb aggregate framework is the way forward for complex queries like this. Firstly, I run a $match to filter the records. Then, we use $lookup to grab the vehicle record. Worth mentioning here that this is a Many to One relationship here (lots of stats, each having a single vehicle). I can then group on the vehicle_id field, which will enable me to return one record per vehicle with a count of the number of stats in the group. As it is a group, we technically have lots of copies of that same vehicle document now in each group, so I then add just the first one into the vehicleObject variable. This would be fine, but $first tends to return an array with a single entry (pointless in my opinion), so I added the $unwind stage to pull the actual vehicle out. I then added a $project stage to calculate an additional field, sorted by the count descending and limited the results to 10.
And take a breath :)
I hope that helps someone. If you know of a better way to do what I did, then I'm open to suggestions to improve.

Meteor collection get last document of each selection

Currently I use the following find query to get the latest document of a certain ID
Conditions.find({
caveId: caveId
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
How can I use the same using multiple ids with $in for example
I tried it with the following query. The problem is that it will limit the documents to 1 for all the found caveIds. But it should set the limit for each different caveId.
Conditions.find({
caveId: {$in: caveIds}
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
One solution I came up with is using the aggregate functionality.
var conditionIds = Conditions.aggregate(
[
{"$match": { caveId: {"$in": caveIds}}},
{
$group:
{
_id: "$caveId",
conditionId: {$last: "$_id"},
diveDate: { $last: "$diveDate" }
}
}
]
).map(function(child) { return child.conditionId});
var conditions = Conditions.find({
_id: {$in: conditionIds}
},
{
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
You don't want to use $in here as noted. You could solve this problem by looping through the caveIds and running the query on each caveId individually.
you're basically looking at a join query here: you need all caveIds and then lookup last for each.
This is a problem of database schema/denormalization in my opinion: (but this is only an opinion!):
You could as mentioned here, lookup all caveIds and then run the single query for each, every single time you need to look up last dives.
However I think you are much better off recording/updating the last dive inside your cave document, and then lookup all caveIds of interest pulling only the lastDive field.
That will give you immediately what you need, rather than going through expensive search/sort queries. This is at the expense of maintaining that field in the document, but it sounds like it should be fairly trivial as you only need to update the one field when a new event occurs.

delayed_jobs with mongomapper is slow

I'm using delayed_jobs with mongomapper. However, it's slow when fetching delayed_jobs records (around 500k records).
I'm running to create indexes { locked_by: -1, priority: 1, run_at: 1 }, but it doesn't help.
I really don't know which indexes to improve the query. Each fetching takes around 2 seconds.
Here is the mongodb log:
Tue Dec 13 09:52:38 [conn497] query api_production.$cmd ntoreturn:1 command: {
findandmodify: "delayed_jobs", query: { run_at: { $lte: new Date(1323769957289) }, failed_at:
null, $or: [ { locked_by: "host:ip-10-128-145-246 pid:26157" }, { locked_at: null }, {
locked_at: { $lt: new Date(1323769057289) } } ] }, sort: { locked_by: -1, priority: -1,
run_at: 1 }, update: { $set: { locked_at: new Date(1323769957289), locked_by: "host:ip-10-
128-145-246 pid:26157" } } } reslen:699 1486ms
Your indexes don't match the query. Your query first eliminates candidates based on run_at, so that should be your first index, but it's not.
Then comes a rather inelegant $or clause. Now it will be hard to choose an appropriate index, because two criteria are locked_at while one is locked_by.
To make matters worse, there are three sort criteria, but they are exactly reverse of the direction of the query constraints. Also, you're sorting on a rather lengthy string.
Basically, I think the query is not very well designed, it tries to accomplish too much in a single query. I don't know if delayed_jobs is some kind of module, but it would be much easier if the rules were simpler. Why does a worker lock so many jobs, for instance? In fact, I think it's best if you only lock the job you're currently working on and have different workers fetch different job types for scaling. The workers might want to use uuids instead of using their ip address and pid (with a prefix that adds no entropy and no selectivity), etc.