Speed up aggregation on large collection - mongodb

I currently have a database with about 270 000 000 documents. They look like this:
[{
'location': 'Berlin',
'product': 4531,
'createdAt': ISODate(...),
'value': 3523,
'minOffer': 3215,
'quantity': 7812
},{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}]
The database currently holds a bit over one month of data and has ~170 locations (in EU and US) with ~8000 products. These documents represent timesteps, so there are about ~12-16 entries per day, per product per location (at most 1 per hour though).
My goal is to retrieve all timesteps of a product in a given location for the last 7 days. For a single location this query works reasonable fast (150ms) with the index { product: 1, location: 1, createdAt: -1 }.
However, I also need these timesteps not just for a single location, but an entire region (so about 85 locations). I'm currently doing that with this aggregation, which groups all the entries per hour and averages the desired values:
this.db.collection('...').aggregate([
{ $match: { { location: { $in: [array of ~85 locations] } }, product: productId, createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) } } }, {
$group: {
_id: {
$toDate: {
$concat: [
{ $toString: { $year: '$createdAt' } },
'-',
{ $toString: { $month: '$createdAt' } },
'-',
{ $toString: { $dayOfMonth: '$createdAt' } },
' ',
{ $toString: { $hour: '$createdAt' } },
':00'
]
}
},
value: { $avg: '$value' },
minOffer: { $avg: '$minOffer' },
quantity: { $avg: '$quantity' }
}
}
]).sort({ _id: 1 }).toArray()
However, this is really really slow, even with the index { product: 1, createdAt: -1, location: 1 } (~40 secs). Is there any way to speed up this aggregation so it goes down to a few seconds at most? Is this even possible, or should I think about using something else?
I've thought about saving these aggregations in another database and just retrieving that and aggregating the rest, this is however really awkward for the first users on the site who have to sit 40 secs through waiting.

These are some ideas which can benefit the querying and performance. Whether all these will work together is matter of some trials and testing. Also, note that changing the way data is stored and adding new indexes means that there will changes to application, i.e., capturing data, and the other queries on the same data need to be carefully verified (that they are not affected in a wrong way).
(A) Storing a Day's Details in a Document:
Store (embed) a day's data within the same document as an array of sub-documents. Each sub-document represents an hour's entry.
From:
{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}
to:
{
location: 'London',
product: 1231,
createdAt: ISODate(...),
details: [ { value: 53523, minOffer: 44215, quantity: 2812 }, ... ]
}
This means about ten entries per document. Adding data for an entry will be pushing data into the details array, instead of adding a document as in present application. In case the hour's info (time) is required it can also be stored as part of the details sub-document; it will entirely depend upon your application needs.
The benefits of this design:
The number of documents to maintain and query will reduce (per
product per day about ten documents).
In the query, the group stage will go away. This will be just a
project stage. Note that the $project supports accumulators $avg and $sum.
The following stage will create the sums and averages for the day (or a document).
{
$project: { value: { $avg: '$value' }, minOffer: { $avg: '$minOffer' }, quantity: { $avg: '$quantity' } }
}
Note the increase in size of the document is not much, with the amount of details being stored per day.
(B) Querying by Region:
The present matching of multiple locations (or a region) with this query filer: { location: { $in: [array of ~85 locations] } }. This filter says : location: location-1, -or- location: location-3, -or- ..., location: location-50. Adding a new field , region, will filter with one value matching.
The query by region will change to:
{
$match: {
region: regionId,
product: productId,
createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) }
}
}
The regionId variable is to be supplied to match with the region field.
Note that, both the queries, "by location" and "by region", will benefit with the above two considerations, A and B.
(C) Indexing Considerations:
The present index: { product: 1, location: 1, createdAt: -1 }.
Taking into consideration, the new field region, newer indexing will be needed. The query with region cannot benefit without an index on the region field. A second index will be needed; a compound index to suit the query. Creating an index with the region field means additional overhead on write operations. Also, there will be memory and storage considerations.
NOTES:
After adding the index, both the queries ("by location" and "by region") need to be verified using explain if they are using their respective indexes. This will require some testing; a trial-and-error process.
Again, adding new data, storing data in a different format, adding new indexes requires to consider these:
Careful testing and verifying that the other existing queries perform as usual.
The change in data capture needs.
Testing the new queries and verifying if the new design performs as expected.

Honestly your aggregation is pretty much as optimized as it can get, especially if you have { product: 1, createdAt: -1, location: 1 } as an index like you stated.
I'm not exactly sure how your entire product is built, however the best solution in my opinion is to have another collection containing just the "relevant" documents from the past week.
Then you could query that collection with ease, This is quite easy to do in Mongo as well using a TTL Index.
If this not an option you could add a temporary field to the "relevant" documents and query on that making it somewhat faster to retrieve them, but maintaining this field will require you to have a process running every X time which could make your results now 100% accurate depending when you decide to run it.

Related

MongoDB, Panache, Quarkus: How to do aggregate, $sum and filter

I have a table in mongodb with sales transactions each containing a userId, a timestamp and a corresponding revenue value of the specific sales transaction.
Now, I would like to query these users and getting the minimum, maximum, sum and average of all transactions of all users. There should only be transactions between two given timestamps and it should only include users, whose sum of revenue is greater than a specified value.
I have composed the corresponding query in mongosh:
db.salestransactions.aggregate(
{
"$match": {
"timestamp": {
"$gte": new ISODate("2020-01-01T19:28:38.000Z"),
"$lte": new ISODate("2020-03-01T19:28:38.000Z")
}
}
},
{
$group: {
_id: { userId: "$userId" },
minimum: {$min: "$revenue"},
maximum: {$max: "$revenue"},
sum: {$sum: "$revenue"},
avg: {$avg: "$revenue"}
}
},
{
$match: { "sum": { $gt: 10 } }
}
]
)
This query works absolutely fine.
How do I implement this query in a PanacheMongoRepository using quarkus ?
Any ideas?
Thanks!
A bit late but you could do it something like this.
Define a repo
this code is in kotkin
class YourRepositoryReactive : ReactivePanacheMongoRepository<YourEntity>{
fun getDomainDocuments():List<YourView>{
val aggregationPipeline = mutableListOf<Bson>()
// create your each stage with Document.parse("stage_obj") and add to aggregates collections
return mongoCollection().aggregate(aggregationPipeline,YourView::class.java)
}
mongoCollection() automatically executes on your Entity
YourView, a call to map related properties part of your output. Make sure that this class has
#ProjectionFor(YourEntity.class)
annotation.
Hope this helps.

Mongodb selecting every nth of a given sorted aggregation

I want to be able to retrieve every nth item of a given collection which is quite large (millions of records)
Here is a sample of my collection
{
_id: ObjectId("614965487d5d1c55794ad324"),
hour: ISODate("2021-09-21T17:21:03.259Z"),
searches: [
ObjectId("614965487d5d1c55794ce670")
]
}
My start of aggregation is like so
[
{
$match: {
searches: {
$in: [ObjectId('614965487d5d1c55794ce670')],
},
},
},
{ $sort: { hour: -1 } },
{ $project: { hour: 1 } },
...
]
I have tried many things including
$sample which does not make the pick in the good order
Using $skip makes it very slow as the number given to skip grows
Using _id instead of $skip but my ids are unfortunately not created in an ordered manner
My goal is thus to retrieve the hour of a record, every 20000 record, so that I can then make a call to retrieve data by chunks of approximately 20000 records.
I imagine it would be possible to
sort, and number every records, then keep only the first, 20000, 40000, ..., and the last
Thanks for your help and let me know if you need more information

Aggregate query with no $match

I have a collection in which unique documents from a different collection can appear over and over again (in example below item), depending on how much a user shares them. I want to create an aggregate query which finds the most shared documents. There is no $match necessary because I'm not matching a certain criteria, I'm just querying the most shared. Right now I have:
db.stories.aggregate(
{
$group: {
_id:'item.id',
'item': {
$first: '$item'
},
'total': {
$sum: 1
}
}
}
);
However this only returns 1 result. It occurs to me I might just need to do a simple find query, but I want the results aggregated, so that each result has the item and total is how many times it's appeared in the collection.
Example of a document in the stories collection:
{
_id: ObjectId('...'),
user: {
id: ObjectId('...'),
propertyA: ...,
propertyB: ...,
etc
},
item: {
id: ObjectId('...'),
propertyA: ...,
propertyB: ...,
etc
}
}
users and items each have their own collections as well.
Change the line
_id:'item.id'
to
_id:'$item.id'
Currently you group by the constant 'item.id' and therefore you only get one document as result.

Filter large dataset base on aggregation result

I need to do sort of an "Advanced Search" functionality with MongoDB. It's a sport system, where player statistic are collected for each season like this:
{
player: {
id: int,
name: string
},
goals: int,
season: int
}
Uses can search data across season, for example: I want to search for player who scored > 30 goals from season 2012 - 2016.
I could use mongodb aggregation:
db.stats.aggregate( [
{ $match: { season: { $gte: 2014, $lte: 2016 } } }
{ $group: { _id: "$player", totalGoals: { $sum: "$goals" } } },
{ $match: { $totalGoals: { $gte: 30 } } },
{ $limit: 10 },
{ $skip: 0 }
] )
That's working fine, the speed is acceptable for the collections with more than 3 millions records.
However, if the user just want to search for a larger seasons range, let say: players lifetime statistic. The aggregation turns out to be very very very slow. And I understand that MongoDB has to go through all the docs and calculate the $totalGoals.
I just wonder if there is better approach that could solve this performance problem?
you can have pre-calculated data for past seasons and make two step query:
a) get past data
b) get current data
you could try to optimise indexes on that query
hardware: use SSD
hardware: more memory
introduce sharding to split load

Query running very slow on big MongoDB db

I have a MongoDB db with a single rather large collection of documents (13GB for about 2M documents) sitting on a single server with 8GB RAM. Each document has a text field that can be relatively large (it can be a whole blog post) and the other fields are data about the text content and the text author. Here's what the schema looks like:
{
text: "Last night there was a storm in San Francisco...",
author: {
name: "Firstname Lastname",
website_url: "http://..."
},
date: "201403075612",
language: "en",
concepts: [
{name: "WeatherConcept", hit: "storm", start: 23, stop: 28},
{name: "LocationConcept", hit: "San Francisco", start: 32, stop: 45}
],
location: "us",
coordinates: []
}
I'm planning to query the data in different ways:
Full-text search on the "text" field. So let's say my text search query is q:
db.coll.aggregate([
{
$match:{
$text: {
$search:q
}
}
}
])
Aggregate documents by author:
db.coll.aggregate([
{
$project: {
name: "$author.name",
url: "$author.website_url"
}
},
{
$group: {
_id: "$name",
size: {
$sum:1
},
url: {
$first: "$url"
}
}
},
{
$sort:{
size:-1
}
}
])
Aggregate documents by concepts:
db.coll.aggregate([
{
$unwind: "$concepts"
},
{
$group: {
_id: "$concepts.name",
size: {
$sum:1
}
}
},
{
$sort:{
size:-1
}
}
])
These three queries may also include filtering on the following fields: date, location, coordinates, language, author.
I don't have indexes yet in place, so the queries run very slow. But since the indexes would be very different for the different ways I hit the data, does that rule out indexing as a solution? Or is there a way to index for all these cases and not have to shard the collection? Basically my questions are:
What would be a good indexing strategy in this case?
Do I need to create separate collections for authors and concepts?
Should I somehow restructure my data?
Do I need to shard my collection or is my 8GB single-server powerful enough to handle that data?
Do you have any indexes on your collection?
Have a look at the following
http://docs.mongodb.org/manual/indexes/
if you do have indexes make sure they are being hit by doing the following
db.CollectionName.find({"Concept":"something"}).explain();
You also need to give us more information about your setup. How much RAM does the server have? I've worked with a MongoDB that has 200GB sitting on 3 shards. So 13GB on 1 shouldn't be an issue