Mongodb relative frequency in grouping aggregation - mongodb

I have data that looks like this
{"customer_id":1, "amount": 100, "item": "a"}
{"customer_id":1, "amount": 20, "item": "b"}
{"customer_id":2, "amount": 25, "item": "a"}
{"customer_id":3, "amount": 10, "item": "a"}
{"customer_id":4, "amount": 10, "item": "b"}
Using R I can get an overview of relative frequencies very easily by doing this
data %>%
group_by(customer_id,item) %>%
summarise(total=sum(amount)) %>%
mutate(per_customer_spend=total/sum(total))
Which returns;
customer_id item total per_customer_spend
<dbl> <chr> <dbl> <dbl>
1 1 a 100 0.833
2 1 b 20 0.167
3 2 a 25 1
4 3 a 10 1
5 4 b 10 1
I can't figure out how to do this in Mongo efficiently, the best solution I have so far involves multiple groups and pushing and unwinding.

If you don't want to change the data structure there's no way around grouping all the data as we need to determine the total amount spent of each user, though this would require just a single $group stage and a single $uwind stage, it would look somethine like this:
db.collection.aggregate([
{
$group: {
_id: "$customer_id",
total: {$sum: "$amount"},
rootHolder: {$push: "$$ROOT"}
}
},
{
$unwind: "$rootHolder"
},
{
$project: {
newRoot: {
$mergeObjects: [
"$rootHolder",
{total: "$total"}
]
}
}
},
{
$replaceRoot: {
newRoot: "$newRoot"
}
},
{
$project: {
customer_id: 1,
item: 1,
total: "$amount",
per_customer_spend: {$divide: ["$amount", "$total"]}
}
}
])
With that said, especially when scale increases this pipeline becomes very expensive, Now depending on how big the scale is and the amount of unique pairs of costumer_id x item i would advice the following:
considering Mongo doesn't like data duplication and assuming a user does not "buy" new items too often it might be worth to actually save it as a field in the current collection. (which requires updating all the users items on purchase), I know this sounds "weird" and costly but again depending on frequency of purchases it might actually be worth it.
Assuming you decide not to do the above I would instead create a new collection with customer_total and customer_id. Mind you this field will still require upkeeping although much cheaper.
With this collection you can either $lookup the total (which again can be expensive).

Related

mongodb aggregation select specific document in group

I need a bit help with mongodb aggregation.
first I have a $match to get filter some specific documents.
then I group by a field I need them grouped in.
the group I need to select a document where field value is ... and get that document as main data.
{"$match": {"$and": [
{chain: chain},
{dex: dex}
]}};
{$group: {
_id: "$pairAddress",
allChange: {"$push": "$$ROOT"},
baseToken: {$last: '$baseToken'},
txCount: {the document with timeframe inside this group 86400.txCount}
}},
{$sort: {txCount: -1}}
{$skip: 0}
{$limit: 100}
the group consist of documents with different timeframes, I need to somehow select a specific timeframe and add fields to the group from that timeframe. for example each timeframe has a different amount of txCount after group I want to sort by txCount and limit the amount and use skip for some pagination.
the problem is in selecting a document from that group with the specific timeframe.
anyone who could help me a bit to the right direction that would be awesome.
Here an example of how data is stored in the database and what I would like the result to be.
const document = {
_id: '3567356735672467',
pairAddress: '0x45jk6v34jy5634jkh5v6kj4h5v62j4h56', // group by pair address
baseToken: '0x456jn345k6hb4k5h6b3khb65k3hb56k3h4b6',
resolution: 86400, // a pair address has 6 documents with each a own timeframe 300, 900, 1800, 3600, 43200, 86400
base0: true,
txCount: 26,
buyCount: 10,
sellCount:16,
buyVolume: '2342354.345',
sellVolume: '1234.34',
volume: '1232352.345',
change: '12.34',
positive: true,
time: 1676865981,
chain: 'ETH',
dex: 'SUS',
price: '12.45',
};
const result = [
{
_id: "0x45jk6v34jy5634jkh5v6kj4h5v62j4h56",
allChange: {"$push": "$$ROOT"}, // array of all documents/timeframes for a pairAddress
selectedTxAmount: 26, // this needs to be the document with selected timeframe example 86400, selected from the group is must match the pairAddress
}
];
Maybe its possible to change the aggregation to make it work and faster.
match all timeframes, dex and chain.
sort by txCount.
skip X amount.
limit to 100
and return all document with a field containing all timestamps per the pairAddress left after the aggregation.
Currently thanks to #1sina1 I got this and it works.
{"$match": {"$and": [
{"chain": chain},
{"dex": dex}
]}},
{$group: {
_id: "$pairAddress",
allChange: {"$push": "$$ROOT"},
baseToken: {$last: '$baseToken'},
txCount: {
"$push": {
"$cond": {
"if": {
"$eq": [
"$resolution",
43200
]
},
"then": "$txCount",
"else": "$$REMOVE"
}
}
}
}},
{$sort: {txCount: -1}},
{$skip: parseInt(page) * 100},
{$limit: 100},
But I think there might be a way to do it just a bit different now we first group all (which is about 20k documents) I am only interested in 100, so maybe first match to timeframe/resolution then sort, skip, limit and then I just need from those 100 pairAddress all the according timeframes/resolutions for each as a flied allChange.

MongoDB $lookup with conditional foreignField

Playground: https://mongoplayground.net/p/OxMnsCFZpmQ
My MongoDB version: 4.2.
I have a collection car_parts and customers.
As the name suggests car_parts has car parts, where some of them can have a field sub_parts which is a list of car_parts._ids this part consists of.
Every customer that bought something at us is stored in customers. The parts field for a customer contains a list of parts the customer bought together on a certain date.
I would like to have an aggregate query in MongoDB that returns a mapping of which car parts were bought (bought_parts) from which customers. However, if the car_parts has the field sub_parts, the customer should show up for the subparts only.
So the query in the playground gives almost the correct result already, except for the sub_parts topic.
Example for customer_3:
{
"_id": "customer_3",
"parts": [
{
"bought_parts": [
3
],
date: "15.07.2020"
}
]
}
Since bought_parts has car_parts._id = 3:
{
"_id": 3,
"name": "steering wheel",
"sub_parts": [
1, // other car_parts._id s
2
]
}
The result should show customer_3 as a customer of car parts 1 and 2.
I'm not sure how to accomplish this, but I assume a "temporary" replacement of the id 3 in bought_parts with the actual ids [1,2] might solve it.
Expected output:
[
{
"_id": 1,
"customers": [
"customer_1",
"customer_2",
"customer_3" // <- since customer_3 bought car part 3 which has 1 in sub_parts
]
},
{
"_id": 2,
"customers": [
"customer_3" // <- since customer_3 bought car part 3 which has 2 in sub_parts
]
},
{
"_id": 3,
"customers": [
"customer_1", // <- since car_parts.id = 3 has [1, 2] in sub_parts, show customers of ids [1, 2]
"customer_2",
"customer_3"
]
},
{
"_id": 4,
"customers": [
"customer_1",
"customer_2"
]
}
]
Thanks a lot in advance!
EDIT: One way to do it is:
db.car_parts.aggregate([
{
$project: {
topLevel: {$concatArrays: [{$ifNull: ["$sub_parts", []]}, ["$_id"]]},
sub_parts: 1
}
},
{$unwind: "$topLevel"},
{
$group: {
_id: "$topLevel",
parts: {$push: "$_id"},
sub_parts: {$first: "$sub_parts"}
}
},
{
$project: {
parts: {$concatArrays: [{"$ifNull": ["$sub_parts", []]}, "$parts"]}
}
},
{
$lookup: {
from: "customers",
localField: "parts",
foreignField: "parts.scanned_parts",
as: "customers"
}
},
{$project: {customers: "$customers._id"}}
])
As you can see working on this playground.
Since you said there is only one level of sub-parts, I used another idea: creating a top level before the $lookup. Since you want customers that used part 3 for example, to be registered under parts 1,2 which are sub-parts of 3, the idea is to group them. This connection is a bit clumsy after the $lookup, but if we use the data that we have on the car_parts collection before the $lookup, we actually knows already that parts 1,2 are subpart of 3. Creating a topLevel temporary field, allows to group, in advance, all the parts and sub-parts that if a customer used on of them, he should be registered under this top level part. This makes things much more elegant...

Query object with max field on MongoDB

I am new to MongoDB and I use Atlas & Charts in order to query and visualize the results.
I want to create a graph that shows the max amount of money every day, and indicate the person with the max amount of money.
for example:
if my collection contains the following documents:
{"date": "15-12-2020", "name": "alice", "money": 7}
{"date": "15-12-2020", "name": "bob", "money": 9}
{"date": "16-12-2020", "name": "alice", "money": 39}
{"date": "16-12-2020", "name": "bob", "money": 25}
what should be the query I put on query box (on "Charts") in order to create a graph with the following result?
date | max_money | the_person_with_max_money
15-12-2020 9 bob
16-12-2020 39 alice
You have to use an aggregation and I think this should works.
First of all $sort values by money (I'll explain later why).
And then use $group to group values by date.
The query looks like this:
db.collection.aggregate([
{
"$sort": { "money": -1 }
},
{
"$group": {
"_id": "$date",
"max_money": { "$max": "$money" },
"the_person_with_max_money": { "$first": "$name" }
}
}
])
Example here
How this works? Well, there is a "problem" using $group, is that you can't keep values for the next stage unless you uses an accumulator, so, the best way it seems is to use $first to get the first name.
And this is why is sorted by money descendent, to get the name whose money value is the greatest at first position.
So, sorting we ensure that the first value is what you want.
And then using group to group the documents with the same date and create the fields max_money and the_person_with_max_money.

MongoDB aggregate query extremely slow

I've a MongoDB query here which is running extremely slow without an index but the query fields are too big to index so i'm looking for advice on how to optimise this query or create valid index for it:
collection.aggregate([{
$match: {
article_id: {
$nin: read_article_ids
},
author_id: {
$in: liked_authors,
$nin: disliked_authors
},
word_count: {
$gte: 1000,
$lte: 10000
},
article_sentiment: {
$elemMatch: {
sentiments: mood
}
}
}
}, {
$sample: {
size: 4
}
}])
The collection in this case is a collection of articles with article_id, author_id, word_count, and article_sentiment. There is around 1.6 million documents in the collection and a query like this takes upwards of 10 seconds without an index. The box has 56gb of memory and is all around pretty specced out.
The query's function is to retrieve a batch of 4 articles by authors the user likes and that they've not read and that match a given sentiment (The article_sentiment key holds a nested array of key:value pairs)
So is this query incorrect for what i'm trying to achieve? Is there a way to improve it?
EDIT: Here is a sample document for this collection.
{
"_id": ObjectId("57f7dd597a1026d326fc02c4"),
"publication_name": "National News Inc",
"author_name": "John Hardwell",
"title": "How Shifting Policy Has Stunted Cultural Growth",
"article_id": "2f0896cd47c9423cb5a309c7277dd90d",
"author_id": "51b7f46f6c0f46f2949608c9ec2624d4",
"word_count": 1202,
"article_sentiment": [{
"sentiments": "happy",
"weight": 0.528596282005
}, {
"sentiments": "serious",
"weight": 0.569274544716
}, {
"sentiments": "relaxed",
"weight": 0.825395524502
}]
}

How to maintain the top count(s) of array elements in mongoDB?

I am looking a way to get the top two (or any other number) counts of a specific element from the given collection.
{"id": "xyz" , "fruits": ["Apple", "Mango"]}
{"id": "abx", "fruits": ["Apple", "Banana"]}
{"id" : "pqr", "fruits": ["Apple", "Mango"]}
For above example, the result would be: Apple and Mango because the occurrence of Apple (three times) is higher followed by Mango (two times). Do I need to go with Mongo map-reduce functionality?
I am more leaned towards the performance and stability of backend platform. How can I move forward if the "number of occurrence" is happening real time?
Any help would be appreciable.
You could use aggregate. Here is a simple example which assumes that a fruit value will not be repeated within a single document:
[
{
$unwind: "$fruits"
},
{
$group: {
_id: "$fruits",
count: {$sum: 1}
}
},
{
$sort: {count:-1}
},
{
$limit: 2
}
]