Mongodb aggregate frequencies of every field (dichotomous) in one query - mongodb

i am fairly new to mongodb and try to develop a nice way of evaluating a so called multiple choice question.
The data looks like this:
db.test.insertMany([
{
Q2_1: 1,
Q2_2: -77,
Q2_3: 1
},
{
Q2_1: -77,
Q2_2: -77,
Q2_3: 1
},
{
Q2_1: 1,
Q2_2: 0,
Q2_3: 0
},
{
Q2_1: 0,
Q2_2: 1,
Q2_3: 0
}
])
In this example we have 4 probands, who gave answers to 3 items.
Every field can contain one of three values -77, 0, 1
-77: proband did not see the item. So it is neither calculated in the 'base' NOR in 'abs'.
0: proband did see the item, but did not choose it. (counts for 'base' BUT NOT for 'abs')
1: proband did see the item, and chose it. (counts for 'base' AND for 'abs')
now i want a result for every item. So item 1 (Q2_1 has the key value of 1 and so on)
so item 1 would have been seen by 3 probands so the 'base' would be 3.
it would have been chosen by two probands so the 'abs' would be 2.
and therefore the 'perc' would be 0.666666.
expected result array:
[
{
"key": 1,
"abs": 2,
"base": 3,
"perc": 0.6666666666
},
{
"key": 2,
"abs": 1,
"base": 2,
"perc": 0.5
},
{
"key": 3,
"abs": 2,
"base": 4,
"perc": 0.5
}
]
Is it possible to do this evaluation in one aggregation query and get this expected result array?
thanks for help :-)

Query
$objectToArray to remove the data from the keys (you should not save data on fields, fields are for the schema only, MongoDB query language is not made for data in fields)
unwind and replace root
group and 2 condition based accumulators base and abs
add the perc and fix the key, split on _ and take the second part
sort by key
*query is bigger because data on fields doesn't work good in MongoDB, so try to avoid it
Playmongo (you can put the mouse in the end of each stage to see what it does)
aggregate(
[{"$unset": ["_id"]}, {"$set": {"data": {"$objectToArray": "$$ROOT"}}},
{"$unwind": "$data"}, {"$replaceRoot": {"newRoot": "$data"}},
{"$group":
{"_id": "$k",
"base": {"$sum": {"$cond": [{"$eq": ["$v", -77]}, 0, 1]}},
"abs": {"$sum": {"$cond": [{"$eq": ["$v", 1]}, 1, 0]}}}},
{"$set": {"key": {"$arrayElemAt": [{"$split": ["$_id", "_"]}, 1]}}},
{"$set": {"_id": "$$REMOVE", "perc": {"$divide": ["$abs", "$base"]}}},
{"$sort": {"key": 1}}])

Related

mongodb aggregation select specific document in group

I need a bit help with mongodb aggregation.
first I have a $match to get filter some specific documents.
then I group by a field I need them grouped in.
the group I need to select a document where field value is ... and get that document as main data.
{"$match": {"$and": [
{chain: chain},
{dex: dex}
]}};
{$group: {
_id: "$pairAddress",
allChange: {"$push": "$$ROOT"},
baseToken: {$last: '$baseToken'},
txCount: {the document with timeframe inside this group 86400.txCount}
}},
{$sort: {txCount: -1}}
{$skip: 0}
{$limit: 100}
the group consist of documents with different timeframes, I need to somehow select a specific timeframe and add fields to the group from that timeframe. for example each timeframe has a different amount of txCount after group I want to sort by txCount and limit the amount and use skip for some pagination.
the problem is in selecting a document from that group with the specific timeframe.
anyone who could help me a bit to the right direction that would be awesome.
Here an example of how data is stored in the database and what I would like the result to be.
const document = {
_id: '3567356735672467',
pairAddress: '0x45jk6v34jy5634jkh5v6kj4h5v62j4h56', // group by pair address
baseToken: '0x456jn345k6hb4k5h6b3khb65k3hb56k3h4b6',
resolution: 86400, // a pair address has 6 documents with each a own timeframe 300, 900, 1800, 3600, 43200, 86400
base0: true,
txCount: 26,
buyCount: 10,
sellCount:16,
buyVolume: '2342354.345',
sellVolume: '1234.34',
volume: '1232352.345',
change: '12.34',
positive: true,
time: 1676865981,
chain: 'ETH',
dex: 'SUS',
price: '12.45',
};
const result = [
{
_id: "0x45jk6v34jy5634jkh5v6kj4h5v62j4h56",
allChange: {"$push": "$$ROOT"}, // array of all documents/timeframes for a pairAddress
selectedTxAmount: 26, // this needs to be the document with selected timeframe example 86400, selected from the group is must match the pairAddress
}
];
Maybe its possible to change the aggregation to make it work and faster.
match all timeframes, dex and chain.
sort by txCount.
skip X amount.
limit to 100
and return all document with a field containing all timestamps per the pairAddress left after the aggregation.
Currently thanks to #1sina1 I got this and it works.
{"$match": {"$and": [
{"chain": chain},
{"dex": dex}
]}},
{$group: {
_id: "$pairAddress",
allChange: {"$push": "$$ROOT"},
baseToken: {$last: '$baseToken'},
txCount: {
"$push": {
"$cond": {
"if": {
"$eq": [
"$resolution",
43200
]
},
"then": "$txCount",
"else": "$$REMOVE"
}
}
}
}},
{$sort: {txCount: -1}},
{$skip: parseInt(page) * 100},
{$limit: 100},
But I think there might be a way to do it just a bit different now we first group all (which is about 20k documents) I am only interested in 100, so maybe first match to timeframe/resolution then sort, skip, limit and then I just need from those 100 pairAddress all the according timeframes/resolutions for each as a flied allChange.

Get items of array by index in MongoDB

So I have a data structure in a Mongo collection (v. 4.0.18) that looks something like this…
{
"_id": ObjectId("242kl4j2lk23423"),
"name": "Doug",
"kids": [
{
"name": "Alice",
"age": 15,
},
{
"name": "James",
"age": 13,
},
{
"name": "Michael",
"age": 10,
},
{
"name": "Sharon",
"age": 8,
}
]
}
In Mongo, how would I get back a projection of this object with only the first two kids? I want the output to look like this:
{
"_id": ObjectId("242kl4j2lk23423"),
"name": "Doug",
"kids": [
{
"name": "Alice",
"age": 15,
},
{
"name": "James",
"age": 13,
}
]
}
It seems like I should easily be able to get them by index, but I'm not seeing anything in the docs about how to do that. The real-world problem I'm trying to solve has nothing to do with kids, and the array could be quite lengthy. I'm trying to break it up and process it in batches without having to load the whole thing into memory in my application.
EDIT (non-sequential indexes):
I noticed that since I asked about item 1 & 2 that $slice would suffice…however, what if I wanted items 1 & 3? Is there a way I can specify specific array indexes to return?
Any ideas or pointers for how to accomplish that?
Thanks!
You are looking for the $slice projection operator if the desired selection are near each other.
https://docs.mongodb.com/manual/reference/operator/projection/slice/
This would return the first 2
client.db.collection.find({"name":"Doug"}, { "kids": { "$slice": 2 } })
returns
{'_id': ObjectId('5f85f682a45e15af3a907f51'), 'name': 'Doug', 'kids': [{'name': 'Alice', 'age': 15}, {'name': 'James', 'age': 13}]}
this would skip the first kid and return the next two (second and third)
client.db.collection.find({"name":"Doug"}, { "kids": { "$slice": [1, 2] } })
returns
{'_id': ObjectId('5f85f682a45e15af3a907f51'), 'name': 'Doug', 'kids': [{'name': 'James', 'age': 13}, {'name': 'Michael', 'age': 10}]}
Edit:
Arbitrary selections 1 and 3 probably need to route through an aggregation pipeline rather than a simple query. The performance shouldn't be too much different assuming you have an index on the $match field.
Steps of your pipeline should be pretty obvious and you should be able to take it from here.
Hate to point to RTFM, but that's going to be super helpful here to at least be acquainted with the pipeline operations.
https://docs.mongodb.com/manual/reference/operator/aggregation/
Your pipeline should:
$match on your desired query
$set some new field kid_selection to element 1 (second element) and element 3 (4th element) since counting starts at 0. Notice the prefixed $ on the "kids" key name in the kid_selection setter. When referencing a key in the document you're working on, you need to prefix with $
project the whole document, minus the original kids field that we've selected from
client.db.collection.aggregate([
{"$match":{"name":"Doug"}},
{"$set": {"kid_selection": [
{ "$arrayElemAt": [ "$kids", 1 ] },
{ "$arrayElemAt": [ "$kids", 3 ] }
]}},
{ "$project": { "kids": 0 } }
])
returns
{
'_id': ObjectId('5f86038635649a988cdd2ade'),
'name': 'Doug',
'kid_selection': [
{'name': 'James', 'age': 13},
{'name': 'Sharon', 'age': 8}
]
}

Is it possible to do a subquery to return an array for the $nin operator in MongoDB?

I have a data set that looks something like:
{"key": "abc", "val": 1, "status": "np"}
{"key": "abc", "val": 2, "status": "p"}
{"key": "def", "val": 3, "status": "np"}
{"key": "ghi", "val": 4, "status": "np"}
{"key": "ghi", "val": 5, "status": "p"}
I want a query that returns document(s) that have a status="np" but only where there are other documents with the same key that do not have a status value of "p". So the document returned from the data set above would be key="def" since "abc" has a value of "np" but "abc" also has a document with a value of "p". This is also true for key="ghi". I came up with something close but I don't think the $nin operator supports q distinct query.
db.test2.find({$and: [{"status":"np"}, {"key": {$nin:[<distinct value query>]]})
If I were to hardcode the value in the $nin array, it would work:
db.test2.find({$and: [{"status":"np"}, {"key": {$nin:['abc', 'ghi']}}]})
I just need to be able to write a find inside the square brackets. I could do something like:
var res=[];
res = db.test2.distinct("key", {"status": "p"});
db.test2.find({$and: [{"status":"np"}, {"key": {$nin:res}}]});
But the problem with this is that in the time between the two queries, another process may update the "status" of a document and then I'd have inconsistent data.
Try this
db.so.aggregate([
{$group: {'_id': '$key', 'st': {$push: '$status'}}},
{$project :{st: 1, '$val':1, '$status':1, 'hasNp':{$in:['np', '$st']}, hasP: {$in:['p', '$st']}}},
{$match: {hasNp: true, hasP: false}}
]);

MongoDB sort by relevance (mix $and and $or)

with 2 documents like :
{
"name": "hello",
"family": 1
},
{
"name": "world",
"family": 1,
"category": 2
}
and a query like :
doc.find({$or: [{family: 1}, {category: 2}]})
how can i have results sorted with the one matching the 2 conditions ("world") as a first result but still have the doc matching only 1 condition as a last result ("hello") ?
i can't use the default $and operator as i would not see the "hello" document that do not match both conditions.
i saw how aggregation could help but for a more complex example than that it would be a lot of computation, i'm guessing this is common use case and there must be something obvious i'm missing
You cannot do that sort of query (pun not intended) with a simple .find() statement. What you are asking for involves "weighting", which is applying a "calculated precedence to values.
Anything with "calculation" basically conditions to be programmatically applied, and the particular assertion here to "sort" rules out the "JavaScript runner" options like mapReduce and simply leaves the Aggregation Framework or other handling of the results.
For the aggregation framework approach you would need to $project a calculated "weight" to each matched document based on the conditions:
db.collection.aggregate([
// Same match conditions to filter
{ "$match": { "$or": [{ "family": 1, }, { "category": 2 }] } },
// Assign the "weight" based on conditions
{ "$project": {
"name": 1,
"family": 1,
"weight": {
"$add": [
{ "$cond": {
"if": { "$eq": [ "$family", 1 ] },
"then": 1,
"else": 0
}},
{ "$cond": {
"if": { "$eq": [ "$category", 2 ] },
"then": 1,
"else": 0
}}
]
}
}},
// Then sort "descending" with highest "weight" on top
{ "$sort": { "weight": -1 } }
])
Basically you are using $cond to evaluate the condition that the returned document actually has data meeting your condition, since in the selection either field being present is a valid response. Where the condition is present we assign a value, and where not the value is 0.
When "both" conditions are present the $add operation combines the total in the weight. So here documents that met only one condition have a 1 and for both they have 2. If you waned for example "family" to have the greater preference, then you would assign 2 in the condition, leaving you with possible document scores of:
3 : For both category and family
2 : For family only
1 : For category only
You could shorten the syntax of the $project in MongoDB 3.4 or later with the $addFields pipeline operator instead, which is most useful when you have a "lot" of other document properties you want to return without needing to list them all in the $project.
Aside from this, the database services don't allow for "calculations" on the "sort". This is considered "manipulation", which is the purpose of the Aggregation Framework.
Whilst you can do the same sort of "weighting" by post processing the result set in client code, the issue here is of course where you want to "limit" the results to return in actions like "paging". This is where running the operations on the server comes into play, and the reason why you use the Aggregation Framework for this.

Query on the last element of an array in MongoDB when the array size is stored in a variable

I have a dataset in MongoDB and this is an example of a line of my data:
{ "conversionDate": "2016-08-01",
"timeLagInDaysHistogram": 0,
"pathLengthInInteractionsHistogram": 4,
"campaignPath": [
{"campaignName": "name1", "source": "sr1", "medium": "md1", "click": "0"},
{"campaignName": "name2", "source": "sr1", "medium": "md1", "click": "0"},
{"campaignName": "name1", "source": "sr2", "medium": "md2", "click": "1"},
{"campaignName": "name3", "source": "sr1", "medium": "md3", "click": "1"}
],
"totalTransactions": 1,
"totalValue": 37.0,
"avgCartValue": 37.0
}
(The length of campaignPath is not constant, so each line can have a different amount of elements.
And I want to find elements that matches "source = sr1" in the last element of campaignPath.
I know I can't do a query with something like
db.paths.find(
{
'campaignPath.-1.source': "sr1"
}
)
But, since I have "pathLengthInInteractionsHistogram" stored which is equal to the length of campaignPath lenght, can't I do something like:
db.paths.find(
{
'campaignPath.$pathLengthInInteractionsHistogram.source': "sr1"
}
)
Starting with MongoDB 3.2, you can do this with aggregate which provides the $arrayElemAt operator which accepts a -1 index to access the last element.
db.paths.aggregate([
// Project the original doc along with the last campaignPath element
{$project: {
doc: '$$ROOT',
lastCampaign: {$arrayElemAt: ['$campaignPath', -1]}
}},
// Filter on the last campaign's source
{$match: {'lastCampaign.source': 'sr1'}},
// Remove the added lastCampaign field
{$project: {doc: 1}}
])
In earlier releases, you're stuck using $where. This will work but has poor performance:
db.paths.find({
$where: 'this.campaignPath[this.pathLengthInInteractionsHistogram-1].source === "sr1"'
})
which you could also do without using pathLengthInInteractionsHistogram:
db.paths.find({$where: 'this.campaignPath[this.campaignPath.length-1].source === "sr1"'})