How to retrieve the total cases in my data set on a MongoDB query? - mongodb

I want the total number of cases in all my documents,
This is the query I tried to use:
db.coviddatajson.aggregate([
{ $group: { _id: null, total: { $sum: "$total_cases"} } }
])
For some reason the result is 0 which does not make sense, as it's supposed to be 1000+ at least and the expected result anything that is not zero will make sense but it's supposed to be a few thousands or something like that.
This is the dataset I am using:
https://covid.ourworldindata.org/data/owid-covid-data.json
What am I doing wrong here?
Any ideas on how to fix this query?

The total_cases field is inside data array, and $sum requires field type as number in $group stage, so before we need to do total($sum) of data.total_cases in current document and then pass it to $group stage and count total sum,
db.coviddatajson.aggregate([
{
$project: { total_cases: { $sum: "$data.total_cases" } }
},
{
$group: {
_id: null,
total: { $sum: "$total_cases" }
}
}
])
Playground

The data set has some issues.
The document size is bigger than 16MiB, you cannot load documents >16MiB into MongoDB. This in an internal limitation. You would need to split the document into sub-documents.
The document contains data for each country but also summarized data for "World". Do you have to exclude the "World" data? Can you use it, instead of manual summary?
The data is not consistent. For example some countries do not provide a number of male/female smokers or median age. Not all countries provide all data for each date, you may have missing values. How to deal with them?
Do you like a simple sum of all total_cases? If yes, the query would be easy, however the result would be pointless (15'773'189'214 total cases, twice population of the world).

Related

How do I calculate the average of multiple fields in a mongoDB document?

I have a collection with the following fields:
_id:612ff22c17286411a17252cf
WELL:"199-H4-15A"
TYPE:"E"
SYSTEM:"HX"
ID:"199-H4-15A_E_HX"
Jan-14:-168.8
Feb-14:-151
Mar-14:-164.1
Apr-14:-168.7
May-14:-172.6
Jun-14:-177.3
Jul-14:-177.6
Aug-14:-171.9
Sep-14:-138.9
Oct-14:-130.3
Nov-14:-163.8
Dec-14:-161.4
Jan-15:-168.7
Feb-15:-168.9
Mar-15:-168.6
Apr-15:-164.6
May-15:-141.7
Jun-15:-153.5
Jul-15:-163.7
Aug-15:-167.7
and I am trying to take the average of all of the month fields (e.g. all files like "Jan-14", "Feb-14" and so on). I was thinking of somehow pushing all of the month fields data into an array and then average the values but would like to avoid having to list all of the individual field names. Below is what I have so far.
[{$match: {
'WELL': '199-H4-15A'
}}, {$group: {
_id: null,
MonthAverageFlows: {
$push: {
$isNumber: ['$all']
}
}
}}, {$unwind: '$MonthAverageFlows'}, {$group: {
_id: null,
average: {
$avg: '$MonthAverageFlows.value'
}
}}]
All that comes out is ```null``. Any and all help would be appreciated. The raw data is in csv form:
WELL, TYPE, SYSTEM, ID, JAN-14, FEB-14, . . .
"199-H4-15A", "E", "HX", "199-H4-15A_E_HX", -168.8, -151, . . .
Using dynamic values as field name is generally considered as anti-pattern and you should avoid that. You are likely to introduce unnecessary difficulty to composing and maintaining your queries.
Nevertheless, you can do the followings in an aggregation pipeline:
$objectToArray to convert your raw document into k-v tuples array
$filter the array to get contain the monthly data only
$avg to calculate the sum
Here is the Mongo playground for your reference.

aggregating and sorting based on a Mongodb Relationship

I'm trying to figure out if what I want to do is even possible in Mongodb. I'm open to all sorts of suggestions regarding more appropriate ways to achieve what I need.
Currently, I have 2 collections:
vehicles (Contains vehicle data such as make and model. This data can be highly unstructured, which is why I turned to Mongodb for this)
views (Simply contains an IP, a date/time that the vehicle was viewed and the vehicle_id. There could be thousands of views)
I need to return a list of vehicles that have views between 2 dates. The list should include the number of views. I need to be able to sort by the number of views in addition to any of the usual vehicle fields. So, to be clear, if a vehicle has had 1000 views, but only 500 of those between the given dates, the count should return 500.
I'm pretty sure I could perform this query without any issues in MySQL - however, trying to store the vehicle data in MySQL has been a real headache in the past and it has been great moving to Mongo where I can add new data fields with ease and not worry about the structure of my database.
What do you all think?? TIA!
As it turns out, it's totally possible. It took me a long while to get my head around this, so I'm posting it up for future google searches...
db.statistics.aggregate({
$match: {
branch_id: { $in: [14] }
}
}, {
$lookup: {
from: 'vehicles', localField: 'vehicle_id', foreignField: '_id', as: 'vehicle'
}
}, {
$group: {
_id: "$vehicle_id",
count: { $sum: 1 },
vehicleObject: { $first: "$vehicle" }
}
}, { $unwind: "$vehicleObject" }, {
$project: {
daysInStock: { $subtract: [ new Date(), "$vehicleObject.date_assigned" ] },
vehicleObject: 1,
count: 1
}
}, { $sort: { count: -1 } }, { $limit: 10 });
To explain the above:
The Mongodb aggregate framework is the way forward for complex queries like this. Firstly, I run a $match to filter the records. Then, we use $lookup to grab the vehicle record. Worth mentioning here that this is a Many to One relationship here (lots of stats, each having a single vehicle). I can then group on the vehicle_id field, which will enable me to return one record per vehicle with a count of the number of stats in the group. As it is a group, we technically have lots of copies of that same vehicle document now in each group, so I then add just the first one into the vehicleObject variable. This would be fine, but $first tends to return an array with a single entry (pointless in my opinion), so I added the $unwind stage to pull the actual vehicle out. I then added a $project stage to calculate an additional field, sorted by the count descending and limited the results to 10.
And take a breath :)
I hope that helps someone. If you know of a better way to do what I did, then I'm open to suggestions to improve.

Meteor collection get last document of each selection

Currently I use the following find query to get the latest document of a certain ID
Conditions.find({
caveId: caveId
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
How can I use the same using multiple ids with $in for example
I tried it with the following query. The problem is that it will limit the documents to 1 for all the found caveIds. But it should set the limit for each different caveId.
Conditions.find({
caveId: {$in: caveIds}
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
One solution I came up with is using the aggregate functionality.
var conditionIds = Conditions.aggregate(
[
{"$match": { caveId: {"$in": caveIds}}},
{
$group:
{
_id: "$caveId",
conditionId: {$last: "$_id"},
diveDate: { $last: "$diveDate" }
}
}
]
).map(function(child) { return child.conditionId});
var conditions = Conditions.find({
_id: {$in: conditionIds}
},
{
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
You don't want to use $in here as noted. You could solve this problem by looping through the caveIds and running the query on each caveId individually.
you're basically looking at a join query here: you need all caveIds and then lookup last for each.
This is a problem of database schema/denormalization in my opinion: (but this is only an opinion!):
You could as mentioned here, lookup all caveIds and then run the single query for each, every single time you need to look up last dives.
However I think you are much better off recording/updating the last dive inside your cave document, and then lookup all caveIds of interest pulling only the lastDive field.
That will give you immediately what you need, rather than going through expensive search/sort queries. This is at the expense of maintaining that field in the document, but it sounds like it should be fairly trivial as you only need to update the one field when a new event occurs.

Result from "aggregate with unwind" is different from the "find with count"?

Here is a few documents from my collections:
{"make":"Lenovo", "model":"Thinkpad T430"},
{"make":"Lenovo", "model":"Thinkpad T430", "problems":["Battery"]},
{"make":"Lenovo", "model":"Thinkpad T430", "problems":["Battery","Brakes"]}
As you can see some documents have no problems, some have only one problem and some have few problems in a list.
I want to calculate how many reviews have a specific "problem" (like "Battery") in problems list.
I have tried to use the following aggregate command:
{ $match : { model : "Thinkpad T430"} },
{ $unwind : "$problems" },
{ $group: {
_id: '$problems',
count: { $sum: 1 }
}}
And for battery problem the count was 382. I also decided to double check this result with find() and count():
db.reviews.find({model:"Thinkpad T430",problems:"Battery"}).count()
Result was 362.
Why do I have this difference? And what is the right way to calculate it?
You likely have documents in the collection where problems contains more than one "Battery" string in the array.
When using $unwind, these will each result in their own doc, so the subsequent $group operation will count them separately.

sorting documents in mongodb

Let's say I have four documents in my collection:
{u'a': {u'time': 3}}
{u'a': {u'time': 5}}
{u'b': {u'time': 4}}
{u'b': {u'time': 2}}
Is it possible to sort them by the field 'time' which is common in both 'a' and 'b' documents?
Thank you
No, you should put your data into a common format so you can sort it on a common field. It can still be nested if you want but it would need to have the same path.
You can use use aggregation and the following code has been tested.
db.test.aggregate({
$project: {
time: {
"$cond": [{
"$gt": ["$a.time", null]
}, "$a.time", "$b.time"]
}
}
}, {
$sort: {
time: -1
}
});
Or if you also want the original fields returned back: gist
Alternatively you can sort once you get the result back, using a customized compare function ( not tested,for illustration purpose only)
db.eval(function() {
return db.mycollection.find().toArray().sort( function(doc1, doc2) {
var time1 = doc1.a? doc1.a.time:doc1.b.time,
time2 = doc2.a?doc2.a.time:doc2.b.time;
return time1 -time2;
})
});
You can, using the aggregation framework.
The trick here is to $project a common field to all the documents so that the $sort stage can use the value in that field to sort the documents.
The $ifNull operator can be used to check if a.time exists, it
does, then the record will be sorted by that value else, by b.time.
code:
db.t.aggregate([
{$project:{"a":1,"b":1,
"sortBy":{$ifNull:["$a.time","$b.time"]}}},
{$sort:{"sortBy":-1}},
{$project:{"a":1,"b":1}}
])
consequences of this approach:
The aggregation pipeline won't be covered by any of the index you
create.
The performance will be very poor for very large data sets.
What you could ideally do is to ask the source system that is sending you the data to standardize its format, something like:
{"a":1,"time":5}
{"b":1,"time":4}
That way your query can make use of the index if you create one on the time field.
db.t.ensureIndex({"time":-1});
code:
db.t.find({}).sort({"time":-1});