Finding duplicates with the same values MongoDB - mongodb

I have a data set that I want to find the duplicates with the same values in multiple rows. For instance I have a data set that contains 12 different rows and I know the bottom 2 rows have similar values besides the _id and title values. How do I query to find these results? As if i dont already know these are duplicates.
My collection is 'sales'
[
{
_id: "C12",
title: "blouse",
price: 15,
units_sold: 100,
retail_price: 30,
ad_boost: 1,
rate_count: 34,
rating: 4
},
{
_id: "C10",
title: "loose floral blouse",
price: 15,
units_sold: 100,
retail_price: 30,
ad_boost: 1,
rate_count: 34,
rating: 4
}
]

Simply $group by all the fields that you identify as duplicate check key. You can $push the _id into an array for later fetching/processing.
db.collection.aggregate([
{
$group: {
_id: {
price: "$price",
units_sold: "$units_sold",
retail_price: "$retail_price",
ad_boost: "$ad_boost",
rate_count: "$rate_count",
rating: "$rating"
},
duplicate_ids: {
$push: {
_id: "$_id",
title: "$title"
}
}
}
}
])
Here is the Mongo playground for your reference.

Related

Extract last value from an array of objects using monogdb query language?

I'm new to Mongo. By new I mean couple of hours new.
Basically I have this document structure:
{
_id: ObjectId("614513461af3bf569fdc420e"),
item: 'postcard',
status: 'A',
size: { h: 10, w: 15.25, uom: 'cm' },
instock: [ { warehouse: 'B', qty: 15 }, { warehouse: 'C', qty: 35 } ]
}
I would like if possible to extract particular field (i.e. its value) from instock's last element. In this case I just need to extract 35 i.e. qty field.
I have managed to do this:
db.offer.find( { _id: ObjectId("614513461af3bf569fdc420e") }, { instock: 1, _id: 0} )
Which results in :
{ instock: [ { warehouse: 'B', qty: 15 }, { warehouse: 'C', qty: 35 } ] }
I don't know how to reach to last object in array and than its qty field and everything needs to be as single query.
Aggregate solution
(requires MongoDB 5, else query would be a little bigger)
Query
filter for the _id with the $match stage
get last element of $instock, and then field qty
project to keep only the above part
*we do it like we would do it in a programming language, get last element, and get a field value.
Test code here
db.collection.aggregate([
{"$match": {"_id": ObjectId("614513461af3bf569fdc420e")}},
{
"$project": {
"_id": 0,
"qty": {"$getField": {"field": "qty","input": {"$last": "$instock"}}}
}
}
])

can anyone help me explain mongo insert process when duplicate key error happened?

When I try to use db.collection.insert(document), the inserted document is an array, when one element of the array occurred duplication error, can all the other elements in this array be successfully inserted into the collection?
If you use the ordered parameter as false, the insert statement will insert all documents except the duplicate documents.
For example :
db.products.insertMany( [
{ _id: 10, item: "large box", qty: 20 },
{ _id: 11, item: "small box", qty: 55 },
{ _id: 11, item: "medium box", qty: 30 },
{ _id: 12, item: "envelope", qty: 100},
{ _id: 13, item: "stamps", qty: 125 },
{ _id: 13, item: "tape", qty: 20},
{ _id: 14, item: "bubble wrap", qty: 30}
], { ordered: false } );
In the above insert statement, all document will be inserted except the duplicate it 11 and 13.

Counting data per user with mongo aggregation framework

I have a collection, where each document contains user_ids as a property, which is an Array field. Example document(s) would be :
[{
_id: 'i3oi1u31o2yi12o3i1',
unique_prop: 33,
prop1: 'some string value',
prop2: 212,
user_ids: [1, 2, 3 ,4]
},
{
_id: 'i3oi1u88ffdfi12o3i1',
unique_prop: 34,
prop1: 'some string value',
prop2: 216,
user_ids: [2, 3 ,4]
},
{
_id: 'i3oi1u8834432ddsda12o3i1',
unique_prop: 35,
prop1: 'some string value',
prop2: 211,
user_ids: [2]
}]
My goal is to get number of documents per user, so sample output would be :
[
{user_id: 1, count: 1},
{user_id: 2, count: 3},
{user_id: 3, count: 2},
{user_id: 4, count: 2}
]
I've tried couple of things none of which worked, lastly I tried :
aggregate([
{ $group: {
_id: { unique_prop: "$unique_prop"},
users: { "$addToSet": "$user_ids" },
count: { "$sum": 1 }
}}
]
But it just returned the users per document. I m still trying to learn the any resource or advice would help.
You need to $unwind the "user_ids" array and in the $group stage count the number of time each "id" appears in the collection.
db.collection.aggregate([
{ "$unwind": "$user_ids" },
{ "$group": { "_id": "$user_ids", "count": {"$sum": 1 }}}
])
MongoDB aggregation performs computation on group of values from documents in a collection and return computed result through executing its stages in a pipeline.
According to above mentioned description please try executing following aggregate query in MongoDB shell.
db.collection.aggregate(
// Pipeline
[
// Stage 1
{
$unwind: "$user_ids"
},
// Stage 2
{
$group: {
_id:{user_id:'$user_ids'},
total:{$sum:1}
}
},
// Stage 3
{
$project: {
_id:0,
user_id:'$_id.user_id',
count:'$total'
}
},
]
);
In above aggregate query initially $unwind operator breaks an array field user_ids of each document into multiple documents for each element of array field and then it groups documents by value of user_ids field contained into each document and performs summation of documents for each value of user_ids field.

MongoDB aggregation but not including certain items

I'm very new to MongoDB's aggregation framework, so I do not know properly how to do this.
I have a data model that is structured like this:
{
name: String,
store: {
item1: Number,
item2: Number,
item3: Number,
item4: Number,
},
createdAt: Date
}
I want to return the average price of every item'i'. I'm trying with this query:
db.commerces.aggregate([
{
$group: {
_id: "",
item1Avg: { $avg: "$store.item1"},
item2Avg: { $avg: "$store.item2"},
item3Avg: { $avg: "$store.item3"},
item4Avg: { $avg: "$store.item4"}
}
}
]);
The problem is that when an item has no price set, it's stored in the database as a "-1".
I don't want these values to pollute the average result. Is there any way to limit the agreggation to only take into account when price is > 0.
$match operator before $group is not a solution because I want to return all the average prices.
Thank you!
EDIT: Here you have of an example of the input & desired output:
[{
name: 'name',
store: {
item1: 10,
item2: -1,
item3: 12,
item4: 3,
}
},
{
name: 'name2',
store: {
item1: 10,
item2: -1,
item3: -1,
item4: 2,
}
},...]
An the desired output:
{
item1Avg: 10,
item2Avg: 0,
item3Avg: 12,
item4Avg: 2.5
}
You need to $unwind the store, then $match values to meet your condition, then $group ones that passed the test. Unfortunately there is no way to $unwind an object, so you need to $project it to array first:
db.commerces.aggregate([
{$project: {store:[
{item:{$literal:"item1"}, val:"$store.item1"},
{item:{$literal:"item2"}, val:"$store.item2"},
{item:{$literal:"item3"}, val:"$store.item3"},
{item:{$literal:"item4"}, val:"$store.item4"}
]}},
{$unwind:"$store"},
{$match: {"store.val":{$gt:0}}},
{$group: {_id:"$store.item", avg:{$avg:"$store.val"}}}
])
EDIT:
As #blakes-seven pointed, it may not work on versions < 3.2. An alternative approach with $map may work:
db.commerces.aggregate([
{$project: {
store: {
$map:{
input:[
{item:{$literal:"item1"}, val:"$store.item1"},
{item:{$literal:"item2"}, val:"$store.item2"},
{item:{$literal:"item3"}, val:"$store.item3"},
{item:{$literal:"item4"}, val:"$store.item4"}
],
as: "i",
in: "$$i"
}
}
}},
{$unwind:"$store"},
{$match: {"store.val":{$gt:0}}},
{$group: {_id:"$store.item", avg:{$avg:"$store.val"}}}
])

Mongodb aggregate on subdocument in array

I am implementing a small application using mongodb as a backend. In this application I have a data structure where the documents will contain a field that contains an array of subdocuments.
I use the following use case as a basis:
http://docs.mongodb.org/manual/use-cases/inventory-management/
As you can see from the example, each document have a field called carted, which is an array of subdocuments.
{
_id: 42,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {...} },
{ sku: '0ab42f88', qty: 4, item_details: {...} }
]
}
This fits me perfect, except for one problem:
I want to count each unique item (with "sku" as the unique identifier key) in the entire collection where each document adds the count by 1 (multiple instances of the same "sku" in the same document will still just count 1). E.g. I would like this result:
{ sku: '00e8da9b', doc_count: 1 },
{ sku: '0ab42f88', doc_count: 9 }
After reading up on MongoDB, I am quite confused about how to do this (fast) when you have a complex schema as described above. If I have understood the otherwise excellent documentation correct, such operation may perhaps be achieved using either the aggregation framework or the map/reduce framework, but this is where I need some input:
Which framework would be better suited to achieve the result I am looking for, given the complexity of the structure?
What kind of indexes would be preferred in order to gain the best possible performance out of the chosen framework?
MapReduce is slow, but it can handle very large data sets. The Aggregation framework on the other hand is a little quicker, but will struggle with large data volumes.
The trouble with your structure shown is that you need to "$unwind" the arrays to crack open the data. This means creating a new document for every array item and with the aggregation framework it needs to do this in memory. So if you have 1000 documents with 100 array elements it will need to build a stream of 100,000 documents in order to groupBy and count them.
You might want to consider seeing if there's a schema layout that will server your queries better, but if you want to do it with the Aggregation framework here's how you could do it (with some sample data so the whole script will drop into the shell);
db.so.remove();
db.so.ensureIndex({ "items.sku": 1}, {unique:false});
db.so.insert([
{
_id: 42,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
]
},
{
_id: 43,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
]
},
]);
db.so.runCommand("aggregate", {
pipeline: [
{ // optional filter to exclude inactive elements - can be removed
// you'll want an index on this if you use it too
$match: { status: "active" }
},
// unwind creates a doc for every array element
{ $unwind: "$items" },
{
$group: {
// group by unique SKU, but you only wanted to count a SKU once per doc id
_id: { _id: "$_id", sku: "$items.sku" },
}
},
{
$group: {
// group by unique SKU, and count them
_id: { sku:"$_id.sku" },
doc_count: { $sum: 1 },
}
}
]
//,explain:true
})
Note that I've $group'd twice, because you said that an SKU can only count once per document, so we need to first sort out the unique doc/sku pairs and then count them up.
If you want the output a little different (in other words, EXACTLY like in your sample) we can $project them.
With the latest mongo build (it may be true for other builds too), I've found that slightly different version of cirrus's answer performs faster and consumes less memory. I don't know the details why, seems like with this version mongo somehow have more possibility to optimize the pipeline.
db.so.runCommand("aggregate", {
pipeline: [
{ $unwind: "$items" },
{
$group: {
// create array of unique sku's (or set) per id
_id: { id: "$_id"},
sku: {$addToSet: "$items.sku"}
}
},
// unroll all sets
{ $unwind: "$sku" },
{
$group: {
// then count unique values per each Id
_id: { id: "$_id.id", sku:"$sku" },
count: { $sum: 1 },
}
}
]
})
to match exactly the same format as asked in question, grouping by "_id" should be skipped