I am running a mongodb aggregate query to group the data of a collection and get a sum of the values in a field and insert it to another collection.
ex: collection1: [
{ name: "foo",
group_id:1,
marks: 10
},
{ name: "bar",
group_id:1,
marks: 20
},
{ name: "Hello World",
group_id:2,
marks: 40
}]
So, the group by query will insert into a collection ex: collection2 with the following data
collection2:[
{
group_id: 1,
marks: 30
}, {
group_id:2,
marks: 40
}
]
I need to do these two operations:
Group the data and get the aggregate
create a new collection with the data
Now, comes the interesting part, The data that is being grouped is of 5 billion rows, and so, the query to get the aggregate of the marks will be very slow to execute.
thus writing a node script to get the data by the group and then insert it to another collection will be not very optimal. The other way that I was thinking was to limit the data by x ex: 1000, and group the 1000s and then insert that to the collection2 and for the next 1000 , update the collection 2, and so on.
So, here are my questions. does aggregating the data by a limit and then iterating over it faster?
ex:
step 1: group and get the sum of the marks of 1000 rows
step 2: insert/update collection2 with this data
step 3: goto step1
is this above method more useful than just getting the aggregate by grouping all the 5 billion records and then inserting it in the collection2? Assuming that there is a node api that is doing the above task, how to determine/calculate the limit size for the faster operations? how do I use whenMatched to update/insert to collection2 with the marks?
Related
In my PostgresDB, I'm performing a deletion operation using another table as below.
DELETE FROM user_records
USING to_delete_records
WHERE user_records.record_id = to_delete_records.record_id
user_records table contains around 200 million records while to_delete_records table contains around 5-10 million records. Everyday the to_delete_records table is updated with a new set of records, and have to perform the above deletion operation. (similar to deletion, insertion operations (around 5-10 million records) take place as well, hence the total dataset of user_records remains around 200 million)
Now I'm replacing the PostgresDB with a MongoDB, and following is the script I'm using for deleting records in user_records collection:
db.to_delete_records.find({}, {_id: 0}).forEach(function(doc){
db.user_records.deleteOne({record_id:doc.record_id});
});
As this is a loop running, seems inefficient.
Is there a better way to delete documents of a collection using another collection in Mongo?
If record_id is a unique field in both user_records and to_delete_records, you can build a unique index for the field for each collection if you have not done so.
db.user_records.createIndex({record_id: 1}, {unique:true});
db.to_delete_records.createIndex({record_id: 1}, {unique:true});
Afterwards, you can use a $merge statement to add an auxiliary field toDelete to the collection user_records, based on the content in to_delete_records
db.to_delete_records.aggregate([
{
"$merge": {
"into": "user_records",
"on": "record_id",
"whenMatched": [
{
$set: {
"toDelete": true
}
}
]
}
}
])
Finally run a deleteMany on user_records
db.user_records.deleteMany({toDelete: true});
I'm trying to efficiently find the top entries by group in Arango (AQL). I have a fairly standard object collection and an edge collection representing Departments and Employees in that department.
Example purpose: Find the top 2 employees in each department by most years of experience.
Sample Data:
"departments" is an object collection. Here are some entries:
_id
name
departments/1
engineering
departments/2
sales
"dept_emp_edges" is an edge collection connecting departments and employee objects by ids.
_id
_from
_to
years_exp
dept_emp_edges/1
departments/1
employees/1
3
dept_emp_edges/2
departments/1
employees/2
4
dept_emp_edges/3
departments/1
employees/3
5
dept_emp_edges/4
departments/2
employees/1
6
I would like to end up with the top 2 employees per department by most years experience:
department
employee
years_exp
departments/1
employee/3
5
departments/1
employee/2
4
departments/2
employee/1
6
Long Working Query
The following query works! But is a bit slow on larger tables and feels inefficient.
FOR dept IN departments
LET top2earners = (
FOR dep_emp_edge IN dept_emp_edges
FILTER dep_emp_edge._from == dept._id
SORT dep_emp_edge.years_exp DESC
LIMIT 2
RETURN {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
)
FOR row in top2earners
return {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
I don't like this because there is 3 loops in here and feels rather inefficient.
Short Query
However, I tried to write:
FOR dept IN departments
FOR dep_emp_edge IN dept_emp_edges
FILTER dep_emp_edge._from == dept._id
SORT dep_emp_edge.years_exp DESC
LIMIT 2
RETURN {'department': dep_emp_edge._from,
'employee': dep_emp_edge._to,
'years_exp': dep_emp_edge.years_exp}
But this last query only outputs the final department top 2 results. Not all of the top 2 in each department.
My questions are: (1) why doesn't the second shorter query give all results? and (2) I'm quite new to Arango and ArangoQL, what other things can I do to make sure this is efficient?
Your first query is incorrect as written (Query: AQL: collection or view not found: dep_emp_edge (while parsing)) - as I could only guess what you mean, I ignore it for now.
Your smaller query limits the overall results to two - counter intuitively - as you are not grouping by department.
I suggest a slightly different approach: Use the edge collection as central source and group by _from, returning one document per department, containing an array of the two top resulting employees (should they exist), not one document per employee:
FOR edge IN dept_emp_edges
SORT edge.years_exp DESC
COLLECT dep = edge._from INTO deps
LET emps = (
FOR e in deps
LIMIT 2
RETURN ZIP(["employee", "years_exp"], [e.edge._to, e.edge.years_exp])
)
RETURN {"department": dep, employees: emps}
For your example database this returns:
[
{
"department": "departments/1",
"employees": [
{
"employee": "employees/3",
"years_exp": 5
},
{
"employee": "employees/2",
"years_exp": 4
}
]
},
{
"department": "departments/2",
"employees": [
{
"employee": "employees/1",
"years_exp": 6
}
]
}
]
If the query is too slow, an index on the year_exp-field of the dept_emp_edges collection could help (Explain suggests it would).
var product = db.GetCollection<Product>("Product");
var lookup1 = new BsonDocument(
"$lookup",
new BsonDocument {
{ "from", "Variant" },
{ "localField", "Maincode" },
{ "foreignField", "Maincode" },
{ "as", "variants" }
}
);
var pipeline = new[] { lookup1};
var result = product.Aggregate<Product>(pipeline).ToList();
The data of collection a is very large so it takes me 30 seconds to put the data in the list.
What should I do to make a faster lookup?
What that query is doing is retrieving every document from the Product collection, and then for each document found, perform a find query in the Variant collection. If there is no index on the Maincode field in the Variant collection, it will be reading the entire collection for each document.
This means that if there are, say, 1000 total products, with 3000 total variants (3 per product, on average), this query will be reading all 1000 documents from Product, and if that index isn't there, it would read all 3000 documents from Variant 1000 times, i.e. it will be examining 3 million documents.
Some ways to possibly speed this up:
create an index on {Maincode:1} in the Variant collection
This will reduce the number of documents that must be read in order to complete the lookup
change the schema
If the variants are stored in the same document with the product, there is no need for a lookup
filter the products prior to lookup
Again, reducing the documents read during the lookup
use a cursor to retrieve the documents in batches
If you perform any necessary sorting first, and the lookup last, you can return the documents to the application in batches, which would allow the application to display or begin processing the first batch before the second batch is available. This doen't make the query itself faster, but it can reduced the perceived wait in the application.
I have a collection with over 10 Million records, I need to match with a particular field and get
the distinct _ids of the records set.
after the $match pipeline the result set becomes less than 5 Million.
if i group with id to get the unique ids, the execution time on my local environment is over 20 seconds.
db.getCollection('viewscounts').aggregate(
[
{
$match: {
MODULE_ID: 4,
}
},
{
$group: {
_id: '$ITEM_ID',
}
}
], { allowDiskUse: true })
If I get rid of either $match or $group and have only 1 pipeline, the execution time is less than 0.1 seconds.
I'm okay with limiting the _ids, but they should be unique.
Can anyone suggest a better way to get the results faster?
You have already implemented the best Aggregation pipelines possible for the query to get your desired output.
The reason why your query results are faster when using only one of the aggregation pipelines is that the query result returns partial output instead of the entire 5 million records. where when you add both the stages, the entire output of the $match stage has to be processed by $group stage resulting in more time.
The only way to optimize your aggregation query is to apply indexes on MODULE_ID and ITEM_ID keys
db.viewscounts.createIndex({MODULE_ID: 1}, { sparse: true })
db.viewscounts.createIndex({ITEM_ID: 1})
It should be faster after you perform the above two indexes on your viewscounts collection.
Additionally, you can also get your desired output from MongoDB distinct command. Give the below query a try and see if it helps.
db.getCollection('viewscounts').distinct("ITEM_ID", {"MODULE_ID": 4})
Note: The above query returns an array of unique key-values instead of objects like in the aggregation query
Hope this helps
I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.
You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.
You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});
Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)
I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});
Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});