MongoDB - Best way to delete documents by query based on results of another query - mongodb

I have a collection that can contain several million documents, for simplicity, lets say they look like this:
{'_id': '1', 'user_id': 1, 'event_type': 'a', 'name': 'x'}
{'_id': '2', 'user_id': 1, 'event_type': 'b', 'name': 'x'}
{'_id': '3', 'user_id': 1, 'event_type': 'c', 'name': 'x'}
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
{'_id': '8', 'user_id': 4, 'event_type': 'a', 'name': 'x'}
{'_id': '9', 'user_id': 4, 'event_type': 'b', 'name': 'x'}
{'_id': '10', 'user_id': 4, 'event_type': 'c', 'name': 'x'}
I want to have a daily job that runs and deletes all documents by user_id, if the user_id has a doc with event_type 'c'
So the resulting collection will be
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
I did it successfully with mongoshell like this
var cur = db.my_collection.find({'event_type': 'c'})
ids = [];
while (cur.hasNext()) {
ids.push(cur.next()['user_id']);
if (ids.length == 5){
print('deleting for user_ids', ids);
print(db.my_collection.deleteMany({user_id: {$in: ids}}));
ids = [];
}
}
if (ids.length){db.my_collection.deleteMany({user_id: {$in: ids}})}
Created a cursor to hold all docs with event_type 'c', grouped them into batches of 5 then deleted all docs with these ids.
It works but looks very slow, like each cur.next() only gets one doc at a time.
I wanted to know if there is a better or more correct way to achieve this, if it was elasticsearch I would create a sliced scroll, scan each slice in parallel and submit parallel deleteByQuery requests with 1000 ids each. Is something like this possible/preferable with mongo?
Scale wise I expect there to be several million docs (~10M) at the collection, 300K docs that match the query, and ~700K that should be deleted

It sounds like you can just use deleteMany with the original query:
db.my_collection.deleteMany({
event_type: 'c'
})
No size limitations on it, it might just take a couple of minutes to run depending on instance size.
EDIT:
I would personally try to use the distinct function, it's the cleanest and easiest code. distinct does have a 16mb limit about 300k~ unique ids a day (depending on userid field size) sounds a bit close to the threshold, or past it.
const userIds = db.my_collection.distinct('user_id', { event_type: 'c'});
db.my_collection.deleteMany({user_id: {$in: userIds}})
Assuming you except scale to increase, or this fails your tests then the best way is to use something similar to your approach, just in much larger batches. for example:
const batchSize = 100000;
const count = await db.my_collection.countDocuments({'event_type': 'c'});
let iteration = 0;
while (iteration * batchSize < count) {
const batch = await db.my_collection.find({'event_type': 'c'}, { projection: { user_id: 1}}).limit(batchSize).toArray();
if (batch.length === 0) {
break
}
await db.my_collection.deleteMany({user_id: {$in: batch.map(v => v.user_id)}});
iteration++
}

Related

Flutter/Dart Get All Values Of a Key from a JSON

How can i get all values of a second level key in a JSON?
I need to get all total values as like 409571117, 410559043, 411028977, 411287235
JSON:
{"country":"USA","timeline":
[{"total":409571117,"daily":757824,"totalPerHundred":121,"dailyPerMillion":2253,"date":"10/14/21"},
{"total":410559043,"daily":743873,"totalPerHundred":122,"dailyPerMillion":2212,"date":"10/15/21"},
{"total":411028977,"daily":737439,"totalPerHundred":122,"dailyPerMillion":2193,"date":"10/16/21"},
{"total":411287235,"daily":731383,"totalPerHundred":122,"dailyPerMillion":2175,"date":"10/17/21"}]}
I can able to get first level values but i don't know how to get second level.
final list = [
{'id': 1, 'name': 'flutter', 'title': 'dart'},
{'id': 35, 'name': 'flutter', 'title': 'dart'},
{'id': 93, 'name': 'flutter', 'title': 'dart'},
{'id': 82, 'name': 'flutter', 'title': 'dart'},
{'id': 28, 'name': 'flutter', 'title': 'dart'},
];
final idList = list.map((e) => e['id']).toList(); // [1, 35, 93, 82, 28]
python version of same question: Python: Getting all values of a specific key from json
UPDATE: You must declare the types in map. See below.
Have you tried subsetting on timeline after using jsonDecode?
For example, you format the data as json:
final newList = jsonEncode(
{ "country": "USA", "timeline":[
{ "total": 409571117, "daily": 757824, "totalPerHundred": 121, "dailyPerMillion": 2253, "date": "10/14/21" },
{"total": 410559043, ...
Then you decode the data into the list you want by first subsetting the timeline feature:
final extractedData = jsonDecode(newList) as Map<String, dynamic>;
final newIdList = extractedData['timeline'].map((e) => e["total"]).toList();

Flutter remove duplicate array by value

I have an array that has the same name value I just need to check if the same value comes it will delete it.
My data looks like
[{'name':'Rameez', 'data': [{'age': 1, 'number': 2}]}, {'name':'XYZ', 'data': [{'age': 1, 'number': 2}]}, {'name':'Rameez', 'data': [{'age': 1, 'number': 2}]}];
I want to show it like this no duplicate name
Expected output dataaa = [{'name':'Rameez', 'data': [{'age': 1, 'number': 2}]}, {'name':'XYZ', 'data': [{'age': 1, 'number': 2}]}];
var list = [{'name':'Rameez', 'data': [{'age': 1, 'number': 2}]},
{'name':'XYZ', 'data': [{'age': 1, 'number': 2}]},
{'name':'Rameez', 'data': [{'age': 1, 'number': 2}]}];
for(int i = 0;i< list.length;i++){
for(int j = i+1;j< list.length;j++){
if(list[i]["name"] == list[j]["name"]){
list.removeAt(j);
}
}
}
list.forEach((item) => print(item.toString()));
Output
{name: Rameez, data: [{age: 1, number: 2}]}
{name: XYZ, data: [{age: 1, number: 2}]}

How does $max work over an array of objects?

Take an example collection with these documents:
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
{'name': 'a', 'sales': 0, 'leads': 2},
{'name': 'b', 'sales': 5, 'leads': 1},
{'name': 'c', 'sales': 3.3, 'leads': 1}]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
{'name': 'a', 'sales': 6, 'leads': 1},
{'name': 'b', 'sales': 6, 'leads': 3},
{'name': 'c', 'sales': 1.3, 'leads': 4}]})
How does $max determine which item in the location array is maximal?
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b995d72eabb0f0d86dceda5'),
'maxItem': {'leads': 1, 'name': 'b', 'sales': 5}},
{'_id': ObjectId('5b995d72eabb0f0d86dceda6'),
'maxItem': {'leads': 3, 'name': 'b', 'sales': 6}}]
It looks like $max is picking to sort on sales but I am not sure why?
I discovered this
https://docs.mongodb.com/manual/reference/bson-type-comparison-order/#objects
which states:
MongoDB’s comparison of BSON objects uses the following order:
Recursively compare key-value pairs in the order that they appear
within the BSON object.
Compare the key field names.
If the key field names are equal, compare the field values.
If the field values are equal, compare the next key/value pair (return to step 1). An object without further pairs is less than an
object with further pairs.
which means that if sales is the first key in the bson object then I have my answer. I'm using pymongo and python dictionaries aren't ordered, so I switched to bson.son.SON and re-did the example:
client.test.foo.delete_many({})
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 0), ('leads', 2)]),
bson.son.SON([('name', 'b'), ('sales', 5), ('leads', 1)]),
bson.son.SON([('name', 'c'), ('sales', 3.3), ('leads', 1)])]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 6), ('leads', 1)]),
bson.son.SON([('name', 'b'), ('sales', 6), ('leads', 3)]),
bson.son.SON([('name', 'c'), ('sales', 1.3), ('leads', 4)])]})
And now its sorting by name:
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b99619beabb0f0d86dcedaf'),
'maxItem': {'leads': 1, 'name': 'c', 'sales': 3.3}},
{'_id': ObjectId('5b99619beabb0f0d86dcedb0'),
'maxItem': {'leads': 4, 'name': 'c', 'sales': 1.3}}]

kafka-streams join produce duplicates

I have two topics:
// photos
{'id': 1, 'user_id': 1, 'url': 'url#1'},
{'id': 2, 'user_id': 2, 'url': 'url#2'},
{'id': 3, 'user_id': 2, 'url': 'url#3'}
// users
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'}
I create map photo by user
KStream<Integer, Photo> photo_by_user = ...
photo_by_user.to("photo_by_user")
Then, I try to join two tables:
KTable<Integer, User> users_table = builder.table("users");
KTable<Integer, Photo> photo_by_user_table = builder.table("photo_by_user");
KStream<Integer, Result> results = users_table.join(photo_by_user_table, (a, b) -> Result.from(a, b)).toStream();
results.to("results");
result like
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
I see that results are duplicated. Why, and how to avoid it?
You might hit a known bug. On "flush" KTable-KTable join might produce some duplicates. Note, that those duplicates are strictly speaking not incorrect, because the result is an update-stream and updating "A" to "A" does not change the result. It's of course undesired to get those duplicates. Try to disable caching -- without caching, the "flush issues" should not occur.

Django/NetworkX Eliminating Repeated Nodes

I want to use d3js to visualize the connections between the users of my Django website.
I am reusing the code for the force directed graph example wich requires that each node has two attributes (ID and Name). I have created a node for each user in user_profiles_table and added an edge between already created nodes based on each row in connections_table. It does not work; networkx creates new nodes when I start working with the connection_table.
nodeindex=0
for user_profile in UserProfile.objects.all():
sourcetostring=user_profile.full_name3()
G.add_node(nodeindex, name=sourcetostring)
nodeindex = nodeindex +1
for user_connection in Connection.objects.all():
target_tostring=user_connection.target()
source_tostring=user_connection.source()
G.add_edge(sourcetostring, target_tostring, value=1)
data = json_graph.node_link_data(G)
result:
{'directed': False,
'graph': [],
'links': [{'source': 6, 'target': 7, 'value': 1},
{'source': 7, 'target': 8, 'value': 1},
{'source': 7, 'target': 9, 'value': 1},
{'source': 7, 'target': 10, 'value': 1},
{'source': 7, 'target': 7, 'value': 1}],
'multigraph': False,
'nodes': [{'id': 0, 'name': u'raymondkalonji'},
{'id': 1, 'name': u'raykaeng'},
{'id': 2, 'name': u'raymondkalonji2'},
{'id': 3, 'name': u'tester1cet'},
{'id': 4, 'name': u'tester2cet'},
{'id': 5, 'name': u'tester3cet'},
{'id': u'tester2cet'},
{'id': u'tester3cet'},
{'id': u'tester1cet'},
{'id': u'raykaeng'},
{'id': u'raymondkalonji2'}]}
How can I eliminate the repeated nodes?
You probably get repeated nodes because your user_connection.target() and user_connection.source() functions return the node name, not its id. When you call add_edge, if the endpoints do not exist in the graph, they are created, which explain why you get duplicates.
The following code should work.
for user_profile in UserProfile.objects.all():
source = user_profile.full_name3()
G.add_node(source, name=source)
for user_connection in Connection.objects.all():
target = user_connection.target()
source = user_connection.source()
G.add_edge(source, target, value=1)
data = json_graph.node_link_data(G)
Also note that you should dump the data object to json if you want a properly formatted json string. You can do that as follows.
import json
json.dumps(data) # get the string representation
json.dump(data, 'somefile.json') # write to file