Django/NetworkX Eliminating Repeated Nodes - networkx

I want to use d3js to visualize the connections between the users of my Django website.
I am reusing the code for the force directed graph example wich requires that each node has two attributes (ID and Name). I have created a node for each user in user_profiles_table and added an edge between already created nodes based on each row in connections_table. It does not work; networkx creates new nodes when I start working with the connection_table.
nodeindex=0
for user_profile in UserProfile.objects.all():
sourcetostring=user_profile.full_name3()
G.add_node(nodeindex, name=sourcetostring)
nodeindex = nodeindex +1
for user_connection in Connection.objects.all():
target_tostring=user_connection.target()
source_tostring=user_connection.source()
G.add_edge(sourcetostring, target_tostring, value=1)
data = json_graph.node_link_data(G)
result:
{'directed': False,
'graph': [],
'links': [{'source': 6, 'target': 7, 'value': 1},
{'source': 7, 'target': 8, 'value': 1},
{'source': 7, 'target': 9, 'value': 1},
{'source': 7, 'target': 10, 'value': 1},
{'source': 7, 'target': 7, 'value': 1}],
'multigraph': False,
'nodes': [{'id': 0, 'name': u'raymondkalonji'},
{'id': 1, 'name': u'raykaeng'},
{'id': 2, 'name': u'raymondkalonji2'},
{'id': 3, 'name': u'tester1cet'},
{'id': 4, 'name': u'tester2cet'},
{'id': 5, 'name': u'tester3cet'},
{'id': u'tester2cet'},
{'id': u'tester3cet'},
{'id': u'tester1cet'},
{'id': u'raykaeng'},
{'id': u'raymondkalonji2'}]}
How can I eliminate the repeated nodes?

You probably get repeated nodes because your user_connection.target() and user_connection.source() functions return the node name, not its id. When you call add_edge, if the endpoints do not exist in the graph, they are created, which explain why you get duplicates.
The following code should work.
for user_profile in UserProfile.objects.all():
source = user_profile.full_name3()
G.add_node(source, name=source)
for user_connection in Connection.objects.all():
target = user_connection.target()
source = user_connection.source()
G.add_edge(source, target, value=1)
data = json_graph.node_link_data(G)
Also note that you should dump the data object to json if you want a properly formatted json string. You can do that as follows.
import json
json.dumps(data) # get the string representation
json.dump(data, 'somefile.json') # write to file

Related

MongoDB - Best way to delete documents by query based on results of another query

I have a collection that can contain several million documents, for simplicity, lets say they look like this:
{'_id': '1', 'user_id': 1, 'event_type': 'a', 'name': 'x'}
{'_id': '2', 'user_id': 1, 'event_type': 'b', 'name': 'x'}
{'_id': '3', 'user_id': 1, 'event_type': 'c', 'name': 'x'}
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
{'_id': '8', 'user_id': 4, 'event_type': 'a', 'name': 'x'}
{'_id': '9', 'user_id': 4, 'event_type': 'b', 'name': 'x'}
{'_id': '10', 'user_id': 4, 'event_type': 'c', 'name': 'x'}
I want to have a daily job that runs and deletes all documents by user_id, if the user_id has a doc with event_type 'c'
So the resulting collection will be
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
I did it successfully with mongoshell like this
var cur = db.my_collection.find({'event_type': 'c'})
ids = [];
while (cur.hasNext()) {
ids.push(cur.next()['user_id']);
if (ids.length == 5){
print('deleting for user_ids', ids);
print(db.my_collection.deleteMany({user_id: {$in: ids}}));
ids = [];
}
}
if (ids.length){db.my_collection.deleteMany({user_id: {$in: ids}})}
Created a cursor to hold all docs with event_type 'c', grouped them into batches of 5 then deleted all docs with these ids.
It works but looks very slow, like each cur.next() only gets one doc at a time.
I wanted to know if there is a better or more correct way to achieve this, if it was elasticsearch I would create a sliced scroll, scan each slice in parallel and submit parallel deleteByQuery requests with 1000 ids each. Is something like this possible/preferable with mongo?
Scale wise I expect there to be several million docs (~10M) at the collection, 300K docs that match the query, and ~700K that should be deleted
It sounds like you can just use deleteMany with the original query:
db.my_collection.deleteMany({
event_type: 'c'
})
No size limitations on it, it might just take a couple of minutes to run depending on instance size.
EDIT:
I would personally try to use the distinct function, it's the cleanest and easiest code. distinct does have a 16mb limit about 300k~ unique ids a day (depending on userid field size) sounds a bit close to the threshold, or past it.
const userIds = db.my_collection.distinct('user_id', { event_type: 'c'});
db.my_collection.deleteMany({user_id: {$in: userIds}})
Assuming you except scale to increase, or this fails your tests then the best way is to use something similar to your approach, just in much larger batches. for example:
const batchSize = 100000;
const count = await db.my_collection.countDocuments({'event_type': 'c'});
let iteration = 0;
while (iteration * batchSize < count) {
const batch = await db.my_collection.find({'event_type': 'c'}, { projection: { user_id: 1}}).limit(batchSize).toArray();
if (batch.length === 0) {
break
}
await db.my_collection.deleteMany({user_id: {$in: batch.map(v => v.user_id)}});
iteration++
}

How to properly sort list by another list in Dart

I have two lists, 1 is a list of Map items, and another list which is the order.
I would like to sort the items based on their description attribute and compare them with the order list and have them inserted at the top.
import 'package:collection/collection.dart';
void main() {
List<String> order = [
'top european',
'top usa',
'top rest of the world'
];
List<Map> items = [
{'id': 0, 'id2': 5, 'description': 'Top USA'},
{'id': 2, 'id2': 2, 'description': 'Top A'},
{'id': 3, 'id2': 0, 'description': 'Top Z'},
{'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
{'id': 4, 'id2': 4, 'description': 'Top C'},
{'id': 5, 'id2': 1, 'description': 'Top D'},
{'id': 1, 'id2': 3, 'description': 'Top European'},
];
//this works but adds the items at the end
items.sort((a,b) {
return order.indexOf(a['description'].toLowerCase()) -
order.indexOf(b['description'].toLowerCase());
});
///Results: print(items);
// List<Map> items = [
// {'id': 2, 'id2': 2, 'description': 'Top A'},
// {'id': 3, 'id2': 0, 'description': 'Top Z'},
// {'id': 4, 'id2': 4, 'description': 'Top C'},
// {'id': 5, 'id2': 1, 'description': 'Top D'},
// {'id': 1, 'id2': 3, 'description': 'Top European'},
// {'id': 0, 'id2': 5, 'description': 'Top USA'},
// {'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
// ];
}
SOLUTION: I also tried this approach which is not ideal, but it works.
List <Map> itemsOrder = items
.where(
(ele) => order.contains(ele['description'].toString().toLowerCase()))
.toList();
itemsOrder.sort((a, b) {
return order.indexOf(a['description'].toLowerCase()) -
order.indexOf(b['description'].toLowerCase());
});
items.removeWhere(
(ele) => order.contains(ele['description'].toString().toLowerCase()));
itemsOrder = itemsOrder.reversed.toList();
for (int i = 0; i < itemsOrder.length; i++) {
items.insert(0, itemsOrder[i]);
}
///Results: print(items);
// List<Map> items = [
// {'id': 1, 'id2': 3, 'description': 'Top European'},
// {'id': 0, 'id2': 5, 'description': 'Top USA'},
// {'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
// {'id': 2, 'id2': 2, 'description': 'Top A'},
// {'id': 3, 'id2': 0, 'description': 'Top Z'},
// {'id': 4, 'id2': 4, 'description': 'Top C'},
// {'id': 5, 'id2': 1, 'description': 'Top D'},
// ];
Ideally, I would like to use sortBy or sortByCompare but unfortunately, I cannot find a proper example or get a grasp of how to use it.
The way I would fix this is to find the index of the description in the order list and if it cannot be found, I would use a number that is out of index inside the order list to indicate that this item should be at the bottom of the list.
This would be my solution:
void testIt() {
final outOfBounds = order.length + 1;
const description = 'description';
items.sort(
(lhs, rhs) {
final lhsDesc = (lhs[description] as String).toLowerCase();
final rhsDesc = (rhs[description] as String).toLowerCase();
final lhsIndex =
order.contains(lhsDesc) ? order.indexOf(lhsDesc) : outOfBounds;
final rhsIndex =
order.contains(rhsDesc) ? order.indexOf(rhsDesc) : outOfBounds;
return lhsIndex.compareTo(rhsIndex);
},
);
}
And the result is:
[{id: 1, id2: 3, description: Top European}, {id: 0, id2: 5, description: Top USA}, {id: 6, id2: 6, description: Top Rest of the world}, {id: 2, id2: 2, description: Top A}, {id: 3, id2: 0, description: Top Z}, {id: 4, id2: 4, description: Top C}, {id: 5, id2: 1, description: Top D}]

Flutter/Dart Get All Values Of a Key from a JSON

How can i get all values of a second level key in a JSON?
I need to get all total values as like 409571117, 410559043, 411028977, 411287235
JSON:
{"country":"USA","timeline":
[{"total":409571117,"daily":757824,"totalPerHundred":121,"dailyPerMillion":2253,"date":"10/14/21"},
{"total":410559043,"daily":743873,"totalPerHundred":122,"dailyPerMillion":2212,"date":"10/15/21"},
{"total":411028977,"daily":737439,"totalPerHundred":122,"dailyPerMillion":2193,"date":"10/16/21"},
{"total":411287235,"daily":731383,"totalPerHundred":122,"dailyPerMillion":2175,"date":"10/17/21"}]}
I can able to get first level values but i don't know how to get second level.
final list = [
{'id': 1, 'name': 'flutter', 'title': 'dart'},
{'id': 35, 'name': 'flutter', 'title': 'dart'},
{'id': 93, 'name': 'flutter', 'title': 'dart'},
{'id': 82, 'name': 'flutter', 'title': 'dart'},
{'id': 28, 'name': 'flutter', 'title': 'dart'},
];
final idList = list.map((e) => e['id']).toList(); // [1, 35, 93, 82, 28]
python version of same question: Python: Getting all values of a specific key from json
UPDATE: You must declare the types in map. See below.
Have you tried subsetting on timeline after using jsonDecode?
For example, you format the data as json:
final newList = jsonEncode(
{ "country": "USA", "timeline":[
{ "total": 409571117, "daily": 757824, "totalPerHundred": 121, "dailyPerMillion": 2253, "date": "10/14/21" },
{"total": 410559043, ...
Then you decode the data into the list you want by first subsetting the timeline feature:
final extractedData = jsonDecode(newList) as Map<String, dynamic>;
final newIdList = extractedData['timeline'].map((e) => e["total"]).toList();

How does $max work over an array of objects?

Take an example collection with these documents:
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
{'name': 'a', 'sales': 0, 'leads': 2},
{'name': 'b', 'sales': 5, 'leads': 1},
{'name': 'c', 'sales': 3.3, 'leads': 1}]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
{'name': 'a', 'sales': 6, 'leads': 1},
{'name': 'b', 'sales': 6, 'leads': 3},
{'name': 'c', 'sales': 1.3, 'leads': 4}]})
How does $max determine which item in the location array is maximal?
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b995d72eabb0f0d86dceda5'),
'maxItem': {'leads': 1, 'name': 'b', 'sales': 5}},
{'_id': ObjectId('5b995d72eabb0f0d86dceda6'),
'maxItem': {'leads': 3, 'name': 'b', 'sales': 6}}]
It looks like $max is picking to sort on sales but I am not sure why?
I discovered this
https://docs.mongodb.com/manual/reference/bson-type-comparison-order/#objects
which states:
MongoDB’s comparison of BSON objects uses the following order:
Recursively compare key-value pairs in the order that they appear
within the BSON object.
Compare the key field names.
If the key field names are equal, compare the field values.
If the field values are equal, compare the next key/value pair (return to step 1). An object without further pairs is less than an
object with further pairs.
which means that if sales is the first key in the bson object then I have my answer. I'm using pymongo and python dictionaries aren't ordered, so I switched to bson.son.SON and re-did the example:
client.test.foo.delete_many({})
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 0), ('leads', 2)]),
bson.son.SON([('name', 'b'), ('sales', 5), ('leads', 1)]),
bson.son.SON([('name', 'c'), ('sales', 3.3), ('leads', 1)])]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 6), ('leads', 1)]),
bson.son.SON([('name', 'b'), ('sales', 6), ('leads', 3)]),
bson.son.SON([('name', 'c'), ('sales', 1.3), ('leads', 4)])]})
And now its sorting by name:
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b99619beabb0f0d86dcedaf'),
'maxItem': {'leads': 1, 'name': 'c', 'sales': 3.3}},
{'_id': ObjectId('5b99619beabb0f0d86dcedb0'),
'maxItem': {'leads': 4, 'name': 'c', 'sales': 1.3}}]

kafka-streams join produce duplicates

I have two topics:
// photos
{'id': 1, 'user_id': 1, 'url': 'url#1'},
{'id': 2, 'user_id': 2, 'url': 'url#2'},
{'id': 3, 'user_id': 2, 'url': 'url#3'}
// users
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'}
I create map photo by user
KStream<Integer, Photo> photo_by_user = ...
photo_by_user.to("photo_by_user")
Then, I try to join two tables:
KTable<Integer, User> users_table = builder.table("users");
KTable<Integer, Photo> photo_by_user_table = builder.table("photo_by_user");
KStream<Integer, Result> results = users_table.join(photo_by_user_table, (a, b) -> Result.from(a, b)).toStream();
results.to("results");
result like
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
I see that results are duplicated. Why, and how to avoid it?
You might hit a known bug. On "flush" KTable-KTable join might produce some duplicates. Note, that those duplicates are strictly speaking not incorrect, because the result is an update-stream and updating "A" to "A" does not change the result. It's of course undesired to get those duplicates. Try to disable caching -- without caching, the "flush issues" should not occur.