kafka-streams join produce duplicates

kafka-streams join produce duplicates - apache-kafka

I have two topics:
// photos
{'id': 1, 'user_id': 1, 'url': 'url#1'},
{'id': 2, 'user_id': 2, 'url': 'url#2'},
{'id': 3, 'user_id': 2, 'url': 'url#3'}
// users
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'}
I create map photo by user
KStream<Integer, Photo> photo_by_user = ...
photo_by_user.to("photo_by_user")
Then, I try to join two tables:
KTable<Integer, User> users_table = builder.table("users");
KTable<Integer, Photo> photo_by_user_table = builder.table("photo_by_user");
KStream<Integer, Result> results = users_table.join(photo_by_user_table, (a, b) -> Result.from(a, b)).toStream();
results.to("results");
result like
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
I see that results are duplicated. Why, and how to avoid it?

You might hit a known bug. On "flush" KTable-KTable join might produce some duplicates. Note, that those duplicates are strictly speaking not incorrect, because the result is an update-stream and updating "A" to "A" does not change the result. It's of course undesired to get those duplicates. Try to disable caching -- without caching, the "flush issues" should not occur.

Related

MongoDB - Best way to delete documents by query based on results of another query

I have a collection that can contain several million documents, for simplicity, lets say they look like this:
{'_id': '1', 'user_id': 1, 'event_type': 'a', 'name': 'x'}
{'_id': '2', 'user_id': 1, 'event_type': 'b', 'name': 'x'}
{'_id': '3', 'user_id': 1, 'event_type': 'c', 'name': 'x'}
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
{'_id': '8', 'user_id': 4, 'event_type': 'a', 'name': 'x'}
{'_id': '9', 'user_id': 4, 'event_type': 'b', 'name': 'x'}
{'_id': '10', 'user_id': 4, 'event_type': 'c', 'name': 'x'}
I want to have a daily job that runs and deletes all documents by user_id, if the user_id has a doc with event_type 'c'
So the resulting collection will be
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
I did it successfully with mongoshell like this
var cur = db.my_collection.find({'event_type': 'c'})
ids = [];
while (cur.hasNext()) {
ids.push(cur.next()['user_id']);
if (ids.length == 5){
print('deleting for user_ids', ids);
print(db.my_collection.deleteMany({user_id: {$in: ids}}));
ids = [];
}
}
if (ids.length){db.my_collection.deleteMany({user_id: {$in: ids}})}
Created a cursor to hold all docs with event_type 'c', grouped them into batches of 5 then deleted all docs with these ids.
It works but looks very slow, like each cur.next() only gets one doc at a time.
I wanted to know if there is a better or more correct way to achieve this, if it was elasticsearch I would create a sliced scroll, scan each slice in parallel and submit parallel deleteByQuery requests with 1000 ids each. Is something like this possible/preferable with mongo?
Scale wise I expect there to be several million docs (~10M) at the collection, 300K docs that match the query, and ~700K that should be deleted

It sounds like you can just use deleteMany with the original query:
db.my_collection.deleteMany({
event_type: 'c'
})
No size limitations on it, it might just take a couple of minutes to run depending on instance size.
EDIT:
I would personally try to use the distinct function, it's the cleanest and easiest code. distinct does have a 16mb limit about 300k~ unique ids a day (depending on userid field size) sounds a bit close to the threshold, or past it.
const userIds = db.my_collection.distinct('user_id', { event_type: 'c'});
db.my_collection.deleteMany({user_id: {$in: userIds}})
Assuming you except scale to increase, or this fails your tests then the best way is to use something similar to your approach, just in much larger batches. for example:
const batchSize = 100000;
const count = await db.my_collection.countDocuments({'event_type': 'c'});
let iteration = 0;
while (iteration * batchSize < count) {
const batch = await db.my_collection.find({'event_type': 'c'}, { projection: { user_id: 1}}).limit(batchSize).toArray();
if (batch.length === 0) {
break
}
await db.my_collection.deleteMany({user_id: {$in: batch.map(v => v.user_id)}});
iteration++
}

How to properly sort list by another list in Dart

I have two lists, 1 is a list of Map items, and another list which is the order.
I would like to sort the items based on their description attribute and compare them with the order list and have them inserted at the top.
import 'package:collection/collection.dart';
void main() {
List<String> order = [
'top european',
'top usa',
'top rest of the world'
];
List<Map> items = [
{'id': 0, 'id2': 5, 'description': 'Top USA'},
{'id': 2, 'id2': 2, 'description': 'Top A'},
{'id': 3, 'id2': 0, 'description': 'Top Z'},
{'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
{'id': 4, 'id2': 4, 'description': 'Top C'},
{'id': 5, 'id2': 1, 'description': 'Top D'},
{'id': 1, 'id2': 3, 'description': 'Top European'},
];
//this works but adds the items at the end
items.sort((a,b) {
return order.indexOf(a['description'].toLowerCase()) -
order.indexOf(b['description'].toLowerCase());
});
///Results: print(items);
// List<Map> items = [
// {'id': 2, 'id2': 2, 'description': 'Top A'},
// {'id': 3, 'id2': 0, 'description': 'Top Z'},
// {'id': 4, 'id2': 4, 'description': 'Top C'},
// {'id': 5, 'id2': 1, 'description': 'Top D'},
// {'id': 1, 'id2': 3, 'description': 'Top European'},
// {'id': 0, 'id2': 5, 'description': 'Top USA'},
// {'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
// ];
}
SOLUTION: I also tried this approach which is not ideal, but it works.
List <Map> itemsOrder = items
.where(
(ele) => order.contains(ele['description'].toString().toLowerCase()))
.toList();
itemsOrder.sort((a, b) {
return order.indexOf(a['description'].toLowerCase()) -
order.indexOf(b['description'].toLowerCase());
});
items.removeWhere(
(ele) => order.contains(ele['description'].toString().toLowerCase()));
itemsOrder = itemsOrder.reversed.toList();
for (int i = 0; i < itemsOrder.length; i++) {
items.insert(0, itemsOrder[i]);
}
///Results: print(items);
// List<Map> items = [
// {'id': 1, 'id2': 3, 'description': 'Top European'},
// {'id': 0, 'id2': 5, 'description': 'Top USA'},
// {'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
// {'id': 2, 'id2': 2, 'description': 'Top A'},
// {'id': 3, 'id2': 0, 'description': 'Top Z'},
// {'id': 4, 'id2': 4, 'description': 'Top C'},
// {'id': 5, 'id2': 1, 'description': 'Top D'},
// ];
Ideally, I would like to use sortBy or sortByCompare but unfortunately, I cannot find a proper example or get a grasp of how to use it.

The way I would fix this is to find the index of the description in the order list and if it cannot be found, I would use a number that is out of index inside the order list to indicate that this item should be at the bottom of the list.
This would be my solution:
void testIt() {
final outOfBounds = order.length + 1;
const description = 'description';
items.sort(
(lhs, rhs) {
final lhsDesc = (lhs[description] as String).toLowerCase();
final rhsDesc = (rhs[description] as String).toLowerCase();
final lhsIndex =
order.contains(lhsDesc) ? order.indexOf(lhsDesc) : outOfBounds;
final rhsIndex =
order.contains(rhsDesc) ? order.indexOf(rhsDesc) : outOfBounds;
return lhsIndex.compareTo(rhsIndex);
},
);
}
And the result is:
[{id: 1, id2: 3, description: Top European}, {id: 0, id2: 5, description: Top USA}, {id: 6, id2: 6, description: Top Rest of the world}, {id: 2, id2: 2, description: Top A}, {id: 3, id2: 0, description: Top Z}, {id: 4, id2: 4, description: Top C}, {id: 5, id2: 1, description: Top D}]

Flutter/Dart Get All Values Of a Key from a JSON

How can i get all values of a second level key in a JSON?
I need to get all total values as like 409571117, 410559043, 411028977, 411287235
JSON:
{"country":"USA","timeline":
[{"total":409571117,"daily":757824,"totalPerHundred":121,"dailyPerMillion":2253,"date":"10/14/21"},
{"total":410559043,"daily":743873,"totalPerHundred":122,"dailyPerMillion":2212,"date":"10/15/21"},
{"total":411028977,"daily":737439,"totalPerHundred":122,"dailyPerMillion":2193,"date":"10/16/21"},
{"total":411287235,"daily":731383,"totalPerHundred":122,"dailyPerMillion":2175,"date":"10/17/21"}]}
I can able to get first level values but i don't know how to get second level.
final list = [
{'id': 1, 'name': 'flutter', 'title': 'dart'},
{'id': 35, 'name': 'flutter', 'title': 'dart'},
{'id': 93, 'name': 'flutter', 'title': 'dart'},
{'id': 82, 'name': 'flutter', 'title': 'dart'},
{'id': 28, 'name': 'flutter', 'title': 'dart'},
];
final idList = list.map((e) => e['id']).toList(); // [1, 35, 93, 82, 28]
python version of same question: Python: Getting all values of a specific key from json

UPDATE: You must declare the types in map. See below.
Have you tried subsetting on timeline after using jsonDecode?
For example, you format the data as json:
final newList = jsonEncode(
{ "country": "USA", "timeline":[
{ "total": 409571117, "daily": 757824, "totalPerHundred": 121, "dailyPerMillion": 2253, "date": "10/14/21" },
{"total": 410559043, ...
Then you decode the data into the list you want by first subsetting the timeline feature:
final extractedData = jsonDecode(newList) as Map<String, dynamic>;
final newIdList = extractedData['timeline'].map((e) => e["total"]).toList();

Flutter remove duplicate array by value

I have an array that has the same name value I just need to check if the same value comes it will delete it.
My data looks like
[{'name':'Rameez', 'data': [{'age': 1, 'number': 2}]}, {'name':'XYZ', 'data': [{'age': 1, 'number': 2}]}, {'name':'Rameez', 'data': [{'age': 1, 'number': 2}]}];
I want to show it like this no duplicate name
Expected output dataaa = [{'name':'Rameez', 'data': [{'age': 1, 'number': 2}]}, {'name':'XYZ', 'data': [{'age': 1, 'number': 2}]}];

var list = [{'name':'Rameez', 'data': [{'age': 1, 'number': 2}]},
{'name':'XYZ', 'data': [{'age': 1, 'number': 2}]},
{'name':'Rameez', 'data': [{'age': 1, 'number': 2}]}];
for(int i = 0;i< list.length;i++){
for(int j = i+1;j< list.length;j++){
if(list[i]["name"] == list[j]["name"]){
list.removeAt(j);
}
}
}
list.forEach((item) => print(item.toString()));
Output
{name: Rameez, data: [{age: 1, number: 2}]}
{name: XYZ, data: [{age: 1, number: 2}]}

Django/NetworkX Eliminating Repeated Nodes

I want to use d3js to visualize the connections between the users of my Django website.
I am reusing the code for the force directed graph example wich requires that each node has two attributes (ID and Name). I have created a node for each user in user_profiles_table and added an edge between already created nodes based on each row in connections_table. It does not work; networkx creates new nodes when I start working with the connection_table.
nodeindex=0
for user_profile in UserProfile.objects.all():
sourcetostring=user_profile.full_name3()
G.add_node(nodeindex, name=sourcetostring)
nodeindex = nodeindex +1
for user_connection in Connection.objects.all():
target_tostring=user_connection.target()
source_tostring=user_connection.source()
G.add_edge(sourcetostring, target_tostring, value=1)
data = json_graph.node_link_data(G)
result:
{'directed': False,
'graph': [],
'links': [{'source': 6, 'target': 7, 'value': 1},
{'source': 7, 'target': 8, 'value': 1},
{'source': 7, 'target': 9, 'value': 1},
{'source': 7, 'target': 10, 'value': 1},
{'source': 7, 'target': 7, 'value': 1}],
'multigraph': False,
'nodes': [{'id': 0, 'name': u'raymondkalonji'},
{'id': 1, 'name': u'raykaeng'},
{'id': 2, 'name': u'raymondkalonji2'},
{'id': 3, 'name': u'tester1cet'},
{'id': 4, 'name': u'tester2cet'},
{'id': 5, 'name': u'tester3cet'},
{'id': u'tester2cet'},
{'id': u'tester3cet'},
{'id': u'tester1cet'},
{'id': u'raykaeng'},
{'id': u'raymondkalonji2'}]}
How can I eliminate the repeated nodes?

You probably get repeated nodes because your user_connection.target() and user_connection.source() functions return the node name, not its id. When you call add_edge, if the endpoints do not exist in the graph, they are created, which explain why you get duplicates.
The following code should work.
for user_profile in UserProfile.objects.all():
source = user_profile.full_name3()
G.add_node(source, name=source)
for user_connection in Connection.objects.all():
target = user_connection.target()
source = user_connection.source()
G.add_edge(source, target, value=1)
data = json_graph.node_link_data(G)
Also note that you should dump the data object to json if you want a properly formatted json string. You can do that as follows.
import json
json.dumps(data) # get the string representation
json.dump(data, 'somefile.json') # write to file

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

kafka-streams join produce duplicates - apache-kafka

Related

MongoDB - Best way to delete documents by query based on results of another query

How to properly sort list by another list in Dart

Flutter/Dart Get All Values Of a Key from a JSON

Flutter remove duplicate array by value

Django/NetworkX Eliminating Repeated Nodes

Categories

Resources