How does $max work over an array of objects? - mongodb

Take an example collection with these documents:
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
{'name': 'a', 'sales': 0, 'leads': 2},
{'name': 'b', 'sales': 5, 'leads': 1},
{'name': 'c', 'sales': 3.3, 'leads': 1}]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
{'name': 'a', 'sales': 6, 'leads': 1},
{'name': 'b', 'sales': 6, 'leads': 3},
{'name': 'c', 'sales': 1.3, 'leads': 4}]})
How does $max determine which item in the location array is maximal?
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b995d72eabb0f0d86dceda5'),
'maxItem': {'leads': 1, 'name': 'b', 'sales': 5}},
{'_id': ObjectId('5b995d72eabb0f0d86dceda6'),
'maxItem': {'leads': 3, 'name': 'b', 'sales': 6}}]
It looks like $max is picking to sort on sales but I am not sure why?

I discovered this
https://docs.mongodb.com/manual/reference/bson-type-comparison-order/#objects
which states:
MongoDB’s comparison of BSON objects uses the following order:
Recursively compare key-value pairs in the order that they appear
within the BSON object.
Compare the key field names.
If the key field names are equal, compare the field values.
If the field values are equal, compare the next key/value pair (return to step 1). An object without further pairs is less than an
object with further pairs.
which means that if sales is the first key in the bson object then I have my answer. I'm using pymongo and python dictionaries aren't ordered, so I switched to bson.son.SON and re-did the example:
client.test.foo.delete_many({})
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 0), ('leads', 2)]),
bson.son.SON([('name', 'b'), ('sales', 5), ('leads', 1)]),
bson.son.SON([('name', 'c'), ('sales', 3.3), ('leads', 1)])]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 6), ('leads', 1)]),
bson.son.SON([('name', 'b'), ('sales', 6), ('leads', 3)]),
bson.son.SON([('name', 'c'), ('sales', 1.3), ('leads', 4)])]})
And now its sorting by name:
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b99619beabb0f0d86dcedaf'),
'maxItem': {'leads': 1, 'name': 'c', 'sales': 3.3}},
{'_id': ObjectId('5b99619beabb0f0d86dcedb0'),
'maxItem': {'leads': 4, 'name': 'c', 'sales': 1.3}}]

Related

MongoDB - Best way to delete documents by query based on results of another query

I have a collection that can contain several million documents, for simplicity, lets say they look like this:
{'_id': '1', 'user_id': 1, 'event_type': 'a', 'name': 'x'}
{'_id': '2', 'user_id': 1, 'event_type': 'b', 'name': 'x'}
{'_id': '3', 'user_id': 1, 'event_type': 'c', 'name': 'x'}
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
{'_id': '8', 'user_id': 4, 'event_type': 'a', 'name': 'x'}
{'_id': '9', 'user_id': 4, 'event_type': 'b', 'name': 'x'}
{'_id': '10', 'user_id': 4, 'event_type': 'c', 'name': 'x'}
I want to have a daily job that runs and deletes all documents by user_id, if the user_id has a doc with event_type 'c'
So the resulting collection will be
{'_id': '4', 'user_id': 2, 'event_type': 'a', 'name': 'x'}
{'_id': '5', 'user_id': 2, 'event_type': 'b', 'name': 'x'}
{'_id': '6', 'user_id': 3, 'event_type': 'a', 'name': 'x'}
{'_id': '7', 'user_id': 3, 'event_type': 'b', 'name': 'x'}
I did it successfully with mongoshell like this
var cur = db.my_collection.find({'event_type': 'c'})
ids = [];
while (cur.hasNext()) {
ids.push(cur.next()['user_id']);
if (ids.length == 5){
print('deleting for user_ids', ids);
print(db.my_collection.deleteMany({user_id: {$in: ids}}));
ids = [];
}
}
if (ids.length){db.my_collection.deleteMany({user_id: {$in: ids}})}
Created a cursor to hold all docs with event_type 'c', grouped them into batches of 5 then deleted all docs with these ids.
It works but looks very slow, like each cur.next() only gets one doc at a time.
I wanted to know if there is a better or more correct way to achieve this, if it was elasticsearch I would create a sliced scroll, scan each slice in parallel and submit parallel deleteByQuery requests with 1000 ids each. Is something like this possible/preferable with mongo?
Scale wise I expect there to be several million docs (~10M) at the collection, 300K docs that match the query, and ~700K that should be deleted
It sounds like you can just use deleteMany with the original query:
db.my_collection.deleteMany({
event_type: 'c'
})
No size limitations on it, it might just take a couple of minutes to run depending on instance size.
EDIT:
I would personally try to use the distinct function, it's the cleanest and easiest code. distinct does have a 16mb limit about 300k~ unique ids a day (depending on userid field size) sounds a bit close to the threshold, or past it.
const userIds = db.my_collection.distinct('user_id', { event_type: 'c'});
db.my_collection.deleteMany({user_id: {$in: userIds}})
Assuming you except scale to increase, or this fails your tests then the best way is to use something similar to your approach, just in much larger batches. for example:
const batchSize = 100000;
const count = await db.my_collection.countDocuments({'event_type': 'c'});
let iteration = 0;
while (iteration * batchSize < count) {
const batch = await db.my_collection.find({'event_type': 'c'}, { projection: { user_id: 1}}).limit(batchSize).toArray();
if (batch.length === 0) {
break
}
await db.my_collection.deleteMany({user_id: {$in: batch.map(v => v.user_id)}});
iteration++
}

pyspark: calculate row difference with variable length partitions

I have a time series dataset where each row is a reading of a certain property. A number of properties are read at the same time in a batch and are tagged with a common time stamp. The number of properties read from time to time can vary (i.e. these batches are of variable length). I need to label each such batch with time delta from the previous batch. I know how to do this for fix-sized batches, but can't figure out how to do this in this particular case.
I need to perform this operation on tens of millions of rows, so the solution has to be highly efficient.
Here's sample data:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
data_in = [
{'time': 12, 'batch_id': 1, 'name': 'a'}
, {'time': 12, 'batch_id': 1, 'name': 'c'}
, {'time': 12, 'batch_id': 1, 'name': 'e'}
, {'time': 12, 'batch_id': 1, 'name': 'd'}
, {'time': 12, 'batch_id': 1, 'name': 'e'}
, {'time': 14, 'batch_id': 2, 'name': 'a'}
, {'time': 14, 'batch_id': 2, 'name': 'b'}
, {'time': 14, 'batch_id': 2, 'name': 'c'}
, {'time': 19, 'batch_id': 3, 'name': 'b'}
, {'time': 19, 'batch_id': 3, 'name': 'c'}
, {'time': 19, 'batch_id': 3, 'name': 'e'}
, {'time': 19, 'batch_id': 3, 'name': 'f'}
, {'time': 19, 'batch_id': 3, 'name': 'g'}
]
# creating a dataframe
dataframe_in = spark.createDataFrame(data)
# show data frame
display(dataframe_in.select('time', 'sample_id', 'name'))
here's what I'm looking for in the output:
data_out = [
{'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'a'}
, {'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'c'}
, {'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'e'}
, {'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'd'}
, {'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'e'}
, {'time': 14, 'time_delta':2, 'batch_id': 2, 'name': 'a'}
, {'time': 14, 'time_delta':2, 'batch_id': 2, 'name': 'b'}
, {'time': 14, 'time_delta':2, 'batch_id': 2, 'name': 'c'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'b'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'c'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'e'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'f'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'g'}
]
You can use first and lag to get the previous group's value.
df = (df.withColumn('lag', F.lag('time').over(Window.orderBy('batch_id')))
.withColumn('time_delta', F.col('time') -
F.first('lag').over(Window.partitionBy('batch_id').orderBy('lag')))
)
Alternatively, you can use min and this might be slightly more efficient because I can omit the orderBy but this will return 0 for batch 1.
df = (df.withColumn('lag', F.lag('time').over(Window.orderBy('batch_id')))
.withColumn('time_delta', F.col('time') -
F.min('lag').over(Window.partitionBy('batch_id')))
)

Combine array of objects in pyspark

Consider the following DF:
df = spark.createDataFrame(
[
Row(
x='a',
y=[
{'f1': 1, 'f2': 2},
{'f1': 3, 'f2': 4}
],
z=[
{'f3': 1, 'f4': '2'},
{'f3': 1, 'f4': '4', 'f5': [1,2,3]}
]
)
]
)
I wish to combine y and z index-wise, so I may get:
[
Row(x='a', y={'f1': 1, 'f2': 2}, z={'f3': 1, 'f4': 2}),
Row(x='a', y={'f1': 3, 'f2': 4}, z={'f3': 1, 'f4': 4, 'f5': [1,2,3]})
]
How can it be done without converting to rdd?
This is output and a little difference with your expectation: the value of z column is changed to string whatever int, string, list.
[Row(x='a', y={'f2': 2, 'f1': 1}, z={'f3': '1', 'f4': '2'}), Row(x='a', y={'f2': 4, 'f1': 3}, z={'f3': '1', 'f4': '4', 'f5': '[1, 2, 3]'})]
This is output
[Row(x='a', y={'f2': 2, 'f1': 1}, z={'f3': '1', 'f4': '2'}), Row(x='a', y={'f2': 4, 'f1': 3}, z={'f3': '1', 'f4': '4', 'f5': '[1, 2, 3]'})]
from code
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import explode,monotonically_increasing_id
df = spark.createDataFrame(<br>[Row(x='a',y=[{'f1': 1, 'f2': 2}, {'f1': 3, 'f2': 4}],z=[{'f3': 1, 'f4': '2'}, {'f3': 1, 'f4': '4', 'f5': [1,2,3]}])]
,StructType([StructField('x', StringType(), True),
StructField('y', ArrayType(MapType(StringType(), IntegerType(), True), True),True),
StructField('z', ArrayType(MapType(StringType(), StringType(), True), True),True)]))
df1 = df.select('x',explode(df.y).alias("y")).withColumn("id", monotonically_increasing_id())
df2 = df.select(explode(df.z).alias("z")).withColumn("id", monotonically_increasing_id())
df3 = df1.join(df2, "id", "outer").drop("id")
df3.collect()

kafka-streams join produce duplicates

I have two topics:
// photos
{'id': 1, 'user_id': 1, 'url': 'url#1'},
{'id': 2, 'user_id': 2, 'url': 'url#2'},
{'id': 3, 'user_id': 2, 'url': 'url#3'}
// users
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'}
I create map photo by user
KStream<Integer, Photo> photo_by_user = ...
photo_by_user.to("photo_by_user")
Then, I try to join two tables:
KTable<Integer, User> users_table = builder.table("users");
KTable<Integer, Photo> photo_by_user_table = builder.table("photo_by_user");
KStream<Integer, Result> results = users_table.join(photo_by_user_table, (a, b) -> Result.from(a, b)).toStream();
results.to("results");
result like
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
I see that results are duplicated. Why, and how to avoid it?
You might hit a known bug. On "flush" KTable-KTable join might produce some duplicates. Note, that those duplicates are strictly speaking not incorrect, because the result is an update-stream and updating "A" to "A" does not change the result. It's of course undesired to get those duplicates. Try to disable caching -- without caching, the "flush issues" should not occur.

Django/NetworkX Eliminating Repeated Nodes

I want to use d3js to visualize the connections between the users of my Django website.
I am reusing the code for the force directed graph example wich requires that each node has two attributes (ID and Name). I have created a node for each user in user_profiles_table and added an edge between already created nodes based on each row in connections_table. It does not work; networkx creates new nodes when I start working with the connection_table.
nodeindex=0
for user_profile in UserProfile.objects.all():
sourcetostring=user_profile.full_name3()
G.add_node(nodeindex, name=sourcetostring)
nodeindex = nodeindex +1
for user_connection in Connection.objects.all():
target_tostring=user_connection.target()
source_tostring=user_connection.source()
G.add_edge(sourcetostring, target_tostring, value=1)
data = json_graph.node_link_data(G)
result:
{'directed': False,
'graph': [],
'links': [{'source': 6, 'target': 7, 'value': 1},
{'source': 7, 'target': 8, 'value': 1},
{'source': 7, 'target': 9, 'value': 1},
{'source': 7, 'target': 10, 'value': 1},
{'source': 7, 'target': 7, 'value': 1}],
'multigraph': False,
'nodes': [{'id': 0, 'name': u'raymondkalonji'},
{'id': 1, 'name': u'raykaeng'},
{'id': 2, 'name': u'raymondkalonji2'},
{'id': 3, 'name': u'tester1cet'},
{'id': 4, 'name': u'tester2cet'},
{'id': 5, 'name': u'tester3cet'},
{'id': u'tester2cet'},
{'id': u'tester3cet'},
{'id': u'tester1cet'},
{'id': u'raykaeng'},
{'id': u'raymondkalonji2'}]}
How can I eliminate the repeated nodes?
You probably get repeated nodes because your user_connection.target() and user_connection.source() functions return the node name, not its id. When you call add_edge, if the endpoints do not exist in the graph, they are created, which explain why you get duplicates.
The following code should work.
for user_profile in UserProfile.objects.all():
source = user_profile.full_name3()
G.add_node(source, name=source)
for user_connection in Connection.objects.all():
target = user_connection.target()
source = user_connection.source()
G.add_edge(source, target, value=1)
data = json_graph.node_link_data(G)
Also note that you should dump the data object to json if you want a properly formatted json string. You can do that as follows.
import json
json.dumps(data) # get the string representation
json.dump(data, 'somefile.json') # write to file