pyspark: calculate row difference with variable length partitions - pyspark

I have a time series dataset where each row is a reading of a certain property. A number of properties are read at the same time in a batch and are tagged with a common time stamp. The number of properties read from time to time can vary (i.e. these batches are of variable length). I need to label each such batch with time delta from the previous batch. I know how to do this for fix-sized batches, but can't figure out how to do this in this particular case.
I need to perform this operation on tens of millions of rows, so the solution has to be highly efficient.
Here's sample data:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
data_in = [
{'time': 12, 'batch_id': 1, 'name': 'a'}
, {'time': 12, 'batch_id': 1, 'name': 'c'}
, {'time': 12, 'batch_id': 1, 'name': 'e'}
, {'time': 12, 'batch_id': 1, 'name': 'd'}
, {'time': 12, 'batch_id': 1, 'name': 'e'}
, {'time': 14, 'batch_id': 2, 'name': 'a'}
, {'time': 14, 'batch_id': 2, 'name': 'b'}
, {'time': 14, 'batch_id': 2, 'name': 'c'}
, {'time': 19, 'batch_id': 3, 'name': 'b'}
, {'time': 19, 'batch_id': 3, 'name': 'c'}
, {'time': 19, 'batch_id': 3, 'name': 'e'}
, {'time': 19, 'batch_id': 3, 'name': 'f'}
, {'time': 19, 'batch_id': 3, 'name': 'g'}
]
# creating a dataframe
dataframe_in = spark.createDataFrame(data)
# show data frame
display(dataframe_in.select('time', 'sample_id', 'name'))
here's what I'm looking for in the output:
data_out = [
{'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'a'}
, {'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'c'}
, {'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'e'}
, {'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'd'}
, {'time': 12, 'time_delta':None, 'batch_id': 1, 'name': 'e'}
, {'time': 14, 'time_delta':2, 'batch_id': 2, 'name': 'a'}
, {'time': 14, 'time_delta':2, 'batch_id': 2, 'name': 'b'}
, {'time': 14, 'time_delta':2, 'batch_id': 2, 'name': 'c'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'b'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'c'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'e'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'f'}
, {'time': 19, 'time_delta':5, 'batch_id': 3, 'name': 'g'}
]

You can use first and lag to get the previous group's value.
df = (df.withColumn('lag', F.lag('time').over(Window.orderBy('batch_id')))
.withColumn('time_delta', F.col('time') -
F.first('lag').over(Window.partitionBy('batch_id').orderBy('lag')))
)
Alternatively, you can use min and this might be slightly more efficient because I can omit the orderBy but this will return 0 for batch 1.
df = (df.withColumn('lag', F.lag('time').over(Window.orderBy('batch_id')))
.withColumn('time_delta', F.col('time') -
F.min('lag').over(Window.partitionBy('batch_id')))
)

Related

How to properly sort list by another list in Dart

I have two lists, 1 is a list of Map items, and another list which is the order.
I would like to sort the items based on their description attribute and compare them with the order list and have them inserted at the top.
import 'package:collection/collection.dart';
void main() {
List<String> order = [
'top european',
'top usa',
'top rest of the world'
];
List<Map> items = [
{'id': 0, 'id2': 5, 'description': 'Top USA'},
{'id': 2, 'id2': 2, 'description': 'Top A'},
{'id': 3, 'id2': 0, 'description': 'Top Z'},
{'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
{'id': 4, 'id2': 4, 'description': 'Top C'},
{'id': 5, 'id2': 1, 'description': 'Top D'},
{'id': 1, 'id2': 3, 'description': 'Top European'},
];
//this works but adds the items at the end
items.sort((a,b) {
return order.indexOf(a['description'].toLowerCase()) -
order.indexOf(b['description'].toLowerCase());
});
///Results: print(items);
// List<Map> items = [
// {'id': 2, 'id2': 2, 'description': 'Top A'},
// {'id': 3, 'id2': 0, 'description': 'Top Z'},
// {'id': 4, 'id2': 4, 'description': 'Top C'},
// {'id': 5, 'id2': 1, 'description': 'Top D'},
// {'id': 1, 'id2': 3, 'description': 'Top European'},
// {'id': 0, 'id2': 5, 'description': 'Top USA'},
// {'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
// ];
}
SOLUTION: I also tried this approach which is not ideal, but it works.
List <Map> itemsOrder = items
.where(
(ele) => order.contains(ele['description'].toString().toLowerCase()))
.toList();
itemsOrder.sort((a, b) {
return order.indexOf(a['description'].toLowerCase()) -
order.indexOf(b['description'].toLowerCase());
});
items.removeWhere(
(ele) => order.contains(ele['description'].toString().toLowerCase()));
itemsOrder = itemsOrder.reversed.toList();
for (int i = 0; i < itemsOrder.length; i++) {
items.insert(0, itemsOrder[i]);
}
///Results: print(items);
// List<Map> items = [
// {'id': 1, 'id2': 3, 'description': 'Top European'},
// {'id': 0, 'id2': 5, 'description': 'Top USA'},
// {'id': 6, 'id2': 6, 'description': 'Top Rest of the world'},
// {'id': 2, 'id2': 2, 'description': 'Top A'},
// {'id': 3, 'id2': 0, 'description': 'Top Z'},
// {'id': 4, 'id2': 4, 'description': 'Top C'},
// {'id': 5, 'id2': 1, 'description': 'Top D'},
// ];
Ideally, I would like to use sortBy or sortByCompare but unfortunately, I cannot find a proper example or get a grasp of how to use it.
The way I would fix this is to find the index of the description in the order list and if it cannot be found, I would use a number that is out of index inside the order list to indicate that this item should be at the bottom of the list.
This would be my solution:
void testIt() {
final outOfBounds = order.length + 1;
const description = 'description';
items.sort(
(lhs, rhs) {
final lhsDesc = (lhs[description] as String).toLowerCase();
final rhsDesc = (rhs[description] as String).toLowerCase();
final lhsIndex =
order.contains(lhsDesc) ? order.indexOf(lhsDesc) : outOfBounds;
final rhsIndex =
order.contains(rhsDesc) ? order.indexOf(rhsDesc) : outOfBounds;
return lhsIndex.compareTo(rhsIndex);
},
);
}
And the result is:
[{id: 1, id2: 3, description: Top European}, {id: 0, id2: 5, description: Top USA}, {id: 6, id2: 6, description: Top Rest of the world}, {id: 2, id2: 2, description: Top A}, {id: 3, id2: 0, description: Top Z}, {id: 4, id2: 4, description: Top C}, {id: 5, id2: 1, description: Top D}]

Combine array of objects in pyspark

Consider the following DF:
df = spark.createDataFrame(
[
Row(
x='a',
y=[
{'f1': 1, 'f2': 2},
{'f1': 3, 'f2': 4}
],
z=[
{'f3': 1, 'f4': '2'},
{'f3': 1, 'f4': '4', 'f5': [1,2,3]}
]
)
]
)
I wish to combine y and z index-wise, so I may get:
[
Row(x='a', y={'f1': 1, 'f2': 2}, z={'f3': 1, 'f4': 2}),
Row(x='a', y={'f1': 3, 'f2': 4}, z={'f3': 1, 'f4': 4, 'f5': [1,2,3]})
]
How can it be done without converting to rdd?
This is output and a little difference with your expectation: the value of z column is changed to string whatever int, string, list.
[Row(x='a', y={'f2': 2, 'f1': 1}, z={'f3': '1', 'f4': '2'}), Row(x='a', y={'f2': 4, 'f1': 3}, z={'f3': '1', 'f4': '4', 'f5': '[1, 2, 3]'})]
This is output
[Row(x='a', y={'f2': 2, 'f1': 1}, z={'f3': '1', 'f4': '2'}), Row(x='a', y={'f2': 4, 'f1': 3}, z={'f3': '1', 'f4': '4', 'f5': '[1, 2, 3]'})]
from code
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import explode,monotonically_increasing_id
df = spark.createDataFrame(<br>[Row(x='a',y=[{'f1': 1, 'f2': 2}, {'f1': 3, 'f2': 4}],z=[{'f3': 1, 'f4': '2'}, {'f3': 1, 'f4': '4', 'f5': [1,2,3]}])]
,StructType([StructField('x', StringType(), True),
StructField('y', ArrayType(MapType(StringType(), IntegerType(), True), True),True),
StructField('z', ArrayType(MapType(StringType(), StringType(), True), True),True)]))
df1 = df.select('x',explode(df.y).alias("y")).withColumn("id", monotonically_increasing_id())
df2 = df.select(explode(df.z).alias("z")).withColumn("id", monotonically_increasing_id())
df3 = df1.join(df2, "id", "outer").drop("id")
df3.collect()

How does $max work over an array of objects?

Take an example collection with these documents:
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
{'name': 'a', 'sales': 0, 'leads': 2},
{'name': 'b', 'sales': 5, 'leads': 1},
{'name': 'c', 'sales': 3.3, 'leads': 1}]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
{'name': 'a', 'sales': 6, 'leads': 1},
{'name': 'b', 'sales': 6, 'leads': 3},
{'name': 'c', 'sales': 1.3, 'leads': 4}]})
How does $max determine which item in the location array is maximal?
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b995d72eabb0f0d86dceda5'),
'maxItem': {'leads': 1, 'name': 'b', 'sales': 5}},
{'_id': ObjectId('5b995d72eabb0f0d86dceda6'),
'maxItem': {'leads': 3, 'name': 'b', 'sales': 6}}]
It looks like $max is picking to sort on sales but I am not sure why?
I discovered this
https://docs.mongodb.com/manual/reference/bson-type-comparison-order/#objects
which states:
MongoDB’s comparison of BSON objects uses the following order:
Recursively compare key-value pairs in the order that they appear
within the BSON object.
Compare the key field names.
If the key field names are equal, compare the field values.
If the field values are equal, compare the next key/value pair (return to step 1). An object without further pairs is less than an
object with further pairs.
which means that if sales is the first key in the bson object then I have my answer. I'm using pymongo and python dictionaries aren't ordered, so I switched to bson.son.SON and re-did the example:
client.test.foo.delete_many({})
client.test.foo.insert_one({
'name': 'clientA',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 0), ('leads', 2)]),
bson.son.SON([('name', 'b'), ('sales', 5), ('leads', 1)]),
bson.son.SON([('name', 'c'), ('sales', 3.3), ('leads', 1)])]})
client.test.foo.insert_one({
'name': 'clientB',
'locations': [
bson.son.SON([('name', 'a'), ('sales', 6), ('leads', 1)]),
bson.son.SON([('name', 'b'), ('sales', 6), ('leads', 3)]),
bson.son.SON([('name', 'c'), ('sales', 1.3), ('leads', 4)])]})
And now its sorting by name:
client.test.foo.aggregate([{'$project': {'maxItem': {'$max': '$locations'}}}]))
Returns:
[{'_id': ObjectId('5b99619beabb0f0d86dcedaf'),
'maxItem': {'leads': 1, 'name': 'c', 'sales': 3.3}},
{'_id': ObjectId('5b99619beabb0f0d86dcedb0'),
'maxItem': {'leads': 4, 'name': 'c', 'sales': 1.3}}]

kafka-streams join produce duplicates

I have two topics:
// photos
{'id': 1, 'user_id': 1, 'url': 'url#1'},
{'id': 2, 'user_id': 2, 'url': 'url#2'},
{'id': 3, 'user_id': 2, 'url': 'url#3'}
// users
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'},
{'id': 1, 'name': 'user#1'}
I create map photo by user
KStream<Integer, Photo> photo_by_user = ...
photo_by_user.to("photo_by_user")
Then, I try to join two tables:
KTable<Integer, User> users_table = builder.table("users");
KTable<Integer, Photo> photo_by_user_table = builder.table("photo_by_user");
KStream<Integer, Result> results = users_table.join(photo_by_user_table, (a, b) -> Result.from(a, b)).toStream();
results.to("results");
result like
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
{'photo_id': 1, 'user': 1, 'url': 'url#1', 'name': 'user#1'}
{'photo_id': 2, 'user': 2, 'url': 'url#2', 'name': 'user#2'}
{'photo_id': 3, 'user': 3, 'url': 'url#3', 'name': 'user#3'}
I see that results are duplicated. Why, and how to avoid it?
You might hit a known bug. On "flush" KTable-KTable join might produce some duplicates. Note, that those duplicates are strictly speaking not incorrect, because the result is an update-stream and updating "A" to "A" does not change the result. It's of course undesired to get those duplicates. Try to disable caching -- without caching, the "flush issues" should not occur.

Django/NetworkX Eliminating Repeated Nodes

I want to use d3js to visualize the connections between the users of my Django website.
I am reusing the code for the force directed graph example wich requires that each node has two attributes (ID and Name). I have created a node for each user in user_profiles_table and added an edge between already created nodes based on each row in connections_table. It does not work; networkx creates new nodes when I start working with the connection_table.
nodeindex=0
for user_profile in UserProfile.objects.all():
sourcetostring=user_profile.full_name3()
G.add_node(nodeindex, name=sourcetostring)
nodeindex = nodeindex +1
for user_connection in Connection.objects.all():
target_tostring=user_connection.target()
source_tostring=user_connection.source()
G.add_edge(sourcetostring, target_tostring, value=1)
data = json_graph.node_link_data(G)
result:
{'directed': False,
'graph': [],
'links': [{'source': 6, 'target': 7, 'value': 1},
{'source': 7, 'target': 8, 'value': 1},
{'source': 7, 'target': 9, 'value': 1},
{'source': 7, 'target': 10, 'value': 1},
{'source': 7, 'target': 7, 'value': 1}],
'multigraph': False,
'nodes': [{'id': 0, 'name': u'raymondkalonji'},
{'id': 1, 'name': u'raykaeng'},
{'id': 2, 'name': u'raymondkalonji2'},
{'id': 3, 'name': u'tester1cet'},
{'id': 4, 'name': u'tester2cet'},
{'id': 5, 'name': u'tester3cet'},
{'id': u'tester2cet'},
{'id': u'tester3cet'},
{'id': u'tester1cet'},
{'id': u'raykaeng'},
{'id': u'raymondkalonji2'}]}
How can I eliminate the repeated nodes?
You probably get repeated nodes because your user_connection.target() and user_connection.source() functions return the node name, not its id. When you call add_edge, if the endpoints do not exist in the graph, they are created, which explain why you get duplicates.
The following code should work.
for user_profile in UserProfile.objects.all():
source = user_profile.full_name3()
G.add_node(source, name=source)
for user_connection in Connection.objects.all():
target = user_connection.target()
source = user_connection.source()
G.add_edge(source, target, value=1)
data = json_graph.node_link_data(G)
Also note that you should dump the data object to json if you want a properly formatted json string. You can do that as follows.
import json
json.dumps(data) # get the string representation
json.dump(data, 'somefile.json') # write to file