PySpark: remove rows which derivate from others - pyspark

I do have the following dataframe, which contains all the paths within a tree after going through all nodes. For each jump between nodes, a row will be created where "dist" is the number of nodes so far, "node" the current node and "path" the path so far.
dist | node | path
0 | 1 | [1]
1 | 2 | [1,2]
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
At the end I just want to have a dataframe containing the complete paths without the intermediate steps:
dist | node | path
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
I also tried by having the path column as a string ("1;2;3") and comparing which row is a substring from each other, however i could not find a way to do that.

I found my old code and created an adapted example for your problem. I used the spark graph library Graphframes for this. The path can be determined by a Pregel like message aggregation loop.
Here the code.
First import all modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as f
from graphframes import GraphFrame
from pyspark.sql.types import *
from graphframes.lib import *
# shortcut for the aggregate message object from the graphframes.lib
AM=AggregateMessages
# to plot the graph
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
spark = (SparkSession
.builder
.appName("PathReduction")
.getOrCreate()
)
sc=spark.sparkContext
Then create a sample dataset
# create dataframe
raw_data = [
("0","1"),
("1","2"),
("1","5"),
("2","3"),
("2","4"),
("a","b"),
("b","c"),
("c","d")]
schema = ["src","dst"]
data = spark.createDataFrame(data=raw_data, schema = schema)
data.show()
+---+---+
|src|dst|
+---+---+
| 0| 1|
| 1| 2|
| 1| 5|
| 2| 3|
| 2| 4|
| a| b|
| b| c|
| c| d|
+---+---+
For visualisation run
plotData_1 = data.select("src","dst").rdd.collect()
plotData_2 = np.array(plotData_1)
plotData_3=[]
for row in plotData_2:
plotData_3.append((row[0],row[1]))
G=nx.DiGraph(directed=True)
G.add_edges_from(plotData_3)
options = {
'node_color': 'orange',
'node_size': 500,
'width': 2,
'arrowstyle': '-|>',
'arrowsize': 20,
}
nx.draw(G, arrows=True, **options,with_labels=True)
With this message aggregation algorithm you find the paths as you searched them. if you set the flag show_steps to True the results of each step is shown which helps to understand.
# if flag is true print results within the loop for debuging
show_steps=False
# max itertions of the loop, should be larger then the longest expected path
max_iter=10
# create vertices from edge data set
vertices=(data.select("src").union(data.select("dst")).distinct().withColumnRenamed('src', 'id'))
edges=data
# create graph to get in and out degrees
gx = GraphFrame(vertices, edges)
# calclulate in and out degrees of each node
inDegrees=gx.inDegrees
outDegrees=gx.outDegrees
if(show_steps==True):
print("in and out degrees")
inDegrees.show()
outDegrees.show()
# create intial vertices
init_vertices=(vertices
# join out degrees on vertices
.join(outDegrees,on="id",how="left")
# join in degree on vertices
.join(inDegrees,on="id",how="left")
# define root, childs in the middle and leafs of the path in order to distinguish full paths later on
.withColumn("nodeType",f.when(f.col("inDegree").isNull(),"root").otherwise(f.when(f.col("outDegree").isNull(),"leaf").otherwise("child")))
# define message with all information [array(id) and array(nodeType)] to be send to the next noe
.withColumn("message",f.array_union(f.array(f.array(f.col("id"))),f.array(f.array(f.col("nodeType")))))
# remove columns that are not used anymore
.drop("inDegree","outDegree")
)
if(show_steps==True):
print("init vertices")
init_vertices.show()
# update graph object with init vertices
gx = GraphFrame(init_vertices, edges)
# define empty dataframe to append found paths on
results = sqlContext.createDataFrame(
sc.emptyRDD(),
StructType([StructField("paths",ArrayType(StringType()),True)])
)
# start loopp for mesage aggregation. Set a max_iter value which has to be larger as the longest path expected
for iter_ in range(max_iter):
if(show_steps==True):
print("iteration step=" + str(iter_))
print("##################################################")
# define the message that should be send. Here we send a message to the source node and we take the column message from the destination source we send backward
msgToSrc = AM.dst["message"]
agg = gx.aggregateMessages(
f.collect_set(AM.msg).alias("aggMess"), # aggregation function is a collect into an array (attention!! this can be an expensive operation in terms of shuffel)
sendToSrc=msgToSrc,
sendToDst=None
)
if(show_steps==True):
print("aggregated message")
agg.show(truncate=False)
# stop loop if no more agg messages collected
if(len(agg.take(1))==0):
print("All paths found in " + str(iter_) + " iterations")
break
# get new vertices to send into next round. Here we have to prepare the next message columns all _column_names are temporary columns for calculation purpose only
vertices_update=(agg
# join initial data to aggregation in order to have to nodeType of the vertice
.join(init_vertices,on="id",how="left")
# exploe the nested array with the path and the nodeType
.withColumn("_explode_to_flatten_array",f.explode(f.col("aggMess")))
# put the path aray into a seperate column
.withColumn("_dataMsg",f.col("_explode_to_flatten_array")[0])
# put the node type into a seperate column
.withColumn("_typeMsg",f.col("_explode_to_flatten_array")[1][0])
# deside if a path is complete. A path is complete if the vertices type is a root and the message type is a leaf
.withColumn("pathComplete",f.when(((f.col("nodeType")=="root") & (f.col("_typeMsg")=="leaf")),True).otherwise(False))
# append the curent vertice id to the path array that is send forward
.withColumn("_message",f.array_union(f.array(f.col("id")),f.col("_dataMsg")))
# merge together the path array and the nodeType array for the new message object
.withColumn("message",f.array_union(f.array(f.col("_message")),f.array(f.array(f.col("_typeMsg")))))
)
if(show_steps==True):
print("new vertices with all temp columns")
vertices_update.show()
# add complete paths to the result dataframe
results=(
results
.union(
vertices_update
.where(f.col("pathComplete")==True)
.select(f.col("_message"))
)
)
# chache the vertices for next iteration and only push forward the two relevant columns in order to reduce data shuffeling between spark executors
cachedNewVertices = AM.getCachedDataFrame(vertices_update.select("id","message"))
# create new updated graph object for next iteration
gx = GraphFrame(cachedNewVertices, gx.edges)
print("##################################################")
print("Collecting result set")
results.show()
it shows then the correct results
All paths found in 3 iterations
##################################################
Collecting result set
+------------+
| paths|
+------------+
| [0, 1, 5]|
|[0, 1, 2, 3]|
|[0, 1, 2, 4]|
|[a, b, c, d]|
+------------+
to get your final dataframe you can join it back or take the first and last element of the array into separate columns
result2=(results
.withColumn("dist",f.element_at(f.col("paths"), 1))
.withColumn("node",f.element_at(f.col("paths"), -1))
)
result2.show()
+------------+----+----+
| paths|dist|node|
+------------+----+----+
| [0, 1, 5]| 0| 5|
|[0, 1, 2, 3]| 0| 3|
|[0, 1, 2, 4]| 0| 4|
|[a, b, c, d]| a| d|
+------------+----+----+
You can write the same algorithm with the Graphframes Pregel API I suppose.
P.S: The algorithm in this form might cause problems if the graph has lops or backward directed edges. I had another algorithm to first clean up loops and cycles

Related

PySpark incrementally add id based on another column and previous data

Incrementally derive ID from a name column and on next load if there are new values added to that name column then assign need ID which is not already assigned to previous data
Example - first load:
Name
a
b
b
a
Result
ID
Name
1
a
2
b
2
b
1
a
Next load:
Name
a
b
b
a
c
d
c
Result:
ID
Name
1
a
2
b
2
b
1
a
3
c
4
d
3
c
As described in question looking for a solution in PySpark
You can create additional dataframe df_map where you store your IDs between loads. If you need to, you can save and restore this dataframe from the disk.
df1 = spark.createDataFrame(
data=[['a'], ['b'], ['b'], ['a']],
schema=["name"]
)
df2 = spark.createDataFrame(
data=[['a'], ['b'], ['b'], ['a'], ['c'], ['d'], ['c'], ['0']],
schema=["name"]
)
w = Window.orderBy('name')
# create empty map
df_map = spark.createDataFrame([], schema='name string, id int')
df_map.show()
# get additional name->id map for df1
n = df_map.select(F.count('id').alias('n')).collect()[0].n
df_map = df1.subtract(df_map.select('name')).withColumn('id', F.row_number().over(w) + F.lit(n)).union(df_map)
df_map.show()
# map can be saved to disk between runs
# get additional name->id map for df2
n = df_map.select(F.count('id').alias('n')).collect()[0].n
df_map = df2.subtract(df_map.select('name')).withColumn('id', F.row_number().over(w) + F.lit(n)).union(df_map)
df_map.show()
# join to get the final dataframe
df2.join(df_map, on='name').show()
You can use window and dense_rank. The code below will make dataframe sorted by 'name' column and give each unique name an incremental unique id.
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window as W
window = W.orderBy('name')
(
df
.withColumn('id', F.dense_rank().over(window))
).show()
+----+---+
|name| id|
+----+---+
| a| 1|
| a| 1|
| b| 2|
| b| 2|
| c| 3|
| c| 3|
| d| 4|
+----+---+

PySpark Working with Delta tables - For Loop Optimization with Union

I'm currently working in databricks and have a delta table with 20+ columns. I basically need to take a value from 1 column in each row, send it to an api which returns two values/columns, and then create the other 26 to merge the values back to the original delta table. So input is 28 columns and output is 28 columns. Currently my code looks like:
from pyspark.sql.types import *
from pyspark.sql import functions as F
import requests, uuid, json
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.functions import col,lit
from functools import reduce
spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.databricks.adaptive.autoOptimizeShuffle.enabled", "true")
spark.sql('set spark.sql.execution.arrow.pyspark.enabled = true')
spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true")
spark.conf.set("spark.sql.parquet.compression.codec","gzip")
spark.conf.set("spark.sql.inMemorycolumnarStorage.compressed","true")
spark.conf.set("spark.databricks.optimizer.dynamicFilePruning","true");
output=spark.sql("select * from delta.`table`").cache()
SeriesAppend=[]
for i in output.collect():
#small mapping fix
if i['col1']=='val1':
var0='a'
elif i['col1']=='val2':
var0='b'
elif i['col1']=='val3':
var0='c'
elif i['col1']=='val4':
var0='d'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list=list(req_var-var0)
#subscription info
headers = {header}
body = [{
'text': i['col2']
}]
if len(i['col2'])<500:
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
dumps=json.dumps(response[0])
loads = json.loads(dumps)
json_rdd = sc.parallelize(loads)
json_df = spark.read.json(json_rdd)
json_df = json_df.withColumn('col1',lit(i['col1']))
json_df = json_df.withColumn('col2',lit(i['col2']))
json_df = json_df.withColumn('col3',lit(i['col3']))
...
SeriesAppend.append(json_df)
else:
pass
Series_output=reduce(DataFrame.unionAll, SeriesAppend)
SAMPLE DF with only 3 columns:
df = spark.createDataFrame(
[
("a", "cat","owner1"), # create your data here, be consistent in the types.
("b", "dog","owner2"),
("c", "fish","owner3"),
("d", "fox","owner4"),
("e", "rat","owner5"),
],
["col1", "col2", "col3"]) # add your column names here
I really just need to write the response + other column values to a delta table, so dataframes are not necessarily required, but haven't found a faster way than the above. Right now, I can run 5 inputs, which returns 15 in 25.3 seconds without the unionAll. With the inclusion of the union, it turns into 3 minutes.
The final output would look like:
df = spark.createDataFrame(
[
("a", "cat","owner1","MI", 48003), # create your data here, be consistent in the types.
("b", "dog","owner2", "MI", 48003),
("c", "fish","owner3","MI", 48003),
("d", "fox","owner4","MI", 48003),
("e", "rat","owner5","MI", 48003),
],
["col1", "col2", "col3", "col4", "col5"]) # add your column names here
How can I make this faster in spark?
As mentioned in my comments, you should use UDF to distribute more workload to workers instead of collect and let a single machine (driver) to run it all. It's simply wrong approach and not scalable.
# This is your main function, pure Python and you can unittest it in any way you want.
# The most important about this function is:
# - everything must be encapsulated inside the function, no global variable works here
def req(col1, col2):
if col1 == 'val1':
var0 = 'a'
elif col1 == 'val2':
var0 = 'b'
elif col1 == 'val3':
var0 = 'c'
elif col1 == 'val4':
var0 = 'd'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list = list(req_var - var0)
#subscription info
headers = {header} # !!! `header` must available **inside** this function, global won't work
body = [{
'text': col2
}]
if len(col2) < 500:
# !!! same as `header`, `constructed_url` must available **inside** this function, global won't work
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
return (response.col4, response.col5)
else:
return None
# Now you wrap the function above into a Spark UDF.
# I'm using only 2 columns here as input, but you can use as many columns as you wish.
# Same as output, I'm using only a tuple with 2 elements, you can make it as many items as you wish.
df.withColumn('temp', F.udf(req, T.ArrayType(T.StringType()))('col1', 'col2')).show()
# Output
# +----+----+------+------------------+
# |col1|col2| col3| temp|
# +----+----+------+------------------+
# | a| cat|owner1|[foo_cat, bar_cat]|
# | b| dog|owner2|[foo_dog, bar_dog]|
# | c|fish|owner3| null|
# | d| fox|owner4| null|
# | e| rat|owner5| null|
# +----+----+------+------------------+
# Now all you have to do is extract the tuple and assign to separate columns
# (and delete temp column to cleanup)
(df
.withColumn('col4', F.col('temp')[0])
.withColumn('col5', F.col('temp')[1])
.drop('temp')
.show()
)
# Output
# +----+----+------+-------+-------+
# |col1|col2| col3| col4| col5|
# +----+----+------+-------+-------+
# | a| cat|owner1|foo_cat|bar_cat|
# | b| dog|owner2|foo_dog|bar_dog|
# | c|fish|owner3| null| null|
# | d| fox|owner4| null| null|
# | e| rat|owner5| null| null|
# +----+----+------+-------+-------+

Pyspark - groupby concat string columns by order

I have a dataframe with the following columns - User, Order, Food.
For example:
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
I would like to concatenate all of the foods into a single string sorted by order and grouped by per user
If I run the following:
df.groupBy("User").agg(concat_ws(" $ ",collect_list("Food")).alias("Food List"))
I get a single list but the foods are not concatenated in order.
User Food List
B Salad
C Bread
A Eggs $ Water $ Peaches
What is a good way to get the food list concatenated in order?
Try use window here:
Build the DataFrame
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import mean, pandas_udf, PandasUDFType
from pyspark.sql.types import *
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
df.show()
+----+-----+-------+
|User|Order| Food|
+----+-----+-------+
| A| 1| Eggs|
| B| 1| Salad|
| A| 2|Peaches|
| C| 1| Bread|
| A| 3| Water|
+----+-----+-------+
Create window and apply a udf to join the strings:
w = Window.partitionBy('User').orderBy('Order').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
#pandas_udf(StringType(), PandasUDFType.GROUPED_AGG)
def _udf(v):
return ' $ '.join(v)
df = df.withColumn('Food List', _udf(df['Food']).over(w)).dropDuplicates(['User', 'Food List']).drop(*['Order', 'Food'])
df.show(truncate=False)
+----+----------------------+
|User|Food List |
+----+----------------------+
|B |Salad |
|C |Bread |
|A |Eggs $ Peaches $ Water|
+----+----------------------+
Based on the possible duplicate comment - collect_list by preserving order based on another variable, I was able to come up with a solution.
First define a sorter function. This takes a struct, sorts by order and then returns the list of items in a string format separated by ' $ '
# define udf
def sorter(l):
res = sorted(l, key=lambda x: x.Order)
return ' $ '.join([item[1] for item in res])
sort_udf = udf(sorter,StringType())
Then create the struct and run the sorter function:
SortedFoodList = (df.groupBy("User")
.agg(collect_list(struct("Order","Food")).alias("food_list"))
.withColumn("sorted_foods",sort_udf("food_list"))
.drop("food_list")
)

pyspark dataframe complex calculation with previous row

I am working with Pyspark and trying to figure out how to do complex calculation with previous columns. I think there are generally two ways to do calculation with previous columns : Windows, and mapwithPartition. I think my problem is too complex to solve by windows, and I want the result as a sepreate row, not column. So I am trying to use mapwithpartition. I am having a trouble with syntax of this.
For instance, here is a rough draft of the code.
def change_dd(rows):
prev_rows = []
prev_rows.append(rows)
for row in rows:
new_row=[]
for entry in row:
# Testing to figure out syntax, things would get more complex
new_row.append(entry + prev_rows[0])
yield new_row
updated_rdd = select.rdd.mapPartitions(change_dd)
However, I can't access to the single data of prev_rows. Seems like prev_rows[0] is itertools.chain. How do I iterate over this prev_rows[0]?
edit
neighbor = sc.broadcast(df_sliced.where(df_sliced.id == neighbor_idx).collect()[0][:-1]).value
current = df_sliced.where(df_sliced.id == i)
def oversample_dt(dataframe):
for row in dataframe:
new_row = []
for entry, neigh in zip(row, neighbor):
if isinstance(entry, str):
if scale < 0.5:
new_row.append(entry)
else:
new_row.append(neigh)
else:
if isinstance(entry, int):
new_row.append(int(entry + (neigh - entry) * scale))
else:
new_row.append(entry + (neigh - entry) * scale)
yield new_row
sttt = time.time()
sample = current.rdd.mapPartitions(oversample_dt).toDF(schema)
In the end, I ended up doing like this for now, but I really don't want to use collect in the first row. If someone knows how to fix this / point out any problem in using pyspark, please tell me.
edit2
--Suppose Alice, and its neighbor Alice_2
scale = 0.4
+---+-------+--------+
|age| name | height |
+---+-------+--------+
| 10| Alice | 170 |
| 11|Alice_2| 175 |
+---+-------+--------+
Then, I want a row
+---+-------+----------------------------------+
|age | name | height |
+---+-------+---------------------------------+
| 10+1*0.4 | Alice_2 | 170 + 5*0.4 |
+---+-------+---------------------------------+
Why not using dataframes?
Add a column to the dataframe with the previous values using window functions like this:
from pyspark.sql import SparkSession, functions
from pyspark.sql.window import Window
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([{'name': 'Alice', 'age': 1}, {'name': 'Alice_2', 'age': 2}])
df.show()
+---+-------+
|age| name|
+---+-------+
| 1| Alice|
| 2|Alice_2|
+---+-------+
window = Window.partitionBy().orderBy('age')
df = df.withColumn("age-1", functions.lag(df.age).over(window))
df.show()
You can use this function for every column
+---+-------+-----+
|age| name|age-1|
+---+-------+-----+
| 1| Alice| null|
| 2|Alice_2| 1|
+---+-------+-----+
An then just make your calculus
And if you want to use rdd, then just use df.rdd

How to randomly selecting rows from one dataframeusing information from another dataframe

The following I am attempting in Scala-Spark.
I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.
I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example
I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.
val randomEntites = dateCountDF.foreach(x => {
val count:Int = x(1).toString().toInt
val result = entitiesDF.take(count)
return result
})
DataFrames
**dateCountDF**
| Date | Count |
+----------+----------------+
|2016-08-31| 4|
|2015-12-31| 1|
|2016-09-30| 5|
|2016-04-30| 5|
|2015-11-30| 3|
|2016-05-31| 7|
|2016-11-30| 2|
|2016-07-31| 5|
|2016-12-31| 9|
|2014-06-30| 4|
+----------+----------------+
only showing top 10 rows
**entitiesDF**
| ID | FirstDate | LastDate |
+----------+-----------------+----------+
| 296| 2014-09-01|2015-07-31|
| 125| 2015-10-01|2016-12-31|
| 124| 2014-08-01|2015-03-31|
| 447| 2017-02-01|2017-01-01|
| 307| 2015-01-01|2015-04-30|
| 574| 2016-01-01|2017-01-31|
| 613| 2016-04-01|2017-02-01|
| 169| 2009-08-23|2016-11-30|
| 205| 2017-02-01|2017-02-01|
| 433| 2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows
Edit:
For clarification.
My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate
To select random you do like this in scala
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.
also you can refer this answer