Currently I'm working with a dataframe and need to calculate the number of days (as integer) between two dates formatted as timestamp
I've opted for this solution:
from pyspark.sql.functions import lit, when, col, datediff
df1 = df1.withColumn("LD", datediff("MD", "TD"))
But after calculating sum from a list I get an error: "Column in not iterable" which makes me impossible to calculate sum of the rows based on column names
col_list = ["a", "b", "c"]
df2 = df1.withColumn("My_Sum", sum([F.col(c) for c in col_list]))
How can I deal with it in order to calculate the difference between dates and then calculate the sum of the rows given the names of certain columns?
The datediff has nothing to do with the sum of a column. The pyspark sql sum function takes in 1 column and it calculates the sum of the rows in that column.
Here are a couple of ways to get the sum of a column from a list of columns using list comprehension.
Single row output with the sum of the column
data_sdf. \
select(*[func.sum(c).alias(c+'_sum') for c in col_list]). \
show()
# +-----+-----+-----+
# |a_sum|b_sum|c_sum|
# +-----+-----+-----+
# | 1337| 3778| 6270|
# +-----+-----+-----+
the sum of all rows of the column in each row
from pyspark.sql.window import Window as wd
data_sdf. \
select('*',
*[func.sum(c).over(wd.partitionBy()).alias(c+'_sum') for c in col_list]
). \
show(5)
# +---+---+---+-----+-----+-----+
# | a| b| c|a_sum|b_sum|c_sum|
# +---+---+---+-----+-----+-----+
# | 45| 58|125| 1337| 3778| 6270|
# | 9| 99|143| 1337| 3778| 6270|
# | 33| 91|146| 1337| 3778| 6270|
# | 21| 85|118| 1337| 3778| 6270|
# | 30| 55|101| 1337| 3778| 6270|
# +---+---+---+-----+-----+-----+
# only showing top 5 rows
Related
I do have the following dataframe, which contains all the paths within a tree after going through all nodes. For each jump between nodes, a row will be created where "dist" is the number of nodes so far, "node" the current node and "path" the path so far.
dist | node | path
0 | 1 | [1]
1 | 2 | [1,2]
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
At the end I just want to have a dataframe containing the complete paths without the intermediate steps:
dist | node | path
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
I also tried by having the path column as a string ("1;2;3") and comparing which row is a substring from each other, however i could not find a way to do that.
I found my old code and created an adapted example for your problem. I used the spark graph library Graphframes for this. The path can be determined by a Pregel like message aggregation loop.
Here the code.
First import all modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as f
from graphframes import GraphFrame
from pyspark.sql.types import *
from graphframes.lib import *
# shortcut for the aggregate message object from the graphframes.lib
AM=AggregateMessages
# to plot the graph
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
spark = (SparkSession
.builder
.appName("PathReduction")
.getOrCreate()
)
sc=spark.sparkContext
Then create a sample dataset
# create dataframe
raw_data = [
("0","1"),
("1","2"),
("1","5"),
("2","3"),
("2","4"),
("a","b"),
("b","c"),
("c","d")]
schema = ["src","dst"]
data = spark.createDataFrame(data=raw_data, schema = schema)
data.show()
+---+---+
|src|dst|
+---+---+
| 0| 1|
| 1| 2|
| 1| 5|
| 2| 3|
| 2| 4|
| a| b|
| b| c|
| c| d|
+---+---+
For visualisation run
plotData_1 = data.select("src","dst").rdd.collect()
plotData_2 = np.array(plotData_1)
plotData_3=[]
for row in plotData_2:
plotData_3.append((row[0],row[1]))
G=nx.DiGraph(directed=True)
G.add_edges_from(plotData_3)
options = {
'node_color': 'orange',
'node_size': 500,
'width': 2,
'arrowstyle': '-|>',
'arrowsize': 20,
}
nx.draw(G, arrows=True, **options,with_labels=True)
With this message aggregation algorithm you find the paths as you searched them. if you set the flag show_steps to True the results of each step is shown which helps to understand.
# if flag is true print results within the loop for debuging
show_steps=False
# max itertions of the loop, should be larger then the longest expected path
max_iter=10
# create vertices from edge data set
vertices=(data.select("src").union(data.select("dst")).distinct().withColumnRenamed('src', 'id'))
edges=data
# create graph to get in and out degrees
gx = GraphFrame(vertices, edges)
# calclulate in and out degrees of each node
inDegrees=gx.inDegrees
outDegrees=gx.outDegrees
if(show_steps==True):
print("in and out degrees")
inDegrees.show()
outDegrees.show()
# create intial vertices
init_vertices=(vertices
# join out degrees on vertices
.join(outDegrees,on="id",how="left")
# join in degree on vertices
.join(inDegrees,on="id",how="left")
# define root, childs in the middle and leafs of the path in order to distinguish full paths later on
.withColumn("nodeType",f.when(f.col("inDegree").isNull(),"root").otherwise(f.when(f.col("outDegree").isNull(),"leaf").otherwise("child")))
# define message with all information [array(id) and array(nodeType)] to be send to the next noe
.withColumn("message",f.array_union(f.array(f.array(f.col("id"))),f.array(f.array(f.col("nodeType")))))
# remove columns that are not used anymore
.drop("inDegree","outDegree")
)
if(show_steps==True):
print("init vertices")
init_vertices.show()
# update graph object with init vertices
gx = GraphFrame(init_vertices, edges)
# define empty dataframe to append found paths on
results = sqlContext.createDataFrame(
sc.emptyRDD(),
StructType([StructField("paths",ArrayType(StringType()),True)])
)
# start loopp for mesage aggregation. Set a max_iter value which has to be larger as the longest path expected
for iter_ in range(max_iter):
if(show_steps==True):
print("iteration step=" + str(iter_))
print("##################################################")
# define the message that should be send. Here we send a message to the source node and we take the column message from the destination source we send backward
msgToSrc = AM.dst["message"]
agg = gx.aggregateMessages(
f.collect_set(AM.msg).alias("aggMess"), # aggregation function is a collect into an array (attention!! this can be an expensive operation in terms of shuffel)
sendToSrc=msgToSrc,
sendToDst=None
)
if(show_steps==True):
print("aggregated message")
agg.show(truncate=False)
# stop loop if no more agg messages collected
if(len(agg.take(1))==0):
print("All paths found in " + str(iter_) + " iterations")
break
# get new vertices to send into next round. Here we have to prepare the next message columns all _column_names are temporary columns for calculation purpose only
vertices_update=(agg
# join initial data to aggregation in order to have to nodeType of the vertice
.join(init_vertices,on="id",how="left")
# exploe the nested array with the path and the nodeType
.withColumn("_explode_to_flatten_array",f.explode(f.col("aggMess")))
# put the path aray into a seperate column
.withColumn("_dataMsg",f.col("_explode_to_flatten_array")[0])
# put the node type into a seperate column
.withColumn("_typeMsg",f.col("_explode_to_flatten_array")[1][0])
# deside if a path is complete. A path is complete if the vertices type is a root and the message type is a leaf
.withColumn("pathComplete",f.when(((f.col("nodeType")=="root") & (f.col("_typeMsg")=="leaf")),True).otherwise(False))
# append the curent vertice id to the path array that is send forward
.withColumn("_message",f.array_union(f.array(f.col("id")),f.col("_dataMsg")))
# merge together the path array and the nodeType array for the new message object
.withColumn("message",f.array_union(f.array(f.col("_message")),f.array(f.array(f.col("_typeMsg")))))
)
if(show_steps==True):
print("new vertices with all temp columns")
vertices_update.show()
# add complete paths to the result dataframe
results=(
results
.union(
vertices_update
.where(f.col("pathComplete")==True)
.select(f.col("_message"))
)
)
# chache the vertices for next iteration and only push forward the two relevant columns in order to reduce data shuffeling between spark executors
cachedNewVertices = AM.getCachedDataFrame(vertices_update.select("id","message"))
# create new updated graph object for next iteration
gx = GraphFrame(cachedNewVertices, gx.edges)
print("##################################################")
print("Collecting result set")
results.show()
it shows then the correct results
All paths found in 3 iterations
##################################################
Collecting result set
+------------+
| paths|
+------------+
| [0, 1, 5]|
|[0, 1, 2, 3]|
|[0, 1, 2, 4]|
|[a, b, c, d]|
+------------+
to get your final dataframe you can join it back or take the first and last element of the array into separate columns
result2=(results
.withColumn("dist",f.element_at(f.col("paths"), 1))
.withColumn("node",f.element_at(f.col("paths"), -1))
)
result2.show()
+------------+----+----+
| paths|dist|node|
+------------+----+----+
| [0, 1, 5]| 0| 5|
|[0, 1, 2, 3]| 0| 3|
|[0, 1, 2, 4]| 0| 4|
|[a, b, c, d]| a| d|
+------------+----+----+
You can write the same algorithm with the Graphframes Pregel API I suppose.
P.S: The algorithm in this form might cause problems if the graph has lops or backward directed edges. I had another algorithm to first clean up loops and cycles
I have a dataframe with the following columns - User, Order, Food.
For example:
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
I would like to concatenate all of the foods into a single string sorted by order and grouped by per user
If I run the following:
df.groupBy("User").agg(concat_ws(" $ ",collect_list("Food")).alias("Food List"))
I get a single list but the foods are not concatenated in order.
User Food List
B Salad
C Bread
A Eggs $ Water $ Peaches
What is a good way to get the food list concatenated in order?
Try use window here:
Build the DataFrame
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import mean, pandas_udf, PandasUDFType
from pyspark.sql.types import *
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
df.show()
+----+-----+-------+
|User|Order| Food|
+----+-----+-------+
| A| 1| Eggs|
| B| 1| Salad|
| A| 2|Peaches|
| C| 1| Bread|
| A| 3| Water|
+----+-----+-------+
Create window and apply a udf to join the strings:
w = Window.partitionBy('User').orderBy('Order').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
#pandas_udf(StringType(), PandasUDFType.GROUPED_AGG)
def _udf(v):
return ' $ '.join(v)
df = df.withColumn('Food List', _udf(df['Food']).over(w)).dropDuplicates(['User', 'Food List']).drop(*['Order', 'Food'])
df.show(truncate=False)
+----+----------------------+
|User|Food List |
+----+----------------------+
|B |Salad |
|C |Bread |
|A |Eggs $ Peaches $ Water|
+----+----------------------+
Based on the possible duplicate comment - collect_list by preserving order based on another variable, I was able to come up with a solution.
First define a sorter function. This takes a struct, sorts by order and then returns the list of items in a string format separated by ' $ '
# define udf
def sorter(l):
res = sorted(l, key=lambda x: x.Order)
return ' $ '.join([item[1] for item in res])
sort_udf = udf(sorter,StringType())
Then create the struct and run the sorter function:
SortedFoodList = (df.groupBy("User")
.agg(collect_list(struct("Order","Food")).alias("food_list"))
.withColumn("sorted_foods",sort_udf("food_list"))
.drop("food_list")
)
I want to add a conditional column Flag to dataframe A. When the following two conditions are satisfied, add 1 to Flag, otherwise 0:
num from dataframe A is in between numStart and numEnd from dataframe B.
If the above condition satifies, check if include is 1.
DataFrame A (it's a very big dataframe, containing millions of rows):
+----+------+-----+------------------------+
|num |food |price|timestamp |
+----+------+-----+------------------------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|
|1001|taco |2.59 |2018-07-21T01:00:07.961Z|
+----+------+-----+------------------------+
DataFrame B (it's a very small DF, containing only 100 rows):
+----------+-----------+-------+
|numStart |numEnd |include|
+----------+-----------+-------+
|0 |200 |1 |
|250 |1050 |0 |
|2000 |3000 |1 |
|10001 |15001 |1 |
+----------+-----------+-------+
Expected output:
+----+------+-----+------------------------+----------+
|num |food |price|timestamp |Flag |
+----+------+-----+------------------------+----------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|0 |
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|1 |
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|1 |
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|0 |
|1001|taco |2.59 |2018-07-21T01:00:07.961Z|0 |
+----+------+-----+------------------------+----------+
You can left-join dfB to dfA based on the condition you described in (i), then build a Flag column using withColumn and the coalesce function to "default" to 0:
Records for which a match was found would use the include value of the matching dfB record
Records for which there was no match would have include=null, and per your requirement such records should get Flag=0, so we use coalesce which in case of null returns the default value with a literal lit(0)
Lastly, get rid of the dfB columns which are of no interest to you:
import org.apache.spark.sql.functions._
import spark.implicits._ // assuming "spark" is your SparkSession
dfA.join(dfB, $"num".between($"numStart", $"numEnd"), "left")
.withColumn("Flag", coalesce($"include", lit(0)))
.drop(dfB.columns: _*)
.show()
// +----+------+-----+--------------------+----+
// | num| food|price| timestamp|Flag|
// +----+------+-----+--------------------+----+
// |1275|tomato| 1.99|2018-07-21T00:00:...| 0|
// | 145|carrot| 0.45|2018-07-21T00:00:...| 1|
// |2678| apple| 0.99|2018-07-21T01:00:...| 1|
// |6578|banana| 1.29|2018-07-20T01:11:...| 0|
// |1001| taco| 2.59|2018-07-21T01:00:...| 0|
// +----+------+-----+--------------------+----+
Join the two dataframes together on the first condition while keeping all rows in dataframe A (i.e. with a left join, see code below). After the join, the include column can be renamed Flag and any NaN values inside it are set to 0. The two extra columns, numStart and numEnd are dropped.
The code can thus be written as follows:
A.join(B, $"num" >= $"numStart" && $"num" <= $"numEnd", "left")
.withColumnRenamed("include", "Flag")
.drop("numStart", "numEnd")
.na.fill(Map("Flag" -> 0))
pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+
I have a DF in which I have bookingDt and arrivalDt columns. I need to find all the dates between these two dates.
Sample code:
df = spark.sparkContext.parallelize(
[Row(vyge_id=1000, bookingDt='2018-01-01', arrivalDt='2018-01-05')]).toDF()
diffDaysDF = df.withColumn("diffDays", datediff('arrivalDt', 'bookingDt'))
diffDaysDF.show()
code output:
+----------+----------+-------+--------+
| arrivalDt| bookingDt|vyge_id|diffDays|
+----------+----------+-------+--------+
|2018-01-05|2018-01-01| 1000| 4|
+----------+----------+-------+--------+
What I tried was finding the number of days between two dates and calculate all the dates using timedelta function and explode it.
dateList = [str(bookingDt + timedelta(i)) for i in range(diffDays)]
Expected output:
Basically, I need to build a DF with a record for each date in between bookingDt and arrivalDt, inclusive.
+----------+----------+-------+----------+
| arrivalDt| bookingDt|vyge_id|txnDt |
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-01|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-02|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-03|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-04|
+----------+----------+-------+----------+
|2018-01-05|2018-01-01| 1000|2018-01-05|
+----------+----------+-------+----------+
For Spark 2.4+ sequence can be used to create an array containg all dates between bookingDt and arrivalDt. This array can then be exploded.
from pyspark.sql import functions as F
df = df \
.withColumn('bookingDt', F.col('bookingDt').cast('date')) \
.withColumn('arrivalDt', F.col('arrivalDt').cast('date'))
df.withColumn('txnDt', F.explode(F.expr('sequence(bookingDt, arrivalDt, interval 1 day)')))\
.show()
Output:
+-------+----------+----------+----------+
|vyge_id| bookingDt| arrivalDt| txnDt|
+-------+----------+----------+----------+
| 1000|2018-01-01|2018-01-05|2018-01-01|
| 1000|2018-01-01|2018-01-05|2018-01-02|
| 1000|2018-01-01|2018-01-05|2018-01-03|
| 1000|2018-01-01|2018-01-05|2018-01-04|
| 1000|2018-01-01|2018-01-05|2018-01-05|
+-------+----------+----------+----------+
As long as you're using Spark version 2.1 or higher, you can exploit the fact that we can use column values as arguments when using pyspark.sql.functions.expr():
Create a dummy string of repeating commas with a length equal to diffDays
Split this string on ',' to turn it into an array of size diffDays
Use pyspark.sql.functions.posexplode() to explode this array along with its indices
Finally use pyspark.sql.functions.date_add() to add the index value number of days to the bookingDt
Code:
import pyspark.sql.functions as f
diffDaysDF.withColumn("repeat", f.expr("split(repeat(',', diffDays), ',')"))\
.select("*", f.posexplode("repeat").alias("txnDt", "val"))\
.drop("repeat", "val", "diffDays")\
.withColumn("txnDt", f.expr("date_add(bookingDt, txnDt)"))\
.show()
#+----------+----------+-------+----------+
#| arrivalDt| bookingDt|vyge_id| txnDt|
#+----------+----------+-------+----------+
#|2018-01-05|2018-01-01| 1000|2018-01-01|
#|2018-01-05|2018-01-01| 1000|2018-01-02|
#|2018-01-05|2018-01-01| 1000|2018-01-03|
#|2018-01-05|2018-01-01| 1000|2018-01-04|
#|2018-01-05|2018-01-01| 1000|2018-01-05|
#+----------+----------+-------+----------+
Well, you can do following.
Create a dataframe with dates only:
dates_df # with all days between first bookingDt and last arrivalDt
and then join those df with between condition:
df.join(dates_df,
on=col('dates_df.dates').between(col('df.bookindDt'), col('dt.arrivalDt'))
.select('df.*', 'dates_df.dates')
It might work even faster then solution with explode, however you need to figure out what is start and end date for this df.
10 years df will have just 3650 records not that many to worry about.
As #vvg suggested:
# I assume, bookindDt has dates range including arrivalDt,
# otherwise you have to find intersection of unique dates of bookindDt and arrivalDt
dates_df = df.select('bookindDt').distinct()
dates_df = dates_df.withColumnRenamed('bookindDt', 'day_of_listing')
listing_days_df = df.join(dates_df, on=dates_df.day_of_listing.between(df.bookindDt, df.arrivalDt))
Output:
+----------+----------+-------+-------------------+
| arrivalDt| bookingDt|vyge_id|day_of_listing |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-01 |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-02 |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-03 |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-04 |
+----------+----------+-------+-------------------+
|2018-01-05|2018-01-01| 1000|2018-01-05 |
+----------+----------+-------+-------------------+