Using pandas udf without looping in pyspark - pyspark

So suppose I have a big spark dataframe .I dont know how many columns.
(the solution has to be in pyspark using pandas udf. Not a different approach)
I want to perform an action on all columns. So it's ok to loop inside on all columns
But I dont want to loop through rows. I want it to act on the column at once.
I didnt find on the internet how this could be done.
Suppose I have this datafrme
A B C
5 3 2
1 7 0
Now I want to send to pandas udf to get sum of each row.
Sum
10
8
Number of columns not known.
I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.
One option I tried is combining all colmns to array column
ARR
[5,3,2]
[1,7,0]
But even here it doesnt work for me without looping.
I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.
It would be nice if I could seperate each column as a one and act on the whole column at once
How do I act on the column at once? Without looping through the rows?
If I loop through the rows I guess it's no better than a regular python udf

I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below
df = spark.read.load('/databricks-datasets/asa/small/small.csv', header=True,format='csv')
sf = df.select(df.colRegex("`.*rrDelay$|.*pDelay$`"))
#sf.show()
columns = ["id","ArrDelay","DepDelay"]
data = [("a", 81.0,3),
("b", 36.2,5),
("c", 12.0,5),
("d", 81.0,5),
("e", 36.3,5),
("f", 12.0,5),
("g", 111.7,5)]
sf = spark.createDataFrame(data=data,schema=columns)
sf.show()
# Use aggregate function
new = (sf.withColumn('sums', array(*[x for x in ['ArrDelay','DepDelay'] ]))#Create an array of values per row on desired columns
.withColumn('sums', expr("aggregate(sums,cast(0 as double), (c,i)-> c+i)"))# USE aggregate to sum
).show()
#use pandas udf
sch= sf.withColumn('v', lit(90.087654623)).schema
def sum_s(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf.assign(v=pdf.sum(1))
sf.mapInPandas(sum_s, schema=sch).show()

here's a simple way to do it
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
from functools import reduce
df = spark.createDataFrame(
[
(5,3,2),
(1,7,0),
],
["A", "B", "C"],
)
cols = df.columns
calculate_sum = reduce(lambda a, x: a+x, map(col, cols))
df = (
df
.withColumn(
"sum",calculate_sum
)
)
df.show()
output:
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 5| 3| 2| 10|
| 1| 7| 0| 8|
+---+---+---+---+

Related

pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using #udf or #pandas_udf

I do try to compute .dot product between 2 columns of a give dataframe,
SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to
DenseVectors but i'm stuck, spent past 3 days to try find out of an
approach and does fail, doesn't return computation for passed 2 vector
columns from dataframe and looking for guidance on this matter,
please, because something I'm missing here and not sure what is root cause ...
For separate vectors and rdd vectors works this approach but does fail
to work when passing dataframe column vectors, to replicate the flow
and issues please see below, ideally would be this computation to happen in parallel since real work data is with billions or more rows (dataframe observations):
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.sql import Row
df = spark.createDataFrame(
[
[["a","b","c"], SparseVector(4527, {0:0.6363067860791387, 1:1.0888040725098247, 31:4.371858972705023}),SparseVector(4527, {0:0.6363067860791387, 1:2.0888040725098247, 31:4.371858972705023})],
[["d"], SparseVector(4527, {8: 2.729945780576634}), SparseVector(4527, {8: 4.729945780576634})],
], ["word", "i", "j"])
# # daframe content
df.show()
+---------+--------------------+--------------------+
| word| i| j|
+---------+--------------------+--------------------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...|
+---------+--------------------+--------------------+
#udf(returnType=ArrayType(FloatType()))
def sim_cos(v1, v2):
if v1 is not None and v2 is not None:
return float(v1.dot(v2))
# # calling udf
df = df.withColumn("dotP", sim_cos(df.i, df.j))
# # output after udf
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j| dotP|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| null|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| null|
+---------+--------------------+--------------------+----------+
Rewriting udf as lambda does work on spark 2.4.5. Posting in case
anyone is interested in this approach for PySpark dataframes:
# # rewrite udf as lambda function:
sim_cos = F.udf(lambda x,y : float(x.dot(y)), FloatType())
# # executing udf on dataframe
df = df.withColumn("similarity", sim_cos(col("i"),col("j")))
# # end result
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j|similarity|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| 21.792336|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| 12.912496|
+---------+--------------------+--------------------+----------+

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

spark scala cartesian product of each element in a column

I have a dataframe which is like :
df:
col1 col2
a [p1,p2,p3]
b [p1,p4]
Desired output is that:
df_out:
col1 col2 col3
p1 p2 a
p1 p3 a
p2 p3 a
p1 p4 b
I did some research and i think that converting df to rdd and then flatMap with cartesian product are ideal for the problem. However i could not combine them together.
Thanks,
It looks like you are trying to do combination rather than cartesian. Please check my understanding.
This is in PySpark but the only python thing is the UDF, the rest is just DataFrame operations.
process is
Create dataframe
define UDF to get all pairs of combinations ignoring order
use UDF to convert array into array of pairs of structs, one for each element of the combination
explode the results to get rows of pair of structs
select each struct and original column 1 into desired result columns
from itertools import combinations
from pyspark.sql import functions as F
df = spark.createDataFrame([
("a", ["p1", "p2", "p3"]),
("b", ["p1", "p4"])
],
["col1", "col2"]
)
# define and register udf that takes an array and returns an array of struct of two strings
#udf("array<struct<_1: string, _2: string>>")
def combinations_list(x):
return combinations(x, 2)
resultDf = df.select("col1", F.explode(combinations_list(df.col2)).alias("combos"))
resultDf.selectExpr("combos._1 as col1", "combos._2 as col2", "col1 as col3").show()
Result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| p1| p2| a|
| p1| p3| a|
| p2| p3| a|
| p1| p4| b|
+----+----+----+

Filter Pyspark Dataframe with udf on entire row

Is there a way to select the entire row as a column to input into a Pyspark filter udf?
I have a complex filtering function "my_filter" that I want to apply to the entire DataFrame:
my_filter_udf = udf(lambda r: my_filter(r), BooleanType())
new_df = df.filter(my_filter_udf(col("*"))
But
col("*")
throws an error because that's not a valid operation.
I know that I can convert the dataframe to an RDD and then use the RDD's filter method, but I do NOT want to convert it to an RDD and then back into a dataframe. My DataFrame has complex nested types, so the schema inference fails when I try to convert the RDD into a dataframe again.
You should write all columns staticly. For example:
from pyspark.sql import functions as F
# create sample df
df = sc.parallelize([
(1, 'b'),
(1, 'c'),
]).toDF(["id", "category"])
#simple filter function
#F.udf(returnType=BooleanType())
def my_filter(col1, col2):
return (col1>0) & (col2=="b")
df.filter(my_filter('id', 'category')).show()
Results:
+---+--------+
| id|category|
+---+--------+
| 1| b|
+---+--------+
If you have so many columns and you are sure to order of columns:
cols = df.columns
df.filter(my_filter(*cols)).show()
Yields the same output.

spark (Scala) dataframe filtering (FIR)

Let say I have a dataframe ( stored in scala val as df) which contains the data from a csv:
time,temperature
0,65
1,67
2,62
3,59
which I have no problem reading this from file as a spark dataframe in scala language.
I would like to add a filtered column (by filter I meant signal processing moving average filtering), (say I want to do (T[n]+T[n-1])/2.0):
time,temperature,temperatureAvg
0,65,(65+0)/2.0
1,67,(67+65)/2.0
2,62,(62+67)/2.0
3,59,(59+62)/2.0
(Actually, say for the first row, I want 32.5 instead of (65+0)/2.0. I wrote it to clarify the expected 2-time-step filtering operation output)
So how to achieve this? I am not familiar with spark dataframe operation which combine rows iteratively along column...
Spark 3.1+
Replace
$"time".cast("timestamp")
with
import org.apache.spark.sql.functions.timestamp_seconds
timestamp_seconds($"time")
Spark 2.0+
In Spark 2.0 and later it is possible to use window function as a input for groupBy. It allows you to specify windowDuration, slideDuration and startTime (offset). It works only with TimestampType column but it is not that hard to find a workaround for that. In your case it will require some additional steps to correct for boundaries but general solution can expressed as shown below:
import org.apache.spark.sql.functions.{window, avg}
df
.withColumn("ts", $"time".cast("timestamp"))
.groupBy(window($"ts", windowDuration="2 seconds", slideDuration="1 second"))
.avg("temperature")
Spark < 2.0
If there is a natural way to partition your data you can use window functions as follows:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.mean
val w = Window.partitionBy($"id").orderBy($"time").rowsBetween(-1, 0)
val df = sc.parallelize(Seq(
(1L, 0, 65), (1L, 1, 67), (1L, 2, 62), (1L, 3, 59)
)).toDF("id", "time", "temperature")
df.select($"*", mean($"temperature").over(w).alias("temperatureAvg")).show
// +---+----+-----------+--------------+
// | id|time|temperature|temperatureAvg|
// +---+----+-----------+--------------+
// | 1| 0| 65| 65.0|
// | 1| 1| 67| 66.0|
// | 1| 2| 62| 64.5|
// | 1| 3| 59| 60.5|
// +---+----+-----------+--------------+
You can create windows with arbitrary weights using lead / lag functions:
lit(0.6) * $"temperature" +
lit(0.3) * lag($"temperature", 1) +
lit(0.2) * lag($"temperature", 2)
It is still possible without partitionBy clause but will be extremely inefficient. If this is the case you won't be able to use DataFrames. Instead you can use sliding over RDD (see for example Operate on neighbor elements in RDD in Spark). There is also spark-timeseries package you may find useful.