Say I have two PySpark DataFrames df1 and df2.
df1= 'a'
1
2
5
df2= 'b'
3
6
And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.
In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be
'a' 'b'
1 3
2 3
5 6
I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def dict(x,y):
return abs(x-y)
udf_dict = udf(dict, IntegerType())
sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)
which gives
Column<PythonUDF#dist(a,b)>
Then I tried
sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))
which runs forever without giving error/output.
My questions are:
As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?
Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:
>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
+---+---+--------+
| a| b|distance|
+---+---+--------+
| 1| 3| 2|
| 1| 6| 5|
| 2| 3| 1|
| 2| 6| 4|
| 5| 3| 2|
| 5| 6| 1|
+---+---+--------+
But there is a more efficient way to apply this distance, by using internal abs:
>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
Then you can find matching numbers by calculating:
>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
+---+---+
| a| b|
+---+---+
| 5| 6|
| 1| 3|
| 2| 3|
+---+---+
Related
I would like to build a moving average on each row in a window. Let's say -10 rows. BUT if there are less than 10 rows available I would like to insert a 0 in the resulting row -> new column.
So what I would try to achieve is using a UDF in an aggregate window with input paramter List() (or whatever superclass) which has the values of all rows available.
Here's a code example that doesn't work:
val w = Window.partitionBy("id").rowsBetween(-10, +0)
dfRetail2.withColumn("test", udftestf(dfRetail2("salesMth")).over(w))
Expected output: List( 1,2,3,4) if no more rows are available and take this as input paramter for the udf function. udf function should return a calculated value or 0 if less than 10 rows available.
the above code terminates: Expression 'UDF(salesMth#152L)' not supported within a window function.;;
You can use Spark's built-in Window functions along with when/otherwise for your specific condition without the need of UDF/UDAF. For simplicity, the sliding-window size is reduced to 4 in the following example with dummy data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 2).flatMap(i => Seq.tabulate(8)(j => (i, i * 10.0 + j))).
toDF("id", "amount")
val slidingWin = 4
val winSpec = Window.partitionBy($"id").rowsBetween(-(slidingWin - 1), 0)
df.
withColumn("slidingCount", count($"amount").over(winSpec)).
withColumn("slidingAvg", when($"slidingCount" < slidingWin, 0.0).
otherwise(avg($"amount").over(winSpec))
).show
// +---+------+------------+----------+
// | id|amount|slidingCount|slidingAvg|
// +---+------+------------+----------+
// | 1| 10.0| 1| 0.0|
// | 1| 11.0| 2| 0.0|
// | 1| 12.0| 3| 0.0|
// | 1| 13.0| 4| 11.5|
// | 1| 14.0| 4| 12.5|
// | 1| 15.0| 4| 13.5|
// | 1| 16.0| 4| 14.5|
// | 1| 17.0| 4| 15.5|
// | 2| 20.0| 1| 0.0|
// | 2| 21.0| 2| 0.0|
// | 2| 22.0| 3| 0.0|
// | 2| 23.0| 4| 21.5|
// | 2| 24.0| 4| 22.5|
// | 2| 25.0| 4| 23.5|
// | 2| 26.0| 4| 24.5|
// | 2| 27.0| 4| 25.5|
// +---+------+------------+----------+
Per remark in the comments section, I'm including a solution via UDF below as an alternative:
def movingAvg(n: Int) = udf{ (ls: Seq[Double]) =>
val (avg, count) = ls.takeRight(n).foldLeft((0.0, 1)){
case ((a, i), next) => (a + (next-a)/i, i + 1)
}
if (count <= n) 0.0 else avg // Expand/Modify this for specific requirement
}
// To apply the UDF:
df.
withColumn("average", movingAvg(slidingWin)(collect_list($"amount").over(winSpec))).
show
Note that unlike sum or count, collect_list ignores rowsBetween() and generates partitioned data that can potentially be very large to be passed to the UDF (hence the need for takeRight()). If the computed Window sum and count are sufficient for what's needed for your specific requirement, consider passing them to the UDF instead.
In general, especially if the data at hand is already in DataFrame format, it'd perform and scale better by using built-in DataFrame API to take advantage of Spark's execution engine optimization than using user-defined UDF/UDAF. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.
Seems like this should work, but I'm getting errors:
mu = mean(df[input])
sigma = stddev(df[input])
dft = df.withColumn(output, (df[input]-mu)/sigma)
pyspark.sql.utils.AnalysisException: "grouping expressions sequence is
empty, and '`user`' is not an aggregate function. Wrap
'(((CAST(`sum(response)` AS DOUBLE) - avg(`sum(response)`)) /
stddev_samp(CAST(`sum(response)` AS DOUBLE))) AS `scaled`)' in
windowing function(s) or wrap '`user`' in first() (or first_value) if
you don't care which value you get.;;\nAggregate [user#0,
sum(response)#26L, ((cast(sum(response)#26L as double) -
avg(sum(response)#26L)) / stddev_samp(cast(sum(response)#26L as
double))) AS scaled#46]\n+- AnalysisBarrier\n +- Aggregate
[user#0], [user#0, sum(cast(response#3 as bigint)) AS
sum(response)#26L]\n +- Filter item_id#1 IN
(129,130,131,132,133,134,135,136,137,138)\n +-
Relation[user#0,item_id#1,response_value#2,response#3,trait#4,response_timestamp#5]
csv\n"
I'm not sure what's going on with this error message.
Using collect() is not a good solution in general and you will see that this will not scale as your data grows.
If you don't want to use StandardScaler, a better way is to use a Window to compute the mean and standard deviation.
Borrowing the same example from StandardScaler in Spark not working as expected:
from pyspark.sql.functions import col, mean, stddev
from pyspark.sql import Window
df = spark.createDataFrame(
np.array(range(1,10,1)).reshape(3,3).tolist(),
["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#| 1| 2| 3|
#| 4| 5| 6|
#| 7| 8| 9|
#+----+----+----+
Suppose you wanted to standardize the column int2:
input_col = "int2"
output_col = "int2_scaled"
w = Window.partitionBy()
mu = mean(input_col).over(w)
sigma = stddev(input_col).over(w)
df.withColumn(output_col, (col(input_col) - mu)/(sigma)).show()
#+----+----+----+-----------+
#|int1|int2|int3|int2_scaled|
#+----+----+----+-----------+
#| 1| 2| 3| -1.0|
#| 7| 8| 9| 1.0|
#| 4| 5| 6| 0.0|
#+----+----+----+-----------+
If you wanted to use the population standard deviation as in the other example, replace pyspark.sql.functions.stddev with pyspark.sql.functions.stddev_pop().
Fortunately, I was able to find code that works:
summary = df.select([mean(input).alias('mu'), stddev(input).alias('sigma')])\
.collect().pop()
dft = df.withColumn(output, (df[input]-summary.mu)/summary.sigma)
Scala 2.12 and Spark 2.2.1 here. I have the following code:
myDf.show(5)
myDf.withColumn("rank", myDf("rank") * 10)
myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
When I run this, in the logs I see:
+---------+-----------+----+
|fizz|buzz|rizzrankrid|rank|
+---------+-----------+----+
| 2| 5| 1440370637| 128|
| 2| 5| 2114144780|1352|
| 2| 8| 199559784|3233|
| 2| 5| 1522258372| 895|
| 2| 9| 918480276| 882|
+---------+-----------+----+
And now:
+---------+-----------+-----+
|fizz|buzz|rizzrankrid| rank|
+---------+-----------+-----+
| 2| 5| 1440370637| 1280|
| 2| 5| 2114144780|13520|
| 2| 8| 199559784|32330|
| 2| 5| 1522258372| 8950|
| 2| 9| 918480276| 8820|
+---------+-----------+-----+
So, interesting:
The first withColumn works, transforming each row's rank value by multiplying itself by 10
However the second withColumn fails, which is just adding the current date/time to all rows as a new lastRanOn column
What do I need to do to get the lastRanOn column addition working?
Your example is probably too simple, because modifying rank should also not work.
withColumn does not update DataFrame, it's create a new DataFrame.
So you must do:
// if myDf is a var
myDf.show(5)
myDf = myDf.withColumn("rank", myDf("rank") * 10)
myDf = myDf.withColumn("lastRanOn", current_date())
println("And now:")
myDf.show(5)
or for example:
myDf.withColumn("rank", myDf("rank") * 10).withColumn("lastRanOn", current_date()).show(5)
Only then you will have new column added - after reassigning new DataFrame reference
I have a dataframe where i need to first apply dataframe and then get weighted average as shown in the output calculation below. What is an efficient way in pyspark to do that?
data = sc.parallelize([
[111,3,0.4],
[111,4,0.3],
[222,2,0.2],
[222,3,0.2],
[222,4,0.5]]
).toDF(['id', 'val','weight'])
data.show()
+---+---+------+
| id|val|weight|
+---+---+------+
|111| 3| 0.4|
|111| 4| 0.3|
|222| 2| 0.2|
|222| 3| 0.2|
|222| 4| 0.5|
+---+---+------+
Output:
id weigthed_val
111 (3*0.4 + 4*0.3)/(0.4 + 0.3)
222 (2*0.2 + 3*0.2+4*0.5)/(0.2+0.2+0.5)
You can multiply columns weight and val, then aggregate:
import pyspark.sql.functions as F
data.groupBy("id").agg((F.sum(data.val * data.weight)/F.sum(data.weight)).alias("weighted_val")).show()
+---+------------------+
| id| weighted_val|
+---+------------------+
|222|3.3333333333333335|
|111|3.4285714285714293|
+---+------------------+
I have an eventlog in csv consisting of three columns timestamp, eventId and userId.
What I would like to do is append a new column nextEventId to the dataframe.
An example eventlog:
eventlog = sqlContext.createDataFrame(Array((20160101, 1, 0),(20160102,3,1),(20160201,4,1),(20160202, 2,0))).toDF("timestamp", "eventId", "userId")
eventlog.show(4)
|timestamp|eventId|userId|
+---------+-------+------+
| 20160101| 1| 0|
| 20160102| 3| 1|
| 20160201| 4| 1|
| 20160202| 2| 0|
+---------+-------+------+
The desired endresult would be:
|timestamp|eventId|userId|nextEventId|
+---------+-------+------+-----------+
| 20160101| 1| 0| 2|
| 20160102| 3| 1| 4|
| 20160201| 4| 1| Nil|
| 20160202| 2| 0| Nil|
+---------+-------+------+-----------+
So far I've been messing around with sliding windows but can't figure out how to compare 2 rows...
val w = Window.partitionBy("userId").orderBy(asc("timestamp")) //should be a sliding window over 2 rows...
val nextNodes = second($"eventId").over(w) //should work if there are only 2 rows
What you're looking for is lead (or lag). Using window you already defined:
import org.apache.spark.sql.functions.lead
eventlog.withColumn("nextEventId", lead("eventId", 1).over(w))
For true sliding window (like sliding average) you can use rowsBetween or rangeBetween clauses of the window definition but it is not really required here. Nevertheless example usage could be something like this:
val w2 = Window.partitionBy("userId")
.orderBy(asc("timestamp"))
.rowsBetween(-1, 0)
avg($"foo").over(w2)