pypsark: take the min or max values across row? - pyspark

I have the following values;
- - - - - -
A| B | C|
- - - - - -
1| 2 | 3|
2| 3 | 6|
3| 5 | 4|
i want to take the minimum across the rows of column B and C.
so that
- - - - - -
A| min(B,C)
- - - - - -
1| 2
2| 3
3| 4
How do I do this in pyspark dataframe?

Whatever you want to check and study refer to pyspark API docs. It will have all possible functions and related docs. In below example, I used least for min and greatest for max.
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
[1,3,2],
[2,3,6],
[3,5,4]
], ['A','B', 'C'])
df.withColumn(
"max",
F.greatest(*[F.col(cl) for cl in df.columns[1:]])
).withColumn(
"min",
F.least(*[F.col(cl) for cl in df.columns[1:]])
).show()
Pyspark API Link: - https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame

Related

How I can transpose a data frame in pyspark?

I couldn't find any function in pyspark for transposing a dataframe.
Cal Cal2 Cal3
'A' 12 11
'U' 10 9
'O' 5 5
'ER' 6 5
Cal 'A' 'U' 'O' 'ER'
Cal2 12 10 5 6
Cal3 11 9 5 5
in pandas is very easy: df.T but I am not sure how it is in pyspark!
Generation of the sample dataframe
df = spark.createDataFrame([('A' ,12 ,11),('U' ,10 ,9 ),('O' , 5 ,5 ),('ER', 6 ,5 )], ['Cal','Cal2','Cal3'])
Option 1: pyspark.pandas.DataFrame.T
For large dataframes compute.max_rows might be required
import pyspark.pandas as ps
ps.get_option("compute.max_rows") # 1000
ps.set_option("compute.max_rows", 2000)
(df
.to_pandas_on_spark()
.set_index('Cal')
.T
.reset_index()
.rename(columns={"index":"Cal"})
.to_spark()
.show())
+----+---+---+---+---+
| Cal| A| U| O| ER|
+----+---+---+---+---+
|Cal2| 12| 10| 5| 6|
|Cal3| 11| 9| 5| 5|
+----+---+---+---+---+
Option 2: pyspark, the hard way
import pyspark.sql.functions as F
header_col = 'Cal'
cols_minus_header = df.columns
cols_minus_header.remove(header_col)
df1 = (df
.groupBy()
.pivot('Cal')
.agg(F.first(F.array(cols_minus_header)))
.withColumn(header_col, F.array(*map(F.lit, cols_minus_header)))
)
df1.show(truncate = False)
+--------+------+------+-------+------------+
| A| ER| O| U| Cal|
+--------+------+------+-------+------------+
|[12, 11]|[6, 5]|[5, 5]|[10, 9]|[Cal2, Cal3]|
+--------+------+------+-------+------------+
df2 = df1.select(F.arrays_zip(*df1.columns).alias('az')).selectExpr('inline(az)')
df2.show(truncate = False)
+---+---+---+---+----+
|A |ER |O |U |Cal |
+---+---+---+---+----+
|12 |6 |5 |10 |Cal2|
|11 |5 |5 |9 |Cal3|
+---+---+---+---+----+
You can unpivot the dataframe and then pivot it based on a different column.
from pyspark.sql import functions as F
data = [('A', 12, 11,),
('U', 10, 9,),
('O', 5, 5,),
('ER', 6, 5,), ]
df = spark.createDataFrame(data, ("Cal", "Cal2", "Cal3",))
key_column = "Cal"
unpivot_cols = [x for x in df.columns if x != key_column]
unpivot_col_expr = " ,".join([f"'{c}', {c}" for c in unpivot_cols])
unpivot_expr = f"stack({len(unpivot_cols)}, {unpivot_col_expr}) as (key,value)"
unpivoted_df = df.select("Cal", F.expr(unpivotExpr))
unpivoted_df.groupBy("key").pivot(key_column).agg(F.first("value")).withColumnRenamed("key", key_column).show()
"""
+----+---+---+---+---+
| Cal| A| ER| O| U|
+----+---+---+---+---+
|Cal3| 11| 5| 5| 9|
|Cal2| 12| 6| 5| 10|
+----+---+---+---+---+
"""
With Spark 3.2.1 supports pyspark supports pandas API as well.
If your dataframe is small you can make use of the same.
This method is based on an expensive operation due to the nature of big data. Internally it needs to generate each row for each value, and then group twice - it is a huge operation. To prevent misusage, this method has the ‘compute.max_rows’ default limit of input length, and raises a ValueError.
See the below link for more details -
pyspark.pandas.DataFrame.transpose
In addition to the above, you can also use Koalas (available in databricks) and is similar to Pandas except makes more sense for distributed processing and available in Pyspark (from 3.0.0 onwards). Something as below -
kdf = df.to_koalas()
Transpose_kdf = kdf.transpose()
TransposeDF = Transpose_kdf.to_spark()
Koalas documentation - Databricks Koalas
One thing to note is, you need to define partitions so as to use Koalas efficiently, else there could be serious performance issues.
There is another option that pyspark natively provides is pivot and stack option. See the below documentation for details -
Stack
Pivot
I'll leave these two for you to explore in detail.

Apply UDF function to Spark window where the input paramter is a list of all column values in range

I would like to build a moving average on each row in a window. Let's say -10 rows. BUT if there are less than 10 rows available I would like to insert a 0 in the resulting row -> new column.
So what I would try to achieve is using a UDF in an aggregate window with input paramter List() (or whatever superclass) which has the values of all rows available.
Here's a code example that doesn't work:
val w = Window.partitionBy("id").rowsBetween(-10, +0)
dfRetail2.withColumn("test", udftestf(dfRetail2("salesMth")).over(w))
Expected output: List( 1,2,3,4) if no more rows are available and take this as input paramter for the udf function. udf function should return a calculated value or 0 if less than 10 rows available.
the above code terminates: Expression 'UDF(salesMth#152L)' not supported within a window function.;;
You can use Spark's built-in Window functions along with when/otherwise for your specific condition without the need of UDF/UDAF. For simplicity, the sliding-window size is reduced to 4 in the following example with dummy data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 2).flatMap(i => Seq.tabulate(8)(j => (i, i * 10.0 + j))).
toDF("id", "amount")
val slidingWin = 4
val winSpec = Window.partitionBy($"id").rowsBetween(-(slidingWin - 1), 0)
df.
withColumn("slidingCount", count($"amount").over(winSpec)).
withColumn("slidingAvg", when($"slidingCount" < slidingWin, 0.0).
otherwise(avg($"amount").over(winSpec))
).show
// +---+------+------------+----------+
// | id|amount|slidingCount|slidingAvg|
// +---+------+------------+----------+
// | 1| 10.0| 1| 0.0|
// | 1| 11.0| 2| 0.0|
// | 1| 12.0| 3| 0.0|
// | 1| 13.0| 4| 11.5|
// | 1| 14.0| 4| 12.5|
// | 1| 15.0| 4| 13.5|
// | 1| 16.0| 4| 14.5|
// | 1| 17.0| 4| 15.5|
// | 2| 20.0| 1| 0.0|
// | 2| 21.0| 2| 0.0|
// | 2| 22.0| 3| 0.0|
// | 2| 23.0| 4| 21.5|
// | 2| 24.0| 4| 22.5|
// | 2| 25.0| 4| 23.5|
// | 2| 26.0| 4| 24.5|
// | 2| 27.0| 4| 25.5|
// +---+------+------------+----------+
Per remark in the comments section, I'm including a solution via UDF below as an alternative:
def movingAvg(n: Int) = udf{ (ls: Seq[Double]) =>
val (avg, count) = ls.takeRight(n).foldLeft((0.0, 1)){
case ((a, i), next) => (a + (next-a)/i, i + 1)
}
if (count <= n) 0.0 else avg // Expand/Modify this for specific requirement
}
// To apply the UDF:
df.
withColumn("average", movingAvg(slidingWin)(collect_list($"amount").over(winSpec))).
show
Note that unlike sum or count, collect_list ignores rowsBetween() and generates partitioned data that can potentially be very large to be passed to the UDF (hence the need for takeRight()). If the computed Window sum and count are sufficient for what's needed for your specific requirement, consider passing them to the UDF instead.
In general, especially if the data at hand is already in DataFrame format, it'd perform and scale better by using built-in DataFrame API to take advantage of Spark's execution engine optimization than using user-defined UDF/UDAF. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.

How to standardize a column in PySpark without using StandardScaler?

Seems like this should work, but I'm getting errors:
mu = mean(df[input])
sigma = stddev(df[input])
dft = df.withColumn(output, (df[input]-mu)/sigma)
pyspark.sql.utils.AnalysisException: "grouping expressions sequence is
empty, and '`user`' is not an aggregate function. Wrap
'(((CAST(`sum(response)` AS DOUBLE) - avg(`sum(response)`)) /
stddev_samp(CAST(`sum(response)` AS DOUBLE))) AS `scaled`)' in
windowing function(s) or wrap '`user`' in first() (or first_value) if
you don't care which value you get.;;\nAggregate [user#0,
sum(response)#26L, ((cast(sum(response)#26L as double) -
avg(sum(response)#26L)) / stddev_samp(cast(sum(response)#26L as
double))) AS scaled#46]\n+- AnalysisBarrier\n +- Aggregate
[user#0], [user#0, sum(cast(response#3 as bigint)) AS
sum(response)#26L]\n +- Filter item_id#1 IN
(129,130,131,132,133,134,135,136,137,138)\n +-
Relation[user#0,item_id#1,response_value#2,response#3,trait#4,response_timestamp#5]
csv\n"
I'm not sure what's going on with this error message.
Using collect() is not a good solution in general and you will see that this will not scale as your data grows.
If you don't want to use StandardScaler, a better way is to use a Window to compute the mean and standard deviation.
Borrowing the same example from StandardScaler in Spark not working as expected:
from pyspark.sql.functions import col, mean, stddev
from pyspark.sql import Window
df = spark.createDataFrame(
np.array(range(1,10,1)).reshape(3,3).tolist(),
["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#| 1| 2| 3|
#| 4| 5| 6|
#| 7| 8| 9|
#+----+----+----+
Suppose you wanted to standardize the column int2:
input_col = "int2"
output_col = "int2_scaled"
w = Window.partitionBy()
mu = mean(input_col).over(w)
sigma = stddev(input_col).over(w)
df.withColumn(output_col, (col(input_col) - mu)/(sigma)).show()
#+----+----+----+-----------+
#|int1|int2|int3|int2_scaled|
#+----+----+----+-----------+
#| 1| 2| 3| -1.0|
#| 7| 8| 9| 1.0|
#| 4| 5| 6| 0.0|
#+----+----+----+-----------+
If you wanted to use the population standard deviation as in the other example, replace pyspark.sql.functions.stddev with pyspark.sql.functions.stddev_pop().
Fortunately, I was able to find code that works:
summary = df.select([mean(input).alias('mu'), stddev(input).alias('sigma')])\
.collect().pop()
dft = df.withColumn(output, (df[input]-summary.mu)/summary.sigma)

Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.
For example,
Suppose we have the following dataframe:
Category Color Number Letter
1 Red 4 A
1 Yellow Null B
3 Green 8 C
2 Blue Null A
1 Green 9 A
3 Green 8 B
3 Yellow Null C
2 Blue 9 B
3 Blue 8 B
1 Blue Null Null
1 Red 7 C
2 Green Null C
1 Yellow 7 Null
3 Red Null B
Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).
So the output would ideally be:
Category Color CountNumber(Non-Nulls) Size MeanNumber ModeNumber ModeCountNumber CountLetter(Non-Nulls) ModeLetter ModeCountLetter
1 Red 2 2 5.5 4 (or 7)
1 Yellow 1 2 7 7
1 Green 1 1 9 9
1 Blue 1 1 - -
2 Blue 1 2 9 9 etc
2 Green - 1 - -
3 Green 2 2 8 8
3 Yellow - 1 - -
3 Blue 1 1 8 8
3 Red - 1 - -
This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.
Thanks.
As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:
// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()
// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
$"base.Category" === $"max.Category" and
$"base.Color" === $"max.Color" and
$"base.count" === $"max._max")
.select($"base.Category", $"base.Color", $"base.Number", $"_max")
.groupBy("Category", "Color")
.agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
.where($"ModeNumber".isNotNull)
// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
count("Color") as "Size", // counting a key column -> includes nulls
count("Number") as "CountNumber", // does not include nulls
mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")
result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// | 3|Yellow| 1| 0| null| null| null|
// | 1| Green| 1| 1| 9.0| 9| 1|
// | 1| Red| 2| 2| 5.5| 7| 1|
// | 2| Green| 1| 0| null| null| null|
// | 3| Blue| 1| 1| 8.0| 8| 1|
// | 1|Yellow| 2| 1| 7.0| 7| 1|
// | 2| Blue| 2| 1| 9.0| 9| 1|
// | 3| Green| 2| 2| 8.0| 8| 2|
// | 1| Blue| 1| 0| null| null| null|
// | 3| Red| 1| 0| null| null| null|
// +--------+------+----+-----------+----------+----------+---------------+
As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...
As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.

Pyspark Dataframe Apply function to two columns

Say I have two PySpark DataFrames df1 and df2.
df1= 'a'
1
2
5
df2= 'b'
3
6
And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.
In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be
'a' 'b'
1 3
2 3
5 6
I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def dict(x,y):
return abs(x-y)
udf_dict = udf(dict, IntegerType())
sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)
which gives
Column<PythonUDF#dist(a,b)>
Then I tried
sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))
which runs forever without giving error/output.
My questions are:
As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?
Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:
>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
+---+---+--------+
| a| b|distance|
+---+---+--------+
| 1| 3| 2|
| 1| 6| 5|
| 2| 3| 1|
| 2| 6| 4|
| 5| 3| 2|
| 5| 6| 1|
+---+---+--------+
But there is a more efficient way to apply this distance, by using internal abs:
>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
Then you can find matching numbers by calculating:
>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
+---+---+
| a| b|
+---+---+
| 5| 6|
| 1| 3|
| 2| 3|
+---+---+