Pyspark sort and get first and last - pyspark

I used code belopw to sort based on one column. I am wondering how can I get the first element and last element in sorted dataframe?
group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(desc("count"))

The max and min functions need to have a group to work with, to circumvent the issue, you can create a dummy column as below, then call the max and min for the maximum and minimum values.
If that's all you need, you don't really need sort here.
from pyspark.sql import functions as F
df = spark.createDataFrame([("a", 0.694), ("b", -2.669), ("a", 0.245), ("a", 0.1), ("b", 0.3), ("c", 0.3)], ["n", "val"])
df.show()
+---+------+
| n| val|
+---+------+
| a| 0.694|
| b|-2.669|
| a| 0.245|
| a| 0.1|
| b| 0.3|
| c| 0.3|
+---+------+
df = df.groupby('n').count() #.sort(F.desc('count'))
df = df.withColumn('dummy', F.lit(1))
df.show()
+---+-----+-----+
| n|count|dummy|
+---+-----+-----+
| c| 1| 1|
| b| 2| 1|
| a| 3| 1|
+---+-----+-----+
df = df.groupBy('dummy').agg(F.min('count').alias('min'), F.max('count').alias('max')).drop('dummy')
df.show()
+---+---+
|min|max|
+---+---+
| 1| 3|
+---+---+

Related

PySpark: Pandas UDF for scipy statistical transformations

I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working.
Here's my example:
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import zscore
#pandas_udf('float')
def zscore_udf(x: pd.Series) -> pd.Series:
return zscore(x)
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns = ["id","x"]
data = [("a", 81.0),
("b", 36.2),
("c", 12.0),
("d", 81.0),
("e", 36.3),
("f", 12.0),
("g", 111.7)]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
df = df.withColumn('y', zscore_udf(df.x))
df.show()
Which results in obviously wrong calculations:
+---+-----+----+
| id| x| y|
+---+-----+----+
| a| 81.0|null|
| b| 36.2| 1.0|
| c| 12.0|-1.0|
| d| 81.0| 1.0|
| e| 36.3|-1.0|
| f| 12.0|-1.0|
| g|111.7| 1.0|
+---+-----+----+
Thank you for your help.
How to fix:
instead of using a UDF calculate the stddev_pop and the avg of the dataframe and calculate z-score manually.
I suggest using "window function" over the entire dataframe for the first step and then a simple arithmetic to get the z-score.
see suggested code:
from pyspark.sql.functions import avg, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
avg("x").over(Window.partitionBy()).alias("avg_x"),
stddev_pop("x").over(Window.partitionBy()).alias("stddev_x"),
) \
.withColumn("manual_z_score", (col("x") - col("avg_x")) / col("stddev_x"))
Why the UDF didn't work?
Spark is used for distributed computation. When you perform operations on a DataFrame Spark distributes the workload into partitions on the executors/workers available.
pandas_udf is not different. When running a UDF from the type pd.Series -> pd.Series some rows are sent to partition X and some to partition Y, then when zscore is run it calculates the mean and std of the data in the partition and writes the zscore based on that data only.
I'll use spark_partition_id to "prove" this.
rows a,b,c were mapped in partition 0 while d,e,f,g in partition 1. I've calculated manually the mean/stddev_pop of both the entire set and the partitioned data and then calculated the z-score. the UDF z-score was equal to the z-score of the partition.
from pyspark.sql.functions import pandas_udf, spark_partition_id, avg, stddev, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
zscore_udf(df.x).alias("z_score"),
spark_partition_id().alias("partition"),
avg("x").over(Window.partitionBy(spark_partition_id())).alias("avg_partition_x"),
stddev_pop("x").over(Window.partitionBy(spark_partition_id())).alias("stddev_partition_x"),
) \
.withColumn("partition_z_score", (col("x") - col("avg_partition_x")) / col("stddev_partition_x"))
df2.show()
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| id| x| z_score|partition| avg_partition_x|stddev_partition_x| partition_z_score|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| a| 81.0| 1.327058| 0|43.06666666666666|28.584533502500186| 1.3270579815484989|
| b| 36.2|-0.24022315| 0|43.06666666666666|28.584533502500186|-0.24022314955974558|
| c| 12.0| -1.0868348| 0|43.06666666666666|28.584533502500186| -1.0868348319887526|
| d| 81.0| 0.5366879| 1| 60.25|38.663063768925504| 0.5366879387524718|
| e| 36.3|-0.61945426| 1| 60.25|38.663063768925504| -0.6194542714757446|
| f| 12.0| -1.2479612| 1| 60.25|38.663063768925504| -1.247961110593097|
| g|111.7| 1.3307275| 1| 60.25|38.663063768925504| 1.3307274433163698|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
I also added df.repartition(8) prior to the calculation and managed to get similar results as in the original question.
partitions with 0 stddev --> null z score, partition with 2 rows --> (-1, 1) z scores.
+---+-----+-------+---------+---------------+------------------+-----------------+
| id| x|z_score|partition|avg_partition_x|stddev_partition_x|partition_z_score|
+---+-----+-------+---------+---------------+------------------+-----------------+
| a| 81.0| null| 0| 81.0| 0.0| null|
| d| 81.0| null| 0| 81.0| 0.0| null|
| f| 12.0| null| 1| 12.0| 0.0| null|
| b| 36.2| -1.0| 6| 73.95| 37.75| -1.0|
| g|111.7| 1.0| 6| 73.95| 37.75| 1.0|
| c| 12.0| -1.0| 7| 24.15|12.149999999999999| -1.0|
| e| 36.3| 1.0| 7| 24.15|12.149999999999999| 1.0|
+---+-----+-------+---------+---------------+------------------+-----------------+

Get first not null value spark scala on a dataframe during aggregation

During the aggregation group by, currently I am taking the first value. But I need the first not null value for the visit_id column. Please let me know if there is any approach in Spark Scala.
DF.groupBy("id").agg(lit(first(col("visit_id"))).alias("visit_id")
Thanks in advance.
You can use ignoreNulls parameter in first:
Example:
val df = Seq((1, Some(2)), (1, None), (2, None), (2, Some(3))).toDF("id", "visit_id")
df.show
+---+--------+
| id|visit_id|
+---+--------+
| 1| 2|
| 1| null|
| 2| null|
| 2| 3|
+---+--------+
df.groupBy("id").agg(first("visit_id", ignoreNulls=true).as("visit_id")).show
+---+--------+
| id|visit_id|
+---+--------+
| 1| 2|
| 2| 3|
+---+--------+

A sum of typedLit columns evaluates to NULL

I am trying to create a sum column by taking the sum of the row values of a set of columns in a dataframe. So I followed the following method to do it.
val temp_data = spark.createDataFrame(Seq(
(1, 5),
(2, 4),
(3, 7),
(4, 6)
)).toDF("A", "B")
val cols = List(col("A"), col("B"))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
So this methods works fine and produce the expected output. However, I want to create the cols variable without specifying the column names explicitly. Therefore I've used typedLit as follows.
val cols2 = temp_data.columns.map(x=>typedLit(x)).toList
when I look at cols and cols2 they look identical.
cols: List[org.apache.spark.sql.Column] = List(A, B)
cols2: List[org.apache.spark.sql.Column] = List(A, B)
However, when I use cols2 to create my sum column, it doesn't work the way I expect it to work.
temp_data.withColumn("sum", cols2.reduce(_ + _)).show
+---+---+----+
| A| B| sum|
+---+---+----+
| 1| 5|null|
| 2| 4|null|
| 3| 7|null|
| 4| 6|null|
+---+---+----+
Does anyone have any idea what I'm doing wrong here? Why doesn't the second method work like the first method?
lit or typedLit is not a replacement for Column. What your code does it creates a list of string literals - "A" and "B"
temp_data.select(cols2: _*).show
+---+---+
| A| B|
+---+---+
| A| B|
| A| B|
| A| B|
| A| B|
+---+---+
and asks for their sums - hence the result is undefined.
You might use TypedColumn here:
import org.apache.spark.sql.TypedColumn
val typedSum: TypedColumn[Any, Int] = cols.map(_.as[Int]).reduce{
(x, y) => (x + y).as[Int]
}
temp_data.withColumn("sum", typedSum).show
but it doesn't provide any practical advantage over standard Column here.
You are trying with typedLit which is not right and like other answer mentioned you don't have to use a function with TypedColumn. You can simply use map transformation on columns of dataframe to convert it to List(Col)
Change your cols2 statement to below and try.
val cols = temp_data.columns.map(f=> col(f))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
You will get below output.
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
Thanks

create column in pyspark based on conditons [duplicate]

I have a PySpark Dataframe with two columns:
+---+----+
| Id|Rank|
+---+----+
| a| 5|
| b| 7|
| c| 8|
| d| 1|
+---+----+
For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.
If I use pseudocode to explain:
For row in df:
if row.Rank > 5:
then replace(row.Id, "other")
The result should look like this:
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])
You can use when and otherwise like -
from pyspark.sql.functions import *
df\
.withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\
.drop(df.Id)\
.select(col('Id_New').alias('Id'),col('Rank'))\
.show()
this gives output as -
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Starting with #Pushkr solution couldn't you just use the following ?
from pyspark.sql.functions import *
df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

How to use Sum on groupBy result in Spark DatFrames?

Based on the following dataframe:
+---+-----+----+
| ID|Categ|Amnt|
+---+-----+----+
| 1| A| 10|
| 1| A| 5|
| 2| A| 56|
| 2| B| 13|
+---+-----+----+
I would like to obtain the sum of the column Amnt groupby ID and Categ.
+---+-----+-----+
| ID|Categ|Count|
+---+-----+-----+
| 1| A| 15 |
| 2| A| 56 |
| 2| B| 13 |
+---+-----+-----+
In SQL I would be doing something like
SELECT ID,
Categ,
SUM (Count)
FROM Table
GROUP BY ID,
Categ;
But how to do this in Scala?
I tried
DF.groupBy($"ID", $"Categ").sum("Count")
But this just changed the Count column name into sum(count) instead of actually giving me the sum of the counts.
Maybe you were summing the wrong column, but your grougBy/sum statement looks syntactically correct to me:
val df = Seq(
(1, "A", 10),
(1, "A", 5),
(2, "A", 56),
(2, "B", 13)
).toDF("ID", "Categ", "Amnt")
df.groupBy("ID", "Categ").sum("Amnt").show
// +---+-----+---------+
// | ID|Categ|sum(Amnt)|
// +---+-----+---------+
// | 1| A| 15|
// | 2| A| 56|
// | 2| B| 13|
// +---+-----+---------+
EDIT:
To alias the sum(Amnt) column (or, for multiple aggregations), wrap the aggregation expression(s) with agg. For example:
// Rename `sum(Amnt)` as `Sum`
df.groupBy("ID", "Categ").agg(sum("Amnt").as("Sum"))
// Aggregate `sum(Amnt)` and `count(Categ)`
df.groupBy("ID", "Categ").agg(sum("Amnt"), count("Categ"))