Pyspark agg max abs val but keep sign - pyspark

I'm agg on another col. For ex if the col was
ID val
A 10
A 100
A -150
A 15
B 10
B 200
B -150
B 15
I'd want to return the below (keeping the sign). Not sure how to do this while keeping the sign
ID max(val)
A -150
B 200

Option 1: using a window function + row_number. We parition by ID and order by abs(val) descending. Then we simply take the first row.
import pyspark.sql.functions as F
data = [
('A', 10),
('A', 100),
('A', -150),
('A', 15),
('B', 10),
('B', 200),
('B', -150),
('B', 15)
]
df = spark.createDataFrame(data=data, schema=('ID','val'))
w = Window().partitionBy('ID').orderBy(F.abs('val').desc())
(df
.withColumn('rn', F.row_number().over(w))
.filter(F.col('rn') == 1)
.drop('rn')
).show()
+---+----+
| ID| val|
+---+----+
| A|-150|
| B| 200|
+---+----+
Option 2: A solution which works with agg. We compare the max value to the absolute max value. If they match then take the max, if they don't then take the min. Note that this solution prefers the positive value in case of ties.
df.groupby('ID').agg(
F.when(F.max('val') == F.max(F.abs('val')), F.max('val')).otherwise(F.min('val')).alias('max_val')
).show()
+---+-------+
| ID|max_val|
+---+-------+
| A| -150|
| B| 200|
+---+-------+

Related

I have a df[rn,rn1,rn2]. When rn is null, I want to generate random number and assign it to rn1,rn2

Df consists of these rows [rn,rn1,rn2].
The condition is,if rn is null,generate a random number between 0-1000 and then assign that value to rn1,rn2.any suggestions please.
I have tried all the possible options. Could not figure out since I'm new to azure.please help
from pyspark.sql.functions import rand, col, when, floor
data = [(None, 200, 1000), (2,300,400), (None, 300,500)]
df = spark.createDataFrame(data).toDF("rn","rn1","rn2")
>>> df.select("*").show()
+----+---+----+
| rn|rn1| rn2|
+----+---+----+
|null|200|1000|
| 2|300| 400|
|null|300| 500|
+----+---+----+
df.select( df.rn,
when(
df.rn.isNull() , # condition
floor(rand()*1000) # true value
).otherwise(
df.rn1 # false value
).alias("rn1") ).show()
+----+---+
| rn|rn1|
+----+---+
|null|545|
| 2|300|
|null|494|
+----+---+
Rinse and repeat for rn2.

Find top N game for every ID based on total time using spark and scala

Find top N Game for every id who watched based on total time so here is my input dataframe:
InputDF:
id | Game | Time
1 A 10
2 B 100
1 A 100
2 C 105
1 N 103
2 B 102
1 N 90
2 C 110
And this is the output that I am expecting:
OutputDF:
id | Game | Time|
1 N 193
1 A 110
2 C 215
2 B 202
Here what I have tried but it is not working as expected:
val windowDF = Window.partitionBy($"id").orderBy($"Time".desc)
InputDF.withColumn("rank", row_number().over(windowDF))
.filter("rank<=10")
Your top-N ranking applies only to individual time rather than total time per game. A groupBy/sum to compute total time followed by a ranking on the total time will do:
val df = Seq(
(1, "A", 10),
(2, "B", 100),
(1, "A", 100),
(2, "C", 105),
(1, "N", 103),
(2, "B", 102),
(1, "N", 90),
(2, "C", 110)
).toDF("id", "game", "time")
import org.apache.spark.sql.expressions.Window
val win = Window.partitionBy($"id").orderBy($"total_time".desc)
df.
groupBy("id", "game").agg(sum("time").as("total_time")).
withColumn("rank", row_number.over(win)).
where($"rank" <= 10).
show
// +---+----+----------+----+
// | id|game|total_time|rank|
// +---+----+----------+----+
// | 1| N| 193| 1|
// | 1| A| 110| 2|
// | 2| C| 215| 1|
// | 2| B| 202| 2|
// +---+----+----------+----+

Weighted mean median quartiles in Spark

I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

Get the number of null per row in PySpark dataframe

This is probably a duplicate, but somehow I have been searching for a long time already:
I want to get the number of nulls per Row in a Spark dataframe. I.e.
col1 col2 col3
null 1 a
1 2 b
2 3 null
Should in the end be:
col1 col2 col3 number_of_null
null 1 a 1
1 2 b 0
2 3 null 1
In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.
I.e.
col1 col2 col3 number_of_ABC
ABC 1 a 1
1 2 b 0
2 ABC ABC 2
I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/
EDIT: Assume that I have so many columns that I can't list them all.
EDIT2: I explicitely dont want to have a pandas solution.
EDIT3: The solution explained with sums or means does not work as it throws errors:
(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))
In Scala:
val df = List(
("ABC", "1", "a"),
("1", "2", "b"),
("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)
Output:
+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1 |a |1 |
|1 |2 |b |0 |
|2 |ABC |ABC |2 |
+----+----+----+--------+
As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7
Taking your DataFrame as df (and as is)
dfvals = [
(None, "1", "a"),
("1", "2", "b"),
("2", None, None)
]
df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])
new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
df.schema.names)))
Check if the value is Null and assign 1 or 0. Add the result to get the count.
new_df.show()
+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null| 1| a| 1|
| 1| 2| b| 0|
| 2|null|null| 2|
+----+----+----+--------+

Pyspark: Delete rows on column condition after groupBy

This is my input dataframe:
id val
1 Y
1 N
2 a
2 b
3 N
Result should be:
id val
1 Y
2 a
2 b
3 N
I want to group by on col id which has both Y and N in the val and then remove the row where the column val contains "N".
Please help me resolve this issue as i am beginner to pyspark
you can first identify the problematic rows with a filter for val=="Y" and then join this dataframe back to the original one. Finally you can filter for Null values and for the rows you want to keep, e.g. val==Y. Pyspark should be able to handle the self-join even if there are a lot of rows.
The example is shown below:
df_new = spark.createDataFrame([
(1, "Y"), (1, "N"), (1,"X"), (1,"Z"),
(2,"a"), (2,"b"), (3,"N")
], ("id", "val"))
df_Y = df_new.filter(col("val")=="Y").withColumnRenamed("val","val_Y").withColumnRenamed("id","id_Y")
df_new = df_new.join(df_Y, df_new["id"]==df_Y["id_Y"],how="left")
df_new.filter((col("val_Y").isNull()) | ((col("val_Y")=="Y") & ~(col("val")=="N"))).select("id","val").show()
The result would be your preferred:
+---+---+
| id|val|
+---+---+
| 1| X|
| 1| Y|
| 1| Z|
| 3| N|
| 2| a|
| 2| b|
+---+---+