Drop rows in Pyspark - pyspark

How can I drop the row values in Pyspark based on the value of row number/row index value?
I am new to Pyspark (and coding) -- I have tried coding something but it is not working.

You can't drop specific cols, but you can just filter the ones you want, by using filter or its alias, where.
Imagine you want "to drop" the rows where the age of a person is lower than 3. You can just keep the opposite rows, like this:
df.filter(df.age >= 3)

import pyspark.sql.functions as F
schema1 = StructType([StructField('rownumber', IntegerType(), True),StructField('name', StringType(), True)])
data1 = [(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e')]
df1 = spark.createDataFrame(data1, schema1)
df1.show()
+---------+----+
|rownumber|name|
+---------+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---------+----+
df1.filter(F.col("rownumber").between(2,4)).show()
+---------+----+
|rownumber|name|
+---------+----+
| 2| b|
| 3| c|
| 4| d|
+---------+----+

Related

Show all pyspark columns after group and agg

I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.
# Normal way of creating dataframe in pyspark
sdataframe_temp = spark.createDataFrame([
(2,2,'0-2'),
(2,23,'22-24')],
['a', 'b', 'c']
)
sdataframe_temp2 = spark.createDataFrame([
(4,6,'4-6'),
(5,7,'6-8')],
['a', 'b', 'c']
)
# Concat two different pyspark dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})
sdataframe_union_1_2_g.show()
output:
+---+------+
| a|max(b)|
+---+------+
| 5| 7|
| 2| 23|
| 4| 6|
+---+------+
Expected output:
+---+------+-----+
| a|max(b)| c |
+---+------+-----+
| 5| 7|6-8 |
| 2| 23|22-24|
| 4| 6|4-6 |
+---+------+---+
You can use a Window function to make it work:
Method 1: Using Window function
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("a").orderBy(F.desc("b"))
(sdataframe_union_1_2
.withColumn('max_val', F.row_number().over(w) == 1)
.where("max_val == True")
.drop("max_val")
.show())
+---+---+-----+
| a| b| c|
+---+---+-----+
| 5| 7| 6-8|
| 2| 23|22-24|
| 4| 6| 4-6|
+---+---+-----+
Explanation
Window functions are useful when we want to attach a new column to the existing set of columns.
In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
Then we use F.row_number() to filter the max values where row number equals 1.
Finally, we drop the new column since it is not being used after filtering the data frame.
Method 2: Using groupby + inner join
f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))
sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()
+---+---+-----+
| a| b| c|
+---+---+-----+
| 2| 23|22-24|
| 5| 7| 6-8|
| 4| 6| 4-6|
+---+---+-----+

A sum of typedLit columns evaluates to NULL

I am trying to create a sum column by taking the sum of the row values of a set of columns in a dataframe. So I followed the following method to do it.
val temp_data = spark.createDataFrame(Seq(
(1, 5),
(2, 4),
(3, 7),
(4, 6)
)).toDF("A", "B")
val cols = List(col("A"), col("B"))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
So this methods works fine and produce the expected output. However, I want to create the cols variable without specifying the column names explicitly. Therefore I've used typedLit as follows.
val cols2 = temp_data.columns.map(x=>typedLit(x)).toList
when I look at cols and cols2 they look identical.
cols: List[org.apache.spark.sql.Column] = List(A, B)
cols2: List[org.apache.spark.sql.Column] = List(A, B)
However, when I use cols2 to create my sum column, it doesn't work the way I expect it to work.
temp_data.withColumn("sum", cols2.reduce(_ + _)).show
+---+---+----+
| A| B| sum|
+---+---+----+
| 1| 5|null|
| 2| 4|null|
| 3| 7|null|
| 4| 6|null|
+---+---+----+
Does anyone have any idea what I'm doing wrong here? Why doesn't the second method work like the first method?
lit or typedLit is not a replacement for Column. What your code does it creates a list of string literals - "A" and "B"
temp_data.select(cols2: _*).show
+---+---+
| A| B|
+---+---+
| A| B|
| A| B|
| A| B|
| A| B|
+---+---+
and asks for their sums - hence the result is undefined.
You might use TypedColumn here:
import org.apache.spark.sql.TypedColumn
val typedSum: TypedColumn[Any, Int] = cols.map(_.as[Int]).reduce{
(x, y) => (x + y).as[Int]
}
temp_data.withColumn("sum", typedSum).show
but it doesn't provide any practical advantage over standard Column here.
You are trying with typedLit which is not right and like other answer mentioned you don't have to use a function with TypedColumn. You can simply use map transformation on columns of dataframe to convert it to List(Col)
Change your cols2 statement to below and try.
val cols = temp_data.columns.map(f=> col(f))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
You will get below output.
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
Thanks

create column in pyspark based on conditons [duplicate]

I have a PySpark Dataframe with two columns:
+---+----+
| Id|Rank|
+---+----+
| a| 5|
| b| 7|
| c| 8|
| d| 1|
+---+----+
For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.
If I use pseudocode to explain:
For row in df:
if row.Rank > 5:
then replace(row.Id, "other")
The result should look like this:
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])
You can use when and otherwise like -
from pyspark.sql.functions import *
df\
.withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\
.drop(df.Id)\
.select(col('Id_New').alias('Id'),col('Rank'))\
.show()
this gives output as -
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Starting with #Pushkr solution couldn't you just use the following ?
from pyspark.sql.functions import *
df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

Different aggregate operations on different columns pyspark

I am trying to apply different aggregation functions to different columns in a pyspark dataframe. Following some suggestions on stackoverflow, I tried this:
the_columns = ["product1","product2"]
the_columns2 = ["customer1","customer2"]
exprs = [mean(col(d)) for d in the_columns1, count(col(c)) for c in the_columns2]
followed by
df.groupby(*group).agg(*exprs)
where "group" is a column not present in either the_columns or the_columns2. This does not work. How to do different aggregation functions on different columns?
You are very close already, instead of put the expressions in a list, add them so you have a flat list of expressions:
exprs = [mean(col(d)) for d in the_columns1] + [count(col(c)) for c in the_columns2]
Here is a demo:
import pyspark.sql.functions as F
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 2| 1|
| 1| 2| 2| 2|
| 2| 3| 3| 3|
| 2| 4| 3| 4|
+---+---+---+---+
cols = ['b']
cols2 = ['c', 'd']
exprs = [F.mean(F.col(x)) for x in cols] + [F.count(F.col(x)) for x in cols2]
df.groupBy('a').agg(*exprs).show()
+---+------+--------+--------+
| a|avg(b)|count(c)|count(d)|
+---+------+--------+--------+
| 1| 1.5| 2| 2|
| 2| 3.5| 2| 2|
+---+------+--------+--------+

how to delete the columns in dataframe

df2000.drop('jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec').show()
now it's showing without deleted columns in dataframe
df2000.show()
when i run the show command alone to check the table .but comes with deleted column.
drop is not a side-effecting function. it returns a new Dataframe with specified columns removed. so you would have assign the new dataframe to a value to be referenced later as shown below.
>>> df2000 = spark.createDataFrame([('a',10,20,30),('a',10,20,30),('a',10,20,30),('a',10,20,30)],['key', 'jan', 'feb', 'mar'])
>>> cols = ['jan', 'feb', 'mar']
>>> df2000.show()
+---+---+---+---+
|key|jan|feb|mar|
+---+---+---+---+
| a| 10| 20| 30|
| a| 10| 20| 30|
| a| 10| 20| 30|
| a| 10| 20| 30|
+---+---+---+---+
>>> cols = ['jan', 'feb', 'mar']
>>> df2000_dropped_col = reduce(lambda x,y: x.drop(y),cols,df2000)
>>> df2000_dropped_col.show()
+---+
|key|
+---+
| a|
| a|
| a|
| a|
+---+
now doing a show on the new dataframe will yield the desired result with all the month columns dropped.