Pyspark copy values from other columns depending on a specific column - pyspark

I want to create the New column depending on the column A values using Pyspark.
I want to take the Column B values for items greater than 1300 when creating the new column.
but I want to keep the Column C values for items less than 1300 when creating the new column.
I am only a beginner. Thank you for your help.
Column A
Column B
Column C
New
1210
100
200
200
1300
70
50
70
1200
10
50
50
1310
15
300
15
I have tried to filter out items greater than 1300.

You can do so with col and when. When A >= 1300 then take column B otherwise column C.
from pyspark.sql.functions import when, col
data = [
[1210, 100, 200],
[1300, 70, 50],
[1200, 10, 50],
[1310, 15, 300]
]
df = spark.createDataFrame(data, ['A', 'B', 'C'])
df.withColumn('New', when(col('A') >= 1300, col('B')).otherwise(col('C'))).show()
+----+---+---+---+
| A| B| C|New|
+----+---+---+---+
|1210|100|200|200|
|1300| 70| 50| 70|
|1200| 10| 50| 50|
|1310| 15|300| 15|
+----+---+---+---+

Related

Pyspark agg max abs val but keep sign

I'm agg on another col. For ex if the col was
ID val
A 10
A 100
A -150
A 15
B 10
B 200
B -150
B 15
I'd want to return the below (keeping the sign). Not sure how to do this while keeping the sign
ID max(val)
A -150
B 200
Option 1: using a window function + row_number. We parition by ID and order by abs(val) descending. Then we simply take the first row.
import pyspark.sql.functions as F
data = [
('A', 10),
('A', 100),
('A', -150),
('A', 15),
('B', 10),
('B', 200),
('B', -150),
('B', 15)
]
df = spark.createDataFrame(data=data, schema=('ID','val'))
w = Window().partitionBy('ID').orderBy(F.abs('val').desc())
(df
.withColumn('rn', F.row_number().over(w))
.filter(F.col('rn') == 1)
.drop('rn')
).show()
+---+----+
| ID| val|
+---+----+
| A|-150|
| B| 200|
+---+----+
Option 2: A solution which works with agg. We compare the max value to the absolute max value. If they match then take the max, if they don't then take the min. Note that this solution prefers the positive value in case of ties.
df.groupby('ID').agg(
F.when(F.max('val') == F.max(F.abs('val')), F.max('val')).otherwise(F.min('val')).alias('max_val')
).show()
+---+-------+
| ID|max_val|
+---+-------+
| A| -150|
| B| 200|
+---+-------+

spark aggregation with sorted rows that returns a row's value before a condition is met

I have some data (invoice data). Assuming id ~ date and id is what I'm sorting by:
fid, id, due, overdue
0, 1, 5, 0
0, 3, 5, 5
0, 13, 5, 10
0, 14, 5, 0
1, 5, 5, 0
1, 26, 5, 5
1, 27, 5, 10
1, 38, 5, 0
remove all rows under some arbitrary date-id id = 20
group_by fid and sort by id within the group
(major) aggregate a new column overdue_id that is the id of the row before the first row in the group that has a nonzero value for overdue
(minor) fill a row for every fid even if all rows are filtered out by #0
so the output would be (given default value null)
fid, overdue_id
0, 1
1, null
because for fid = 0, the first id with nonzero overdue is id = 3, and I'd like to output the id for the row that before that in id-date time which is id = 1.
I have group_by('fid').withColumn('overdue_id', ...), and want to use functions like agg, min, when, but am not sure after that as I am very new to the docs.
You can use the following steps to solve :
import pyspark.sql.functions as F
from pyspark.sql import *
#added fid=2 for overdue = 0 condition
fid = [0,1,2]*4
fid.sort()
dateId = [1,3,13,14,5,26,27,28]
dateId.extend(range(90,95))
due = [5]*12
overdue = [0,5,10,0]*2
overdue.extend([0,0,0,0])
data = zip(fid, dateId, due, overdue)
df = spark.createDataFrame(data, schema =["fid", "dateId", "due", "overdue"])
win = Window.partitionBy(df['fid']).orderBy(df['dateId'])
res = df\
.filter(F.col("dateId")!= 20)\
.withColumn("lag_id", F.lag(F.col("dateId"), 1).over(win))\
.withColumn("overdue_id", F.when(F.col("overdue")!=0, F.col("lag_id")).otherwise(None))\
.groupBy("fid")\
.agg(F.min("overdue_id").alias("min_overdue_id"))
>>> res.show()
+---+--------------+
|fid|min_overdue_id|
+---+--------------+
| 0| 1|
| 1| 5|
| 2| null|
+---+--------------+
You need to use the lag and window function. Before we begin, why is your example output showing null for fid 1. The first non zero value is for id 26, so the id before that is 5. so shouldn't be 5? Unless you need something else, you can try this.
tst=sqlContext.createDataFrame([(0, 1,5,0),(0,20,5,0),(0,30,5,5),(0,13,5,10),(0,14,5,0),(1,5,5,0),(1,26,5,5),(1,27,5,10),(1,38,5,0)],schema=["fid","id","due","overdue"])
# To filter data
tst_f = tst.where('id!=20')
# Define window function
w=Window.partitionBy('fid').orderBy('id')
tst_lag = tst_f.withColumn('overdue_id',F.lag('id').over(w))
# Remove rows with 0 overdue
tst_od = tst_lag.where('overdue!=0')
# Find the row before first non zero overdue
tst_res = tst_od.groupby('fid').agg(F.first('overdue_id').alias('overdue_id'))
tst_res.show()
+---+----------+
|fid|overdue_id|
+---+----------+
| 0| 1|
| 1| 5|
+---+----------+
If you are weary about using the first function , or just to be confident about avoiding ghost issues, you can try the below performance expensive option
# Create a copy to avoid ambiguous join and select the minimum from non zero overdue rows
tst_min= tst_od.withColumn("dummy",F.lit('dummy')).groupby('fid').agg(F.min('id').alias('id_min'))
# Join this with the dataframe to get results
tst_join = tst_od.join(tst_min,on=tst_od.id==tst_min.id_min,how='right')
tst_join.show()
+---+---+---+-------+----------+---+------+
|fid| id|due|overdue|overdue_id|fid|id_min|
+---+---+---+-------+----------+---+------+
| 1| 26| 5| 5| 5| 1| 26|
| 0| 13| 5| 10| 1| 0| 13|
+---+---+---+-------+----------+---+------+
# This way you can see all the information
You can filter the relevant information from this dataframe using filter() or where() method

Spark Scala aggregation group Dataframe

I have input Dataframe and have to produce output Dataframe.
On input Dataframe, I have to group several columns and if that group has sum of another column some value for that group then I have to update one column for each member of that group with x.
So I will get several groups and have to update one of their columns with x and for rows that don’t get in any group value in that column must not be changed.
Like:
Job id , job name, department, age, old.
First 3 columns are grouped, sum(age) = 100 then old gets x for all rows in group
And their will be several groups.
And output Dataframe will have same number of rows as input one.
val dfIn = job id , job name , department , age , old
24 Dev Sales 30 0
24 Dev Sales 40 0
24 Dev Sales 20 0
24 Dev Sales 10 0
24 Dev HR 30 0
24 Dev HR 20 0
24 Dev Retail 50 0
24 Dev Retail 50 0
val dfOut= job id , job name , department , age , old
24 Dev Sales 30 x
24 Dev Sales 40 x
24 Dev Sales 20 x
24 Dev Sales 10 x
24 Dev HR 30 0
24 Dev HR 20 0
24 Dev Retail 50 x
24 Dev Retail 50 x
Just calculate sum_age using Window function and use when/otherwise to affect X to old column when sum_age = 100 otherwise keep same value 0.
import org.apache.spark.sql.expressions.Window
val df = Seq(
(24, "Dev", "Sales", 30, "0"), (24, "Dev", "Sales", 40, "0"),
(24, "Dev", "Sales", 20, "0"), (24, "Dev", "Sales", 10, "0"),
(24, "Dev", "HR", 30, "0"), (24, "Dev", "HR", 20, "0"),
(24, "Dev", "Retail", 50, "0"), (24, "Dev", "Retail", 50, "0")
).toDF("job_id", "job_name", "department", "age", "old")
val w = Window.partitionBy($"job_id", $"job_name", $"department").orderBy($"job_id")
val dfOut = df.withColumn("sum_age", sum(col("age")).over(w))
.withColumn("old", when($"sum_age" === lit(100), lit("X")).otherwise($"old"))
.drop($"sum_age")
dfOut.show()
+------+--------+----------+---+---+
|job_id|job_name|department|age|old|
+------+--------+----------+---+---+
| 24| Dev| HR| 30| 0|
| 24| Dev| HR| 20| 0|
| 24| Dev| Retail| 50| X|
| 24| Dev| Retail| 50| X|
| 24| Dev| Sales| 30| X|
| 24| Dev| Sales| 40| X|
| 24| Dev| Sales| 20| X|
| 24| Dev| Sales| 10| X|
+------+--------+----------+---+---+

Spark dataframe data aggregation

I have a below requirement to aggregate the data on Spark dataframe in scala.
I have a spark dataframe with two columns.
mo_id sales
201601 11.01
201602 12.01
201603 13.01
201604 14.01
201605 15.01
201606 16.01
201607 17.01
201608 18.01
201609 19.01
201610 20.01
201611 21.01
201612 22.01
As shown above the dataframe has two columns 'mo_id' and 'sales'.
I want to add a new column (agg_sales)to the dataframe which should have the sum of sales upto the current month like as shown below.
mo_id sales agg_sales
201601 10 10
201602 20 30
201603 30 60
201604 40 100
201605 50 150
201606 60 210
201607 70 280
201608 80 360
201609 90 450
201610 100 550
201611 110 660
201612 120 780
Description:
For the month 201603 agg_sales will be sum of sales from 201601 to 201603.
For the month 201604 agg_sales will be sum of sales from 201601 to 201604.
and so on.
Can anyone please help to do this.
Versions using : Spark 1.6.2 and Scala 2.10
You are looking for a cumulative sum which can be accomplished with a window function:
scala> val df = sc.parallelize(Seq((201601, 10), (201602, 20), (201603, 30), (201604, 40), (201605, 50), (201606, 60), (201607, 70), (201608, 80), (201609, 90), (201610, 100), (201611, 110), (201612, 120))).toDF("id","sales")
df: org.apache.spark.sql.DataFrame = [id: int, sales: int]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val ordering = Window.orderBy("id")
ordering: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#75d454a4
scala> df.withColumn("agg_sales", sum($"sales").over(ordering)).show
16/12/27 21:11:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+-----+-------------+
| id|sales| agg_sales |
+------+-----+-------------+
|201601| 10| 10|
|201602| 20| 30|
|201603| 30| 60|
|201604| 40| 100|
|201605| 50| 150|
|201606| 60| 210|
|201607| 70| 280|
|201608| 80| 360|
|201609| 90| 450|
|201610| 100| 550|
|201611| 110| 660|
|201612| 120| 780|
+------+-----+-------------+
Note that I defined the ordering on the ids, you would probably want some sort of time stamp to order the summation.

giving duplicate values unique identifiers in spark/scala

I was hoping somebody might know a simple solution to this problem using spark and scala.
I have some network data of animal movements in the following format (currently in a dataframe in spark):
id start end date
12 0 10 20091017
12 10 20 20091201
12 20 0 20091215
12 0 15 20100220
12 15 0 20100320
the id is the id of the animal, the start and end are locations of movements (i.e. the second row is movement from location id 10 to location id 20). If the start or end is a 0 that means the animal is born or has died (i.e. first row animal 12 is born and row 3 the animal has died).
The problem I have is that the data was collected so that animal ID's were re-used in the database so after an animal has died its id may re-occur.
What I want to do is apply a unique tag to all movements which are re-used. So you would get a database something like
id start end date
12a 0 10 20091017
12a 10 20 20091201
12a 20 0 20091215
12b 0 15 20100220
12b 15 0 20100320
I've been trying a few different approaches but can't seem to get anything that works. The database is very large (several gigabytes) so need something that works quite efficiently.
Any help is much appreciated.
The only solution that may work relatively well directly on DataFrames is to use window functions but I still wouldn't expect particularly high performance here:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(12, 0, 10, 20091017), (12, 10, 20, 20091201),
(12, 20, 0, 20091215), (12, 0, 15, 20100220),
(12, 15, 0, 20100320)
).toDF("id", "start", "end", "date")
val w = Window.partitionBy($"id").orderBy($"date")
val uniqueId = struct(
$"id", sum(when($"start" === 0, 1).otherwise(0)).over(w))
df.withColumn("unique_id", uniqueId).show
// +---+-----+---+--------+---------+
// | id|start|end| date|unique_id|
// +---+-----+---+--------+---------+
// | 12| 0| 10|20091017| [12,1]|
// | 12| 10| 20|20091201| [12,1]|
// | 12| 20| 0|20091215| [12,1]|
// | 12| 0| 15|20100220| [12,2]|
// | 12| 15| 0|20100320| [12,2]|
// +---+-----+---+--------+---------+