I have data like below.
A
B
C
D
1
A
Day
D1
1
A
Tim
1am
1
A
Tim
3am
Need to create like this
A
B
Day
Tim1
Tim2
1
A
D1
1am
3am
Can you help how to get in spark scala
You can add the row numbers for the duplicates first and then do the pivot.
import org.apache.spark.sql.expressions.Window
val w1 = Window.partitionBy("A", "B", "C").orderBy("D")
val w2 = Window.partitionBy("A", "B", "C")
val df1 = df0.withColumn("row_num", row_number().over(w1)).withColumn("max_num", max("row_num").over(w2))
df1.show(false)
//+---+---+---+---+-------+-------+
//|A |B |C |D |row_num|max_num|
//+---+---+---+---+-------+-------+
//|1 |A |Tim|1am|1 |2 |
//|1 |A |Tim|3am|2 |2 |
//|1 |A |Day|D1 |1 |1 |
//+---+---+---+---+-------+-------+
val df2 = df1.withColumn("C", expr("if(max_num != 1, concat(C, row_num), C)"))
df2.show(false)
//+---+---+----+---+-------+-------+
//|A |B |C |D |row_num|max_num|
//+---+---+----+---+-------+-------+
//|1 |A |Tim1|1am|1 |2 |
//|1 |A |Tim2|3am|2 |2 |
//|1 |A |Day |D1 |1 |1 |
//+---+---+----+---+-------+-------+
val df3 = df2.groupBy("A", "B").pivot("C").agg(first("D"))
df3.show(false)
//+---+---+---+----+----+
//|A |B |Day|Tim1|Tim2|
//+---+---+---+----+----+
//|1 |A |D1 |1am |3am |
//+---+---+---+----+----+
Related
I have problem in joining 2 dataframes grouped by ID
val df1 = Seq(
(1, 1,100),
(1, 3,20),
(2, 5,5),
(2, 2,10)).toDF("id", "index","value")
val df2 = Seq(
(1, 0),
(2, 0),
(3, 0),
(4, 0),
(5,0)).toDF("index", "value")
df1 joins with df2 by index column for every id
expected result
id
index
value
1
1
100
1
2
0
1
3
20
1
4
0
1
5
0
2
1
0
2
2
10
2
3
0
2
4
0
2
5
5
please help me on this
First of all, I would replace your df2 table with this:
var df2 = Seq(
(Array(1, 2), Array(1, 2, 3, 4, 5))
).toDF("id", "index")
This allows us to use explode and auto-generate a table which can be of help to us:
df2 = df2
.withColumn("id", explode(col("id")))
.withColumn("index", explode(col("index")))
and it gives:
+---+-----+
|id |index|
+---+-----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|1 |5 |
|2 |1 |
|2 |2 |
|2 |3 |
|2 |4 |
|2 |5 |
+---+-----+
Now, all we need to do, is join with your df1 as below:
df2 = df2
.join(df1, Seq("id", "index"), "left")
.withColumn("value", when(col("value").isNull, 0).otherwise(col("value")))
And we get this final output:
+---+-----+-----+
|id |index|value|
+---+-----+-----+
|1 |1 |100 |
|1 |2 |0 |
|1 |3 |20 |
|1 |4 |0 |
|1 |5 |0 |
|2 |1 |0 |
|2 |2 |10 |
|2 |3 |0 |
|2 |4 |0 |
|2 |5 |5 |
+---+-----+-----+
which should be what you want. Good luck!
I have a dataset like:
a
c
c
d
b
a
a
d
d
c
c
b
a
b
I want to add a column that looks like the one below. When 'c' is reached, the new column will be zero and then be increased by one. Is there a way we can do this using pyspark?
a 1
c 0
c 0
d 2
b 2
a 2
a 2
d 2
d 2
c 0
c 0
b 3
a 3
b 3
I have tried the below code but it is not working.
from pyspark.sql.functions import col, when, lag, sum
s = df.filter(col("col") == 'c')
df = df.withColumn("new", when(s.neq(lag("s", 1).over()), sum("s").over(Window.orderBy("index"))).otherwise(0))
The following solution uses PySpark SQL functions to implement the logic requested above.
Set-Up
Create a DataFrame to mimic the example provided
df = spark.createDataFrame(
[('a',),
('c',),
('c',),
('d',),
('b',),
('a',),
('a',),
('d',),
('d',),
('c',),
('c',),
('b',),
('a',),
('b',),],
['id',])
Output
+---+
|id |
+---+
|a |
|c |
|c |
|d |
|b |
|a |
|a |
|d |
|d |
|c |
|c |
|b |
|a |
|b |
+---+
Logic
Calculate row number (reference logic for row_num here)
df = df.withColumn("row_num", F.row_number().over(Window.orderBy(F.monotonically_increasing_id())))
Use row number to determine the preceding id value (the lag). There is no preceding id for the first row so the lag results in a null - set this missing value to "c".
df = df.withColumn("lag_id", F.lag("id",1).over(Window.orderBy("row_num")))
df = df.na.fill(value="c", subset=['lag_id'])
output
+---+--------------+------+
|id | row_num |lag_id|
+---+--------------+------+
|a |1 |c |
|c |2 |a |
|c |3 |c |
|d |4 |c |
|b |5 |d |
|a |6 |b |
|a |7 |a |
|d |8 |a |
|d |9 |d |
|c |10 |d |
|c |11 |c |
|b |12 |c |
|a |13 |b |
|b |14 |a |
+---+--------------+------+
Determine order (sequence) for rows that immediately follow a row where id = "c"
df_sequence = df.filter((df.id != "c") & (df.lag_id == "c"))
df_sequence = df_sequence.withColumn("sequence", F.row_number().over(Window.orderBy("row_num")))
output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|d |4 |c |2 |
|b |12 |c |3 |
+---+--------------+------+--------+
Join the sequence DF to the original DF
df_joined = df.alias("df1").join(df_sequence.alias("df2"),
on="row_num",
how="leftouter")\
.select(df["*"],df_sequence["sequence"])
)
Set sequence to 0 when id = "c"
df_joined = df_joined.withColumn('sequence', F.when(df_joined.id == "c", 0)
.otherwise(df_joined.sequence)
output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|c |2 |a |0 |
|c |3 |c |0 |
|d |4 |c |2 |
|b |5 |d |null |
|a |6 |b |null |
|a |7 |a |null |
|d |8 |a |null |
|d |9 |d |null |
|c |10 |d |0 |
|c |11 |c |0 |
|b |12 |c |3 |
|a |13 |b |null |
|b |14 |a |null |
+---+--------------+------+--------+
Forward fill sequence values (reference the forward fill logic here)
df_final = df_joined.withColumn('sequence', F.last('sequence', ignorenulls=True).over(Window.orderBy("row_num")
Final Output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|c |2 |a |0 |
|c |3 |c |0 |
|d |4 |c |2 |
|b |5 |d |2 |
|a |6 |b |2 |
|a |7 |a |2 |
|d |8 |a |2 |
|d |9 |d |2 |
|c |10 |d |0 |
|c |11 |c |0 |
|b |12 |c |3 |
|a |13 |b |3 |
|b |14 |a |3 |
+---+--------------+------+--------+
I have a dataframe with 2 columns "Id" and "category". For each category, I want to label encode the column "Id", so the expected outcome will be the column "Enc_id" like this
Id Category Enc_id
a1 A 0
a2 A 1
b1 B 0
c1 C 0
c2 C 1
a3 A 2
b2 B 1
b3 B 2
b4 B 3
b4 B 3
b3 B 2
Here, the Id may not be unique, so that there may be duplicated rows. I thought of creating a window to partitionBy(category), then apply the label encoding (StringIndexer) over this window but it didn't work. Any hint, please?
You can use the window function with substring function with and calculate the rank
val window = Window.partitionBy($"Category", substring($"Id", 1,1)).orderBy("Id")
df.withColumn("Enc_id", rank().over(window) - 1) // -1 to start the rank from 0
.show(false)
Output:
+---+--------+------+
|Id |Category|Enc_id|
+---+--------+------+
|a1 |A |0 |
|a2 |A |1 |
|a3 |A |2 |
|c1 |C |0 |
|c2 |C |1 |
|b1 |B |0 |
|b2 |B |1 |
|b3 |B |2 |
|b4 |B |3 |
+---+--------+------+
Update1:
for the updated case with duplicate id
df1.groupBy("Id", "Category")
.agg(collect_list("Category") as "list_category")
.withColumn("Enc_id", rank().over(window) - 1)
.withColumn("Category", explode($"list_category"))
.drop("list_category")
.show(false)
Output:
+---+--------+------+
|Id |Category|Enc_id|
+---+--------+------+
|a1 |A |0 |
|a2 |A |1 |
|a3 |A |2 |
|c1 |C |0 |
|c2 |C |1 |
|b1 |B |0 |
|b2 |B |1 |
|b3 |B |2 |
|b3 |B |2 |
|b4 |B |3 |
|b4 |B |3 |
+---+--------+------+
I'm trying to replace Null or invalid values present in a column with the above or below nonnull value of the same column. For Example:-
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
In this case, I try to replace all the NULL values in the column "Name" 1st NULL will replace with 'a' & 2nd NULL will replace with 'c' and in column "Place" NULL replace with 'a2'.
When we try to replace the 8th cell NULL of 'Place' column then also replace with its sparse nonnull value 'a2'.
Required Result:
If we select the 8th cell NULL of 'Place' column replacing then result will be
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
|a2 |8
d |c1 |9
if we select the 4th cell NULL of 'Name' column for replace then result will be
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
a |d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
Windows functions would come handy to solve the issue. For the sake of simplicity, I'm focusing on just name column. If previous row has null, I'm using next row value. You can change this order according to your need.Same approach needs to be done for other columns as well.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("a", "a1", "1"),
("a", "a2", "2"),
("a", "a2", "3"),
("d1", null, "4"),
("b", "a2", "5"),
("c", "a2", "6"),
(null, null, "7"),
(null, null, "8"),
("d", "c1", "9")).toDF("name", "place", "row_count")
val window = Window.orderBy("row_count")
val lagNameWindowExpression = lag('name, 1).over(window)
val leadNameWindowExpression = lead('name, 1).over(window)
val nameConditionExpression = when($"name".isNull.and('previous_name_col.isNull), 'next_name_col)
.when($"name".isNull.and('previous_name_col.isNotNull), 'previous_name_col).otherwise($"name")
df.select($"*", lagNameWindowExpression as 'previous_name_col, leadNameWindowExpression as 'next_name_col)
.withColumn("name", nameConditionExpression).drop("previous_name_col", "next_name_col")
.show(false)
Output
+----+-----+---------+
|name|place|row_count|
+----+-----+---------+
|a |a1 |1 |
|a |a2 |2 |
|a |a2 |3 |
|d1 |null |4 |
|b |a2 |5 |
|c |a2 |6 |
|c |null |7 |
|d |null |8 |
|d |c1 |9 |
+----+-----+---------+
from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat",'Combined')
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
([
row(1,1,'12/10/2016',"A",'Water^World'),
row(1,2,None,'A','Sea^Born'),
row(2,1,'14/10/2016','B','Germ^Any'),
row(3,3,'!~2016/2/276','B','Fin^Land'),
row(None,1,'26/09/2016','A','South^Korea'),
row(1,1,'12/10/2016',"A",'North^America'),
row(1,2,None,'A','South^America'),
row(2,1,'14/10/2016','B','New^Zealand'),
row(None,None,'!~2016/2/276','B','South^Africa'),
row(None,1,'26/09/2016','A','Saudi^Arabia')
]).toDF())
cols = F.split(tdf['Combined'], '^')
tdf = tdf.withColumn('column1', cols.getItem(0))
tdf = tdf.withColumn('column2', cols.getItem(1))
tdf.show(truncate = False )
Above is my sample code.
For some reason it is not splitting the column by ^ character.
Any advice?
The pattern is a regular expression, see split; and ^ is an anchor that matches the beginning of string in regex, to match literally, you need to escape it:
cols = F.split(tdf['Combined'], r'\^')
tdf = tdf.withColumn('column1', cols.getItem(0))
tdf = tdf.withColumn('column2', cols.getItem(1))
tdf.show(truncate = False)
+----+----+------------+---+-------------+-------+-------+
|UK_1|UK_2|Date |Cat|Combined |column1|column2|
+----+----+------------+---+-------------+-------+-------+
|1 |1 |12/10/2016 |A |Water^World |Water |World |
|1 |2 |null |A |Sea^Born |Sea |Born |
|2 |1 |14/10/2016 |B |Germ^Any |Germ |Any |
|3 |3 |!~2016/2/276|B |Fin^Land |Fin |Land |
|null|1 |26/09/2016 |A |South^Korea |South |Korea |
|1 |1 |12/10/2016 |A |North^America|North |America|
|1 |2 |null |A |South^America|South |America|
|2 |1 |14/10/2016 |B |New^Zealand |New |Zealand|
|null|null|!~2016/2/276|B |South^Africa |South |Africa |
|null|1 |26/09/2016 |A |Saudi^Arabia |Saudi |Arabia |
+----+----+------------+---+-------------+-------+-------+
Try with '\^'. Also, it will be the same situation when you used '.' dot as the denominator.