Create a boolean feature to check if two columns are the same - scala

I have a dataframe DF1 that has three features (columns) a,b,c, all of StringType. I want to create a new dataframe DF2 from DF1 that has two columns:
The column a
A new column d with 1 if b=c otherwise 0
Input example:
a b c
A B B
B C A
D D D
Wanted output
a d
A 1
B 0
D 1

The part missing is drop for the other two columns.
val df2 = df1.withColumn("d", col("b") === col("c")).drop("b").drop("c")
df2.show
This gives us
+---+-----+
| a| d|
+---+-----+
| A| true|
| B|false|
| D| true|
+---+-----+

Please use This val df2=df1.withColumn("d",col("b") === col("c"))
Here WithColumn will add new columns in df2.

For me, as I was using pyspark, below commands worked
df2 = df1.withColumn("d", col("b") == col("c")).select(col("a"), col("d"))
df2.display()

Related

PySpark incrementally add id based on another column and previous data

Incrementally derive ID from a name column and on next load if there are new values added to that name column then assign need ID which is not already assigned to previous data
Example - first load:
Name
a
b
b
a
Result
ID
Name
1
a
2
b
2
b
1
a
Next load:
Name
a
b
b
a
c
d
c
Result:
ID
Name
1
a
2
b
2
b
1
a
3
c
4
d
3
c
As described in question looking for a solution in PySpark
You can create additional dataframe df_map where you store your IDs between loads. If you need to, you can save and restore this dataframe from the disk.
df1 = spark.createDataFrame(
data=[['a'], ['b'], ['b'], ['a']],
schema=["name"]
)
df2 = spark.createDataFrame(
data=[['a'], ['b'], ['b'], ['a'], ['c'], ['d'], ['c'], ['0']],
schema=["name"]
)
w = Window.orderBy('name')
# create empty map
df_map = spark.createDataFrame([], schema='name string, id int')
df_map.show()
# get additional name->id map for df1
n = df_map.select(F.count('id').alias('n')).collect()[0].n
df_map = df1.subtract(df_map.select('name')).withColumn('id', F.row_number().over(w) + F.lit(n)).union(df_map)
df_map.show()
# map can be saved to disk between runs
# get additional name->id map for df2
n = df_map.select(F.count('id').alias('n')).collect()[0].n
df_map = df2.subtract(df_map.select('name')).withColumn('id', F.row_number().over(w) + F.lit(n)).union(df_map)
df_map.show()
# join to get the final dataframe
df2.join(df_map, on='name').show()
You can use window and dense_rank. The code below will make dataframe sorted by 'name' column and give each unique name an incremental unique id.
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window as W
window = W.orderBy('name')
(
df
.withColumn('id', F.dense_rank().over(window))
).show()
+----+---+
|name| id|
+----+---+
| a| 1|
| a| 1|
| b| 2|
| b| 2|
| c| 3|
| c| 3|
| d| 4|
+----+---+

Advanced join two dataframe spark scala

I have to join two Dataframes.
Sample:
Dataframe1 looks like this
df1_col1 df1_col2
a ex1
b ex4
c ex2
d ex6
e ex3
Dataframe2
df2_col1 df2_col2
1 a,b,c
2 d,c,e
3 a,e,c
In result Dataframe I would like to get result like this
res_col1 res_col2 res_col3
a ex1 1
a ex1 3
b ex4 1
c ex2 1
c ex2 2
c ex2 3
d ex6 2
e ex3 2
e ex3 3
What will be the best way to achieve this join?
I have updated the code below
val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3")))
val df2 = sc.parallelize(Seq(List(("1","a,b,c"),("2","d,c,e")))).toDF
df2.withColumn("df2_col2_explode", explode(split($"_2", ","))).select($"_1".as("df2_col1"),$"df2_col2_explode").join(df1.select($"_1".as("df1_col1"),$"_2".as("df1_col2")), $"df1_col1"===$"df2_col2_explode","inner").show
You just need to split the values and generate multiple rows by exploding it and then join with the other dataframe.
You can refer this link, How to split pipe-separated column into multiple rows?
I used spark sql for this join, here is a part of code;
df1.createOrReplaceTempView("temp_v_df1")
df2.createOrReplaceTempView("temp_v_df2")
val df_result = spark.sql("""select
| b.df1_col1 as res_col1,
| b.df1_col2 as res_col2,
| a.df2_col1 as res_col3
| from (select df2_col1, exp_col
| from temp_v_df2
| lateral view explode(split(df2_col2,",")) dummy as exp_col) a
| join temp_v_df1 b on a.exp_col = b.df1_col1""".stripMargin)
I used spark scala data frame to achieve your desire output.
val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3"))).toDF("df1_col1","df1_col2")
val df2 = sc.parallelize(Seq((1,("a,b,c")),(2,("d,c,e")),(3,("a,e,c")))).toDF("df2_col1","df2_col2")
df2.withColumn("_tmp", explode(split($"df2_col2", "\\,"))).as("temp").join (df1,$"temp._tmp"===df1("df1_col1"),"inner").drop("_tmp","df2_col2").show
Desire Output
+--------+--------+--------+
|df2_col1|df1_col1|df1_col2|
+--------+--------+--------+
| 2| e| ex3|
| 3| e| ex3|
| 2| d| ex6|
| 1| c| ex2|
| 2| c| ex2|
| 3| c| ex2|
| 1| b| ex4|
| 1| a| ex1|
| 3| a| ex1|
+--------+--------+--------+
Rename the Column according to your requirement.
Here the screenshot of running code
Happy Hadoooooooooooooooppppppppppppppppppp

Pyspark: Delete rows on column condition after groupBy

This is my input dataframe:
id val
1 Y
1 N
2 a
2 b
3 N
Result should be:
id val
1 Y
2 a
2 b
3 N
I want to group by on col id which has both Y and N in the val and then remove the row where the column val contains "N".
Please help me resolve this issue as i am beginner to pyspark
you can first identify the problematic rows with a filter for val=="Y" and then join this dataframe back to the original one. Finally you can filter for Null values and for the rows you want to keep, e.g. val==Y. Pyspark should be able to handle the self-join even if there are a lot of rows.
The example is shown below:
df_new = spark.createDataFrame([
(1, "Y"), (1, "N"), (1,"X"), (1,"Z"),
(2,"a"), (2,"b"), (3,"N")
], ("id", "val"))
df_Y = df_new.filter(col("val")=="Y").withColumnRenamed("val","val_Y").withColumnRenamed("id","id_Y")
df_new = df_new.join(df_Y, df_new["id"]==df_Y["id_Y"],how="left")
df_new.filter((col("val_Y").isNull()) | ((col("val_Y")=="Y") & ~(col("val")=="N"))).select("id","val").show()
The result would be your preferred:
+---+---+
| id|val|
+---+---+
| 1| X|
| 1| Y|
| 1| Z|
| 3| N|
| 2| a|
| 2| b|
+---+---+

Scala Spark: splitting dataframe column dynamically

I am very new to scala and spark.
I have read a text file into a dataframe, and successfully split the single column into columns (essentially the file is SPACE delimited csv)
val irisDF:DataFrame = spark.read.csv("src/test/resources/iris-in.txt")
irisDF.show()
val dfnew:DataFrame = irisDF.withColumn("_tmp", split($"_c0", " ")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3"),
$"_tmp".getItem(3).as("col4")
).drop("_tmp")
This works.
BUT what if I do not know how many columns there are in the datafile? How do I dynamically generate the columns depending on the number of items generated by the split function?
You can create a sequence of select expressions, and then apply all of them to select method with :_* syntax:
Example Data:
val df = Seq("a b c d", "e f g").toDF("c0")
df.show
+-------+
| c0|
+-------+
|a b c d|
| e f g|
+-------+
If you want five columns from the c0 column, which you need to determine before doing this:
val selectExprs = 0 until 5 map (i => $"temp".getItem(i).as(s"col$i"))
df.withColumn("temp", split($"c0", " ")).select(selectExprs:_*).show
+----+----+----+----+----+
|col0|col1|col2|col3|col4|
+----+----+----+----+----+
| a| b| c| d|null|
| e| f| g|null|null|
+----+----+----+----+----+

Get Unique records in Spark [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a dataframe df as mentioned below:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
I want to create a new dataframe df2, which will have only unique customer ids, but as rule_name and rule_id columns are different for same customer in data, so I want to pick those records which has highest priority for the same customer, so my final outcome should be:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
Can anyone please help me to achieve it using Spark scala. Any help will be appericiated.
You basically want to select rows with extreme values in a column. This is a really common issue, so there's even a whole tag greatest-n-per-group. Also see this question SQL Select only rows with Max Value on a Column which has a nice answer.
Here's an example for your specific case.
Note that this could select multiple rows for a customer, if there are multiple rows for that customer with the same (minimum) priority value.
This example is in pyspark, but it should be straightforward to translate to Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
To create df2 you have to first order df by priority and then find unique customers by id. Like this:
val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
It would give you expected output:
+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
+----------+--------+-------+----------+--------+---------+
Corey beat me to it, but here's the Scala version:
val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
+---------+--------+-------+------+---------+-------+
You will have to use min aggregation on priority column grouping the dataframe by customers and then inner join the original dataframe with the aggregated dataframe and select the required columns.
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
finalDF.select("customers", "product", "val_id", "rule_name", "rule_id", "priority").show
you should have the desired result