Coalesce in spark scala - scala

I am trying to understand if there is a default method available in Spark - scala to include empty strings in coalesce.
Ex- I have the below DF with me -
val df2=Seq(
("","1"),
("null","15_20")
)toDF("c1","c2")
+----+-----+
| c1| c2|
+----+-----+
| | 1|
|null|15_20|
+----+-----+
The below code will work only for NULL values. But I require the coalesce to work for empty strings as well.
df2.withColumn("FirstNonNullOrBlank",coalesce(col("c1"),col("c2")))show
+----+-----+-------------------+
| c1| c2|FirstNonNullOrBlank|
+----+-----+-------------------+
| | 1| |
|null|15_20| 15_20|
+----+-----+-------------------+
Expected Output -
+----+-----+-------------------+
| c1| c2|FirstNonNullOrBlank|
+----+-----+-------------------+
| | 1| 1 |
|null|15_20| 15_20|
+----+-----+-------------------+
What should be the best approach here ?

you need a helper function to "nullify" these records :
def nullify(c: Column) = when(not (c==="" or c==="null"),c)
df2
.withColumn("FirstNonNullOrBlank", coalesce(
nullify(col("c1")),
nullify(col("c2")))
)
.show
+----+-----+-------------------+
| c1| c2|FirstNonNullOrBlank|
+----+-----+-------------------+
| | 1| 1|
|null|15_20| 15_20|
+----+-----+-------------------+

Related

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Filter one data frame using other data frame in spark scala

I am going to demonstrate my question using following two data frames.
val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
(2,"living",1.38),(2,"worth",1.38),(2,"life",0.69),(3,"learning",0.69),(3,"never",1.38)).toDF("ID","token","value")
datF1.show()
+---+-----------+-----+
| ID| token|value|
+---+-----------+-----+
| 1|everlasting| 1.39|
| 1| game| 2.7|
| 1| life| 0.69|
| 1| learning| 0.69|
| 2| living| 1.38|
| 2| worth| 1.38|
| 2| life| 0.69|
| 3| learning| 0.69|
| 3| never| 1.38|
+---+-----------+-----+
val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
dataF2.show()
+--------+----+
| token1|val2|
+--------+----+
| life |0.71|
|learning|0.75|
+--------+----+
I want to filter the ID and value of dataF1 based on the token1 of dataF2. For the each word in token1 of dataF2 , if there is a word token then value should be equal to the value of dataF1 else value should be zero.
In other words my desired output should be like this
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Since learning is not presented in ID equals 2 , the val has equal to zero. Similarly since life is not there for ID equal 3, val2 equlas zero.
I did it manually as follows ,
val newQ61=datF1.filter($"token"==="learning")
val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val tf2=newQ81.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )
val newQ62=datF1.filter($"token"==="life")
val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val tf3=newQ82.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )
val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
tf4.show()
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.
Thank you
UPDATE
When i use leftsemi join my output is like this :
datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
+---+--------+-----+
| ID| token|value|
+---+--------+-----+
| 1|learning| 0.69|
| 3|learning| 0.69|
+---+--------+-----+
I believe a left outer join and then pivoting on token can work here:
val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
.filter($"token1".isNotNull)
.select("ID","token","value")
.groupBy("ID")
.pivot("token")
.agg(first("value"))
.na.fill(0)
The result (without the null handling):
ans.show
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 3| 0.69|0.0 |
| 2| 0.0 |0.69|
+---+--------+----+
UPDATE: as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.
I think the inner join is enough. Btw, I found the typo in your test case, which makes the result wrong.
val dataF1= Seq((1,"everlasting",1.39),
(1,"game", 2.7),
(1,"life",0.69),
(1,"learning",0.69),
(2,"living",1.38),
(2,"worth",1.38),
(2,"life",0.69),
(3,"learning",0.69),
(3,"never",1.38)).toDF("ID","token","value")
dataF1.show
// +---+-----------+-----+
// | ID| token|value|
// +---+-----------+-----+
// | 1|everlasting| 1.39|
// | 1| game| 2.7|
// | 1| life| 0.69|
// | 1| learning| 0.69|
// | 2| living| 1.38|
// | 2| worth| 1.38|
// | 2| life| 0.69|
// | 3| learning| 0.69|
// | 3| never| 1.38|
// +---+-----------+-----+
val dataF2= Seq(("life",0.71), // "life " -> "life"
("learning",0.75)).toDF("token1","val2")
dataF2.show
// +--------+----+
// | token1|val2|
// +--------+----+
// | life|0.71|
// |learning|0.75|
// +--------+----+
val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
resultDF.show
// +---+--------+-----+--------+----+
// | ID| token|value| token1|val2|
// +---+--------+-----+--------+----+
// | 1| life| 0.69| life|0.71|
// | 1|learning| 0.69|learning|0.75|
// | 2| life| 0.69| life|0.71|
// | 3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+
resultDF.groupBy("ID").pivot("token").agg(first("value"))
.na.fill(0).orderBy("ID").show
This will give you the result such as
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 2| 0.0|0.69|
| 3| 0.69| 0.0|
+---+--------+----+
Seems like you need "left semi-join". It will filter one dataframe, based on another one.
Try using it like
datF1.join(datF2, $"token"===$"token2", "leftsemi")
You can find a bit more info here - https://medium.com/datamindedbe/little-known-spark-dataframe-join-types-cc524ea39fd5

Group rows that match sub string in a column using scala

I have a fol df:
Zip | Name | id |
abc | xyz | 1 |
def | wxz | 2 |
abc | wex | 3 |
bcl | rea | 4 |
abc | txc | 5 |
def | rfx | 6 |
abc | abc | 7 |
I need to group all the names that contain 'x' based on same Zip using scala
Desired Output:
Zip | Count |
abc | 3 |
def | 2 |
Any help is highly appreciated
As #Shaido mentioned in the comment above, all you need is filter, groupBy and aggregation as
import org.apache.spark.sql.functions._
fol.filter(col("Name").contains("x")) //filtering the rows that has x in the Name column
.groupBy("Zip") //grouping by Zip column
.agg(count("Zip").as("Count")) //counting the rows in each groups
.show(false)
and you should have the desired output
+---+-----+
|Zip|Count|
+---+-----+
|abc|3 |
|def|2 |
+---+-----+
You want to groupBy bellow data frame.
+---+----+---+
|zip|name| id|
+---+----+---+
|abc| xyz| 1|
|def| wxz| 2|
|abc| wex| 3|
|bcl| rea| 4|
|abc| txc| 5|
|def| rfx| 6|
|abc| abc| 7|
+---+----+---+
then you can simply use groupBy function with passing column parameter and followed by count will give you the result.
val groupedDf: DataFrame = df.groupBy("zip").count()
groupedDf.show()
// +---+-----+
// |zip|count|
// +---+-----+
// |bcl| 1|
// |abc| 4|
// |def| 2|
// +---+-----+

Spark dataframe filter

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
So the output should be like below.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Can any one please help on this?
I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?
Version: Spark 1.6.2
Scala : 2.10
This works too. Concise and very similar to SQL.
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)
I used below to filter rows from dataframe and this worked form me.Spark 2.2
val spark = new org.apache.spark.sql.SQLContext(sc)
val data = spark.read.format("csv").
option("header", "true").
option("delimiter", "|").
option("inferSchema", "true").
load("D:\\test.csv")
import spark.implicits._
val filter=data.filter($"dept" === "IT" )
OR
val filter=data.filter($"dept" =!= "IT" )
val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))
This worked.

Spark DataFrame Add Column with Value

I have a DataFrame with below data
scala> nonFinalExpDF.show
+---+----------+
| ID| DATE|
+---+----------+
| 1| null|
| 2|2016-10-25|
| 2|2016-10-26|
| 2|2016-09-28|
| 3|2016-11-10|
| 3|2016-10-12|
+---+----------+
From this DataFrame I want to get below DataFrame
+---+----------+----------+
| ID| DATE| INDICATOR|
+---+----------+----------+
| 1| null| 1|
| 2|2016-10-25| 0|
| 2|2016-10-26| 1|
| 2|2016-09-28| 0|
| 3|2016-11-10| 1|
| 3|2016-10-12| 0|
+---+----------+----------+
Logic -
For latest DATE(MAX Date) of an ID, Indicator value would be 1 and others
are 0.
For null value of the account Indicator would be 1
Please suggest me a simple logic to do that.
Try
df.createOrReplaceTempView("df")
spark.sql("""
SELECT id, date,
CAST(LEAD(COALESCE(date, TO_DATE('1900-01-01')), 1)
OVER (PARTITION BY id ORDER BY date) IS NULL AS INT)
FROM df""")