How to get number of lines resulted by join in Spark - scala

Consider these two Dataframes:
+---+
|id |
+---+
|1 |
|2 |
|3 |
+---+
+---+-----+
|idz|word |
+---+-----+
|1 |bat |
|1 |mouse|
|2 |horse|
+---+-----+
I am doing a Left join on ID=IDZ:
val r = df1.join(df2, (df1("id") === df2("idz")), "left_outer").
withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r.show(false)
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |mouse |
|1 |1 |bat |
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
But what if I only want to keep the lines whose ID only have one equal IDZ? If not, I would Like to have null in ID_EMPLOYE_VENDEUR. Desired output is:
+---+----+------------------+
|id |idz |ID_EMPLOYE_VENDEUR|
+---+----+------------------+
|1 |1 |null | --Because the Join resulted two different lines
|2 |2 |horse |
|3 |null|null |
+---+----+------------------+
I should precise that I am working on a large DF. The solution should be not very expensive in time.
Thank you

As per you mentioned data your data is too large, so groupBy is not good option to group data and join Windows over function as below :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("idz")
val newDF = df1.withColumn("count", count("idz").over(windowSpec)).dropDuplicates("idz").withColumn("word", when(col("count") >=2 , lit(null)).otherwise(col("word"))).drop("count")
val r = df1.join(newDF, (df1("id") === newDF("idz")), "left_outer").withColumn("ID_EMPLOYE_VENDEUR", when(col("word") =!= ("null"), col("word")).otherwise(null)).drop("word")
r show
+---+----+------------------+
| id| idz|ID_EMPLOYE_VENDEUR|
+---+----+------------------+
| 1| 1| null|
| 3|null| null|
| 2| 2| horse|
+---+----+------------------+

You can retrieve easily the information that more than two df2's idz matched a single df1's id with a groupBy and a join.
r.join(
r.groupBy("id").count().as("g"),
$"g.id" === r("id")
)
.withColumn(
"ID_EMPLOYE_VENDEUR",
expr("if(count != 1, null, ID_EMPLOYE_VENDEUR)")
)
.drop($"g.id").drop("count")
.distinct()
.show()
Note: Both the groupBy and the join do not trigger any additional exchange step (shuffle around network) because the dataframe r is already partitioned on id (because it is the result of a join on id).

Related

Filter DF using the column of another DF (same col in both DF) Spark Scala

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!
You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

column split in Spark Scala dataframe

I have the below Data frame with me -
scala> val df1=Seq(
| ("1_10","2_20","3_30"),
| ("7_70","8_80","9_90")
| )toDF("c1","c2","c3")
scala> df1.show
+----+----+----+
| c1| c2| c3|
+----+----+----+
|1_10|2_20|3_30|
|7_70|8_80|9_90|
+----+----+----+
How to split this to different columns based on delimiter "_".
Expected output -
+----+----+----+----+----+----+
| c1| c2| c3|c1_1|c2_1|c3_1|
+----+----+----+----+----+----+
|1 |2 |3 | 10| 20| 30|
|7 |8 |9 | 70| 80| 90|
+----+----+----+----+----+----+
Also I have 50 + columns in the DF. Thanks in Advance.
Here is the good use of foldLeft. Split each column and create a new column for each splited value
val cols = df1.columns
cols.foldLeft(df1) { (acc, name) =>
acc.withColumn(name, split(col(name), "_"))
.withColumn(s"${name}_1", col(name).getItem(0))
.withColumn(s"${name}_2", col(name).getItem(1))
}.drop(cols:_*)
.show(false)
If you need the columns name exactly as you want then you need to filter the columns that ends with _1 and rename them again with foldLeft
Output:
+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1 |10 |2 |20 |3 |30 |
|7 |70 |8 |80 |9 |90 |
+----+----+----+----+----+----+
You can use split method
split(col("c1"), '_')
This will return you ArrayType(StringType)
Then you can access items with .getItem(index) method.
That is if you have a stable number of elements after spliting if that isnt the case you will have some null values if the indexed value isnt present in the array after splitting.
Example of code:
df.select(
split(col("c1"), "_").alias("c1_items"),
split(col("c2"), "_").alias("c2_items"),
split(col("c3"), "_").alias("c3_items"),
).select(
col("c1_items").getItem(0).alias("c1"),
col("c1_items").getItem(1).alias("c1_1"),
col("c2_items").getItem(0).alias("c2"),
col("c2_items").getItem(1).alias("c2_1"),
col("c3_items").getItem(0).alias("c3"),
col("c3_items").getItem(1).alias("c3_1")
)
Since you need to do this for 50+ columns I would probably suggest to wrap this in a method for a single column + withColumn statement in this kind of way
def splitMyCol(df: Dataset[_], name: String) = {
df.withColumn(
s"${name}_items", split(col("name"), "_")
).withColumn(
name, col(s"${name}_items).getItem(0)
).withColumn(
s"${name}_1", col(s"${name}_items).getItem(1)
).drop(s"${name}_items")
}
Note I assume you do not need items to be maintained thus I drop it. Also not that due to _ in the name between two variable is s"" string you need to wrap first one in {}, while second doesnt really need {} wrapping and $ is enough.
You can wrap this then in a fold method in this way:
val result = columnsToExpand.foldLeft(df)(
(acc, next) => splitMyCol(acc, next)
)
pyspark solution:
import pyspark.sql.functions as F
df1=sqlContext.createDataFrame([("1_10","2_20","3_30"),("7_70","8_80","9_90")]).toDF("c1","c2","c3")
expr = [F.split(coln,"_") for coln in df1.columns]
df2=df1.select(*expr)
#%%
df3= df2.withColumn("clctn",F.flatten(F.array(df2.columns)))
#%% assuming all columns will have data in the same format x_y
arr_size = len(df1.columns)*2
df_fin= df3.select([F.expr("clctn["+str(x)+"]").alias("c"+str(x/2)+'_'+str(x%2)) for x in range(arr_size)])
Results:
+----+----+----+----+----+----+
|c0_0|c0_1|c1_0|c1_1|c2_0|c2_1|
+----+----+----+----+----+----+
| 1| 10| 2| 20| 3| 30|
| 7| 70| 8| 80| 9| 90|
+----+----+----+----+----+----+
Try to use select instead of foldLeft for better performance. As foldLeft might be taking longer time than select
Check this post - foldLeft,select
val expr = df
.columns
.flatMap(c => Seq(
split(col(c),"_")(0).as(s"${c}_1"),
split(col(c),"_")(1).as(s"${c}_2")
)
)
.toSeq
Result
df.select(expr:_*).show(false)
+----+----+----+----+----+----+
|c1_1|c1_2|c2_1|c2_2|c3_1|c3_2|
+----+----+----+----+----+----+
|1 |10 |2 |20 |3 |30 |
|7 |70 |8 |80 |9 |90 |
+----+----+----+----+----+----+
You can do like this.
var df=Seq(("1_10","2_20","3_30"),("7_70","8_80","9_90")).toDF("c1","c2","c3")
for (cl <- df.columns) {
df=df.withColumn(cl+"_temp",split(df.col(cl),"_")(0))
df=df.withColumn(cl+"_"+cl.substring(1),split(df.col(cl),"_")(1))
df=df.withColumn(cl,df.col(cl+"_temp")).drop(cl+"_temp")
}
df.show(false)
}
//Sample output
+---+---+---+----+----+----+
|c1 |c2 |c3 |c1_1|c2_2|c3_3|
+---+---+---+----+----+----+
|1 |2 |3 |10 |20 |30 |
|7 |8 |9 |70 |80 |90 |
+---+---+---+----+----+----+

Add a new Column in Spark DataFrame which contains the sum of all values of one column-Scala/Spark

This is the snapshot which is takes after adding a column but that does not contain the sum of all values of one column
I am trying to add a column in the dataframe which contains the sum of all values of one column in the same dataframe.
For eg:
In the pic there are columns- UserID,MovieID,Rating,Unixtimestamp.
Now I want to add one column named as Sum which will contain the sum of all values of Rating Column.
I have a Ratings Data Frame
Ratings DataFrame column name: USerID, MovieID, Ratings, UnixTimeStamp.
+------+-------+------+-------------+
|UserID|MovieID|Rating|UnixTimeStamp|
+------+-------+------+-------------+
| 196| 242| 3| 881250949|
| 186| 302| 3| 891717742|
| 22| 377| 1| 878887116|
| 244| 51| 2| 880606923|
| 166| 346| 1| 886397596|
+------+-------+------+-------------+
only showing top 5 rows
I have to calculate wa rating and store this into a dataframe.
wa_rating= (rating>3)/total ratings
please help me to find the wa_rating dataframe which contains a new column with that using scala spark
Check this out:
scala> val df = Seq((196,242,3,881250949),(186,302,3,891717742),(22,377,1,878887116),(244,51,2,880606923),(166,346,1,886397596)).toDF("userid","movieid","rating","unixtimestamp")
df: org.apache.spark.sql.DataFrame = [userid: int, movieid: int ... 2 more fields]
scala> df.show(false)
+------+-------+------+-------------+
|userid|movieid|rating|unixtimestamp|
+------+-------+------+-------------+
|196 |242 |3 |881250949 |
|186 |302 |3 |891717742 |
|22 |377 |1 |878887116 |
|244 |51 |2 |880606923 |
|166 |346 |1 |886397596 |
+------+-------+------+-------------+
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val df2 = df.withColumn("total_rating",sum('rating).over())
df2: org.apache.spark.sql.DataFrame = [userid: int, movieid: int ... 3 more fields]
scala> df2.show(false)
19/01/23 08:38:46 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+-------+------+-------------+------------+
|userid|movieid|rating|unixtimestamp|total_rating|
+------+-------+------+-------------+------------+
|22 |377 |1 |878887116 |10 |
|244 |51 |2 |880606923 |10 |
|166 |346 |1 |886397596 |10 |
|196 |242 |3 |881250949 |10 |
|186 |302 |3 |891717742 |10 |
+------+-------+------+-------------+------------+
scala> df2.withColumn("wa_rating",coalesce( when('rating >= 3,'rating),lit(0))/'total_rating).show(false)
19/01/23 08:47:49 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+-------+------+-------------+------------+---------+
|userid|movieid|rating|unixtimestamp|total_rating|wa_rating|
+------+-------+------+-------------+------------+---------+
|22 |377 |1 |878887116 |10 |0.0 |
|244 |51 |2 |880606923 |10 |0.0 |
|166 |346 |1 |886397596 |10 |0.0 |
|196 |242 |3 |881250949 |10 |0.3 |
|186 |302 |3 |891717742 |10 |0.3 |
+------+-------+------+-------------+------------+---------+
scala>

Spark: Extaract domain from email address in dataframe

I have a difficulty in extracting email domains. I have below dataframe.
+---+----------------+
|id |email |
+---+----------------+
|1 |ii#koko.com |
|2 |lol#fsa.org |
|3 |kokojambo#mon.eu|
+---+----------------+
Now I want to have a new field for domains which I'll get:
+---+----------------+------+
|id |email |domain|
+---+----------------+------+
|1 |ii#koko.com |koko |
|2 |lol#fsa.org |fsa |
|3 |kokojambo#mon.eu|mon |
+---+----------------+------+
I tried to do something like this:
val test = df_short.withColumn("email", split($"email", "#."))
But got a false output. Can anybody direct me better?
You can simple use inbuilt regexp_extract function to get your domain name from email address.
//create an example dataframe
val df = Seq((1, "ii#koko.com"),
(2, "lol#fsa.org"),
(3, "kokojambo#mon.eu"))
.toDF("id", "email")
//original dataframe
df.show(false)
//output
// +---+----------------+
// |id |email |
// +---+----------------+
// |1 |ii#koko.com |
// |2 |lol#fsa.org |
// |3 |kokojambo#mon.eu|
// +---+----------------+
//using regex get the domain name
df.withColumn("domain",
regexp_extract($"email", "(?<=#)[^.]+(?=\\.)", 0))
.show(false)
//output
// +---+----------------+------+
// |id |email |domain|
// +---+----------------+------+
// |1 |ii#koko.com |koko |
// |2 |lol#fsa.org |fsa |
// |3 |kokojambo#mon.eu|mon |
// +---+----------------+------+
You can do like this
import org.apache.spark.sql.functions._
df.withColumn("domain", split(split(df.col("email"),"#")(1),"\\.")(0)).show
Sample Input:
+---------------+
| email|
+---------------+
|manoj#gmail.com|
| abc#ac.in|
+---------------+
Sample Output:
+---------------+------+
| email|domain|
+---------------+------+
|manoj#gmail.com| gmail|
| abc#ac.in| ac|
+---------------+------+

Filtering out rows of a table bassed on a column

I am trying to filter out table rows based in column value.
I have a dataframe:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |1 |
|3 |0 |
|4 |1 |
|4 |0 |
|4 |0 |
+---+-----+
I want to create a new dataframe deleting all rows with value!=0:
+---+-----+
|id |value|
+---+-----+
|3 |0 |
|3 |0 |
|4 |0 |
|4 |0 |
+---+-----+
I figured the syntax should be something like this but couldn't get it right:
val newDataFrame = OldDataFrame.filter($"value"==0)
Correct way is as following. You just forgot to add one = sign
val newDataFrame = OldDataFrame.filter($"value" === 0)
Their are various ways by which you can do the filtering.
val newDataFrame = OldDataFrame.filter($"value"===0)
val newDataFrame = OldDataFrame.filter(OldDataFrame("value") === 0)
val newDataFrame = OldDataFrame.filter("value === 0")
You can also use where function as well instead of filter.