Spark SQL 1.5.2: left excluding join - left-join

Given dataframes df_a and df_b, how can I achieve the same result as left excluding join:
SELECT df_a.*
FROM df_a
LEFT JOIN df_b
ON df_a.id = df_b.id
WHERE df_b.id is NULL
I've tried:
df_a.join(df_b, df_a("id")===df_b("id"), "left")
.select($"df_a.*")
.where(df_b.col("id").isNull)
I get an exception from the above:
Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()

If you wish to do it through dataframes try below example :
import sqlContext.implicits._
val df1 = sc.parallelize(List("a", "b", "c")).toDF("key1")
val df2 = sc.parallelize(List("a", "b")).toDF("key2")
import org.apache.spark.sql.functions._
df1.join(df2,
df1.col("key1") <=> df2.col("key2"),
"left")
.filter(col("key2").isNull)
.show
You would get output :
+----+----+
|key1|key2|
+----+----+
| c|null|
+----+----+

You can try executing SQL query itself - keeping it simple..
df_a.registerTempTable("TableA")
df_b.registerTempTable("TableB")
result = sqlContext.sql("SELECT * FROM TableA A \
LEFT JOIN TableB B \
ON A.id = B.id \
WHERE B.id is NULL ")

Related

Replacing Spark SQL with Scala API calls

I've a simple LEFT JOIN
spark.sql(
s"""
|SELECT
| a.*,
| b.company_id AS companyId
|FROM profile_views a
|LEFT JOIN companies_info b
| ON a.memberId = b.member_id
|""".stripMargin
).createOrReplaceTempView("company_views")
How do I replace this with the scala API?
Try below code.
Below code will work for temp views as well as hive tables.
val profile_views = spark
.table("profile_views")
.as("a")
val companies_info = spark
.table("companies_info")
.select($"company_id".as("companyId"),$"member_id".as("memberId"))
.as("b")
profile_views
.join(companies_info,Seq("memberId"),"left")
.createOrReplaceTempView("company_views")
If you have already data in DataFrame, You can use below code.
profile_viewsDF.as("a")
.join(
companies_infoDF.select($"company_id".as("companyId"),$"member_id".as("memberId")).as("b"),
Seq("memberId"),
"left"
)
.createOrReplaceTempView("company_views")
Update : temp views can also called using spark.table(). Please check below code.
scala> val df = Seq(("srinivas",10)).toDF("name","age")
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> df.createTempView("person")
scala> spark.table("person").show
+--------+---+
| name|age|
+--------+---+
|srinivas| 10|
+--------+---+

Searching for strings from column and creating new column in spark dataframe with results

I have dataframe consisting of columns. I need to check the http link and stored the value code in new column using spark dataframe.
Dataframe :
colA colB ColC colD
A B C a1.abc.com/823659) a1.abc.com/823521)
B C D go.xyz.com/?LinkID=226971 a1.abc.com/823521)
C D E a1.abc.com/?LinkID=226971 go.xyz.com/?LinkID=226975
Required Output:
colA colB ColC colD ColE
A B C a1.abc.com/823659) a1.abc.com/823521) 823659,823521
B C D go.xyz.com/?LinkID=226971 a1.abc.com/823521) 226971,823521
C D E a1.abc.com/?LinkID=226971 go.xyz.com/?LinkID=226975 226971,226975
df.withColumn("colE", regexp_extract(col("colD"), 'regex', ".com"
I have tried using regexp_extract, dont the pattern in not getting matched. Could you please help to get the required output.
Anyway, run this, you can add the other columns, and I am not sure of your string, but assume spaces. Added to the original answer.
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(("A", "a1.abc.com/823659) a1.abc.com/823521 a1.abc.com/9999)"), ("B", "go.xyz.com/?LinkID=226971 a1.abc.com/823521)") ).toDF("col1", "col2")
val df2 = df.withColumn("col3", split(col("col2"), " "))
val df3 = df2.select($"col1", $"col2", array_except($"col3", array(lit(""))).as("col3"))
val df4 = df3.withColumn("xyz", explode($"col3")).withColumn("leng", length($"xyz"))
// Get values to right of .com
val df5 = df4.withColumn("xyz2", substring_index($"xyz", ".com", -1))
val df6 = df5.withColumn("col4", regexp_replace(df5("xyz2"), "[^0-9]", "") )
val df7 = df6.select($"col1", $"col4")
val dfres = df7.groupBy("col1").agg(collect_list(col("col4")).as("col2"))
dfres.show(false)
returns on my data:
+----+----------------------+
|col1|col2 |
+----+----------------------+
|B |[226971, 823521] |
|A |[823659, 823521, 9999]|
+----+----------------------+

Joining multiple dataframes horizontally

I have the following dataframe
val count :Dataframe = spark.sql("select 1,$database_name,$table_name count(*) from $table_name ")
Output :
1,stock,T076p,4332
val dist_count :Dataframe = spark.sql("1,select distinct count(*) from $table_name")`
Output :
4112 or 4332(can be same )
val truecount : Dataframe = spark.sql("select 1,count(*) from $table_name where flag =true")`
Output :
4330
val Falsecount : DataFrame = spark.sql("select 1,count(*) from $table_name where flag =false")
Output :
4332
Question : How do I join above dataframe to get the resultant dataframe which give me Output.
as the below.
stock ,T076p, 4332,4332,4330
Here comma is for column separator
P.S - I have added 1 to every dataframe so I can use join dataframes (so 1 is not mandatory here.)
Question :
How do I join above dataframe to get the resultant dataframe which
give me o/p as the below.
stock ,T076p, 4332,4332,4330 -Here comma is for column seperator
just check this example. I have mimicked your requirement with dummy dataframes like below.
package com.examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object MultiDFJoin {
def main(args: Array[String]) {
import org.apache.spark.sql.functions._
Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder.
master("local")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
val columns = Array("column1", "column2", "column3", "column4")
val df1 = (Seq(
(1, "stock", "T076p", 4332))
).toDF(columns: _*).as("first")
df1.show()
val df2 = Seq((1, 4332)).toDF(columns.slice(0, 2): _*).as("second")
df2.show()
val df3 = Seq((1, 4330)).toDF(columns.slice(0, 2): _*).as("third")
df3.show()
val df4 = Seq((1, 4332)).toDF(columns.slice(0, 2): _*).as("four")
df4.show()
val finalcsv = df1.join(df2, col("first.column1") === col("second.column1")).selectExpr("first.*", "second.column2")
.join(df3, Seq("column1")).selectExpr("first.*", "third.column2")
.join(df4, Seq("column1"))
.selectExpr("first.*", "third.column2", "four.column2")
.drop("column1").collect.mkString(",") // this column used for just joining hence dropping
print(finalcsv)
}
}
Result :
+-------+-------+-------+-------+
|column1|column2|column3|column4|
+-------+-------+-------+-------+
| 1| stock| T076p| 4332|
+-------+-------+-------+-------+
+-------+-------+
|column1|column2|
+-------+-------+
| 1| 4332|
+-------+-------+
+-------+-------+
|column1|column2|
+-------+-------+
| 1| 4330|
+-------+-------+
+-------+-------+
|column1|column2|
+-------+-------+
| 1| 4332|
+-------+-------+
[stock,T076p,4332,4330,4332]

Dynamically select multiple columns while joining different Dataframe in Scala Spark

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join.
def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {
val dfJoinResult = df1.join(df2, joinExprs, joinType)
dfJoinResult
//.select()
}
Input data:
val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")
Expected result:
val dfJoinResult = df1
.join(df2, df1("id") === df2("id"), "inner")
.select(df1("type"), df1("account"), df2("value"))
dfJoinResult.schema():
StructType(StructField(type,StringType,true),
StructField(account,StringType,true),
StructField(value,StringType,true))
I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's.
Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def? I'm using Spark 2.2.0.
It is possible to pass the select expression as a Seq[Column] to the method:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr:_*)
}
To call the method use:
val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
This will give the desired result:
+------+-------+-----+
| type|account|value|
+------+-------+-----+
| new|current| 7|
|closed| saving| 5|
+------+-------+-----+
In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. However, this can be further simplified if the following assumptions are true:
The columns to join on have the same name in both dataframes
The columns to be selected have unique names (the other dataframe do not have a column with the same name)
In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String]:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}
Calling the method now looks cleaner:
val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
Note: When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. When there are columns with the same name present, there will be no way to separately select these afterwards.
A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation.
val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)
But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation.

Merging Dataframes in Spark

I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. When the keys match in A & B, I need to get the row from B, not from A.
For example:
DataFrame A:
Employee1, salary100
Employee2, salary50
Employee3, salary200
DataFrame B
Employee1, salary150
Employee2, salary100
Employee4, salary300
The resulting DataFrame should be:
DataFrame C:
Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300
How can I do this in Spark & Scala?
Try:
dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
or
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
CASE dfB.employee IS NOT NULL THEN dfB.salary
CASE dfB.employee IS NOT NULL THEN dfA.salary
END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
Assuming dfA and dfB have 2 columns emp and sal. You can use the following:
import org.apache.spark.sql.{functions => f}
val dfB1 = dfB
.withColumnRenamed("sal", "salB")
.withColumnRenamed("emp", "empB")
val joined = dfA
.join(dfB1, 'emp === 'empB, "outer")
.select(
f.coalesce('empB, 'emp).as("emp"),
f.coalesce('salB, 'sal).as("sal")
)
NB: you should have only one row per Dataframe for a giving value of emp