Equivalent to left outer join in SPARK - scala

Is there a left outer join equivalent in SPARK SCALA ? I understand there is join operation which is equivalent to database inner join.

Spark Scala does have the support of left outer join. Have a look here
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD
Usage is quite simple as
rdd1.leftOuterJoin(rdd2)

It is as simple as rdd1.leftOuterJoin(rdd2) but you have to make sure both rdd's are in the form of (key, value) for each element of the rdd's.

Yes, there is. Have a look at the DStream APIs and they have provided left as well as right outer joins.
If you have a stream of of type let's say 'Record', and you wish to join two streams of records, then you can do this like :
var res: DStream[(Long, (Record, Option[Record]))] = left.leftOuterJoin(right)
As the APIs say, the left and right streams have to be hash partitioned. i.e., you can take some attributes from a Record, (or may be in any other way) to calculate a Hash value and convert it to pair DStream. left and right streams will be of type DStream[(Long, Record)] before you call that join function. (It is just an example. The Hash type can be of some type other than Long as well.)

Spark SQL / Data Frame API also supports LEFT/RIGHT/FULL outer joins directly:
https://spark.apache.org/docs/latest/sql-programming-guide.html
Because of this bug: https://issues.apache.org/jira/browse/SPARK-11111 outer joins in Spark prior to 1.6 might be very slow (unless you have really small data sets to join). It used to use cartesian product and then filtering before 1.6. Now it is using SortMergeJoin instead.

Related

Spark scala join dataframe within a dataframe

I have a requirement where I need to join dataframes A and B and calculate a column and use that calculated value in another join between the same 2 dataframes with different Join conditions.
e.g.:
DF_Combined = A_DF.join(B_DF,'Join-Condition',"left_outer").withColumn(col1,'value')
after doing the above I need to do the same join but use the value calculated in the previous join.
DF_Final=A_DF.join(B_DF,'New join COndition',"left_outer").withcolumn(col2,DF_Combined.col1*vol1*10)
When I try to do this I get a Cartesian product issue.
You can't use a column that is not present in dataframe. I mean when you do A_DF.join(B_DF,... in the resulting dataframe you only have columns from A_DF and B_DF. If you want to have the new column - you need to use DF_Combined.
From your question i believe you don't need to have another join, but you have 2 possible options:
1. When you do first join - at this place calculate vol1*10.
2. After join do DF_Combined.withColumn....
But please remember - withColumn(name, expr) creates a column with a namesetting value to result of expr. So .withcolumn(DF_Combined.col1,vol1*10) does not make sense.

leftOuterJoin throws TableException: Unsupported join type 'LEFT'

I'm trying to run a left outer join on two tables and convert the results to a DataStream.
All the joins I've done before using flink have been inner joins, and I have always followed the join with a .toRetractStream[MyCaseClass](someQueryConfig). However, with the introduction of null values due to the left join, my understanding from the flink docs is that I can no longer use case classes because they don't support null values when converting a table to a DataStream.
So, I'm trying to accomplish this using a POJO. Here is my code:
class EnrichedTaskUpdateJoin(val enrichedTaskId: String, val enrichedTaskJobId: String, val enrichedTaskJobDate: String, val enrichedTaskJobMetadata: Json, val enrichedTaskStartedAt: String, val enrichedTaskTaskMetadata: Json, val taskUpdateMetadata: Json = Json.Null) {}
val qConfig = tableEnv.queryConfig
qConfig.withIdleStateRetentionTime(IDLE_STATE_RETENTION_TIME)
val updatedTasksUpsertTable = enrichedTasksUpsertTable
.leftOuterJoin(taskUpdatesUpsertTable, 'enrichedTaskId === 'taskUpdateId)
.select(
'enrichedTaskId,
'enrichedTaskJobId,
'enrichedTaskJobDate,
'enrichedTaskJobMetadata,
'enrichedTaskStartedAt,
'enrichedTaskTaskMetadata,
'taskUpdateMetadata
)
val updatedEnrichedTasksStream: KeyedStream[String, String] = updatedTasksUpsertTable
.toAppendStream[EnrichedTaskUpdateJoin](qConfig)
.map(toEnrichedTask(_))
.map(encodeTask(_))
.keyBy(x => parse(x).getOrElse(Json.Null).hcursor.get[String]("id").getOrElse(""))
This compiles just fine, but when I try to run it, I get org.apache.flink.table.api.TableException: Unsupported join type 'LEFT'. Currently only non-window inner joins with at least one equality predicate are supported. However, according to these docs, it seems like I should be able to run a left join. It also seems worth noting that the error gets thrown from the .toAppendStream[EnrichedTaskUpdateJoin](qConfig).
I thought perhaps the non-window portion of the error implied that my idle state retention time was a problem, so I took the query config out, but got the same error.
Hopefully this has enough context, but if I need to add anything else please let me know. Also, I'm running flink 1.5-SNAPSHOT and Circe for json parsing. I'm also quite new to scala, so it's very possible that this is just some dumb syntax error.
Non-windowed outer joins are not supported in Flink 1.5-SNAPSHOT. As you can see in the link that you have posted, there is no "Streaming" tag next to "Outer Joins". Time-windowed joins (which work on time attributes) were supported in 1.5.
Flink 1.6 will provide LEFT, RIGHT, and FULL outer joins (see also FLINK-5878).
Btw. make sure that EnrichedTaskUpdateJoin is really a POJO because POJOs need a default constructor and I think also var instead of val.

Is it Join and except the best way for taking rows in priority dataFrame?

I am using spark 1.6.0 in scala, normaly I am using dataFrame for processing the data.
In this case and I need to create a dataFrame with these source dfEu,dfFlow and dfLt.
The relationship between those are, dfEu with dfFlow, dfEu with dfLt
The logic that need to implement is the next.
Take all rows that join between dfFlow and dfEu
Take all rows that have dfFlow and I haven`t taken in below join
Take all rows tha have dfEu and I haven`t taken in below join
Take all rows that have dfLt and I haven`t taken in dfEu
What is the best way to implement these?
Use dataframe join and Except?
Use a strutucture data key value, and implement the insert low priority to high (for keep the high priority).
Or are you advice me another strategy for that?
Best regards, and thank you for your time.

Difference between sc.broadcast and broadcast function in spark sql

I have used sc.broadcast for lookup files to improve the performance.
I also came to know there is a function called broadcast in Spark SQL Functions.
What is the difference between two?
Which one i should use it for broadcasting the reference/look up tables?
one word answer :
1) org.apache.spark.sql.functions.broadcast() function is user supplied,explicit hint for given sql join.
2) sc.broadcast is for broadcasting readonly shared variable.
More details about broadcast function #1 :
Here is scala doc from
sql/execution/SparkStrategies.scala
which says.
Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that
side has an explicit broadcast hint (e.g. the user applied the *
[[org.apache.spark.sql.functions.broadcast()]] function to a
DataFrame), then that side * of the join will be broadcasted
and the other side will be streamed, with no shuffling *
performed. If both sides of the join are eligible to be broadcasted
then the *
Shuffle hash join: if the average size of a single
partition is small enough to build a hash * table.
Sort merge: if the matching join keys are sortable.
If there is no joining keys, Join implementations are chosen with the following precedence:
BroadcastNestedLoopJoin: if one side of the join could be broadcasted
CartesianProduct: for Inner join
BroadcastNestedLoopJoin
The below method controls the behavior based on size we set to
spark.sql.autoBroadcastJoinThreshold by default it is 10mb
Note : smallDataFrame.join(largeDataFrame) does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame) does.
/** Matches a plan whose output should be small enough to be used in broadcast join.
**/
private def canBroadcast(plan: LogicalPlan): Boolean = {
plan.statistics.isBroadcastable ||
plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
}
In future the below configurations will be deprecated in coming versions of spark.
If you want to achieve broadcast join in Spark SQL you should use broadcast function (combined with desired spark.sql.autoBroadcastJoinThreshold configuration). It will:
Mark given relation for broadcasting.
Adjust SQL execution plan.
When output relation is evaluated it will take care of collecting data, and broadcasting, and applying correct join mechanism.
SparkContext.broadcast is used to handle local objects and is applicable for use with Spark DataFrames.

How to obtain the symmetric difference between two DataFrames?

In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:
df1.except(df2).union(df2.except(df1))
But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.
You can always rewrite it as:
df1.unionAll(df2).except(df1.intersect(df2))
Seriously though this UNION, INTERSECT and EXCEPT / MINUS is pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.
why not the below?
df1.except(df2)
If you are looking for Pyspark solution, you should use subtract() docs.
Also, unionAll is deprecated in 2.0, use union() instead.
df1.union(df2).subtract(df1.intersect(df2))
Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:
https://issues.apache.org/jira/browse/SPARK-21274
As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as
SELECT a,b,c
FROM tab1 t1
LEFT OUTER JOIN
tab2 t2
ON (
(t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
)
WHERE
COALESCE(t2.a, t2.b, t2.c) IS NULL
I think it could be more efficient using a left join and then filtering out the nulls.
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)