Kapacitor Join using 'on' Property and Single Dimension

Kapacitor Join using 'on' Property and Single Dimension - kapacitor

I'm joining streams in Kapacitor using the on property. The join only seems to work if one of the streams has multiple groupBy dimensions, even if only one dimension is actually needed. Why is that?
For example, in the code below, the join won't return anything if floor was removed from .groupBy('building', 'floor'). Why doesn't the join work with building alone?
var building = stream
|from()
.measurement('building_power')
.groupBy('building')
var floor = stream
|from()
.measurement('floor_power')
.groupBy('building', 'floor')
building
|join(floor)
.as('building', 'floor')
.on('building')

Related

How to use OPTIMIZE ZORDER BY in Databricks

I have two dataframes(from a delta lake table) that do a left join via an id column.
sd1, sd2
%sql
select
a.columnA,
b.columnB,
from sd1 a
left outer join sd2 b
on a.id = b.id
The problem is that my query takes a long time, looking for ways to improve the results I have found OPTIMIZE ZORDER BY Youtube video
according to the video seems to be useful when ordering columns if they are going to be part of the where condition`.
But since the two dataframes use the id in the join condition, could it be interesting to order that column?
spark.sql(f'OPTIMIZE delta.`{sd1_delta_table_path}` ZORDER BY (id)')
the logic that follows in my head is that if we first order that column then it will take less time to look for them to make the match. Is this correct ?
Thanks ind advance

OPTIMIZE ZORDER may help a bit by placing related data together, but it's usefulness may depend on the data type used for ID column. OPTIMIZE ZORDER relies on the data skipping functionality that just gives you min & max statistics, but may not be useful when you have big ranges in your joins.
You can also tune a file sizes, to avoid scanning of too many smaller files.
But from my personal experience, for joins, bloom filters give better performance because they allow to skip files more efficiently than data skipping. Just build bloom filter on the ID column...

Data Flow left outer Join results in empty data set when executed, but not on debuging

We have a Left outer join configured between two data sets. When executed, the Join shows no results, although when debugging it does. Both dataset of the join contains data and the condition is fulfilled, although with a left outer join I would expect at least the content from the first data set.
The dataflow pipeline contains two similar joins and in both we have the same behavior.
The datasets involved in the join contains from 20K to 60K records. However, the flow loads a couple of datasets around 1 million records. We would expect some error if this was related to memory though...

KSQL Join streams with condition on struct field

I have two streams defined each from a topic on which JSON messages are published a bit like this:
{"payload": {"some_id": "123"}}
Their corresponding streams are defined like this:
CREATE STREAM mystream
(payload STRUCT <someid varchar>)
WITH (kafka_topic='mytopic', value_format='JSON')
When I try to JOIN the two streams together:
SELECT
s.payload->some_id,
o.payload->other_id
FROM mystream s
LEFT JOIN otherstream o ON s.payload->some_id = o.payload->other_id;
I get the following error:
Invalid comparison expression 'S.PAYLOAD->SOME_ID'
in join '(S.PAYLOAD->SOME_ID = O.PAYLOAD->OTHER_ID)'.
Joins must only contain a field comparison.
Is it not possible to join two streams based on a struct field? Do I first need to publish a stream that flattens each source stream before I can perform the JOIN?

Correct, this is not currently possible. Feel free to upvote the issue tracking it here: https://github.com/confluentinc/ksql/issues/4051
As you say, the workaround is to flatten it in another stream first and then join it.

Difference between sc.broadcast and broadcast function in spark sql

I have used sc.broadcast for lookup files to improve the performance.
I also came to know there is a function called broadcast in Spark SQL Functions.
What is the difference between two?
Which one i should use it for broadcasting the reference/look up tables?

one word answer :
1) org.apache.spark.sql.functions.broadcast() function is user supplied,explicit hint for given sql join.
2) sc.broadcast is for broadcasting readonly shared variable.
More details about broadcast function #1 :
Here is scala doc from
sql/execution/SparkStrategies.scala
which says.
Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that
side has an explicit broadcast hint (e.g. the user applied the *
[[org.apache.spark.sql.functions.broadcast()]] function to a
DataFrame), then that side * of the join will be broadcasted
and the other side will be streamed, with no shuffling *
performed. If both sides of the join are eligible to be broadcasted
then the *
Shuffle hash join: if the average size of a single
partition is small enough to build a hash * table.
Sort merge: if the matching join keys are sortable.
If there is no joining keys, Join implementations are chosen with the following precedence:
BroadcastNestedLoopJoin: if one side of the join could be broadcasted
CartesianProduct: for Inner join
BroadcastNestedLoopJoin
The below method controls the behavior based on size we set to
spark.sql.autoBroadcastJoinThreshold by default it is 10mb
Note : smallDataFrame.join(largeDataFrame) does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame) does.
/** Matches a plan whose output should be small enough to be used in broadcast join.
**/
private def canBroadcast(plan: LogicalPlan): Boolean = {
plan.statistics.isBroadcastable ||
plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
}
In future the below configurations will be deprecated in coming versions of spark.

If you want to achieve broadcast join in Spark SQL you should use broadcast function (combined with desired spark.sql.autoBroadcastJoinThreshold configuration). It will:
Mark given relation for broadcasting.
Adjust SQL execution plan.
When output relation is evaluated it will take care of collecting data, and broadcasting, and applying correct join mechanism.
SparkContext.broadcast is used to handle local objects and is applicable for use with Spark DataFrames.

Equivalent to left outer join in SPARK

Is there a left outer join equivalent in SPARK SCALA ? I understand there is join operation which is equivalent to database inner join.

Spark Scala does have the support of left outer join. Have a look here
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD
Usage is quite simple as
rdd1.leftOuterJoin(rdd2)

It is as simple as rdd1.leftOuterJoin(rdd2) but you have to make sure both rdd's are in the form of (key, value) for each element of the rdd's.

Yes, there is. Have a look at the DStream APIs and they have provided left as well as right outer joins.
If you have a stream of of type let's say 'Record', and you wish to join two streams of records, then you can do this like :
var res: DStream[(Long, (Record, Option[Record]))] = left.leftOuterJoin(right)
As the APIs say, the left and right streams have to be hash partitioned. i.e., you can take some attributes from a Record, (or may be in any other way) to calculate a Hash value and convert it to pair DStream. left and right streams will be of type DStream[(Long, Record)] before you call that join function. (It is just an example. The Hash type can be of some type other than Long as well.)

Spark SQL / Data Frame API also supports LEFT/RIGHT/FULL outer joins directly:
https://spark.apache.org/docs/latest/sql-programming-guide.html
Because of this bug: https://issues.apache.org/jira/browse/SPARK-11111 outer joins in Spark prior to 1.6 might be very slow (unless you have really small data sets to join). It used to use cartesian product and then filtering before 1.6. Now it is using SortMergeJoin instead.