Outer join two Datasets (not DataFrames) in Spark Structured Streaming - scala

I have some code that joins two streaming DataFrames and outputs to console.
val dataFrame1 =
df1Input.withWatermark("timestamp", "40 seconds").as("A")
val dataFrame2 =
df2Input.withWatermark("timestamp", "40 seconds").as("B")
val finalDF: DataFrame = dataFrame1.join(dataFrame2,
"A.id = B.id" +
" AND " +
"B.timestamp >= A.timestamp " +
" AND " +
"B.timestamp <= A.timestamp + interval 1 hour")
, joinType = "leftOuter")
What I now want is to refactor this part to use Datasets, so I can have some compile-time checking.
So what I tried was pretty straightforward:
val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
"A.id = B.id" +
" AND " +
"B.timestamp >= A.timestamp " +
" AND " +
"B.timestamp <= A.timestamp + interval 1 hour")
, joinType = "leftOuter")
However, this gives the following error:
org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;
As you can see, the join code hasn't changed, so there is a watermark on both sides and a range condition. The only change was to use the Dataset API instead of DataFrame.
Also, it is fine when I use inner join:
val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
"A.id = B.id" +
" AND " +
"B.timestamp >= A.timestamp " +
" AND " +
"B.timestamp <= A.timestamp + interval 1 hour")
Does anyone know how can this happen?

Well, when you using joinWith method instead of join you rely on different implementation and it seems like this implementation not support leftOuter join for streaming Datasets.
You can check outer joins with watermarking section of the official documentation. Method join not joinWith used. Note that result type will be DataFrame. That means that you most likely will have to map field manually
val finalDS = dataFrame1.as[A].join(dataFrame2.as[B],
"A.key = B.key" +
" AND " +
"B.timestamp >= A.timestamp " +
" AND " +
"B.timestamp <= A.timestamp + interval 1 hour"),
joinType = "leftOuter").select(/* useful fields */).as[C]

If you here for understnding why this exception
org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;
still aprears while you have introduced the watermark to the join and Spark 3 supports the streams join already, you probably have added watermarking AFTER the join, but Spark want you to add watermarking BEFORE the join on each stream!


for loop alternative in scala? (Improve performance)

i'm new to scala ,my requirement is delete the particular column records from almost 100 tables,so that i read the data from csv (which is my source) ,selected that particular column and changed into List.
val csvDF = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("inferSchema", true).option("escape", "\"").option("multiline", "true").option("quotes", "").load(inputPath)
val badrecods = csvDF.select("corrput_id").collect().map(_ (0)).toList
then read the metadata from postgres schema, there will get the all the tables list ,here i write the two for loops which is working fine,but performance wat too bad,how can i imporve this
val query = "(select table_name from information_schema.tables where table_schema = '" + db + "' and table_name not in " + excludetables + ") temp "
val tablesdf = spark.read.jdbc(jdbcUrl, table = query, connectionProperties)
val tablelist = tablesdf.select($"corrput_id").collect().map(_(0)).toList
for (i <- tablelist) {
val s2 = dbconnection.createStatement()
for (j <- bad_records) {
s2.execute("delete from " + db + "." + i + " where corrput_id = '" + j + "' ")
Thanks in advance
If you're looking to improve your performance, in my opinion, I think you should consider more on optimizing your queries instead! executing a query per row in a table WILL affect your performance, something like
" where corrput_id IN " + bad_records.map(str => s" '$str' ").mkString("(", ",", ")")
would be better. The second point, why don't you just use spark APIs? I mean like using collect on a DF and then processing it in a single thread is kind of like awaiting a Future (I mean you are not using the actual power that you can), spark is made to do such things, and can do these efficiently I believe.

JPA and Sysdate issue - forcefully accepting Double data types

I am using Spring Data JPA and developed below query which will dynamically take the day values and fetch data, but looks like its arguable looking for Double data type. Any reason why its taking double?
#Query("SELECT new com.XXX.SomeDTODto(p.visitDate, ..........) "
+ "FROM PatientData p "
+ "INNER JOIN ................. "
+ "INNER JOIN ................."
+ "INNER JOIN .................. "
+ "INNER ..... "
+ "WHERE (TRUNC (p.visitDate) >= TRUNC (SYSDATE + :startDay) AND TRUNC (p.visitDate) <= TRUNC (SYSDATE + :endDay))")
Page<SomeDTODto> findByvisitDateBetweenDays(#Param("startDay") Double startDay,
#Param("endDay") Double endDay, Pageable pageable);
I am getting below, functional test cases runs against H2 DB and failing
here TRUNC(patient0_.patient_dt)>=TRUNC(SYSDATE+?) and TRUNC(patient0_.patient_dt)<=TRUNC(SYSDATE+?) order by patient0_.patient_dt asc limit ? [50004-196]
You can use below, this works fine for me, also Integer is getting casted to Double
"WHERE (TRUNC (p.visitDate) >= TRUNC (SYSDATE + CAST(:startDay AS double) + 0) AND TRUNC (p.visitDate) <= TRUNC (SYSDATE + CAST(:endDay AS double) + 0))")

multiple use of expression via jpql alias keyword

I'm using spring data with a postgresql server and i want to perform some GPS-data range queries. This means, given a coordinate i compute on the fly the distance from the entry to the given point and check for a certain range.
Since i also want to order my data regarding the distance and additionally i want to retrieve the actual distance too, in sql i would use the AS keyword to compute the expression only once and then use this auxiliary expression in the where and the order by part.
However, so far I haven't yet figured out how to do this in jqpl. So my query should do something like this:
SELECT NEW Result(p, <distance-expression>) FROM MyModel p where <distance-expression> <= :rangeParam order by <distance-expression>
however, i'm afraid that the will be evaluated more than once for each entry and so this will have a negative impact on the runtime/response time of the query.
Is there any way in jqpl to use the AS keyword to avoid the multiple evaluation of
Best regards
A native query with an inner view should get the job done. Assuming class Location(id, latitude, longitude) and the Haversine formula for finding distances between points on great circles, the following repository method declaration with a custom native query should be sufficient:
#Query(nativeQuery = true
, value = "SELECT "
+ " r.id "
+ " , r.latitude "
+ " , r.longitude "
+ "FROM "
+ " (SELECT "
+ " l.id AS id "
+ " , l.latitude AS latitude "
+ " , l.longitude AS longitude "
+ " , 2 * 6371 * ASIN(SQRT(POWER(SIN(RADIANS((l.latitude - ?1) / 2)), 2) + COS(RADIANS(l.latitude))*COS(RADIANS(?1))*POWER(SIN(RADIANS((l.longitude - ?2) / 2)), 2))) AS distance "
+ " FROM "
+ " location l) AS r "
+ "WHERE "
+ " r.distance < ?3")
List<Location> findAllByProximity(BigDecimal latitude
, BigDecimal longitude
, BigDecimal distance);
Sample available on Github as an example (metric units assumed).
Note: The reason behind using a native query in the example as opposed to JPQL is the lack of support for trigonometric functions in JPQL. In cases where the expression is simpler and can be coded using native JPQL functions, the native query can be replaced with a JPA query.

concat columns by joining multiple DataFrames

I have multiple dataframes I need to concat the addresses and zip based condition.Actually I had sql query which i need to convert to dataframe join
I had written UDF which is working fine for concating multiple columns to obtain a single column,
val getConcatenated = udf( (first: String, second: String,third: String,fourth: String,five: String,six: String) => { first + "," + second + "," +third + "," +fourth + "," +five + "," +six } )
MySQl Query
CONCAT(al.Address1,',',al.Address2,',',al.Zip) AS AtAddress,
CONCAT(rl.Address1,',',rl.Address2,',',rl.Zip) AS RtAddress,
CONCAT(d.Address1,',',d.Address2,','d.Zip) AS DAddress,
CONCAT(s.Address1,',',s.Address2,',',s.Zip) AS SAGddress,
CONCAT(vl.Address1,',',vl.Address2,',vl.Zip) AS VAddress,
CONCAT(sg.Address1,',',sg.Address2,',sg.Zip) AS SAGGddress
si s inner join
at a on s.cid = a.cid and s.cid =a.cid
inner join De d on s.cid = d.cid AND d.aid = a.aid
inner join SGrpM sgm on s.cid = sgm.cid and s.sid =sgm.sid and sgm.status=1
inner join SeG sg on sgm.cid =sg.cid and sgm.gid =sg.gid
inner join bd bu on s.cid = bu.cid and s.sid =bu.sid
inner join locas al on a.ALId = al.lid
inner join locas rl on a.RLId = rl.lid
inner join locas vl on a.VLId = vl.lid
I am facing issue when joining the dataframes which gives me null value.
val DS = DS_SI.join(at,Seq("cid","sid"),"inner").join(DS_DE,Seq("cid","aid"),"inner") .join(DS_SGrpM,Seq("cid","sid"),"inner").join(DS_SG,Seq("cid","gid"),"inner") .join(at,Seq("cid","sid"),"inner")
.join(DS_BD,Seq("cid","sid"),"inner").join(DS_LOCAS("ALId") <=> DS_LOCATION("lid") && at("RLId") <=> DS_LOCAS("lid")&& at("VLId") <=> DS_LOCAS("lid"),"inner")
Iam trying to join my dataFrames like above which is not giving be proper results and then I want to concat by adding the column
Any one tell me how effectively we can achieve this and am I joining the dataframes correctly or any better approach for this .....
You can use concat_ws(separator, columns_to_concat).
import org.apache.spark.sql.functions._
df.withColumn("title", concat_ws(", ", DS_DE("Address2"), DS_DE("Address2"), DS_DE("Zip")))

JPQL "DISTINCT" returns only one result

I am confused by DISTINCT in JPQL. I have two JPQL queries identical except for "DISTINCT" in one of them:
String getObjectsForFlow =
" se.componentID " +
"FROM " +
" StatisticsEvent se " +
"WHERE " +
" se.serverID IS NOT NULL " +
" AND se.flowID = :uuid " +
" AND se.componentID IS NOT NULL " +
" se.timeStamp desc ";
String getObjectsForFlowDistinct =
" se.componentID " +
"FROM " +
" StatisticsEvent se " +
"WHERE " +
" se.serverID IS NOT NULL " +
" AND se.flowID = :uuid " +
" AND se.componentID IS NOT NULL " +
" se.timeStamp desc ";
I run a little code to get the results from each query and dump them to stdout, and I get many rows with some duplicates for non-distinct, but for distinct I get only one row which is part of the non-distinct list.
::: 01e2e915-35c1-6cf0-9d0e-14109fdb7235
::: 01e2e915-35c1-6cf0-9d0e-14109fdb7235
::: 01e2e915-35d9-afe0-9d0e-14109fdb7235
::: 01e2e915-35d9-afe0-9d0e-14109fdb7235
::: 01e2e915-35bd-c370-9d0e-14109fdb7235
::: 01e2e915-35bd-c370-9d0e-14109fdb7235
::: 01e2e915-35aa-1460-9d0e-14109fdb7235
::: 01e2e915-35d1-2460-9d0e-14109fdb7235
::: 01e2e915-35e1-7810-9d0e-14109fdb7235
::: 01e2e915-35e1-7810-9d0e-14109fdb7235
::: 01e2e915-35d0-12f0-9d0e-14109fdb7235
::: 01e2e915-35b0-cb20-9d0e-14109fdb7235
::: 01e2e915-35a8-66b0-9d0e-14109fdb7235
::: 01e2e915-35a8-66b0-9d0e-14109fdb7235
::: 01e2e915-35e2-6270-9d0e-14109fdb7235
::: 01e2e915-357f-33d0-9d0e-14109fdb7235
::: 01e2e915-35e2-6270-9d0e-14109fdb7235
Where are the other entries? I would expect a DISTINCT list containing eleven (I think) entries.
Double check equals() method on your StatisticsEvent entity class. Maybe those semantically different values returns same when equals() is called hence producing this behavior
The problem was the "ORDER BY se.timeStamp" clause. To fulfill the request, JPQL added the ORDER BY field to the SELECT DISTINCT clause.
This is like a border case in the interplay between JPQL and SQL. The JPQL syntax clearly applies the DISTINCT modifier only to se.componentID, but when translated into SQL the ORDER BY field gets inserted.
I am surprised that the ORDER BY field had to be selected at all. Some databases can return a data set ORDERed by a field not in the SELECTion. Oracle can do so. My underlying database is Derby -- could this be a limitation in Derby?
Oracle does not support SELECT DISTINCT with an order by unless the order by columns are in the SELECT. Not sure if any databases do. It will work in Oracle if the DISTINCT is not required (does not run because rows are unique), but if it needs to run you will get an error.
You will get, "ORA-01791: not a SELECTed expression"
If you are using EclipseLink this functionality is controlled by the DatabasPlatform method,
You can extend your platform to return false if your database does not require this.
Still, I don't see how adding the TIMESTAMP will change the query results?
Both queries are incorrect JPQL queries, because ORDER BY clause refers to the item that is not on select list. JPA 2.0 specification contains example that matches to this case:
The following two queries are not legal because the orderby_item is
not reflected in the SELECT clause of the query.
SELECT p.product_name
FROM Order o JOIN o.lineItems l JOIN l.product p JOIN o.customer c
WHERE c.lastname = ‘Smith’ AND c.firstname = ‘John’
ORDER BY p.price
SELECT p.product_name
FROM Order o, IN(o.lineItems) l JOIN o.customer c
WHERE c.lastname = ‘Smith’ AND c.firstname = ‘John’
Of course it would be nicer if if implementation could give clear error message instead of trying to guess what is expected result of incorrect query.