How to obtain the symmetric difference between two DataFrames? - scala

In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:
df1.except(df2).union(df2.except(df1))
But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.

You can always rewrite it as:
df1.unionAll(df2).except(df1.intersect(df2))
Seriously though this UNION, INTERSECT and EXCEPT / MINUS is pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.

why not the below?
df1.except(df2)

If you are looking for Pyspark solution, you should use subtract() docs.
Also, unionAll is deprecated in 2.0, use union() instead.
df1.union(df2).subtract(df1.intersect(df2))

Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:
https://issues.apache.org/jira/browse/SPARK-21274
As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as
SELECT a,b,c
FROM tab1 t1
LEFT OUTER JOIN
tab2 t2
ON (
(t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
)
WHERE
COALESCE(t2.a, t2.b, t2.c) IS NULL

I think it could be more efficient using a left join and then filtering out the nulls.
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)

Related

How to access Spark nested struct fields from sql (not the DSL)

In the following sql the syntax to access the nested struct is needed.
Specifically the following on the third line:
collect_list(struct( .. ) )
I have put rec.* but that is certainly not the correct way.
select matchMethod, rec.* from
(select first(matchMethod) matchMethod,
collect_list(struct(rawTp,tp,fp,fn,
precision,recall,weight,F1,
truthGrpId,entityId,
tpIds,fpIds, fnIds,truthIds,actuals)) rec
from scoring5
where entityId is not null and truthGrpId is not null
group by truthGrpId
) order by rec.truthGrpId, rec.recall desc
This results in :
org.apache.spark.sql.AnalysisException:
Can only star expand struct data types. Attribute: `ArrayBuffer(rec)`;
Many other ways have been attempted. I have also perused about ten other questions here on SOF but none address this directly specifically for the SQL and not the DSL .. Is this at all possible?
I am uncertain whether the message Can only star expand struct data types means that there may be a different syntax to achieve this, or whether spark sql simply has a deficiency here.
We are using spark 2.3.X.
Given the significant research as well as trials of various combinations of syntax I tend to agree with #user6910411 that the above is not presently supported. It seems there is some help coming along in the form of Spark 2.4: see this answer by Jacek Laskowski:
In any case I found a more straightforward approach using windowing functions as follows:
select * from
(select row_number() over (partition by truthGrpId order by recall desc) rownum,*
from
(select matchMethod, rawTp,tp,fp,fn,
precision,recall,weight,F1,
truthGrpId,entityId,
tpIds,fpIds, fnIds,truthIds,actuals
from scoring5
where entityId is not null and truthGrpId is not null
) order by truthGrpId, recall desc
) where rownum=1 order by truthGrpId""")
The obvious follow-up here is to dig down deeper into windowing functions and incorporate them as first class citizens into my exploratory work.

Sort data within a subquery with another subquery?

I am trying to sort the OUN.note column by using the OUN.outcomeKey, since
the way it it is working right now is putting the notes in the wrong order (sorting alphabetically). Any idea on how to go about this? I've been trying to sort the data using another sub-query within, but I haven't had much luck (I don't have a plethora of experience).
Here's my current query:
SELECT DISTINCT OC.outcomeKey [Outcome Key], OC.outcome [Result],
STUFF((SELECT ','+' '+ OUN.note
FROM
Outcome AS OUT
JOIN OutcomeNote AS OUN
ON OUT.outcomeKey = OUN.outcomeKey
WHERE OUN.outcomeKey = OC.outcomeKey
GROUP BY OUN.note
FOR XML PATH ('')), 1, 1, '') [Outcome Note]
FROM Outcome AS OC
Any help or tips would be greatly appreciated! Also, please let me know if any more info is needed.
You may replace the line
GROUP BY OUN.note
with the line
ORDER BY OUN.outcomeKey
Also, because the concatenation starts with ', ', you may want to use 1, 2, '' as the additional arguments of the STUFF function. Otherwise, the values in your [Outcome note] column always start with a space.
Edit:
By the way, sorting the notes by outcomeKey in the subquery that generates the values for the [Outcome note] column has no effect... since all the notes in each subquery result will have the same outcomeKey value...
But you may sort on any column you want, of course. Perhaps there are other columns in your OutcomeNotes table that can serve as a useful sorting column of your outcome notes.
If I misunderstood your question, please provide definitions of the Outcome and OutcomeNote tables, together with a demo population of those tables and the desired/expected query result, please.
Edit 2:
Starting with SQL Server 2017, Transact-SQL contains a function called STRING_AGG, which seems to be functionally equivalent (more or less) to MySQL's GROUP_CONCAT function. Using this function, your query would become something like this:
SELECT
OUN.outcomeKey [Outcome Key],
OC.outcome [Result],
STRING_AGG(OUN.[Note], ', ') WITHIN GROUP (ORDER BY OUN.outcomeKey) [Outcome Note]
FROM
Outcome AS OC
JOIN OutcomeNote AS OUN ON OUN.outcomeKey = OC.outcomeKey
GROUP BY
OUN.outcomeKey,
OC.outcome;
When using SQL Server 2017 or SQL Azure, this might be a more fitting choice, since it does not only make the query more readable, but it also eliminates the use of (way less efficient) XML-functions in your query.
I too have used the XML-functionality for field concatenation (the way you use it) intensively in the past, but I noticed a considerable drop in performance of my queries (which sometimes contained up to 10 columns with concatenated data). Since then, I tend to go for recursive common table expressions or scalar UDF with recursion approaches in pre SQL Server 2017 environments.

Equivalent to left outer join in SPARK

Is there a left outer join equivalent in SPARK SCALA ? I understand there is join operation which is equivalent to database inner join.
Spark Scala does have the support of left outer join. Have a look here
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD
Usage is quite simple as
rdd1.leftOuterJoin(rdd2)
It is as simple as rdd1.leftOuterJoin(rdd2) but you have to make sure both rdd's are in the form of (key, value) for each element of the rdd's.
Yes, there is. Have a look at the DStream APIs and they have provided left as well as right outer joins.
If you have a stream of of type let's say 'Record', and you wish to join two streams of records, then you can do this like :
var res: DStream[(Long, (Record, Option[Record]))] = left.leftOuterJoin(right)
As the APIs say, the left and right streams have to be hash partitioned. i.e., you can take some attributes from a Record, (or may be in any other way) to calculate a Hash value and convert it to pair DStream. left and right streams will be of type DStream[(Long, Record)] before you call that join function. (It is just an example. The Hash type can be of some type other than Long as well.)
Spark SQL / Data Frame API also supports LEFT/RIGHT/FULL outer joins directly:
https://spark.apache.org/docs/latest/sql-programming-guide.html
Because of this bug: https://issues.apache.org/jira/browse/SPARK-11111 outer joins in Spark prior to 1.6 might be very slow (unless you have really small data sets to join). It used to use cartesian product and then filtering before 1.6. Now it is using SortMergeJoin instead.

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.

Optimising (My)SQL Query

I usually use ORM instead of SQL and I am slightly out of touch on the different JOINs...
SELECT `order_invoice`.*
, `client`.*
, `order_product`.*
, SUM(product.cost) as net
FROM `order_invoice`
LEFT JOIN `client`
ON order_invoice.client_id = client.client_id
LEFT JOIN `order_product`
ON order_invoice.invoice_id = order_product.invoice_id
LEFT JOIN `product`
ON order_product.product_id = product.product_id
WHERE (order_invoice.date_created >= '2009-01-01')
AND (order_invoice.date_created <= '2009-02-01')
GROUP BY `order_invoice`.`invoice_id`
The tables/ columns are logically names... it's an shop type application... the query works... it's just very very slow...
I use the Zend Framework and would usually use Zend_Db_Table_Row::find(Parent|Dependent)Row(set)('TableClass') but I have to make lots of joins and I thought it'll improve performance by doing it all in one query instead of hundreds...
Can I improve the above query by using more appropriate JOINs or a different implementation? Many thanks.
The query is wrong, the GROUP BY is wrong. All columns in the SELECT-part that are not in an aggregate function, have to be in the GROUP BY. You mention only one column.
Change the SQL Mode, set it to ONLY_FULL_GROUP_BY.
When this is done and you have a correct query, use EXPLAIN to find out how the query is executed and what indexes are used. Then start optimizing.