How to access Spark nested struct fields from sql (not the DSL)

How to access Spark nested struct fields from sql (not the DSL) - scala

In the following sql the syntax to access the nested struct is needed.
Specifically the following on the third line:
collect_list(struct( .. ) )
I have put rec.* but that is certainly not the correct way.
select matchMethod, rec.* from
(select first(matchMethod) matchMethod,
collect_list(struct(rawTp,tp,fp,fn,
precision,recall,weight,F1,
truthGrpId,entityId,
tpIds,fpIds, fnIds,truthIds,actuals)) rec
from scoring5
where entityId is not null and truthGrpId is not null
group by truthGrpId
) order by rec.truthGrpId, rec.recall desc
This results in :
org.apache.spark.sql.AnalysisException:
Can only star expand struct data types. Attribute: `ArrayBuffer(rec)`;
Many other ways have been attempted. I have also perused about ten other questions here on SOF but none address this directly specifically for the SQL and not the DSL .. Is this at all possible?
I am uncertain whether the message Can only star expand struct data types means that there may be a different syntax to achieve this, or whether spark sql simply has a deficiency here.
We are using spark 2.3.X.

Given the significant research as well as trials of various combinations of syntax I tend to agree with #user6910411 that the above is not presently supported. It seems there is some help coming along in the form of Spark 2.4: see this answer by Jacek Laskowski:
In any case I found a more straightforward approach using windowing functions as follows:
select * from
(select row_number() over (partition by truthGrpId order by recall desc) rownum,*
from
(select matchMethod, rawTp,tp,fp,fn,
precision,recall,weight,F1,
truthGrpId,entityId,
tpIds,fpIds, fnIds,truthIds,actuals
from scoring5
where entityId is not null and truthGrpId is not null
) order by truthGrpId, recall desc
) where rownum=1 order by truthGrpId""")
The obvious follow-up here is to dig down deeper into windowing functions and incorporate them as first class citizens into my exploratory work.

Related

Cannot use Named Parameters with SSRS and PostgreSQL

I'm trying to add named parameters to a dataset query in a SSRS report (I'm using Report Builder), but I have had no luck discovering the correct syntax. I have tried #parameter, $1, $parameter and others, all without success. I suspect the syntax is just different for PostgreSQL versus normal SQL.
The only success I have had with passing parameters was based on this answer.
It involves using ? for every single parameter.
My query might look something like this:
SELECT address, code, remarks FROM table_1 WHERE date BETWEEN ? AND ? AND apt_num IS NULL AND ADDRESS = ?
This does work, but in the case of a query where I pass the same parameter to more than one part of the SELECT statement, I have to add the same parameter to the list multiple times as shown here. They are passed in this order, so adding a new parameter to an existing query results in having to reshuffle, and sometimes completely rebuild, the query parameters tab.
What are the proper syntax and naming requirements for adding named Parameters when using a PostgreSQL data source in SSRS?

From my comment, this is what it would look like with a regular join:
with inparms as (
select ? as from_date, ? as to_date, ? as address
)
select t.address, t.code, t.remarks
from inparms i
join table_1 t
on t.date between i.from_date and i.to_date
and t.apt_num is null
and t.address = i.address;
I said cross join in my comment because it is sometimes quicker when retrofitting somebody else's SQL instead of trying to untangle things (thinking of a friend who uses right join sometimes just to ruin my day).

Faster/efficient alternative to IN clause in custom/native queries in spring data jpa

I have a custom query along these lines. I get the list of orderIds from outside. I have the entire order object list with me, so I can change the query in any way, if needed.
#Query("SELECT p FROM Person p INNER JOIN p.orders o WHERE o.orderId in :orderIds)")
public List<Person> findByOrderIds(#Param("orderIds") List<String> orderIds);
This query works fine, but sometimes it may have anywhere between 50-1000 entries in the orderIds list sent from outside function. So it becomes very slow, taking as much as 5-6 seconds which is not fast enough. My question is, is there a better, faster way to do this? When I googled, and on this site, I see we can use ANY, EXISTS: Postgresql: alternative to WHERE IN respective WHERE NOT IN or create a temporary table: https://dba.stackexchange.com/questions/12607/ways-to-speed-up-in-queries-under-postgresql or join this to VALUES clause: Alternative when IN clause is inputed A LOT of values (postgreSQL). All these answers are tailored towards direct SQL calls, nothing based on JPA. ANY keyword is not supported by spring-data. Not sure about creating temporary tables in custom queries. I think I can do it with native queries, but have not tried it. I am using spring-data + OpenJPA + PostgresSQL.
Can you please suggest a solution or give pointers? I apologize if I missed anything.
thanks,
Alice

You can use WHERE EXISTS instead of IN Clause in a native SQL Query as well as in HQL in JPA which results in a lot of performance benefits. Please see sample below
Sample JPA Query:
SELECT emp FROM Employee emp JOIN emp.projects p where NOT EXISTS (SELECT project from Project project where p = project AND project.status <> 'Active')

How to obtain the symmetric difference between two DataFrames?

In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:
df1.except(df2).union(df2.except(df1))
But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.

You can always rewrite it as:
df1.unionAll(df2).except(df1.intersect(df2))
Seriously though this UNION, INTERSECT and EXCEPT / MINUS is pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.

why not the below?
df1.except(df2)

If you are looking for Pyspark solution, you should use subtract() docs.
Also, unionAll is deprecated in 2.0, use union() instead.
df1.union(df2).subtract(df1.intersect(df2))

Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:
https://issues.apache.org/jira/browse/SPARK-21274
As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as
SELECT a,b,c
FROM tab1 t1
LEFT OUTER JOIN
tab2 t2
ON (
(t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
)
WHERE
COALESCE(t2.a, t2.b, t2.c) IS NULL

I think it could be more efficient using a left join and then filtering out the nulls.
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)

JPQL equivalent of SQL query using unions and selecting constants

I've written a SQL query that basically selects from a number of tables to determine which ones have rows that were created since a particular date. My SQL looks something like this:
SELECT widget_type FROM(
SELECT 'A' as widget_type
FROM widget_a
WHERE creation_timestamp > :cutoff
UNION
SELECT 'B' as widget_type
FROM widget_b
WHERE creation_timestamp > :cutoff
) types
GROUP BY widget_type
HAVING count(*)>0
That works well in SQL but I recently found that, while JPA may use unions to perform "table per class" polymorphic queries, JPQL does not support unions in queries. So that leaves me wondering whether JPA has an alternative I could use to accomplish the same thing.
In reality, I would be querying against a dozen tables, not just two, so I would like to avoid doing separate queries. I would also like to avoid doing a native SQL query for portability reasons.
In the question I linked to above, it was asked whether the entities that map to widget_a and widget_b are part of the same inheritance tree. Yes, they are. However, if I selected from their base class, I don't believe I would have a way of specifying different string constants for the different child entities, would I? If I could select an entity's class name instead of a string I provide, that might serve my purpose too. But I don't know if that's possible either. Thoughts?

I did a little more searching and found a (seemingly obscure) feature of JPA that serves my purpose perfectly. What I found is that JPA 2 has a type keyword that allows you to limit polymorphic queries to a particular subclass, like so:
SELECT widget
FROM BaseWidget widget
WHERE TYPE(widget) in (WidgetB, WidgetC)
I've found that JPA (or at least Hibernate as a JPA implementation) allows you to use type not only in constraints but also in select lists. This is approximately what my query ended up looking like:
SELECT DISTINCT TYPE(widget)
FROM BaseWidget widget
WHERE widget.creationTimestamp > :cutoff
That query returns a list of Class objects. My original query was selecting string literals because that's closest to what I might have done in SQL. Selecting Class is actually preferable in my case. But if I did prefer to select a constant based on an entity's type, that is the exact scenario that Oracle's documentation uses to illustrate case statements:
SELECT p.name
CASE TYPE(p)
WHEN Student THEN 'kid'
WHEN Guardian THEN 'adult'
WHEN Staff THEN 'adult'
ELSE 'unknown'
END
FROM Person p

Some JPA providers do support UNION,
http://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Basic_JPA_Development/Querying/JPQL#UNION
but your query seems very complex, and non object-oriented, so using a native SQL query would probably be best.

JPQL Group By not working

This is my simple JPQL:
SELECT s
FROM Site s
GROUP BY s.siteType
siteResult = q.getResultList();
for (Site site : siteResult) {
// loops all sites
}
This query returns all sites, including sites of the same siteType.
I'm using JPA 2.0 Eclipselink.
Whats wrong here?

Such a query does not make sense. If you use GROUP BY, other attributes in SELECT should be aggregated. As it is said in JPA specification:
The requirements for the SELECT clause when GROUP BY is used follow
those of SQL: namely, any item that appears in the SELECT clause
(other than as an aggregate function or as an argument to an aggregate
function) must also appear in the GROUP BY clause. In forming the
groups, null values are treated as the same for grouping purposes.
If you think SQL counterpart of your query:
SELECT s.attr1, attr2, s.siteType
FROM site s
GROUP BY (s.siteType)
you notice that it is hard to imagine which possible value of attr1 and attr2 should be chosen.
In such a case EclipseLink with derby just drops GROUP BY away from the query, which is of course little bit questionable way to handle invalid JPQL. I like more how Hibernate+MySQL behaves with such a invalid JPQL, it fails with quite clear error message:
java.sql.SQLSyntaxErrorException: The SELECT list of a grouped query
contains at least one invalid expression. If a SELECT list has a GROUP
BY, the list may only contain valid grouping expressions and valid
aggregate expressions.
Answer to comment:
One Site contains probably also attributes other than siteType as well. Lets use following example:
public class Site {
int id;
String siteType;
}
and two instances: (id=1, siteType="same"), (id=2, siteType="same"). Now when type of select is Site itself (or all attributes of it) and you make group by by siteType, it is impossible to define should result have one with id value 1 or 2. Thats why you have to use some aggregate function (like AVG, which gives you average of attribute values) for remaining attributes (id in our case).
Behind this link: ObjectDB GROUP BY you can find some examples with GROUP BY and aggregates.