Scala & Spark: Recycling SQL statements

Scala & Spark: Recycling SQL statements - scala

I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R scripts. This is how it worked
sqlContent = readSQLFile("file1.sql")
sqlContent = setSQLVariables(sqlContent, variables)
results = executeSQL(sqlContent)
The clue is, that for some queries a result from a prior query is required - why creating VIEWs in the database itself does not solve this problem. With Spark 2.0 I already figured out a way to do just that through
// create a dataframe using a jdbc connection to the database
val tableDf = spark.read.jdbc(...)
var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase
var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString
sqlQuery = setSQLVariables(sqlQuery, sqlVariables)
sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName)
tableDf.createOrReplaceTempView(tempTableName)
var data = spark.sql(sqlQuery)
But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL code using filter($""), .select($""), etc.
The overall goal is to get multiple org.apache.spark.sql.DataFrames, each representing the results of one former SQL query (which always a few JOINs, WITHs, etc.). So n queries leading to n DataFrames.
Is there a better option than the provided two?
Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation

It's not especially clear what your requirement is, but I think you're saying you have queries something like:
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM (SELECT * FROM people LEFT OUTER JOIN places ON ...) WHERE age>20
and you would want to declare and execute this efficiently as
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM <cachedresult> WHERE age>20
To achieve that I would enhance the input file so each sql statement has an associated table name into which the result will be stored.
e.g.
PEOPLEPLACES\tSELECT * FROM people LEFT OUTER JOIN places ON ...
ADULTS=SELECT * FROM PEOPLEPLACES WHERE age>18
Then execute in a loop like
parseSqlFile().foreach({case (name, query) => {
val data: DataFrame = execute(query)
data.createOrReplaceTempView(name)
}
Make sure you declare the queries in order so all required tables have been created. Other do a little more parsing and sort by dependencies.
In an RDMS I'd call these tables Materialised Views. i.e. a transform on other data, like a view, but with the result cached for later reuse.

Related

Selecting identical named columns in jOOQ

Im currently using jOOQ to build my SQL (with code generation via the mvn plugin).
Executing the created query is not done by jOOQ though (Using vert.X SqlClient for that).
Lets say I want to select all columns of two tables which share some identical column names. E.g. UserAccount(id,name,...) and Product(id,name,...). When executing the following code
val userTable = USER_ACCOUNT.`as`("u")
val productTable = PRODUCT.`as`("p")
create().select().from(userTable).join(productTable).on(userTable.ID.eq(productTable.AUTHOR_ID))
the build method query.getSQL(ParamType.NAMED) returns me a query like
SELECT "u"."id", "u"."name", ..., "p"."id", "p"."name", ... FROM ...
The problem here is, the resultset will contain the column id and name twice without the prefix "u." or "p.", so I can't map/parse it correctly.
Is there a way how I can say to jOOQ to alias these columns like the following without any further manual efforts ?
SELECT "u"."id" AS "u.id", "u"."name" AS "u.name", ..., "p"."id" AS "p.id", "p"."name" AS "p.name" ...
Im using the holy Postgres Database :)
EDIT: Current approach would be sth like
val productFields = productTable.fields().map { it.`as`(name("p.${it.name}")) }
val userFields = userTable.fields().map { it.`as`(name("p.${it.name}")) }
create().select(productFields,userFields,...)...
This feels really hacky though

How to correctly dereference tables from records
You should always use the column references that you passed to the query to dereference values from records in your result. If you didn't pass column references explicitly, then the ones from your generated table via Table.fields() are used.
In your code, that would correspond to:
userTable.NAME
productTable.NAME
So, in a resulting record, do this:
val rec = ...
rec[userTable.NAME]
rec[productTable.NAME]
Using Record.into(Table)
Since you seem to be projecting all the columns (do you really need all of them?) to the generated POJO classes, you can still do this intermediary step if you want:
val rec = ...
val userAccount: UserAccount = rec.into(userTable).into(UserAccount::class.java)
val product: Product = rec.into(productTable).into(Product::class.java)
Because the generated table has all the necessary meta data, it can decide which columns belong to it, and which ones don't. The POJO doesn't have this meta information, which is why it can't disambiguate the duplicate column names.
Using nested records
You can always use nested records directly in SQL as well in order to produce one of these 2 types:
Record2<Record[N], Record[N]> (e.g. using DSL.row(table.fields()))
Record2<UserAccountRecord, ProductRecord> (e.g using DSL.row(table.fields()).mapping(...), or starting from jOOQ 3.17 directly using a Table<R> as a SelectField<R>)
The second jOOQ 3.17 solution would look like this:
// Using an implicit join here, for convenience
create().select(productTable.userAccount(), productTable)
.from(productTable)
.fetch();
The above is using implicit joins, for additional convenience
Auto aliasing all columns
There are a ton of flavours that users could like to have when "auto-aliasing" columns in SQL. Any solution offered by jOOQ would be no better than the one you've already found, so if you still want to auto-alias all columns, then just do what you did.
But usually, the desire to auto-alias is a derived feature request from a misunderstanding of what's the best approch to do something in jOOQ (see above options), so ideally, you don't follow down the auto-aliasing road.

slick 3.3.2 not joining when explicitly writing a join query

I'm trying to get two tables joined with slick 3.3.2 and play 2.7.x, but i'm having a hard time understanding why my codes doesn't do what i want it to.
I have two tables: Foo and Bar, both with string that i need to join on a string column, lets call it fooBar
val innerJoin = for {
(f, b) <- Foo join Bar on (_.fooBar === _.fooBar)
} yield (f, b)
db.run(innerJoin.result)
Docs say this is the way to do it: http://scala-slick.org/doc/3.3.2/queries.html#applicative-joins
But the query slick generated when debugging, doesn't actually use a join, it simply selects the properties from the two tables, like so: (simplified) select * from Foo, Bar where (x2.fooBar = x3.fooBar) clause
What is going on here?

Slick has generated a join there, but it's in a form known as an "implicit join" (in SQL).
It's a syntax difference, and you can check with your database documentation to see if the query optimiser will treat them the same.
As a rule I would not worry about the surface SQL text Slick generates, unless there's a performance issues which you can identify by profiling the query plan in your database.

Get aggregate sum of difference of two dates in JOOQ

I am using JOOQ for writing SQL in my java code.I have following query written into the PostgreSQL database:'
Query: Fetches the total number of checked task and the total time taken to complete the tasks.
Total time for a task is calculated from table "workevents" by doing (endtime-starttime).But here I am fetching the total time spent on all the tasks.
with taskdata as (
select taskid from unittest.tasks
where projectname='test'and status='checked'
),
workevents as(
select (endtime-starttime) diff ,unittest.workevents.taskid as
workeventtaskid from unittest.workevents ,taskdata
where taskdata.taskid=unittest.workevents.taskid
)
select sum(workevents.diff),count(distinct workeventtaskid) from
workevents;
I have converted it into the jooQ AS below:
final String sql =
with(TASK_INFO_WRAPPER)
.as(select(TASK_ID).from(TASK_TABLE)
.where(PROJECT_NAME.eq(param()).and(TASK_STATUS.eq("checked"))))
.with(WORKEVENT_INFO_WRAPPER)
.as(select(TASK_END_TIME.sub(TASK_START_TIME).as("diff"),
WORKEVENT_TASK_ID.as("workeventtaskid"))
.from(WORKEVENT_TABLE, table(name(TASK_INFO_WRAPPER)))
.where("workeventinfo.taskid=taskinfo.taskid"))
.select().getSQL(ParamType.INDEXED);
But I am not able to get the aggregate sum of the "diff"(difference of the dates).Is there any function in JOOQ that can convert sql statement "select sum(workevents.diff)" into JOOQ.
I have tried sum(field) function but its giving compile time error because sum is used for numbers.
and Here I am calculating the accumulative sum of the difference of the two dates(diff).

All RDBMS behave subtly differently when implementing a date difference using the - operator, which is why it is generally recommended to use jOOQ's DSL.dateDiff() or DSL.timestampDiff() instead.
A side note on using WITH
WITH is often used to decompose a problem into smaller problems in SQL. But at some point, that decomposition leads to more complicated queries than necessary, as in your case. Especially when using jOOQ, it is often recommended to avoid common table expressions (WITH) or derived tables, not only because they're a bit more difficult to express in jOOQ, but also because they don't really add value to your query.
Your query could be written like this instead:
select
sum(e.endtime - e.starttime),
count(distinct e.taskid)
from unittest.tasks t
join unittest.workevents e on t.taskid = e.taskid
where t.projectname = 'test' and t.status = 'checked'
And that would obviously be quite easier to translate to jOOQ.

Iterating in Scala DataFrame

I have a DataFrame in spark with Sample accounts which has 5 different columns.
val sampledf= sqlContext.sql(select * from Sampledf)
I have other table in oracle db which has millions of records. OracleTable
I want to filter Accounts present in OracleTable with respect to SampleDF
Select * from OracleTable where column in (select column from SamplesDf)
I realized that in oracle we can not provide more than 1000 values in IN condition.
And below subquery query is not working. Due to huge data in OracleTable
I want to achieve below query
select column from OracleTable where (acctnum in (1,2,3,...1000) or acctnum in (1001,....2000) ....
Basically all the accounts from SampleDF (every 1000 accounts)
Since we cant give more than 1000 at once (that's the limitation in Oracle) we can give 1000 every time.
How can I generate this kind of dynamic query. DO I need to create Array from Dataframe?
I just need a work around, how can I proceed. Any suggestions will be helpful.

broadcast join is the best option which will broadcast the smaller dataframe across the cluster. As it’s mentioned the reading oracle data it’s taking time, it might be due to the profile restrictions of number of parallel sessions.
See below work around to build a dynamic in condition.
Val newsampledf = sampledf.withColumn(“seq”,row_number().over(Window.orderBy(“yourcolumn”)).select(“yourcolumn”, “seq”)
var i = 1L
var j = 0L
while(i <= (cnt/999))
{ var sql = newsampledf.select(“yourcolumn”).where(col(“seq” >= j).where(col(“seq”) <j + 999) j=j+999 i=i+1}

You can try to join the both tables based on the column.
Load the Oracle table as dataframe
Join the oracleDF with sampleDF
val resultDF=oracleDF.join(sampleDF,seq("column"))
Use broadcast if sampleDF is small for better performance
val resultDF=oracleDF.join(broadcast(sampleDF),seq("column"))
Hope it helps you.

Faster/efficient alternative to IN clause in custom/native queries in spring data jpa

I have a custom query along these lines. I get the list of orderIds from outside. I have the entire order object list with me, so I can change the query in any way, if needed.
#Query("SELECT p FROM Person p INNER JOIN p.orders o WHERE o.orderId in :orderIds)")
public List<Person> findByOrderIds(#Param("orderIds") List<String> orderIds);
This query works fine, but sometimes it may have anywhere between 50-1000 entries in the orderIds list sent from outside function. So it becomes very slow, taking as much as 5-6 seconds which is not fast enough. My question is, is there a better, faster way to do this? When I googled, and on this site, I see we can use ANY, EXISTS: Postgresql: alternative to WHERE IN respective WHERE NOT IN or create a temporary table: https://dba.stackexchange.com/questions/12607/ways-to-speed-up-in-queries-under-postgresql or join this to VALUES clause: Alternative when IN clause is inputed A LOT of values (postgreSQL). All these answers are tailored towards direct SQL calls, nothing based on JPA. ANY keyword is not supported by spring-data. Not sure about creating temporary tables in custom queries. I think I can do it with native queries, but have not tried it. I am using spring-data + OpenJPA + PostgresSQL.
Can you please suggest a solution or give pointers? I apologize if I missed anything.
thanks,
Alice

You can use WHERE EXISTS instead of IN Clause in a native SQL Query as well as in HQL in JPA which results in a lot of performance benefits. Please see sample below
Sample JPA Query:
SELECT emp FROM Employee emp JOIN emp.projects p where NOT EXISTS (SELECT project from Project project where p = project AND project.status <> 'Active')

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala & Spark: Recycling SQL statements - scala

Related

Selecting identical named columns in jOOQ

slick 3.3.2 not joining when explicitly writing a join query

Get aggregate sum of difference of two dates in JOOQ

Iterating in Scala DataFrame

Faster/efficient alternative to IN clause in custom/native queries in spring data jpa

Categories

Resources