I am new to jOOQ and consider replacing some JDBC code with jOOQ.
Looking at the jOOQ Java 8 streams is examples I start wondering if I can get a performance improvement by using jOOQ.
I have a PostgreSQL query with this characteristics:
Merge Join (cost=1.34..7649.90 rows=30407 width=333) (actual time=0.042..46.644 rows=28264 loops=1)
At the database the server the first row is returned after 0.042 ms while the last row is returned after 46.644 ms.
But my JDBC does not return the ResultSet until it is complete.
Is jOOQ (with Java 8 streams) able to start handling tuples as soon as the are ready or is jOOQ limited by JDBC?
jOOQ's Java 8 integration has two methods that might be interesting to you:
fetchStream(), which returns a Stream
fetchAsync(), which returns a CompletionStage
As of jOOQ 3.8, both APIs are limited by the blocking nature of the underlying JDBC API, i.e. both APIs internally iterate on ResultSet.next().
In particular, you can turn on using server side cursors by setting:
// JDBC
statement.setFetchSize(50);
// jOOQ, which delegates this call to JDBC
quest.fetchSize(50);
See also Statement.setFetchSize() or this question for more details.
Related
I want to execute multiple select statements in the same JDBC query, in order to amortize network latency over a number of related queries, as described in this question:
Multiple queries executed in java in single statement
The accepted answer is to use allowMultiQueries=true. Unfortunately, this is a feature specific to the MySQL JDBC driver.
What is the equivalent in PostgreSQL JDBC?
I've set up a sample project using spring boot, webflux, and r2dbc. I've been able to stream rows from a postgres db table to the client.
Is there a memory bottleneck on this server implementation (for storing the results of the query)? Do the rows stream through?
PS I'm not claiming any level of quality on this, I know pagination and so on would be essential, just wondering about how the db query interacts with the reactive framework.
Pagination is not essential with R2DBC. If you have a lot of rows to process you can issue a single query instead of fetching batches. The driver uses back-pressure to allow flow control so it does not overwhelm your application. You could read here about how backpressure is applied on such queries.
Kudu tables can be accessed via Impala thus its jdbc driver. Thanks to that it is accessable via standard java/scala jdbc api. I was wondering if it is possible to use slick for it. Or if not is any other high level scala db framework supporting impla/kudu.
Slick can be used with any JDBC database
http://slick.lightbend.com/doc/3.3.0/database.html
At least, for me, Slick is not fully compatible with Impala Kudu. Using Slick, I can not modify db entities, can not create, update or delete any item. It works only to read data.
There are two ways you could use Slick with an arbitrary JDBC driver (and SQL dialect).
The first is to use low-level JDBC calls. The SimpleDBIO class gives you access to a JDBC connection:
val getAutoCommit = SimpleDBIO[Boolean](_.connection.getAutoCommit)
That example is from the Slick manual.
However, I think you're more interested in working at a higher level than that. In that case, for Slick, you'd need to implement a custom Profile. If Impala is similar enough to an existing database profile, you may be able to extend an existing profile and adjust it to account for any differences. For example, this would allow you to customize how SQL is formatted for Impala, how timestamps are represented, how column names are quoted. The documentation on Porting SQL from Other Database Systems to Impala would give you an idea of what needs to change in a driver.
Or if not is any other high level scala db framework supporting impla/kudu.
None of the main-stream libraries seem to support Impala as a feature. Having said that, the Doobie documentation mentions customising connections for Hive. So Doobie may be worth quickly trying Doobie to see if you can query and insert, for example.
I am not an expert of Spark SQL API, nor of the underlying RDD one.
But, knowing of the Catalyst optimization engine, I would expect Spark to try and minimize in-memory effort.
This is my situation:
I have, let's say, two table
TABLE GenericOperation (ID, CommonFields...)
TABLE SpecificOperation (OperationID, SpecificFields...)
They are both quite huge (~500M, not big data, but unfeasible to have as a whole in memory in a standard application server)
That said, suppose I have to retrieve using Spark (part of a larger use case) all the SpecificOperation instances that match some particular condition on fields that belong to GenericOperation.
This is the code that I am using:
val gOps = spark.read.jdbc(db.connection, "GenericOperation", db.properties)
val sOps = spark.read.jdbc(db.connection, "SpecificOperation", db.properties)
val joined = sOps.join(gOps).where("ID = OperationID")
joined.where("CommonField= 'SomeValue'").select("SpecificField").show()
Problem is, when it comes to run the above, I can see from SQL Profiler that Spark does not execute the join on the database, but rather retrieves all the OperationID from SpecificOperation, and then I assume it will be running all the merge in memory. Since no filter is applicable on SpecificOperation, such retrieve would bring a lot, too much, data to the end system.
Is it possible to write the above so that the join is demanded directly to dbms?
Or it depends on some magic configuration of Spark I am not aware of?
Of course, I could simply hardcode the join as a subquery when retrieving, but that's not feasible in my case: statements hve to be created at runtime starting from simple building blocks. Hence, I need to implement this starting from two spark.sql.DataFrame already built up
As a side note, I am running this with Spark 2.3.0 for Scala 2.11, against a SQL Server 2016 database instance.
Is it possible to write the above so that the join is demanded directly to dbms? Or it depends on some magic configuration of Spark I am not aware of?
Excluding statically generated queries (In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?), Spark doesn't support join pushdown. Only predicates and selection can be delegated to the source.
There is no magic configuration or code that could even support this type of process.
In general if server can handle join, data is usually not large enough to benefit from Spark.
I'd like to use JPA over JDBC for a new application. I'm strictly using Named queries and basic CRUD methods of JPA Entity manager, which allows me (with the help of Hibernate, or any other JPA implementation) to extract all SQL native queries that will be performed on the database. With this list of static queries, I understand that I can build a DB2 package that is all execution plans of my requests.
So my question is: Does performing those queries through JDBC against DB2 will take advantage of those execution plans, or not ? I understand that the PureQuery product can capture the list of sql orders. Does it, still through JDBC and not through PureQuery specific API, provide more ? such a specific DB2 static bind feature ? or it is equivalent to JDBC?
Thank you for any piece of answer.
JDBC applications execute dynamic SQL only (i.e. DB2 does not use static packages).
There are only 2 ways to get static SQL (where the queries are stored in a package in the database): Write your application using SQLJ (which eliminates JPA/Hibernate) or use pureQuery (which sits between JDBC and the database).
Keep in mind that even with dynamic SQL, DB2 does cache the execution plans for queries, so if they are executed frequently enough (i.e., they remain in the cache), then you won't see the overhead from query compilation. The cache is only useful if the queries are an exact byte-for-byte match, so select * from t1 where c1 = 1 is not the same as select * from t1 where c1 = 2, nor is select * from t1 where C1 = 1 (which gives the same result, but the query differs). Using parameter markers (select * from t1 where c1 = ?) is key. Your DBA can tune the size of the catalog cache to help maximize the hit ratio on this cache.
Although caching helps avoid repeatedly compiling a query, it does not offer the plan stability that static SQL does, so YMMV.