JPA 2.0: Batch query, safe and performant? - jpa

I am looking for a JPA-solution (vendor-independent) to execute a query in batches. The challenge is to make this performant as well as thread-safe.
Example query:
Query query = em.createQuery("select e from Entity e where e.property in :list");
The list is a collection of size between 1 and 385000. Hence, the requirement to batch this query.
Initial naive approach was to get a sublist from the original list and loop through until done. This was safe and working well except that it was not performant.
Second approach was to load everything from the list onto a temp table (permanent in existence, but used as a temporary table) and then use the original query and join with the temp table. This is definitely performant, but is not thread-safe as I need to clear the temp table after each batch and without having any thread id or something of that sort in the temp table its pretty unsafe (which is at the moment).
I would really appreciate suggestions to arrive at a performant and safe way to tackle this issue.
Thanks

First of all, the query is not valid JPQL, because it doesn't have a select clause.
Second, it should be where e.property in (:list).
Your strategy of populating a temp table looks fine to me. You could just make it contain an additional uuid column, and generate a new UUID each time you want to perform such a query:
generate a UUID
insert all the elements of the list in the table, with the uuid column set to the generated UUID
execute a query such as select e from Entity e, TempEntity temp where e.property = temp.property and temp.uuid = :uuid
execute a query to delete all the rows from the temp table (not absolutely necessary): delete from TempEntity temp where temp.uuid :uuid

Related

CREATE TABLE WITH NO DATA is very slow

I'm running this SQL command on PostgreSQL 11:
CREATE TABLE IF NOT EXISTS my_temp_table AS TABLE my_enormous_table WITH NO DATA;
It takes 5 minutes to make the new table.
The EXPLAIN ... is:
Seq Scan on my_enormous_table (cost=0.00..35999196.34 rows=143407234 width=3278)
Moving to a query like CREATE TABLE ... (SELECT * FROM my_enormous_table WHERE FALSE); is orders of magnitude faster - there is no seq scan, and the outcome is the same.
Any ideas what could be causing this issue?
WITH NO DATA still executes the query, it just ignores the result.
The better way to do that would be to avoid CREATE TABLE ... AS:
CREATE TABLE my_temp_table (LIKE my_enormous_table);
That also allows you to use the INCLUDING clause to copy default values, storage parameters, constraints and other things from the original table:
CREATE TABLE my_temp_table (LIKE my_enormous_table
INCLUDING CONSTRAINTS INCLUDING DEFAULTS);

INSERT INTO .. SELECT causing possible race condition?

INSERT INTO A
SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A);
or, written differently:
WITH selection AS
(SELECT * FROM B WHERE timestamp > (SELECT max(timestamp) FROM A))
INSERT INTO A SELECT * FROM selection;
If these queries run multiple times simultaneously, is it possible that I will end up with duplicated rows in A?
How does Postgres process these queries? Is it one or multiple?
If it is multiple queries (find max(timestamp)[1], select[2] then insert[3]) I can imagine this will cause duplicated rows.
If that is correct, would wrapping it in BEGIN/END (a transaction) help?
Yes, that might result in duplicate values.
A single statement sees a consistent view of the data in all tables as of the point in time when the statement started.
Wrapping that single statement into a transaction won't change that (a single statement is always executed as an atomic statement regardless of the number of sub-query involved).
The statement will never see uncommitted data from other transactions (which is the root cause why you can wind up with duplicate values).
The only safe way to avoid duplicate values, is to create a unique constraint (or index) on that column. In that case the INSERT would result in an error if such a value already exists.
If you want to avoid the error, use insert ... on conflict
This depends on the isolation level set in your database.
This is from the postgres documentation
By default, this is set to Repeatable read, which means that each query will get the output based on when the transaction first attempted to read the data. If 2 queries read before any one writes, then you will get duplicate data in these tables.
If you want to avoid having duplicate entries, you have a few options.
Try using the isolation level Serializable
Apply a unique index on a field of A in table B. Timestamp is not a great contender as you might legitimately have 2 rows with the same timestamp. Probably id of the table A is a good option.
Take a lock at the application level before performing such a query.

Most efficient way to DECODE multiple columns -- DB2

I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.

Postgresql get references from a dictionary

I'm trying to build a request to get the data from a table, but some of those columns have foreign keys I would like to replace by the associated keyword in one request.
Basically there's
table A with column 1:PKA-ID and column 2:name.
table B with column 1:PKB-ID, column 2:FKA-ID, column 3:amount.
I want to get all the lines in table B but with all foreign keys replaced by the associated names in table A.
I started building a request with a subrequest + alias to get that, but ofc I have more than one result per subrequest, yet I can't find a way to link that subrequest to the ID of table B [might be exhausted, dumb or both] from the main request. I did something like that:
SELECT (SELECT "NAME" FROM A JOIN B ON ID = FKA-ID) AS name, amount FROM TABLEB;
it feels so simple of a request yet...
You don't need a join in the subselect.
SELECT pkb_id,
(SELECT name FROM a WHERE a.pka_id = b.fka_id),
amount
FROM b;
(See it live in SQL Fiddle).
The subselect query runs for each and every row of its parent select and has the parent row available from the context.
You can also use a simple join.
SELECT b.pkb_id, a.name, b.amount
FROM b, a
WHERE a.pka_id = b.fka_id;
Note that the join version puts less restrictions on the PostgreSQL query optimizer so in some cases the join version might work faster. (For example, in PostgreSQL 9.6 the join might utilize multiple CPU units, cf. Parallel Query).

IN clause with large list in OpenJpa causing too complex statement

I have to create a named query where I need to group my results by some fields and also using an IN clause to limit my results.
The it looks something like this
SELECT new MyDTO(e.objID) FROM Entity e WHERE e.objId IN (:listOfIDs) GROUP BY e.attr1, e.attr2
I'm using OpenJPA and IBM DB2. In some cases my List of IDs can be very large (>80.000 IDs) and then the generated SQL statement becomes too complex for DB2, because the final generated statement prints out all IDs, like this:
SELECT new MyDTO(e.objID) FROM Entity e WHERE e.objId IN (1,2,3,4,5,6,7,...) GROUP BY e.attr1, e.attr2
Is there any good way to handle this kind of query? A possible Workaround would be to write the IDs in a temporary table and then using the IN clause on this table.
You should put all of the values in a table and rewrite the query as a join. This will not only solve your query problem, it should be more efficient as well.
declare global temporary table ids (
objId int
) with replace on commit preserve rows;
--If this statement is too long, use a couple of insert statements.
insert into session.ids values
(1,2,3,4,....);
select new mydto(e.objID)
from entity e
join session.ids i on
e.objId = i.objId
group by e.attr1, e.attr2;