CREATE TABLE WITH NO DATA is very slow - postgresql

I'm running this SQL command on PostgreSQL 11:
CREATE TABLE IF NOT EXISTS my_temp_table AS TABLE my_enormous_table WITH NO DATA;
It takes 5 minutes to make the new table.
The EXPLAIN ... is:
Seq Scan on my_enormous_table (cost=0.00..35999196.34 rows=143407234 width=3278)
Moving to a query like CREATE TABLE ... (SELECT * FROM my_enormous_table WHERE FALSE); is orders of magnitude faster - there is no seq scan, and the outcome is the same.
Any ideas what could be causing this issue?

WITH NO DATA still executes the query, it just ignores the result.
The better way to do that would be to avoid CREATE TABLE ... AS:
CREATE TABLE my_temp_table (LIKE my_enormous_table);
That also allows you to use the INCLUDING clause to copy default values, storage parameters, constraints and other things from the original table:
CREATE TABLE my_temp_table (LIKE my_enormous_table
INCLUDING CONSTRAINTS INCLUDING DEFAULTS);

Related

Multiple Text-Column Index

I'm exploring how postgres works in different circumstances. My questions concerns multi-column indexes. Let's say I create and populate the following table.
DROP TABLE IF EXISTS perf_test;
create table perf_test(id int, reason text, annotation text);
insert into perf_test(id, reason, annotation)
select s.id, md5(random()::text), null)
from generate_series(1,10000000) as s(id)
order by random();
-- do this separately to avoid the same values for both columns
UPDATE perf_test
SET annotation = md5(random()::text)
If I run a query such as:
EXPLAIN ANALYZE
SELECT *
FROM perf_test
WHERE reason LIKE 'bc%' AND annotation LIKE 'ab%';
I get a parallel sequential scan, not surprisingly. If I build an index for both columns:
CREATE INDEX idx_perf_test_reason_annotation
ON perf_test(reason,annotation);
and then run the same query again, I get the same problem: a sequential scan. Why the planner doesn't want to switch to the index scan? The result consists of about 100 records (not so many I guess). I tried other chars getting just several records but still, it did that with a sequential scan. Why it doesn't want to use index even when the result consists of a couple of rows?

PostgreSQL different index creation time for same datatype

I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

Most efficient way to DECODE multiple columns -- DB2

I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.

Optimization of count query for PostgreSQL

I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.

JPA 2.0: Batch query, safe and performant?

I am looking for a JPA-solution (vendor-independent) to execute a query in batches. The challenge is to make this performant as well as thread-safe.
Example query:
Query query = em.createQuery("select e from Entity e where e.property in :list");
The list is a collection of size between 1 and 385000. Hence, the requirement to batch this query.
Initial naive approach was to get a sublist from the original list and loop through until done. This was safe and working well except that it was not performant.
Second approach was to load everything from the list onto a temp table (permanent in existence, but used as a temporary table) and then use the original query and join with the temp table. This is definitely performant, but is not thread-safe as I need to clear the temp table after each batch and without having any thread id or something of that sort in the temp table its pretty unsafe (which is at the moment).
I would really appreciate suggestions to arrive at a performant and safe way to tackle this issue.
Thanks
First of all, the query is not valid JPQL, because it doesn't have a select clause.
Second, it should be where e.property in (:list).
Your strategy of populating a temp table looks fine to me. You could just make it contain an additional uuid column, and generate a new UUID each time you want to perform such a query:
generate a UUID
insert all the elements of the list in the table, with the uuid column set to the generated UUID
execute a query such as select e from Entity e, TempEntity temp where e.property = temp.property and temp.uuid = :uuid
execute a query to delete all the rows from the temp table (not absolutely necessary): delete from TempEntity temp where temp.uuid :uuid