How to minimize the use of CTEs in a Postgresql query - postgresql

The follow code works well with a small number of vector features. However when I run the query using a larger table (~35,000 rows), my memory use goes to 100% (32GB) and then I get a "Connection to the server has been lost" in pgadmin. I am running on localhost, so the issue is not network related. I'm guessing its because I am using to many CTE's (WITH queries). I was thinking of nesting the query in a PL/pgSQL loop and updating a table with the results. Thereby closing the temporary tables after each iteration. This seems like an inelegant solution and I was hoping someone might be able to show me how I can minimize the use of CTE's in the below query.
CREATE TABLE dem_stats AS
WITH
-- Select Features using lookup table and determine the raster tiles said features are intersecting
feat AS
(SELECT title_no,
a.grid_tile_name || '.asc' AS tile_name,
a.wkb_geometry as geom
FROM test_polygons a, parcels_all_shapefile_lookup_osgb_grid_5km b
WHERE a.title_no = b.olp_title_no
),
-- Merge rasters tiles from main raster file that intersect features
merged_rast AS
(SELECT ST_Union(rast,1) AS rast
FROM dem, feat
WHERE filename
IN (tile_name)
),
-- As the tiles are now merged duplicates are not required
feat_temp AS
(SELECT DISTINCT ON (title_no) * FROM feat
),
-- Clip merged raster and obtain pixel statistics
b_stats AS
(SELECT title_no, (stats).*
FROM (SELECT title_no, ST_SummaryStats(ST_Clip(a.rast,1,b.geom,-9999,true)) AS stats
FROM merged_rast a
INNER JOIN feat_temp b
ON ST_Intersects(b.geom,a.rast)
) AS foo
)
-- Summarise statistics for each title number
SELECT title_no,
count As pixel_val_count,
min AS pixel_val_min,
max AS pixel_val_max,
mean AS pixel_val_mean,
stddev AS pixel_val_stddev
FROM b_stats
WHERE count > 0;

though it is not exactly readable you can always inline the CTEs - this is sometimes a good idea because CTEs are optimization fences in PostgreSQL: https://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/

Related

Can't count() a PostgreSql table [duplicate]

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint formats the real number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:
If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.
If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)
Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.
Table and schema qualifications make it immune to any search_path and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
The Postgres Wiki FAQ
The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n) in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like #a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:
No autovacuum running.
Immediately after a large INSERT / UPDATE / DELETE.
TEMPORARY tables (which are not covered by autovacuum).
This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.
Typically, the estimate from pg_class will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.
I did this once in a postgres app by running:
EXPLAIN SELECT * FROM foo;
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
Seq Scan on uids (cost=0.00..1.21 rows=8 width=75)
You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).
Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
SELECT reltuples::bigint AS EstimatedCount
FROM pg_class
WHERE oid = 'public.TableName'::regclass;
Using pg_stat_user_tables:
SELECT
schemaname
,relname
,n_live_tup AS EstimatedCount
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is
In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:
SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
For SQL Server (2005 or above) a quick and reliable method is:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')
AND (index_id=0 or index_id=1);
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.

How to avoid skewing in redshift for Big Tables?

I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.

Why is performance of CTE worse than temporary table in this example

I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks
Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.
This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.

SQL limit query

I'm having an issue with limiting the SQL query. I'm using SQL 2000 so I can't use any of the functions like ROW_NUMBER(),CTE OR OFFSET_ROW FETCH.
I have tried the Select TOP limit * FROM approach and excluded the already shown results but this way the query is so slow because sometimes my result query fetches more than 10000 records.
Also I have tried the following approach:
SELECT * FROM (
SELECT DISTINCT TOP 100 PERCENT i.name, i.location, i.image ,
( SELECT count(DISTINCT i.id) FROM image WHERE i.id<= im.id ) AS recordnum
FROM images AS im
order by im.location asc, im.name asc) as tmp
WHERE recordnum between 5 AND 15
same problem here plus issue because I couldn't add ORDER option in sub query from record um. I have placed both solution in stored procedure but still the query execution is still so slow.
So my question is:
IS there an efficient way to limit the query to pull 20 records per page in SQL 2000 for large amounts of data i.e more than 10000?
Thanks.
Now the subquery is only run once
where im2.id is null will skip the first 40 rows
SELECT top 25 im1.*
FROM images im1
left join ( select top 40 id from images order by id ) im2
on im1.id = im2.id
where im2.id is null
order by im1.id
Query-wise, there is no great performing way. If performance is critical and the data will always be grouped/ordered the same, you could add a int column and set the value by trigger based on the grouping/ordering. Index it and it should be extremely fast for reads; writes will be a bit slower.
Also, make sure you have indexes on the Id columns on image and images.

way to reduce the cost in db2 for count(*)

Hi I had a DB2 Query as below
select count(*) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
For the above query i even tried with Count(1) but when i took the access plan, cost is same.
select count(1) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
Is there any other way to reduce the cost.
PS: b.xxx,b.yyy, c.wedf is indexed..
Thanks in advance.
I think one of the problem are statistics on the table. Did you execute Runstats? Probably, the data distribution or the quantity of rows that has to be read is a lot, and DB2 concludes that is better to read the whole table, instead of process an index, and then fetch the rows from the table.
It seems that both queries are taking the same access plan, and I think they are doing table scans.
Are the three columns part of the same index? or they are indexed separately? If they are part of different indexes, is there any ANDing between indexes in the access plan? If there is not ANDing with different indexes, the columns has to be read from the table in order to process the predicates.
The reason count(1) and count(*) are giving the same cost, is because both has to do a TableScan.
Please, take a look at the access plan, not only the results in timerons, but also the steps. Is the access plan taking the indexes? how many sorts is executing?
Try to change the optimization level, and you will see that the access plans change. I think you are executing with the default one (5)
If you want to force the query to take in account an index, you can create an optimization profile
What is the relation between (B,C) tables and A table. In your query you just use CROSS JOIN between A and (B,C). So it is the MAIN performance issue.
If you really need this count just multiply counts for A and (B,C):
select
(select count(*) from a)
*
(select count(*) from b, c where b.xxx=234 AND b.yyy=c.wedf )
for DB2 use this:
select a1.cnt*
(select count(*) as cnt2 from b, c where b.xxx=234 AND b.yyy=c.wedf )
from
(select count(*) as cnt1 from a) a1