PostgreSQL: Statistics on partial index? - postgresql

PostgreSQL version: 9.3.13
Consider the following tables, index and data:
CREATE TABLE orders (
order_id bigint,
status smallint,
owner int,
CONSTRAINT orders_pkey PRIMARY KEY (order_id)
)
CREATE INDEX owner_index ON orders
USING btree
(owner) WHERE status > 0;
CREATE TABLE orders_appendix (
order_id bigint,
note text
)
Data
orders:
(IDs, 0, 1337) * 1000000 rows
(IDs, 10, 1337) * 1000 rows
(IDs, 10, 777) * 1000 rows
orders_appendix:
one row for each order
My problem is:
select * from orders where owner=1337 and status>0
The query planner estimated the number of row to be 1000000, but actual number of row is 1000.
In a more complicated following query:
SELECT note FROM orders JOIN orders_appendix using (order_id)
WHERE owner=1337 AND status>0
Instead of using inner join (which is preferable for small number of rows), it picks bitmap join + a full table scan on orders_appendix, which is very slow.
If the condition is "owner=777", it will choose the preferable inner join instead.
I believe it is bacause of the statistics, as AFAIK postgres can only collect and consider stats for each column independently.
However, if I...
CREATE INDEX onwer_abs ON orders (abs(owner)) where status>0;
Now, a slightly changed query...
SELECT note FROM orders JOIN rders_appendix using (order_id)
WHERE abs(owner)=1337 AND status>0
will results in the inner join that I wanted.
Is there a better solution? Perhaps "statistics on partial index"?

Related

PostgreSQL: optimization of query for select with overlap

I create the following query for selection data with overlapping periods (for campaigns, which have the same business identifier!):
select
campaign_instance_1.campaign_id,
campaign_instance_1.start_time
from campaign_instance as campaign_instance_1
inner join campaign_instance as campaign_instance_2
on campaign_instance_1.campaign_id = campaign_instance_2.campaign_id
and (
(campaign_instance_1.start_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
or (campaign_instance_1.finish_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
or (campaign_instance_1.start_time<campaign_instance_2.start_time and campaign_instance_1.finish_time>campaign_instance_2.finish_time)
or (campaign_instance_1.start_time>campaign_instance_2.start_time and campaign_instance_1.finish_time<campaign_instance_2.finish_time))
With index, created as:
CREATE INDEX IF NOT EXISTS camp_inst_idx_campaign_id_and_finish_time
ON public.campaign_instance_without_index USING btree
(campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST)
TABLESPACE pg_default;
Already on 100 000 rows it runs very slowly - 43 seconds!
For optimization I tried to add index on start_time:
(campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST, start_time DESC NULLS LAST)
But result is the same.
As I understand results of explain analyze, index "start_time" doesn't uses as a Index Condition:
I tried the query with this index either with 10 000 and 100 000 rows - so, as possible, it does not depends on sample size (at least on this scales).
Source table contains the following structure:
campaign_id bigint,
fire_time bigint,
start_time bigint,
finish_time bigint,
recap character varying,
details json
Why my index is not used, and what possible ways to improve the query?
Joining to campaign_instance (itself) doesn't really serve anything here other than making an "existence" check and probably your intention is not to get back duplicates for matching records. Thus you could simplify the query with EXISTS or LATERAL join. Also your join condition on time could be simplified, you seem to be looking for overlapping times:
select campaign_id,start_time
from campaign_instance c1
where exists( select * from campaign_instance c2
where c1.campaign_id = c2.campaign_id
and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));
That time overlap probably would use < and > instead of <= and >= but I don't know your exact requirements, between is implicitly saying it is <= and >=.
EDIT: Ensure that the match is not the row itself:
(This table should have a primary key to make things easier, but as it doesn't, I would assume that there is no duplication on campaign_id, start_time and finish_time and that could be used as a composite key)
select campaign_id,start_time
from campaign_instance c1
where exists( select * from campaign_instance c2
where c1.campaign_id = c2.campaign_id
and (c1.start_time != c2.start_time or c1.finish_time != c2.finish_time)
and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));
This takes around 230-250 milliseconds on my system (iMac i5 7500, 3.4 Ghz, 64 Gb mem).

UPDATE query for 180k rows in 10M row table unexpectedly slow

I have a table that is getting too big and I want to reduce it's size
with an UPDATE query. Some of the data in this table is redundant, and
I should be able to reclaim a lot of space by setting the redundant
"cells" to NULL. However, my UPDATE queries are taking excessive
amounts of time to complete.
Table details
-- table1 10M rows (estimated)
-- 45 columns
-- Table size 2200 MB
-- Toast Table size 17 GB
-- Indexes Size 1500 MB
-- **columns in query**
-- id integer primary key
-- testid integer foreign key
-- band integer
-- date timestamptz indexed
-- data1 real[]
-- data2 real[]
-- data3 real[]
This was my first attempt at an update query. I broke it up into some
temporary tables just to get the id's to update. Further, to reduce the
query, I selected a date range for June 2020
CREATE TEMP TABLE A as
SELECT testid
FROM table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND band = 3;
CREATE TEMP TABLE B as -- this table has 180k rows
SELECT id
FROM table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND testid in (SELECT testid FROM A)
AND band > 1
UPDATE table1
SET data1 = Null, data2 = Null, data3 = Null
WHERE id in (SELECT id FROM B)
Queries for creating TEMP tables execute in under 1 sec. I ran the UPDATE query for an hour(!) before I finally killed it. Only 180k
rows needed to be updated. It doesn't seem like it should take that much
time to update that many rows. Temp table B identifies exactly which
rows to update.
Here is the EXPLAIN from the above UPDATE query. One of the odd features of this explain is that it shows 4.88M rows, but there are only 180k rows to update.
Update on table1 (cost=3212.43..4829.11 rows=4881014 width=309)
-> Nested Loop (cost=3212.43..4829.11 rows=4881014 width=309)
-> HashAggregate (cost=3212.00..3214.00 rows=200 width=10)
-> Seq Scan on b (cost=0.00..2730.20 rows=192720 width=10)
-> Index Scan using table1_pkey on table1 (cost=0.43..8.07 rows=1 width=303)
Index Cond: (id = b.id)
Another way to run this query is in one shot:
WITH t as (
SELECT id from table1
WHERE testid in (
SELECT testid
from table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND band = 3
)
)
UPDATE table1 a
SET data1 = Null, data2 = Null, data3 = Null
FROM t
WHERE a.id = t.id
I only ran this one for about 10 minutes before I killed it. It feels like I should be able to run this query in much less time if I just knew the tricks. This query has EXPLAIN below. This explain shows 195k rows which is more expected, but cost is much higher # 1.3M to 1.7M
Update on testlog a (cost=1337986.60..1740312.98 rows=195364 width=331)
CTE t
-> Hash Join (cost=8834.60..435297.00 rows=195364 width=4)
Hash Cond: (testlog.testid = testlog_1.testid)
-> Seq Scan on testlog (cost=0.00..389801.27 rows=9762027 width=8)
-> Hash (cost=8832.62..8832.62 rows=158 width=4)"
-> HashAggregate (cost=8831.04..8832.62 rows=158 width=4)
-> Index Scan using amptest_testlog_date_idx on testlog testlog_1 (cost=0.43..8820.18 rows=4346 width=4)
Index Cond: ((date >= '2020-06-01 00:00:00-07'::timestamp with time zone) AND (date <= '2020-07-01 00:00:00-07'::timestamp with time zone))
Filter: (band = 3)
-> Hash Join (cost=902689.61..1305015.99 rows=195364 width=331)
Hash Cond: (t.id = a.id)
-> CTE Scan on t (cost=0.00..3907.28 rows=195364 width=32)
-> Hash (cost=389801.27..389801.27 rows=9762027 width=303)
-> Seq Scan on testlog a (cost=0.00..389801.27 rows=9762027 width=303)
Edit: one of the suggestions in the accepted answer was to drop any indexes before the update and then add them back later. This is what I went with, with a twist: I needed another table to hold indexed data from the dropped indexes to make the A and B queries faster:
CREATE TABLE tempid AS
SELECT id, testid, band, date
FROM table1
I made indexes on this table for id, testid, and date. Then I replaced table1 in the A and B queries with tempid. It still went slower than I would have liked, but it did get the job done.
You might have another table that has a foreign key to this table to one or more columns you are setting to NULL. And this foreign table does not have an index on the column.
Each time you set the row value to NULL the database has to check the foreign table - maybe it has a row that references the value you are removing.
If this is the case you should be able to speed it up by adding an index on this remote table.
For example if you have a table like this:
create table table2 (
id serial primary key,
band integer references table1(data1)
)
Then you can create an index create index table2_band_nnull_idx on table2(band) where band is not null.
But you suggested that all columns you are setting to NULL have array type. This means that it is unlikely that they are referenced. Still it is worth checking.
Another possibility is that you have a trigger on the table that works slowly.
Another possibility is that you have a lot of indexes on the table. Each index has to be updated for each row you update and it can use only a single processor core.
Sometimes it is faster to drop all indexes, do the bulk update and then recreate them all back. Creating indexes can use multiple cores - one core per index.
Another possibility is that your query is waiting for some other query to finish and release its locks. You should check with:
select now()-query_start, * from pg_stat_activity where state<>'idle' order by 1;

Strange PostgreSQL index using while using LIMIT..OFFSET

PostgreSQL 9.6.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
Table and indices:
create table if not exists orders
(
id bigserial not null constraint orders_pkey primary key,
partner_id integer,
order_id varchar,
date_created date,
state_code integer,
state_date timestamp,
recipient varchar,
phone varchar,
);
create index if not exists orders_partner_id_index on orders (partner_id);
create index if not exists orders_order_id_index on orders (order_id);
create index if not exists orders_partner_id_date_created_index on orders (partner_id, date_created);
The task is to create paging/sorting/filtering data.
The query for the first page:
select order_id, date_created, recipient, phone, state_code, state_date
from orders
where partner_id=1 and date_created between '2019-04-01' and '2019-04-30'
order by order_id asc limit 10 offset 0;
The query plan:
QUERY PLAN
"Limit (cost=19495.48..38990.41 rows=10 width=91)"
" -> Index Scan using orders_order_id_index on orders (cost=0.56..**41186925.66** rows=21127 width=91)"
" Filter: ((date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date) AND (partner_id = 1))"
Index orders_partner_id_date_created_index is not used, so the cost is extremely high!
But starting from some offset values (the exact value differs from time to time, looks like it depends on total row count) the index starts to be used:
select order_id, date_created, recipient, phone, state_code, state_date
from orders
where partner_id=1 and date_created between '2019-04-01' and '2019-04-30'
order by order_id asc limit 10 offset 40;
Plan:
QUERY PLAN
"Limit (cost=81449.76..81449.79 rows=10 width=91)"
" -> Sort (cost=81449.66..81502.48 rows=21127 width=91)"
" Sort Key: order_id"
" -> Bitmap Heap Scan on orders (cost=4241.93..80747.84 rows=21127 width=91)"
" Recheck Cond: ((partner_id = 1) AND (date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date))"
" -> Bitmap Index Scan on orders_partner_id_date_created_index (cost=0.00..4236.65 rows=21127 width=0)"
" Index Cond: ((partner_id = 1) AND (date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date))"
What's happening? Is this a way to force the server to use the index?
General answer:
Postgres stores some information about your tables
Before executing the query, planner prepares execution plan based on those informations
In your case, planner thinks that for certain offset value this sub-optimal plan will be better. Note that your desired plan requires sorting all selected rows by order_id, while this "worse" plan does not. I'd guess that Postgres bets there will be quite many such rows for various orders and just tests one order after another, starting from lowest.
I can think of two solutions:
A) provide more data to the planer, by running
ANALYZE orders;
(https://www.postgresql.org/docs/9.6/sql-analyze.html)
or bo changing gathered statistics
ALTER TABLE orders SET STATISTCS (...);
(https://www.postgresql.org/docs/9.6/planner-stats.html)
B) Rewrite query in a way that hints desired index usage, like this:
WITH
partner_date (partner_id, date_created) AS (
SELECT 1,
generate_series('2019-04-01'::date, '2019-04-30'::date, '1 day'::interval)::date
)
SELECT o.order_id, o.date_created, o.recipient, o.phone, o.state_code, o.state_date
FROM orders o
JOIN partner_date pd
ON (o.partner_id, o.date_created) = (pd.partner_id, pd.date_created)
ORDER BY order_id ASC LIMIT 10 OFFSET 0;
Or maybe even better:
WITH
partner_date (partner_id, date_created) AS (
SELECT 1,
generate_series('2019-04-01'::date, '2019-04-30'::date, '1 day'::interval)::date
),
all_data AS (
SELECT o.order_id, o.date_created, o.recipient, o.phone, o.state_code, o.state_date
FROM orders o
JOIN partner_date pd
ON (o.partner_id, o.date_created) = (pd.partner_id, pd.date_created)
)
SELECT *
FROM all_data
ORDER BY order_id ASC LIMIT 10 OFFSET 0;
Disclaimer - I can't explain why the first query should be interpreted in other way by Postgres planner, just think it could. On the other hand, second query separates offsets/limits from joins and I'd be very surprised if Postgres still did it the "bad" (according to you benchmarks) way.

Cassandra filter with ordering query modeling

I am new to Cassandra and I am trying to model a table in Cassandra. My queries look like the following
Query #1: select * from TableA where Id = "123"
Query #2: select * from TableA where name="test" orderby startTime DESC
Query #3: select * from TableA where state="running" orderby startTime DESC
I have been able to build the table for Query #1 which looks like
val tableAStatement = SchemaBuilder.createTable("tableA").ifNotExists.
addPartitionKey(Id, DataType.uuid).
addColumn(Name, DataType.text).
addColumn(StartTime, DataType.timestamp).
addColumn(EndTime, DataType.timestamp).
addColumn(State, DataType.text)
session.execute(tableAStatement)
but for Query#2 and 3, I have tried many different things but failed. Everytime, I get stuck in a different error from cassandra.
Considering the above queries, what would be the right table model? What is the right way to model such queries.
Query #2: select * from TableB where name="test"
CREATE TABLE TableB (
name text,
start_time timestamp,
PRIMARY KEY (text, start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
Query #3: select * from TableC where state="running"
CREATE TABLE TableC (
state text,
start_time timestamp,
PRIMARY KEY (state, start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
In cassandra you model your tables around your queries. Data denormalization and duplication is wanted. Notice the clustering order - this way you can omit the "ordered by" in your query

Random order by performance

What is the best way to get top n rows by random order?
I use the query like:
Select top(10) field1,field2 .. fieldn
from Table1
order by checksum(newid())
The problem in the above query is that it will keep getting slower as table size increases. it will always do full clustered index scan to find top(10) rows by random order.
Is there any other better way to do it?
I have tested this and got better performance changing the query.
The DDL for the table I used in my tests.
CREATE TABLE [dbo].[TestTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Col1] [nvarchar](100) NOT NULL,
[Col2] [nvarchar](38) NOT NULL,
[Col3] [datetime] NULL,
[Col4] [nvarchar](50) NULL,
[Col5] [int] NULL,
CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_TestTable_Col5] ON [dbo].[TestTable]
(
[Col5] ASC
)
The table has 722888 rows.
First query:
select top 10
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
order by newid()
Statistics for first query:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 13 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 12492, physical reads 14, read-ahead reads 6437, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 859 ms, elapsed time = 1700 ms.
Execution plan first query:
Second query:
select
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
inner join (select top 10 ID
from TestTable
order by newid()) as C
on T.ID = C.ID
Statistics for second query:
SQL Server parse and compile time:
CPU time = 125 ms, elapsed time = 183 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 1291, physical reads 10, read-ahead reads 399, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 516 ms, elapsed time = 706 ms.
Execution plan second query:
Summary:
The second query is using the index on Col5 to order the rows by newid() and then it does a Clustered Index Seek 10 times to get the values for the output.
The performance gain is there because the index on Col5 is narrower than the clustered key and that causes fewer reads.
Thanks to Martin Smith for pointing that out.
One way to reduce the size of the scan necessary is by using a combination of TABLESAMPLE with ORDER BY newid in order to select a random number of rows from a selection of pages in the table, rather scanning the entire table.
The idea is to calculate the average number of rows per page, then use tablesample to select 1 random page of data for each row you want to output. Then you will run the ORDER BY newid() query on just that subset of data. This approach will be slightly less random then the original approach, but is much better than just using tablesample and involves reading much less data off the table.
Unfortunately, the TABLESAMPLE clause does not accept a variable so dynamic sql is necessary in order to use a dynamic rows value based on the size of the records of the input table.
declare #factor int
select #factor=8000/avg_record_size_in_bytes from sys.dm_db_index_physical_stats(db_id(), object_id('sample'), null, null, 'detailed') where index_level = 0
declare #numRows int = 10
declare #sampledRows int = #factor * #numRows
declare #stmt nvarchar(max) = N'select top (#numRows) * from sample tablesample (' + convert(varchar(32), #sampledRows) + ' rows) order by checksum(newid())'
exec sp_executesql #stmt, N'#numRows int', #numRows
This question is 7 years old, and had no accepted answer. But it ranked high when I searched for SQL performance on selecting random rows. But none of the current answers seems to give a simple, quick solution in the case of large tables so I want to add my suggestion.
Assumptions:
The primary key is a numeric data type (typical int / +1 per row)
The primary key is the clustered index
The table has many rows, and only a few should be selected
I think this is fairly common, so it would help in a lot of cases.
Given a typical set of data, my suggestion is to
Find max and min
pick a random number
check if the number is a valid id in the table
Repeat as needed
These operations should all be very fast as they all are on the clustered index. Only at the end will the rest of the data be read, by selecting a set based on a list of primary keys so we only pull in the data we actually need.
Example (MS SQL):
--
-- First, create a table with some dummy data to select from
--
DROP TABLE IF EXISTS MainTable
CREATE TABLE MainTable(
Id int IDENTITY(1,1) NOT NULL,
[Name] nvarchar(50) NULL,
[Content] text NULL
)
GO
DECLARE #I INT = 0
WHILE #I < 40
BEGIN
INSERT INTO MainTable VALUES('Foo', 'bar')
SET #I=#I+1
END
UPDATE MainTable SET [Name] = [Name] + CAST(Id as nvarchar(50))
-- Create a gap in IDs at the end
DELETE FROM MainTable
WHERE ID < 10
-- Create a gap in IDs in the middle
DELETE FROM MainTable
WHERE ID >= 20 AND ID < 30
-- We now have our "source" data we want to select random rows from
--
-- Then we select random data from our table
--
-- Get the interval of values to pick random values from
DECLARE #MaxId int
SELECT #MaxId = MAX(Id) FROM MainTable
DECLARE #MinId int
SELECT #MinId = MIN(Id) FROM MainTable
DECLARE #RandomId int
DECLARE #NumberOfIdsTofind int = 10
-- Make temp table to insert ids from
DROP TABLE IF EXISTS #Ids
CREATE TABLE #Ids (Id int)
WHILE (#NumberOfIdsTofind > 0)
BEGIN
SET #RandomId = ROUND(((#MaxId - #MinId -1) * RAND() + #MinId), 0)
-- Verify that the random ID is a real id in the main table
IF EXISTS (SELECT Id FROM MainTable WHERE Id = #RandomId)
BEGIN
-- Verify that the random ID has not already been inserted
IF NOT EXISTS (SELECT Id FROM #Ids WHERE Id = #RandomId)
BEGIN
-- It's a valid, new ID, add it to the list.
INSERT INTO #Ids VALUES (#RandomId)
SET #NumberOfIdsTofind = #NumberOfIdsTofind - 1;
END
END
END
-- Select the random rows of data by joining the main table with our random Ids
SELECT MainTable.* FROM MainTable
INNER JOIN #Ids ON #Ids.Id = MainTable.Id
No, there is no way to improve the performance here. Since you want the rows in "random" order, indexes will be useless. You could, however, try ordering by newid() instead of its checksum, but that's just an optimization to the random ordering, not the sorting itself.
The server has no way of knowing that you want a random selection of 10 rows from the table. The query is going to evaluate the order by expression for every row in the table, since it's a computed value that cannot be determined by index values. This is why you're seeing a full clustered index scan.