I have two postgres databases (one for development, one for test). Both have the same structure.
I'm running this query on both (merges one table into another)
WITH moved_rows AS
(
DELETE FROM mybigtable_20220322
RETURNING *
)
INSERT INTO mybigtable_202203
SELECT * FROM moved_rows;
But I get slightly different results for EXPLAIN on each version.
Development (Postgres 13.1) -
Insert on mybigtable_202203 (cost=363938.39..545791.17 rows=9092639 width=429)
CTE moved_rows
-> Delete on mybigtable_20220322 (cost=0.00..363938.39 rows=9092639 width=6)
-> Seq Scan on mybigtable_20220322 (cost=0.00..363938.39 rows=9092639 width=6)
-> CTE Scan on moved_rows (cost=0.00..181852.78 rows=9092639 width=429)
Test (Postgres 14.1) -
Insert on mybigtable_202203 (cost=372561.91..558377.73 rows=0 width=0)
CTE moved_rows
-> Delete on mybigtable_20220322 (cost=0.00..372561.91 rows=9290791 width=6)
-> Seq Scan on mybigtable_20220322 (cost=0.00..372561.91 rows=9290791 width=6)
-> CTE Scan on moved_rows (cost=0.00..185815.82 rows=9290791 width=429)
The big difference is the first line, on Development I get rows=9092639 width=429 on Test I get rows=0 width=0
All the tables have the same definitions, with the same indexes (not that they seem to be used) the query succeeds on both databases, the EXPLAIN indicates similar costs on both database, and the tables on each database have a similar record count (just over 9 million rows)
In practice the difference is that on Development the query takes a few minutes, on Test is takes a few hours.
Both databases were created with the same scripts, so should be 100% identical my guess is there's some small, subtle difference that's crept in somewhere. Any suggestion on what the difference might be or how to find it? Thanks
Update
both the tables being merged (on both databases) have been VACUUM ANALYZED in similar timeframes.
I used fc to compare both DBs. There was ONE difference, on the development database the table was clustered on one of the indexes. I did similar clustering on the test table but results didn't change.
In response to the comment 'the plans are the same, only the estimated rows are different'. This difference is the only clue I currently have to an underlying problem. My development database is on a 10 year old server struggling with lack of resources, my test database is a brand new server. The former takes a few minutes to actually run the query the later takes a few hours. Whenever I post a question on the forum I'm always told 'start with the explain plan'
This change was made with commit f0f13a3a08b27, which didn't make it into the release notes, since it was considered a bug fix:
Fix estimates for ModifyTable paths without RETURNING.
In the past, we always estimated that a ModifyTable node would emit the
same number of rows as its subpaths. Without a RETURNING clause, the
correct estimate is zero. Fix, in preparation for a proposed parallel
write patch that is sensitive to that number.
A remaining problem is that for RETURNING queries, the estimated width
is based on subpath output rather than the RETURNING tlist.
Reviewed-by: Greg Nancarrow gregn4422#gmail.com
Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com
This change only affects EXPLAIN output, not what actually happens during data modifications.
Related
I have 2 tables in AWS redshift. The details are as below
a) impressions (to count the number of impressions of a particular ad)
Number of rows (170 million)
distribution key(ad_campaign)
sort key (created_on)
b) clicks (to count the number of clicks of a particular ad).
Number of rows (80 million)
distribution key(ad_campaign)
sort key (created_on)
I have a single DC1 Large cluster with 2 slices.
I am trying to run the below query
select impressions.offer_id, count(imp_cnt) from
bidsflyer.tblImpressionLog_Opt impressions
full join bidsflyer.tblTrackingLinkLog_Opt clicks
on impressions.offer_id=clicks.offer_id and date_trunc('week',
impressions.created_on)=date_trunc('week', clicks.created_on)
where impressions.created_on >= '2017-07-27 00:00:00'
group by 1
This query takes more then 8 mins to run. I think this is quite large considering the volume of data, which I feel is not huge.
The query plan looks like something below
XN HashAggregate (cost=2778257688268.43..2778257688268.60 rows=67 width=12)
-> XN Hash Left Join DS_DIST_NONE (cost=179619.84..2778170875920.65 rows=17362469555 width=12)
Hash Cond: (("outer".offer_id = "inner".offer_id) AND (date_trunc('week'::text, "outer".created_on) = date_trunc('week'::text, "inner".created_on)))
-> XN Seq Scan on tblimpressionlog_opt impressions (cost=0.00..724967.36 rows=57997389 width=20)
Filter: (created_on >= '2017-07-27 00:00:00'::timestamp without time zone)
-> XN Hash (cost=119746.56..119746.56 rows=11974656 width=12)
-> XN Seq Scan on tbltrackinglinklog_opt clicks (cost=0.00..119746.56 rows=11974656 width=12)
Can anyone provide me guidance of the correct usage of distribution key and sort keys.
How should I design my query?
Table setup:
1) According to the plan, the most expensive operation is grouping by offer_id. This makes sense because you didn't sort or distribute your data by offer_id. Your tables are quite large, so you can recreate the table with an interleaved sort key by (offer_id,created_on) (interleaved keys are supposed to give equal and order-independent weight to the included columns and are known to have positive effect on larger tables).
2) If you join by weeks you can materialize your week column (create a physical column and populate it with date_trunc output). That might save you some computation effort to get these values dynamically during the join. However, this operation is cheap and if your table is already sorted by timestamp column Redshift might already scan the appropriate blocks only. Also, if each offer runs for a short period of time (meaning offer column has high cardinality and high correlation with time column) you can have a compound sort key by (offer_id,week_created) that will allow merge join that is faster, and aggregate will fun fast as well.
3) if you don't use ad_campaign in other queries you can distribute both tables by offer_id. Having join column in distribution key is a good practice, it's unlikely that your query will benefit from this since you have a single node and distribution style mostly affects multinode setups.
All recommendations are just the assumptions without knowing the exact nature of your data, they require running benchmarks (create table with the recommended configuration, copy data, vaccuum, analyze, run the same query at least 3 times and compare times with the original setup). I would appreciate if you do this and post results here.
RE the query itself, you can replace FULL JOIN with JOIN because you don't need it. FULL JOIN should be used when you want to get not only the intersection of both tables but also impressions that don't have related clicks and vice versa. Which doesn't seem the case because you are filtering by impressions.created_on and group by impressions.offer_id. So, all you need is just the intersection. Replacing FULL JOIN by simple JOIN also might affect query performance. If you want to see the offers that have zero clicks you can use LEFT JOIN.
Merge join is faster than hash join, you should try to achieve merge join. You sort key looks okay, but is your data actually sorted? Redshift does not automatically keep table's rows sorted by sort key, there is no way for redshift to perform merge join on your table. Running a full vacuum on the table, redshift will start performing merge join.
select * from svv_table_info where table = 'impressions'
select * from svv_table_info where table = 'clicks'
Use above query to check the amount of unsorted data you have in your table.
Run a full vacuum on both your tables. Depending on the amount of unsorted data this might take a while and use a lot of your cluster resource.
VACUUM impressions to 100 percent
VACUUM clicks to 100 percent
If I’ve made a bad assumption please comment and I’ll refocus my answer.
explain analyze shows that postgres will use index scanning for my query that fetches rows and performs filtering by date (i.e., 2017-04-14 05:27:51.039):
explain analyze select * from tbl t where updated > '2017-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using updated on tbl t (cost=0.43..7317.12 rows=10418 width=93) (actual time=0.011..0.515 rows=1179 loops=1)
Index Cond: (updated > '2017-04-14 05:27:51.039'::timestamp without time zone)
Planning time: 0.102 ms
Execution time: 0.720 ms
however running the same query but with different date filter '2016-04-14 05:27:51.039' shows that postgres will run the query using seq scan instead:
explain analyze select * from tbl t where updated > '2016-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl t (cost=0.00..176103.94 rows=5936959 width=93) (actual time=0.008..2005.455 rows=5871963 loops=1)
Filter: (updated > '2016-04-14 05:27:51.039'::timestamp without time zone)
Rows Removed by Filter: 947
Planning time: 0.100 ms
Execution time: 2910.086 ms
How does postgres decide on what to use, specifically when performing filtering by date?
The Postgres query planner bases its decisions on cost estimates and column statistics, which are gathered by ANALYZE and opportunistically by some other utility commands. That all happens automatically when autovacuum is on (by default).
The manual:
Most queries retrieve only a fraction of the rows in a table, due to
WHERE clauses that restrict the rows to be examined. The planner thus
needs to make an estimate of the selectivity of WHERE clauses, that
is, the fraction of rows that match each condition in the WHERE
clause. The information used for this task is stored in the
pg_statistic system catalog. Entries in pg_statistic are updated by
the ANALYZE and VACUUM ANALYZE commands, and are always approximate
even when freshly updated.
There is a row count (in pg_class), a list of most common values, etc.
The more rows Postgres expects to find, the more likely it will switch to a sequential scan, which is cheaper to retrieve large portions of a table.
Generally, it's index scan -> bitmap index scan -> sequential scan, the more rows are expected to be retrieved.
For your particular example, the important statistic is histogram_bounds, which give Postgres a rough idea how many rows have a greater value than the given one. There is the more convenient view pg_stats for the human eye:
SELECT histogram_bounds
FROM pg_stats
WHERE tablename = 'tbl'
AND attname = 'updated';
There is a dedicated chapter explaining row estimation in the manual.
Obviously, optimization of queries is tricky. This answer is not intended to dive into the specifics of the Postgres optimizer. Instead, it is intended to give you some background on how the decision to use an index is made.
Your first query is estimated to return 10,418 rows. When using an index, the following operations happen:
The engine uses the index to find the first value meeting the condition.
The engine then loops over the values, finishing when the condition is no longer true.
For each value in the index, the engine then looks up the data on the data page.
In other words, there is a little bit of overhead when using the index -- initializing the index and then looking up each data page individually.
When the engine does a full table scan it:
Starts with the first record on the first page
Does the comparison and accepts or rejects the record
Continues sequentially through all data pages
There is no additional overhead. Further, the engine can "pre-load" the next pages to be scanned while processing the current page. This overlap of I/O and processing is a big win.
The point I'm trying to make is that getting the balance between these two can be tricky. Somewhere between 10,418 and 5,936,959, Postgres decides that the index overhead (and fetching the pages randomly) costs more than just scanning the whole table.
I'm brand spanking new to relational databases, and need help setting up a basic working environment for querying a large (pre-existing) database. I've connected to our remote server through PGAdmin, but all my basic queries are extremely slow.
Query
SELECT to_char(created, 'YYYY-MM-DD'), user_id
FROM transactions
WHERE type = 'purchase'
AND created > NOW() AT TIME ZONE 'US/Mountain' - INTERVAL '1 month'
ORDER BY created;
EXPLAIN(BUFFERS, ANALYZE) output:
Index Scan using payments_transaction_created_42e34d6ca1e04ffe_uniq
on payments_transaction (cost=0.44..339376.18 rows=481811 width=24) (actual time=2.643..49422.733 rows=511058 loops=1)
Index Cond: (created > (timezone('US/Mountain'::text, now()) - '1 mon'::interval))
Filter: ((type)::text = 'purchase'::text)
Rows Removed by Filter: 955691
Buffers: shared hit=405597 read=295625 written=764
Planning time: 0.111 ms
Execution time: 49569.324 ms
In my limited knowledge, the execution time seems much too long to me.
What steps should I take to create the most efficient environment possible? Does creating a local copy of the database mean faster queries? Are there other factors that could lead to such inefficiencies?
Remember, I'm brand new to databases, so there are no answers too simple.
Your query seems to be using an index on transactions(created). You are returning around 0.5M of rows while discarding 2x as much.
Depending on the distribution of your values across type column you may benefit from adding an index on both your text and timestamp columns:
CREATE INDEX ON transactions(type, created);
A rule of thumb while adding indexes is to first index for equality operators and then for dates. It might actually speed up your query a lot - though as I've mentioned earlier depending on value distribution.
Remember to update table statistics after creating an index using:
ANALYZE transactions;
Testing on local copy of the database does mean faster processing because you are not sending packages over the net. Instead everything is being processed locally, but this shouldn't be that much of a deal with your query and it's better to always test in as much comparable environment to the production as you can.
I'm trying to diagnose a slow query, using EXPLAIN ANALYZE. I'm new to the command so I've read http://www.postgresql.org/docs/9.3/static/using-explain.html . The query plan uses a "CTE scan", but I don't know what that is, compared to, say, a sequential scan - and more importantly, what a CTE scan means in general for query performance.
A "CTE scan" is a sequential scan of the materialized results of a CTE term (a named section like "blah" in a CTE like WITH blah AS (SELECT ...).
Materialized means that PostgreSQL has calculated the results and turned them into a temporary store of rows, it isn't just using the CTE like a view.
The main implication is that selecting a small subset from a CTE term and discarding the rest can do a lot of wasted work, because the parts you discard must still be fully calculated.
For details see a recent blog post I wrote on the topic.
There are a few discussions this and there (including the official post on postgres web) about the slow count(*) prior version 9.2; somehow I did not find satisfied answer.
Basically I had postgres 9.1 installed, and I observed slow count(*) as simple as
select count(*) from restaurants;
on tables with records of 100k+. The average request is around 850ms. Well I assumed that that was the symptom people have been talking about for slow count on postgres 9.1 and below since postgres 9.2 has some new feature like index-only scan. I want to experiment this by using the same dataset from 9.1 and put it on 9.2. I call the count statement, and it still give a bad result as 9.1.
explain analyze select count(*) from restaurants;
------------------------------------------------------------------
Aggregate (cost=23510.35..23510.36 rows=1 width=0) (actual time=979.960..979.961 rows=1 loops=1)
-> Seq Scan on restaurants (cost=0.00..23214.88 rows=118188 width=0) (actual time=0.050..845.097 rows=118188 loops=1)
Total runtime: 980.037 ms
Can anyone suggest feasible solution to this problem? Do I need to configure anything on postgres to enable the feature?
P.S. where clause doesn't help in my case either.
See the index only scans wiki entries:
What types of queries may be satisfied by an index-only scan?
is count(*) much faster now?
Why isn't my query using an index-only scan?
In particular, I quote:
It is important to realise that the planner is concerned with
minimising the total cost of the query. With databases, the cost of
I/O typically dominates. For that reason, "count(*) without any
predicate" queries will only use an index-only scan if the index is
significantly smaller than its table. This typically only happens when
the table's row width is much wider than some indexes'.
See also the discussion of VACUUM and ANALYZE for maintaining the visibility map. Essentially, you probably want to make VACUUM more aggressive, and you'll want to manually VACUUM ANALYZE the table after you first load it.
This happens due to PostgreSQL's MVCC implementation. In short, in order to count the table rows, PostgreSQL needs to ensure that they exist. But given the multiple snapshots/versions of each record, PostgreSQL is unable to summarize the whole table directly. So instead PostgreSQL reads each row, performing a sequential scan.!
How to fix this?
There are different approaches to fix this issue, including a trigger-based mechanism. If it is acceptable for you to use an estimated count of the rows you can check PostgreSQL pg_class reltuples:
SELECT reltuples::BIGINT AS estimate FROM pg_class WHERE relname=<table_name>
Reltuples:
[It is the] Number of rows in the table. This is only an estimate used by the planner. — PostgreSQL: Documentation: pg_class
More info:
https://medium.com/#vinicius_edipo/handling-slow-counting-with-elixir-postgresql-f5ff47f3d5b9
http://www.varlena.com/GeneralBits/120.php
https://www.postgresql.org/docs/current/static/catalog-pg-class.html