Beginner PostgreSQL: Setting up environment for querying large pre-existing database

Beginner PostgreSQL: Setting up environment for querying large pre-existing database - postgresql

I'm brand spanking new to relational databases, and need help setting up a basic working environment for querying a large (pre-existing) database. I've connected to our remote server through PGAdmin, but all my basic queries are extremely slow.
Query
SELECT to_char(created, 'YYYY-MM-DD'), user_id
FROM transactions
WHERE type = 'purchase'
AND created > NOW() AT TIME ZONE 'US/Mountain' - INTERVAL '1 month'
ORDER BY created;
EXPLAIN(BUFFERS, ANALYZE) output:
Index Scan using payments_transaction_created_42e34d6ca1e04ffe_uniq
on payments_transaction (cost=0.44..339376.18 rows=481811 width=24) (actual time=2.643..49422.733 rows=511058 loops=1)
Index Cond: (created > (timezone('US/Mountain'::text, now()) - '1 mon'::interval))
Filter: ((type)::text = 'purchase'::text)
Rows Removed by Filter: 955691
Buffers: shared hit=405597 read=295625 written=764
Planning time: 0.111 ms
Execution time: 49569.324 ms
In my limited knowledge, the execution time seems much too long to me.
What steps should I take to create the most efficient environment possible? Does creating a local copy of the database mean faster queries? Are there other factors that could lead to such inefficiencies?
Remember, I'm brand new to databases, so there are no answers too simple.

Your query seems to be using an index on transactions(created). You are returning around 0.5M of rows while discarding 2x as much.
Depending on the distribution of your values across type column you may benefit from adding an index on both your text and timestamp columns:
CREATE INDEX ON transactions(type, created);
A rule of thumb while adding indexes is to first index for equality operators and then for dates. It might actually speed up your query a lot - though as I've mentioned earlier depending on value distribution.
Remember to update table statistics after creating an index using:
ANALYZE transactions;
Testing on local copy of the database does mean faster processing because you are not sending packages over the net. Instead everything is being processed locally, but this shouldn't be that much of a deal with your query and it's better to always test in as much comparable environment to the production as you can.

Related

Postgres - same query two different EXPLAIN plans

I have two postgres databases (one for development, one for test). Both have the same structure.
I'm running this query on both (merges one table into another)
WITH moved_rows AS
(
DELETE FROM mybigtable_20220322
RETURNING *
)
INSERT INTO mybigtable_202203
SELECT * FROM moved_rows;
But I get slightly different results for EXPLAIN on each version.
Development (Postgres 13.1) -
Insert on mybigtable_202203 (cost=363938.39..545791.17 rows=9092639 width=429)
CTE moved_rows
-> Delete on mybigtable_20220322 (cost=0.00..363938.39 rows=9092639 width=6)
-> Seq Scan on mybigtable_20220322 (cost=0.00..363938.39 rows=9092639 width=6)
-> CTE Scan on moved_rows (cost=0.00..181852.78 rows=9092639 width=429)
Test (Postgres 14.1) -
Insert on mybigtable_202203 (cost=372561.91..558377.73 rows=0 width=0)
CTE moved_rows
-> Delete on mybigtable_20220322 (cost=0.00..372561.91 rows=9290791 width=6)
-> Seq Scan on mybigtable_20220322 (cost=0.00..372561.91 rows=9290791 width=6)
-> CTE Scan on moved_rows (cost=0.00..185815.82 rows=9290791 width=429)
The big difference is the first line, on Development I get rows=9092639 width=429 on Test I get rows=0 width=0
All the tables have the same definitions, with the same indexes (not that they seem to be used) the query succeeds on both databases, the EXPLAIN indicates similar costs on both database, and the tables on each database have a similar record count (just over 9 million rows)
In practice the difference is that on Development the query takes a few minutes, on Test is takes a few hours.
Both databases were created with the same scripts, so should be 100% identical my guess is there's some small, subtle difference that's crept in somewhere. Any suggestion on what the difference might be or how to find it? Thanks
Update
both the tables being merged (on both databases) have been VACUUM ANALYZED in similar timeframes.
I used fc to compare both DBs. There was ONE difference, on the development database the table was clustered on one of the indexes. I did similar clustering on the test table but results didn't change.
In response to the comment 'the plans are the same, only the estimated rows are different'. This difference is the only clue I currently have to an underlying problem. My development database is on a 10 year old server struggling with lack of resources, my test database is a brand new server. The former takes a few minutes to actually run the query the later takes a few hours. Whenever I post a question on the forum I'm always told 'start with the explain plan'

This change was made with commit f0f13a3a08b27, which didn't make it into the release notes, since it was considered a bug fix:
Fix estimates for ModifyTable paths without RETURNING.
In the past, we always estimated that a ModifyTable node would emit the
same number of rows as its subpaths. Without a RETURNING clause, the
correct estimate is zero. Fix, in preparation for a proposed parallel
write patch that is sensitive to that number.
A remaining problem is that for RETURNING queries, the estimated width
is based on subpath output rather than the RETURNING tlist.
Reviewed-by: Greg Nancarrow gregn4422#gmail.com
Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com
This change only affects EXPLAIN output, not what actually happens during data modifications.

How do I improve date-based query performance on a large table?

This is related to 2 other questions I posted (sounds like I should post this as a new question) - the feedback helped, but I think the same issue will come back the next time I need to insert data. Things were running slowly still which forced me to temporarily remove some of the older data so that only 2 months' worth remained in the table that I'm querying.
Indexing strategy for different combinations of WHERE clauses incl. text patterns
How to get date_part query to hit index?
Giving further detail this time - hopefully it will help pinpoint the issue:
PG version 10.7 (running on heroku
Total DB size: 18.4GB (this contains 2 months worth of data, and it will grow at approximately the same rate each month)
15GB RAM
Total available storage: 512GB
The largest table (the one that the slowest query is acting on) is 9.6GB (it's the largest chunk of the total DB) - about 10 million records
Schema of the largest table:
-- Table Definition ----------------------------------------------
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_feb2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-02-01 00:00:00'::timestamp without time zone AND datelocal < '2019-03-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_mar2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-03-01 00:00:00'::timestamp without time zone AND datelocal < '2019-04-01 00:00:00'::timestamp without time zone;
CREATE INDEX reportimpression_jan2019_index ON reportimpression(datelocal timestamp_ops) WHERE datelocal >= '2019-01-01 00:00:00'::timestamp without time zone AND datelocal < '2019-02-01 00:00:00'::timestamp without time zone;
Slow query:
SELECT
date_part('hour', datelocal) AS hour,
SUM(CASE WHEN gender = 'male' THEN views ELSE 0 END) AS male,
SUM(CASE WHEN gender = 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE
datelocal >= '3-1-2019' AND
datelocal < '4-1-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)
The date range in this query will generally be for an entire month (it accepts user input from a web based report) - as you can see, I tried creating an index for each month's worth of data. That helped, but as far as I can tell, unless the query has recently been run (putting the results into the cache), it can still take up to a minute to run.
Explain analyze results:
Finalize GroupAggregate (cost=1035890.38..1035897.86 rows=1361 width=24) (actual time=3536.089..3536.108 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
-> Sort (cost=1035890.38..1035891.06 rows=1361 width=24) (actual time=3536.083..3536.087 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
-> Gather (cost=1035735.34..1035876.21 rows=1361 width=24) (actual time=3535.926..3579.818 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Partial HashAggregate (cost=1034735.34..1034740.11 rows=1361 width=24) (actual time=3532.917..3532.933 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
-> Parallel Index Scan using reportimpression_mar2019_index on reportimpression (cost=0.09..1026482.42 rows=3301168 width=17) (actual time=0.045..2132.174 rows=2801158 loops=2)
Planning time: 0.517 ms
Execution time: 3579.965 ms
I wouldn't think 10 million records would be too much to handle, especially given that I recently bumped up the PG plan that I'm on to try to throw resources at it, so I assume that the issue is still just either my indexes or my queries not being very efficient.

A materialized view is the way to go for what you outlined. Querying past months of read-only data works without refreshing it. You may want to special-case the current month if you need to cover that, too.
The underlying query can still benefit from an index, and there are two directions you might take:
First off, partial indexes like you have now won't buy much in your scenario, not worth it. If you collect many more months of data and mostly query by month (and add / drop rows by month) table partitioning might be an idea, then you have your indexes partitioned automatically, too. I'd consider Postgres 11 or even the upcoming Postgres 12 for this, though.)
If your rows are wide, create an index that allows index-only scans. Like:
CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal, views, gender);
Related:
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
Or INCLUDE additional columns in Postgres 11 or later:
CREATE INDEX reportimpression_covering_idx ON reportimpression(datelocal) INCLUDE (views, gender);
Else, if your rows are physically sorted by datelocal, consider a BRIN index. It's extremely small and probably about as fast as a B-tree index for your case. (But being so small it will stay cached much easier and not push other data out as much.)
CREATE INDEX reportimpression_brin_idx ON reportimpression USING BRIN (datelocal);
You may be interested in CLUSTER or pg_repack to physically sort table rows. pg_repack can do it without exclusive locks on the table and even without a btree index (required by CLUSTER). But it's an additional module not shipped with the standard distribution of Postgres.
Related:
Optimize Postgres deletion of orphaned records
How to reclaim disk space after delete without rebuilding table?

Your execution plan seems to be doing the right thing.
Things you can do to improve, in descending order of effectiveness:
Use a materialized view that pre-aggregates the data
Don't use a hosted database, use your own iron with good local storage and lots of RAM.
Use only one index instead of several partitioned ones. This is not primarily a performance advice (the query will probably not be measurably slower unless you have a lot of indexes), but it will ease the management burden.

How does postgres decide whether to use index scan or seq scan?

explain analyze shows that postgres will use index scanning for my query that fetches rows and performs filtering by date (i.e., 2017-04-14 05:27:51.039):
explain analyze select * from tbl t where updated > '2017-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using updated on tbl t (cost=0.43..7317.12 rows=10418 width=93) (actual time=0.011..0.515 rows=1179 loops=1)
Index Cond: (updated > '2017-04-14 05:27:51.039'::timestamp without time zone)
Planning time: 0.102 ms
Execution time: 0.720 ms
however running the same query but with different date filter '2016-04-14 05:27:51.039' shows that postgres will run the query using seq scan instead:
explain analyze select * from tbl t where updated > '2016-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl t (cost=0.00..176103.94 rows=5936959 width=93) (actual time=0.008..2005.455 rows=5871963 loops=1)
Filter: (updated > '2016-04-14 05:27:51.039'::timestamp without time zone)
Rows Removed by Filter: 947
Planning time: 0.100 ms
Execution time: 2910.086 ms
How does postgres decide on what to use, specifically when performing filtering by date?

The Postgres query planner bases its decisions on cost estimates and column statistics, which are gathered by ANALYZE and opportunistically by some other utility commands. That all happens automatically when autovacuum is on (by default).
The manual:
Most queries retrieve only a fraction of the rows in a table, due to
WHERE clauses that restrict the rows to be examined. The planner thus
needs to make an estimate of the selectivity of WHERE clauses, that
is, the fraction of rows that match each condition in the WHERE
clause. The information used for this task is stored in the
pg_statistic system catalog. Entries in pg_statistic are updated by
the ANALYZE and VACUUM ANALYZE commands, and are always approximate
even when freshly updated.
There is a row count (in pg_class), a list of most common values, etc.
The more rows Postgres expects to find, the more likely it will switch to a sequential scan, which is cheaper to retrieve large portions of a table.
Generally, it's index scan -> bitmap index scan -> sequential scan, the more rows are expected to be retrieved.
For your particular example, the important statistic is histogram_bounds, which give Postgres a rough idea how many rows have a greater value than the given one. There is the more convenient view pg_stats for the human eye:
SELECT histogram_bounds
FROM pg_stats
WHERE tablename = 'tbl'
AND attname = 'updated';
There is a dedicated chapter explaining row estimation in the manual.

Obviously, optimization of queries is tricky. This answer is not intended to dive into the specifics of the Postgres optimizer. Instead, it is intended to give you some background on how the decision to use an index is made.
Your first query is estimated to return 10,418 rows. When using an index, the following operations happen:
The engine uses the index to find the first value meeting the condition.
The engine then loops over the values, finishing when the condition is no longer true.
For each value in the index, the engine then looks up the data on the data page.
In other words, there is a little bit of overhead when using the index -- initializing the index and then looking up each data page individually.
When the engine does a full table scan it:
Starts with the first record on the first page
Does the comparison and accepts or rejects the record
Continues sequentially through all data pages
There is no additional overhead. Further, the engine can "pre-load" the next pages to be scanned while processing the current page. This overlap of I/O and processing is a big win.
The point I'm trying to make is that getting the balance between these two can be tricky. Somewhere between 10,418 and 5,936,959, Postgres decides that the index overhead (and fetching the pages randomly) costs more than just scanning the whole table.

PostgreSQL bitmap heap scan takes long time

I have Case table which has almost 25,00,000 rows and 176 columns (mostly varchar)
select
count(ca.id)
from
salesforce.case ca
where
ca.accountid = '001i000000E'
AND ca.createddate BETWEEN current_date - interval '6 months' AND current_date
I am trying to get number of records created for last 6 months for specific account. But when I look at the explain and analyze, I found that it first index scans all the records created in that timespan and index scans all records for that account and then does bitmap heap scan (which takes long time).
https://explain.depesz.com/s/8Lje
Is there any way we can make it faster?

You can create an index ON ca(accountid, createddate) that can be used for both conditions.
Given on the numbers in your explain output, it may be that PostgreSQL still uses a bitmap index scan because it thinks that is faster. You can try to lower random_page_cost, see if different plans are chosen and test which one is the fastest.

Slow index scan

I have table with index:
Table:
Participates (player_id integer, other...)
Indexes:
"index_participates_on_player_id" btree (player_id)
Table contains 400kk rows.
I execute the same query two times:
Query: explain analyze select * from participates where player_id=149294217;
First time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=261.061..2025.559 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 2025.644 ms
(3 rows)
Second time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=0.030..0.479 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 0.527 ms
(3 rows)
So, first query has big actual time - how to increase speed the first execute?
UPDATE
Sorry, How to accelerate first query?)
Why index scan search so slow?

The difference in execution time is probably because the second time through, the table/index data from the first run of the query is in the shared buffers cache, and so the subsequent run of the query takes less time, since it doesn't have to go long-path to disk for that information.
Edit:
Regarding the slowness of the original query, does the table have a lot of dead tuples? Those can slow things down considerably. If so, VACUUM ANALYZE the table.
Another factor can be if there are long-ish idle transactions on the server (i.e. several minutes or more). Due to the nature of MVCC this can also slow even index-based queries down quite a bit.
Also, what the query planner is expecting for the results vs. actual is quite different, so you may want to do an ANALYZE on the query beforehand to update the stats.

1.) Take a look at http://www.postgresql.org/docs/9.3/static/runtime-config-resource.html and check out for some tuning for using more memory. This can speed up your search but will not give you a warranty (depending on the answer before)!
2.) Transfer a part of your tables/indexes to a more powerful tablespace. For example a tablespace based on SSDs.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse