Best aws rds machine instance for specified ETL

Best aws rds machine instance for specified ETL - postgresql

I would like to choose a pg rds for an ETL that I am working on. The concept is that I want to load every 4h and once a week csv files from s3 to postgres. I have around 10 parallel loads of files varying from KB to GB. As it is now with a t3.small the load for a partitioned folder of 30GB takes more than 20h (and most of the times tested it completely failed) while smaller folders of 1GB can take up to 2h.
I am working on optimising the load from data engineering perspective, like avoiding deletes, creating indexes after loading etc.
Also I am experimenting with updating some parameters as suggested in the official doc.
Still I am certain that this machine definitely doesn't work. Can someone please suggest the bare minimum machine that I should have for my case? Note that I don't have a large volume of data, every 4h I want to copy up to approximately 7GB of data and once a week 50GB of data.
Execution plan for one of the copies from s3 to pg. 1GB file:
Result (cost=0.00..0.01 rows=1 width=32) (actual time=1225272.639..1225272.650 rows=1 loops=1)
Buffers: shared hit=48984552 read=82716 dirtied=1087394 written=1618757
I/O Timings: read=725497.871 write=11315.914
Planning Time: 0.031 ms
Execution Time: 1225449.639 ms
I am not sure how to interpret the values though as it's not a typical query plan to focus on which parts take longer

Related

Postgres EXPLAIN ANALYSE Planning Time slow for first query per connection

When running the query the first time in psql, it is a bit slow. The second time it's a lot faster since the Planning Time goes down substantially.
> EXPLAIN ANALYSE SELECT * FROM public.my_custom_function(10, 10, 'Personal');
The first time:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on my_custom_function (cost=0.25..10.25 rows=1000 width=32) (actual time=4.900..4.901 rows=1 loops=1)
Planning Time: 30.870 ms
Execution Time: 3.410 ms
(3 rows)
All subsequent queries:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on my_custom_function (cost=0.25..10.25 rows=1000 width=32) (actual time=4.900..4.901 rows=1 loops=1)
Planning Time: 0.620 ms
Execution Time: 4.920 ms
(3 rows)
This is the case any time I make a new connection to the DB, the first call has considerable Planning Time and all others are fine.
Additional Context
Deployment: Docker
Postgres version: 12
SQL logic: Does Indexed JOINs and WHERE lookups. I know logic there is fast and solid and it's not the query itself that needs to be optimised.
Whether I run the query by itself or via the function, the same Planning Time issue remains.
Problem:
I have an HTTP API making a connection per request, calling the function once and then returning. Hence every API request has the performance of a non-planned query.
Question:
How can I make this query be Planned for once and never again? Maybe using a PREPARE statement?

While there might be ways to speed this up (if we could see your function), fundamentally if you are very sensitive to performance, then you need to choose some technology that doesn't make one connection per request. Like mod_perl or FastCGI or maybe pgbouncer.

We have intermittent slow queries. Is our PostgreSQL struggling with memory?

I am investigating a few slow queries and I need some help reading the data I got.
We have this one particular query which uses an index and runs pretty fast most of the time, however from time to time it runs slow (700ms+), not sure why.
Limit (cost=8.59..8.60 rows=1 width=619) (actual time=5.653..5.654 rows=1 loops=1)
-> Sort (cost=8.59..8.60 rows=1 width=619) (actual time=5.652..5.652 rows=1 loops=1)
Sort Key: is_main DESC, id
Sort Method: quicksort Memory: 25kB
-> Index Scan using index_pictures_on_imageable_id_and_imageable_type on pictures (cost=0.56..8.58
rows=1 width=619) (actual time=3.644..5.587 rows=1 loops=1)
Index Cond: ((imageable_id = 12345) AND ((imageable_type)::text = 'Product'::text))
Filter: (tag = 30)
Rows Removed by Filter: 2
Planning Time: 1.699 ms
Execution Time: 5.764 ms
If I understand that correctly, I would say that almost the entire cost of the query is on index scan, right? which sounds good to me, so why does the same query run pretty slow sometimes?
I started to think that maybe our instance is not being able to keep the entire index in memory, so it is using disk from time to time. That would explain the slow queries. However, that is way over my head. Does that make sense?
That table has around 15 million rows and 5156 MB in size. Index is 1752 MB. BTW, it is a btree index.
Our PostgreSQL is on a "Highly available" Google Cloud SQL instance. It has 2 vCPUs and 7.5 GB of RAM. Our entire database is around 35 GB in size.
CPU consumption almost never goes beyond 40%. It usually settles around 20-30%.
Checking instance memory graph, I noticed that consumption grows until ~4 GB, then it drops down ~700 MB and it starts growing again. That is a repetitive pattern.
In theory, the instance has 7.5 GB of RAM, but I don't know if all of it is supposed to be available for PostgreSQL. Anyway, ~3.5 GB just for OS sounds pretty high, right?
Memory graph
I read that these configs are important, so throwing them here (Cloud SQL defaults):
shared_buffers | 318976
temp_buffers | 1024
work_mem | 4096
Considering that we have a bunch of other tables and indexes, is it reasonable to assume that if one index alone is 1.7 GB, 7.5 GB for the entire instance is too low?
Is there any way I can assert whether we have a memory issue or not?
I appreciate your help.

Three things that can help you:
This function do a "prewarm" on table permanently on your memory. This reduces drastically your disk access, helping a lot on performance. The limitation for prewarm is resources. So, not all tables can be put on memory. If the table is small or not constantly accessed, it's not recommended. Every time that your database is stopped, on the next up of database, you need to run pg_prewarm() again
https://www.postgresql.org/docs/current/pgprewarm.html
Create a CLUSTER on your index. You can create one cluster per table. Clustering your index is a great way to get a good access of the data. The way that data is stored is related with cluster, so, to access a determined position on previously ordered data is very faster.
CLUSTER [VERBOSE] table_name [ USING index_name ]
Reference: https://www.postgresql.org/docs/current/sql-cluster.html
Run periodically VACUUM ANALYZE on table. Postgresql collect statistics about your queries and classifies the information in vacuum with analyze option focused on optimize your queries.

I think is more a memory problem as you say. Checking your graph I can say that most of the time your database is using the 4GB of memory assigned and when you run your query postgres has to use the disk.
I suppose your query runs faster is when is under the memory limit. Another thing to consider is that maybe, time ago, your database was not big as now and with the dafult memory assign (4 GB) was ok.
You can modify your memory assigned to postgres configuring the flags, in particular the work_mem flag. I suggest to assign 2GB of extra memory and check the results. If you see your database uses again the 100% of the memory, consider increasing the whole memory and the memory assigned to the database.

explain analyze - cost to actual time relation

Usual when improving my queries I see a coinciding improvement with both cost and actual time when running an explain analyze on both before and after queries.
However, in one case, the before query reports
"Hash Join (cost=134.06..1333.57 rows=231 width=70)
(actual time=115.349..115.650 rows=231 loops=1)"
<cut...>
"Planning time: 4.060 ms"
"Execution time: 115.787 ms"
and the after reports
"Hash Join (cost=4.63..1202.61 rows=77 width=70)
(actual time=0.249..0.481 rows=231 loops=1)"
<cut...>
"Planning time: 2.079 ms"
"Execution time: 0.556 ms"
So as you can see, the costs are similar but actual and real execution times are vastly different, regardless of the order in which I run the tests.
Using Postgres 8.4.
Can anyone clear up my understanding as to why the cost does not show an improvement?

There isn't much information available in the details given in the question but a few pointers can may be help others who come here searching on the topic.
The cost is a numerical estimate based on table statistics that are calculated when analyze is run on the tables that are involved in the query. If the table has never been analyzed then the plan and the cost may be way sub optimal. The query plan is affected by the table statistics.
The actual time is the actual time taken to run the query. Again this may not correlate properly to the cost depending on how fresh the table statistics are. The plan may be arrived upon depending on the current table statistics, but the actual execution may find real data conditions different from what the table statistics tell, resulting in a skewed execution time.
Point to note here is that, table statistics affect the plan and the cost estimate, where as the plan and actual data conditions affect the actual time. So, as a best practice, before working on query optimization, always run analyze on the tables.
A few notes:
analyze <table> - updates the statistics of the table.
vacuum analyze <table> - removes stale versions of the updated records from the table and then updates the statistics of the table.
explain <query> - only generates a plan for the query using statistics of the tables involved in the query.
explain (analyze) <query> - generates a plan for the query using existing statistics of the tables involved in the query, and also runs the query collecting actual run time data. Since the query is actually run, if the query is a DML query, then care should be taken to enclose it in begin and rollback if the changes are not intended to be persisted.

Cost meaning
The costs are in an arbitrary unit. A common misunderstanding is that they are in milliseconds or some other unit of time, but that’s not the case.
The cost units are anchored (by default) to a single sequential page read costing 1.0 units (seq_page_cost).
Each row processed adds 0.01 (cpu_tuple_cost)
Each non-sequential page read adds 4.0 (random_page_cost).
There are many more constants like this, all of which are configurable.
Startup cost
The first numbers you see after cost= is known as the “startup cost”. This is an estimate of how long it will take to fetch the first row.
The startup cost of an operation includes the cost of its children.
Total cost
After the startup cost and the two dots, is known as the “total cost”. This estimates how long it will take to return all the rows.
example
QUERY PLAN |
--------------------------------------------------------------+
Sort (cost=66.83..69.33 rows=1000 width=17) |
Sort Key: username |
-> Seq Scan on users (cost=0.00..17.00 rows=1000 width=17)|
We can see that the total cost of the Seq Scan operation is 17.00, and the startup cost of the Seq Scan is 0.00. For the Sort operation, the total cost is 69.33, which is not much more than its startup cost (66.83).
Actual time meaning
The “actual time” values are in milliseconds of real time, it is the result of EXPLAIN's ANALYZE. Note: the EXPLAIN ANALYZE option performs the query (be careful with UPDATE and DELETE)
EXPLAIN ANALYZE could be used to compare the estimated number of rows with the actual rows returned by each operation.
Helping the planner estimate more accurately
Gather better statistics
tables also change over time, so tuning the autovacuum settings to make sure it runs frequently enough for your workload can be very helpful.
If you’re having trouble with bad estimates for a column with a skewed distribution, you may benefit from increasing the amount of information Postgres gathers by using the ALTER TABLE SET STATISTICS command, or even the default_statistics_target for the whole database.
Another common cause of bad estimates is that, by default, Postgres will assume that two columns are independent. You can fix this by asking it to gather correlation data on two columns from the same table via extended statistics.
Tune the constants it uses for the calculations
Assuming you’re running on SSDs, you’ll likely at minimum want to tune your setting of random_page_cost. This defaults to 4, which is 4x more expensive than the seq_page_cost we looked at earlier. This ratio made sense on spinning disks, but on SSDs it tends to penalize random I/O too much.
Source:
PG doc - using explain
Postgres explain cost

Identify queries leading to seq_scan shown in pg_stat_user_tables

I have verified all the queries using EXPLAIN, and they show an index scan node in the plan.
But when i see stats in the table pg_stat_user_tables I see a non-zero value of seq_scan.
It is possible that PostgreSQL is doing a bitmap heap scan rather than just an index scan, but I am not completely sure.
I have following queries:
Does bitmap heap scan count as seq_scan in above stats table ?
How to identify the queries that perform the sequential scan? The traffic to the database is non-uniform, hence monitoring pg_stat_activity is not helpful.

A bitmap index scan is counted under idx_scan.
Finding the query that performs the sequential scan is harder.
Let's assume that the table is fairly big so that a sequential scan takes a certain time (for this answer, I assume at least 500 ms, but that may be different of couse). If the sequential scan is very short, you wouldn't and shouldn't worry.
Now put auto_explain into shared_preload_libraries and add the following parameters to postgresql.conf:
auto_explain.log_min_duration = 500
auto_explain.log_nested_statements = on
Then restart the server.
Now you will get the execution plans of all statements exceeding a duration of 500 ms in the PostgreSQL log, and you should be able to find the query.
By the way, sequential scans are not always something you should worry about. If they occur only rarely, that is usually fine. It might be that you are hunting your own database backup that uses pg_dump! Only expensive sequential scans that happen often are a problem.

PostgreSQL - long running SELECT on Big XML - Data in TOAST

I am currently analyzing why the application installed on top of PostgreSQL we are using is sometimes soo slow. The logfiles are showing that queries to a specific table have extremely long execution times.
I further found out, that it is one column on the table, which contains XML documents (ranging from a few bytes to one entry with ~7MB XML data), which is the cause of the slow query.
There are 1100 Rows in the table and a
SELECT * FROM mytable
has the same query execution time of 5 Seconds as
SELECT [XML-column-only] FROM mytable
But in contrast, a
SELECT [id-only] FROM mytable
has a query execution time of only 0.2s!
I couldn't produce any noticeable differences depending on the settings (the usual ones, work_mem, shared_buffers,...), there is even almost no difference in comparison between our production server (PostgreSQL 9.3) and running it in a VM on PostgreSQL 9.4 on my workstation PC.
Disk monitoring shows almost no I/O activity for the query.
So the last thing I went to analyze was the Network I/O.
Of course, as mentioned before, it's a lot of data in these XML Column. Total size for the 1100 rows (XML column only) is 36 MB. Divided for the 5 seconds running time, this are a mere 7.2MB/s Network Transfer, which equal around 60MBit/s. Which is a little bit slow, as we all are on Gbit Ethernet, aren't we? :D Windows Taskmanger also show a utilization of 6% for the Networking during the runtime of the query, which is in concordance with the manual calculation from before.
Furthermore, the query execution time is almost linear to the amount of XML data in the table. For testing I deleted the top 10% rows with the largest amount of data on the XML column, and the execution time (now ~18 instead of 36MB to transfer) dropped to 2.5s instead of 5s.
So, to get to the point: What are my options on the database administration side (we cannot touch or change the application itself), to make the simple SELECT for this XML-Data noticeable faster? Is there any bottleneck I didn't take into account yet? Or is this the normal PostgreSQL behaviour?
EDIT: I use pgAdmin III. The Execution plan (explain (analyze, verbose) select * from gdb_items) shows a much shorter total runtime, than the actual query and the statement duration entry in the log:
Seq Scan on sde.gdb_items (cost=0.00..181.51 rows=1151 width=1399) (actual time=0.005..0.193 rows=1151 loops=1)
Output: objectid, uuid, type, name, physicalname, path, url, properties, defaults, datasetsubtype1, datasetsubtype2, datasetinfo1, datasetinfo2, definition, documentation, iteminfo, shape
Total runtime: 0.243 ms

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse