Redshift latency, i.e. discrepancy between "execution time" and "total runtime" - amazon-redshift

I'm currently experimenting with Redshift and I've noticed that for a simple query like:
SELECT COUNT(*) FROM table WHERE column = 'value';
The execution time reported by Redshift is only 84ms, which is expected and pretty good with the table at ~33M rows. However, the total runtime, both observed on my local psql client as well as in Redshift's console UI is 5 seconds. I've tried the query on both a single node cluster and multi-node (2 nodes and 4 nodes) clusters.
In addition, when I try with more realistic, complicated queries, I can see similarly that the query execution itself is only ~500ms in a lot of cases, but the total runtime is ~7 seconds.
What causes this discrepancy? Is there anyway to reduce this latency? Any internal table to dive deeper into the time distribution that covers the entire end-to-end runtime?
I read about Cold query performance improvements that Amazon recently introduced, but this latency seems to be there even on queries past the first cold one, as long as I alter the value in my where clause. However, the latency is somewhat inconsistent but definitely still go all the way up to 5 seconds.
-- Edited to give more details based on Bill Weiner's answer below --
There is no difference between doing SELECT COUNT(*) vs SELECT COUNT(column) (where column is a dist key to avoid skew).
There are absolutely zero other activities happening on the cluster because this is for exploration only. I'm the only one issuing queries and making connections to the DB, so there should be no queueing or locking delays.
The data resides in the Redshift database, with a normal schema and common-sense dist key and sort key. I have not added explicit compression to any columns, so everything is just AUTO right now.
Looks like compile time is the culprit!
STL_WLM_QUERY shows that for query 12599, this is the exec_start_time/exec_end_time:
-[ RECORD 1 ]------------+-----------------------------------------------------------------
userid | 100
xid | 14812605
task | 7289
query | 12599
service_class | 100
slot_count | 1
service_class_start_time | 2021-04-22 21:46:49.217
queue_start_time | 2021-04-22 21:46:49.21707
queue_end_time | 2021-04-22 21:46:49.21707
total_queue_time | 0
exec_start_time | 2021-04-22 21:46:49.217077
exec_end_time | 2021-04-22 21:46:53.762903
total_exec_time | 4545826
service_class_end_time | 2021-04-22 21:46:53.762903
final_state | Completed
est_peak_mem | 2097152
query_priority | Normal
service_class_name | Default queue
And from SVL_COMPILE, we have:
userid | xid | pid | query | segment | locus | starttime | endtime | compile
--------+----------+-------+-------+---------+-------+----------------------------+----------------------------+---------
100 | 14812605 | 30442 | 12599 | 0 | 1 | 2021-04-22 21:46:49.218872 | 2021-04-22 21:46:53.744529 | 1
100 | 14812605 | 30442 | 12599 | 2 | 2 | 2021-04-22 21:46:53.745711 | 2021-04-22 21:46:53.745728 | 0
100 | 14812605 | 30442 | 12599 | 3 | 2 | 2021-04-22 21:46:53.761989 | 2021-04-22 21:46:53.762015 | 0
100 | 14812605 | 30442 | 12599 | 1 | 1 | 2021-04-22 21:46:53.745476 | 2021-04-22 21:46:53.745503 | 0
(4 rows)
It shows that compile took from 21:46:49.218872 to 2021-04-22 21:46:53.744529, i.e. the overwhelming majority of the 4545ms total exec time.

There's a lot that could be taking up this time. Looking a more of the query and queuing statistics will help track down what is happening. Here are a few possibilities that I've seen be significant in the past:
Date return time. Since your query is an open select and could be returning a meaningful amount of data and moving this over the network to the requesting computer takes time.
Queuing delays. What else is happening on your cluster? Does you query start right away or does it need to wait for a slot?
Locking delays. What else is happening on your cluster? Are data/tables changing? Is the data your query needs being committed elsewhere?
Compile time. Is this the first time this query is run?
Is the table external? In S3 as an external table. Or are you using the new rs3 instance type? All the source data is in S3. (I'm guessing you are not on rs3 nodes but it doesn't hurt to ask)
A place to start is STL_WLM_QUERY to see where the query is spending this extra time.

Related

PostgreSQL index bloat ratio more than table bloat ratio and autovacuum_vacuum_scale_factor

Index bloats are reaching 57%, while table bloat is 9% only and autovacuum_vacuum_Scale_factor is 10% only.
what is more surprising is even primary key is having bloat of 57%. My understanding is since my primary key is auto incrementing and single column key only so after 10% of table dead tuples, primary key index should also have 10% dead tuples.
Now when autovacuum will run at 10% of dead tuples , it will clean dead tuples. The dead tuple space now becomes bloat and this should be reused by new updates, insert. But this isn't happening in my database, here bloat size keeps on increasing.
FYI:
Index Bloat:
current_database | schemaname | tblname | idxname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio
| is_na
------------------+------------+----------------------+----------------------------------------------------------+------------+------------+------------------+------------+------------+-------------------
+-------
stackdb | public | data_entity | data_entity_pkey | 2766848000 | 1704222720 | 61.5943745373797 | 90 | 1585192960 | 57.2923760177646
Table Bloat:
current_database | schemaname | tblname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio | is_na
stackdb | public | data_entity | 10106732544 | 1007288320 | 9.96650812332014 | 100 | 1007288320 | 9.96650812332014 | f
Autovacuum Settings:
stackdb=> show autovacuum_vacuum_scale_factor;
autovacuum_vacuum_scale_factor
--------------------------------
0.1
(1 row)
stackdb=> show autovacuum_vacuum_threshold;
autovacuum_vacuum_threshold
-----------------------------
50
(1 row)
Note:
autovacuum is on
autovacuum is running successfully at defined intervals.
postgreSQL is running version 10.6. Same issue has been found with version 12.x
First: an index bloat of 57% is totally healthy. Don't worry.
Indexes become more bloated than tables, because the empty space cannot be reused as freely as it can be in a table. The table, also known al “heap”, has no predetermined ordering: if a new row is written as the result of an INSERT or UPDATE, it ends up in the first page that has enough free space, so it is easy to keep bloat low if VACUUM does its job.
B-tree indexes are different: their entries have a certain ordering, so the database is not free to choose where to put the new row. So you may have to put it into a page that is already full, causing a page split, while elsewhere in the index there are pages that are almost empty.

Know which table are affected by a connection

I want to know if there is a way to retrieve which table are affected by request made from a connection in PostgreSQL 9.5 or higher.
The purpose is to have the information in such a way that will allow me to know which table where affected, in which order and in what way.
More precisely, something like this will suffice me :
id | datetime | id_conn | id_query | table | action
---+----------+---------+----------+---------+-------
1 | ... | 2256 | 125 | user | select
2 | ... | 2256 | 125 | order | select
3 | ... | 2256 | 125 | product | select
(this will be the result of a select query from user join order join product).
I know I can retrieve id_conn througth "pg_stat_activity", and I can see if there is a running query, but I can't find an "history" of the query.
The final purpose is to debug the database when incoherent data are inserted into the table (due to a lack of constraint). Knowing which connection do the insert will lead me to find the faulty script (as I have already the script name and the id connection linked).

Refreshing materialized view CONCURRENTLY causes table bloat

In PostgreSQL 9.5 I've decided to create a materialized view "effects" and scheduled an hourly concurrent refresh, since I wanted it to be always available:
REFRESH MATERIALIZED VIEW CONCURRENTLY effects;
In the beginning everything worked well, my materialized view was refreshing and disk space usage remained more or less constant.
The Issue
After some time though, disk usage started to linearly grow.
I've concluded that the reason for this growth is the materialized view and ran the query from this answer to get the following result:
what | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+-------------+--------------+---------------
core_relation_size | 32224567296 | 30 GB | 21140
visibility_map | 991232 | 968 kB | 0
free_space_map | 7938048 | 7752 kB | 5
table_size_incl_toast | 32233504768 | 30 GB | 21146
indexes_size | 22975922176 | 21 GB | 15073
total_size_incl_toast_and_indexes | 55209426944 | 51 GB | 36220
live_rows_in_text_representation | 316152215 | 302 MB | 207
------------------------------ | | |
row_count | 1524278 | |
live_tuples | 676439 | |
dead_tuples | 1524208 | |
(11 rows)
Then, I found that the last time this table was autovacuumed was two days ago, by running:
SELECT relname, n_dead_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup desc;
I decided to manually call vacuum (VERBOSE) effects. It ran for about half an hour and produced the following output:
vacuum (VERBOSE) effects;
INFO: vacuuming "public.effects"
INFO: scanned index "effects_idx" to remove 129523454 row versions
DETAIL: CPU 12.16s/55.76u sec elapsed 119.87 sec
INFO: scanned index "effects_campaign_created_idx" to remove 129523454 row versions
DETAIL: CPU 19.11s/154.59u sec elapsed 337.91 sec
INFO: scanned index "effects_campaign_name_idx" to remove 129523454 row versions
DETAIL: CPU 28.51s/151.16u sec elapsed 315.51 sec
INFO: scanned index "effects_campaign_event_type_idx" to remove 129523454 row versions
DETAIL: CPU 38.60s/373.59u sec elapsed 601.73 sec
INFO: "effects": removed 129523454 row versions in 3865537 pages
DETAIL: CPU 59.02s/36.48u sec elapsed 326.43 sec
INFO: index "effects_idx" now contains 1524208 row versions in 472258 pages
DETAIL: 113679000 index row versions were removed.
463896 index pages have been deleted, 60386 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO: index "effects_campaign_created_idx" now contains 1524208 row versions in 664910 pages
DETAIL: 121637488 index row versions were removed.
41014 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "effects_campaign_name_idx" now contains 1524208 row versions in 711391 pages
DETAIL: 125650677 index row versions were removed.
696221 index pages have been deleted, 28150 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "effects_campaign_event_type_idx" now contains 1524208 row versions in 956018 pages
DETAIL: 127659042 index row versions were removed.
934288 index pages have been deleted, 32105 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "effects": found 0 removable, 493 nonremovable row versions in 3880239 out of 3933663 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 666922 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 180.49s/788.60u sec elapsed 1799.42 sec.
INFO: vacuuming "pg_toast.pg_toast_1371723"
INFO: index "pg_toast_1371723_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_1371723": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
At this point I thought the problem was resolved and started thinking what could interfere with the autovacuum. To be sure, I ran again the query to find space usage by that table and to my surprise it didn't change.
Only after I called REFRESH MATERIALIZED VIEW effects; not concurrently. Only now the output of the query to check table size was:
what | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+-----------+--------------+---------------
core_relation_size | 374005760 | 357 MB | 245
visibility_map | 0 | 0 bytes | 0
free_space_map | 0 | 0 bytes | 0
table_size_incl_toast | 374013952 | 357 MB | 245
indexes_size | 213843968 | 204 MB | 140
total_size_incl_toast_and_indexes | 587857920 | 561 MB | 385
live_rows_in_text_representation | 316175512 | 302 MB | 207
------------------------------ | | |
row_count | 1524385 | |
live_tuples | 676439 | |
dead_tuples | 1524208 | |
(11 rows)
And everything went back to normal...
Questions
The problem is solved but there's still a fair amount of confusion
Could anyone please explain what was the issue I experienced?
How could I avoid this in the future?
First, let's explain the bloat
REFRESH MATERIALIZED CONCURRENTLY is implemented in src/backend/commands/matview.c, and the comment is enlightening:
/*
* refresh_by_match_merge
*
* Refresh a materialized view with transactional semantics, while allowing
* concurrent reads.
*
* This is called after a new version of the data has been created in a
* temporary table. It performs a full outer join against the old version of
* the data, producing "diff" results. This join cannot work if there are any
* duplicated rows in either the old or new versions, in the sense that every
* column would compare as equal between the two rows. It does work correctly
* in the face of rows which have at least one NULL value, with all non-NULL
* columns equal. The behavior of NULLs on equality tests and on UNIQUE
* indexes turns out to be quite convenient here; the tests we need to make
* are consistent with default behavior. If there is at least one UNIQUE
* index on the materialized view, we have exactly the guarantee we need.
*
* The temporary table used to hold the diff results contains just the TID of
* the old record (if matched) and the ROW from the new table as a single
* column of complex record type (if matched).
*
* Once we have the diff table, we perform set-based DELETE and INSERT
* operations against the materialized view, and discard both temporary
* tables.
*
* Everything from the generation of the new data to applying the differences
* takes place under cover of an ExclusiveLock, since it seems as though we
* would want to prohibit not only concurrent REFRESH operations, but also
* incremental maintenance. It also doesn't seem reasonable or safe to allow
* SELECT FOR UPDATE or SELECT FOR SHARE on rows being updated or deleted by
* this command.
*/
So the materialized view is refreshed by deleting rows and inserting new ones from a temporary table. This can of course lead to dead tuples and table bloat, which is confirmed by your VACUUM (VERBOSE) output.
In a way, that's the price you pay for CONCURRENTLY.
Second, let's debunk the myth that VACUUM cannot remove the dead tuples
VACUUM will remove the dead rows, but it cannot reduce the bloat (that can be done with VACUUM (FULL), but that would lock the view just like REFRESH MATERIALIZED VIEW without CONCURRENTLY).
I suspect that the query you use to determine the number of dead tuples is just an estimate that gets the number of dead tuples wrong.
An example that demonstrates all that
CREATE TABLE tab AS SELECT id, 'row ' || id AS val FROM generate_series(1, 100000) AS id;
-- make sure autovacuum doesn't spoil our demonstration
CREATE MATERIALIZED VIEW tab_v WITH (autovacuum_enabled = off)
AS SELECT * FROM tab;
-- required for CONCURRENTLY
CREATE UNIQUE INDEX ON tab_v (id);
Use the pgstattuple extension to accurately measure table bloat:
CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len | 4431872
tuple_count | 100000
tuple_len | 3788895
tuple_percent | 85.49
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 16724
free_percent | 0.38
Now let's delete some rows in the table, refresh and measure again:
DELETE FROM tab WHERE id BETWEEN 40001 AND 80000;
REFRESH MATERIALIZED VIEW CONCURRENTLY tab_v;
SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len | 4431872
tuple_count | 60000
tuple_len | 2268895
tuple_percent | 51.19
dead_tuple_count | 40000
dead_tuple_len | 1520000
dead_tuple_percent | 34.3
free_space | 16724
free_percent | 0.38
Lots of dead tuples. VACUUM gets rid of these:
VACUUM tab_v;
SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len | 4431872
tuple_count | 60000
tuple_len | 2268895
tuple_percent | 51.19
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 1616724
free_percent | 36.48
The dead tuples are gone, but now there is a lot of empty space.
I'm adding to #Laurenz Albe full answer provided above. There is another possibility for the bloating. Consider the following scenario:
You have a view that rarely changes in most (1000000 records, 100 records change per request) and yet you still get 500000 dead tuples. The reason for that can be null in the index column.
As described in the answer above when views materialized concurrently, a copy is recreated and compared. The comparison uses the mandatory unique index but, what about nulls? nulls are never equal to each other in sql. So if you primary key allow nulls, the records that include nulls even if unchanged will be always recreated and added to the table
In order to fix this what you can do to remove the bloat is to add additional column that will coalesce the null column to some never used value (-1, to_timestamp(0), ...) and use this column only for the primary index

What is the column limit for Spark Data Frames?

Our team is having a lot of issues with the Spark API particularly with large schema tables. We currently have a program written in Scala that utilizes the Apache spark API to create two Hive tables from raw files. We have one particularly very large raw data file that is giving us issues that contains around ~4700 columns and ~200,000 rows.
Every week we get a new file that shows the updates, inserts and deletes that happened in the last week. Our program will create two tables – a master table and a history table. The master table will be the most up to date version of this table while the history table shows all changes inserts and updates that happened to this table and showing what changed. For example, if we have the following schema where A and B are the primary keys:
Week 1 Week 2
|-----|-----|-----| |-----|-----|-----|
| A | B | C | | A | B | C |
|-----|-----|-----| |-----|-----|-----|
| 1 | 2 | 3 | | 1 | 2 | 4 |
|-----|-----|-----| |-----|-----|-----|
Then the master table will now be
|-----|-----|-----|
| A | B | C |
|-----|-----|-----|
| 1 | 2 | 4 |
|-----|-----|-----|
And The history table will be
|-----|-----|-------------------|----------------|-------------|-------------|
| A | B | changed_column | change_type | old_value | new_value |
|-----|-----|-------------------|----------------|-------------|-------------|
| 1 | 2 | C | Update | 3 | 4 |
|-----|-----|-------------------|----------------|-------------|-------------|
This process is working flawlessly for shorter schema tables. We have a table that has 300 columns but over 100,000,000 rows and this code still runs as expected. The process above for the larger schema table runs for around 15 hours, and then crashes with the following error:
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.generic.Growable$class.loop$1(Growable.scala:52)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
Here is a code example that takes around 4 hours to run for this larger table, but runs in 20 seconds for other tables:
var dataframe_result = dataframe1.join(broadcast(dataframe2), Seq(listOfUniqueIds:_*)).repartition(100).cache()
We have tried all of the following with no success:
Using hash broad-cast joins (dataframe2 is smaller, dataframe1 is huge)
Repartioining on different numbers, as well as not repartitioning at all
Caching the result of the dataframe (we originally did not do this).
What is causing this error and how can we fix it? The only difference between this problem table is that it has so many columns. Is there an upper limit to how many columns Spark can handle?
Note: We are running this code on a very large MAPR cluster and we tried giving the code 500GB of RAM and its still failing.

How to create a PostgreSQL partitioned sequence?

Is there a simple (ie. non-hacky) and race-condition free way to create a partitioned sequence in PostgreSQL. Example:
Using a normal sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
Using a partitioned sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
I do not believe there is a simple way that is as easy as regular sequences, because:
A sequence stores only one number stream (next value, etc.). You want one for each partition.
Sequences have special handling that bypasses the current transaction (to avoid the race condition). It is hard to replicate this at the SQL or PL/pgSQL level without using tricks like dblink.
The DEFAULT column property can use a simple expression or a function call like nextval('myseq'); but it cannot refer to other columns to inform the function which stream the value should come from.
You can make something that works, but you probably won't think it simple. Addressing the above problems in turn:
Use a table to store the next value for all partitions, with a schema like multiseq (partition_id, next_val).
Write a multinextval(seq_table, partition_id) function that does something like the following:
Create a new transaction independent on the current transaction (one way of doing this is through dblink; I believe some other server languages can do it more easily).
Lock the table mentioned in seq_table.
Update the row where the partition id is partition_id, with an incremented value. (Or insert a new row with value 2 if there is no existing one.)
Commit that transaction and return the previous stored id (or 1).
Create an insert trigger on your projects table that uses a call to multinextval('projects_table', NEW.Project_ID) for insertions.
I have not used this entire plan myself, but I have tried something similar to each step individually. Examples of the multinextval function and the trigger can be provided if you want to attempt this...