I'm trying to figure out the different types of sortkeys in Amazon Redshift and I encountered a strange warning here, which is not explained:
Important: Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
And yet, in their own example, Amazon uses interleaved key on a date column with good performance.
So, my question is - what's the explanation to this warning and should I take it seriously? More precisely - is there a problem with using interleaved key over a timestamp column?
I think it might have been explained later on when they describe issues around vacuuming/reindexing:
When tables are initially loaded, Amazon Redshift analyzes the
distribution of the values in the sort key columns and uses that
information for optimal interleaving of the sort key columns. As a
table grows, the distribution of the values in the sort key columns
can change, or skew, especially with date or timestamp columns. If the
skew becomes too large, performance might be affected.
So if that is the only reason, then it just means you will have increased maintenance on index.
From https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html
As you add rows to a sorted table that already contains data, the
unsorted region grows, which has a significant effect on performance.
The effect is greater when the table uses interleaved sorting,
especially when the sort columns include data that increases
monotonically, such as date or timestamp columns.
The key point in the original quote is not that that data is a date or timestamp, it's that it increases "monotonically", which in this context presumably means increasing sequentially such as an event timestamp or an Id number.
The date(not timestamp) column as a interleaved sort key makes sense when you know in an average X number of rows are processed everyday and you are going to filter based on it, if you are not going to use it then leave it out.
Also a note on vacuum - when the VACUUM process is in progress, it needs temporary space to be able to complete the task by sorting and then merging the data in chunks. Cancelling the VACUUM process mid flight will cause extra spaces to be not reclaimed so if for some reason any Vacuum has ever been cancelled in your cluster this can be accounted to the space increase. See the link https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html#r_VACUUM_usage_notes and point 3 the last point is of particular interest.
In my case the tables ended up growing very rapidly compared to the amount of rows inserted and had to build an auto table creation using deep copy
Timestamp column may go to hours, minutes, seconds and milliseconds which is costly to sort the data. A data with milliseconds granularity is like having too much degree of zone maps to keep a record where the data starts and ends within dataset. Same is not true with date column in sort key. Date column has less degress of zone maps to keep a track of data residing in the table.
Related
My team is looking at moving our non partitioned table with ~1TB of data over to a partitioned table.
We would be using range partitioning based on a timestamp column.
One thing I don't understand is whether we need to add an index on the timestamp column if it's being used as the partition key. If we make our partitions quite small (e.g. partition for every day), would this act in a similar way to an index?
We would only be doing queries on a maximum resolution of one day.
I am reluctant to add an index as we've tried this in the past and it never completed (probably because we didn't turn off writes. Not really an option to turn off writes for an extended period).
Your feeling is right: omitting the index on the partitioning column is one of the few places where partitioning actually makes queries faster.
You can then get away with a sequential scan of a single partition, and you don't have to maintain the index with every data modifying statement.
The other advantage is that partitioning makes mass deletion of data (along the partition boundaries) so much more efficient. And finally, autovacuum's job will become easier.
Two points about partitioning:
Upgrade to v12; there have been substantial performance improvements that concern partitioning.
Don't use too many partitions. With v12, you can probably go up to a few thousand, in earlier versions you will get performance problems earlier on.
I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!
Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.
In attempt to handle custom fields for specific objects in multi-tenant dimensional DW I created ultra wide denormalized dimension table (hundreds of columns, hard coded limit of column) that Redshift is not liking too much ;).
user1|attr1|attr2...attr500
Even innocent update query on single column on handful of records takes approximately 20 seconds. (Which is kind of surprising as I would guess it shouldn't be such a problem on columnar database.)
Any pointer how to modify design for better reporting from normalized source table (one user has multiple different attributes, one attribute is one line) to denormalized (one row per user with generic columns, different for each of the tenants)?
Or anyone tried to perform transposing (pivoting) of normalized records into denormalized view (table) in Redshift? I am worried about performance.
Probably important to think about how Redshift stores data and then implements updates on that data.
Each column is stored in it's own sequence of 1MB blocks and the content of those blocks is determined by the SORTKEY. So, how ever many rows of the sort key's values can fit in 1MB is how many (and which) values are in corresponding 1MB for all other columns.
When you ask Redshift to UPDATE a row it actually writes a new version of the entire block for all columns that correspond to that row - not just the block(s) which change. If you have 1,600 columns that means updating a single row requires Redshift to write a minimum of 1,600MB of new data to disk.
This issue can be amplified if your update touches many rows that are not located together. I'd strongly suggest choosing a SORTKEY that corresponds closely to the range of data being updated to minimise the volume of writes.
I am trying to load very large volumes of data into redshift into a single table that will be too cost prohibitive to Vacuum once loaded. To avoid having to vacuum this table, I am loading data using COPY command, from a large number of pre-sorted CSV files. The files I am loading are pre-sorted based on the sort keys defined in the table.
However after loading the first two files, I find that redshift reports the table as ~50% unsorted. I have verified that the files have the data in the correct sort order. Why would redshift not recognize the new incoming data as already sorted?
Do I have to do anything special to let the copy command know that this new data is already in the correct sort order?
I am using the SVV_TABLE_INFO table to determine the sort percentage (using the unsorted field). The sort key is a composite key of three different fields (plane, x, y).
Official Answer by Redshift Support:
Here is what we say officially:
http://docs.aws.amazon.com/redshift/latest/dg/vacuum-load-in-sort-key-order.html
When your table has a sort key defined, the table is divided into 2
regions:
sorted, and
unsorted
As long as you load data in sorted key order, even though the data is
in the unsorted region, it is still in sort key order, so there is no
need for VACUUM to ensure the data is sorted. A VACUUM is still needed
to move the data from the unsorted region to the sorted region,
however this is less critical as the data in the unsorted region is
already sorted.
Storing sorted data in Amazon Redshift tables has an impact in several areas:
How data is displayed when ORDER BY is not used
The speed with which data is sorted when using ORDER BY (eg sorting is faster if it is mostly pre-sorted)
The speed of reading data from disk, based upon Zone Maps
While you might want to choose a SORTKEY that improves sort speed (eg Change order of sortkey to descending), the primary benefit of a SORTKEY is to make queries run faster by minimizing disk access through the use of Zone Maps.
I admit there doesn't seem to be a lot of documentation available about how Zone Maps work, so I'll try and explain it here.
Amazon Redshift stores data on disk in 1MB blocks. Each block contains data relating to one column of one table, and data from that column can occupy multiple blocks. Blocks can be compressed, so they will typically contain more than 1MB of data.
Each block on disk has an associated Zone Map that identifies the minimum and maximum value in that block for the column being stored. This enables Redshift to skip over blocks that do not contain relevant data. For example, if the SORTKEY is a timestamp and a query has a WHERE clause that limits data to a specific day, then Redshift can skip over any blocks where the desired date is not within that block.
Once Redshift locates the blocks with desired data, it will load those blocks into memory and will then perform the query across the loaded data.
Queries will run the fastest in Redshift when it only has to load the fewest possible blocks from disk. Therefore, it is best to use a SORTKEY that commonly matches WHERE clauses, such as timestamps where data is often restricted by date ranges. Sometimes it is worth setting the SORTKEY to the same column as the DISTKEY even though they are used for different purposes.
Zone maps can be viewed via the STV_BLOCKLIST virtual system table. Each row in this table includes:
Table ID
Column ID
Block Number
Minimum Value of the field stored in this block
Maximum Value of the field stored in this block
Sorted/Unsorted flag
I suspect that the Sorted flag is set after the table is vacuumed. However, tables do not necessarily need to be vacuumed. For example, if data is always appended in timestamp order, then the data is already sorted on disk, allowing Zone Maps to work most efficiently.
You mention that your SORTKEY is "a composite key using 3 fields". This might not be the best SORTKEY to use. It could be worth running some timing tests against tables with different SORTKEYs to determine whether the composite SORTKEY is better than using a single SORTKEY. The composite key would probably perform best if all 3 fields are often used in WHERE clauses.
I would like some sugestion on how to design a table that gets like 10 to 50 million inserts a day and needs to respond quickly to selects... should i use indexes? or the overhead cost would be too great?
edit:Im not worried about the transaction volume... this is actually a assigment... and i need to figure out a design to a table that "must respond very well to selects not based on the primary key, knowing that this table will receive a enourmous amount of inserts day-in-day-out"
definitely. At least the primary key, foreign keys, and then whatever you need for reporting, just don't overdo it. 10k-50k inserts a day is not a problem. If it was like, I don't know, a million inserts then you could start thinking of having separate tables, data dictionaries and what not, but for your needs I wouldn't worry.
Even if you did 50,000 per day and your day was an 8 hour work day, that would still be less than two inserts per second on average. I suppose you might get peaks that are much higher than that, but in general, SQL Server can handle much higher transaction rates than what you seem to have.
If your table is fairly wide (lots of columns or a few really long ones) then you might want to consider clustering by a surrogate (IDENTITY) column. Your volumes aren't enough to make for a bad hot-spot at the end of the table. In combination with this, use indexes for any keys needed for data consistency (i.e. FK's) and retrieval (PK, natural key, etc). Be careful about setting the fill factor on your indexes and consider rebuilding them during a periodic down-time window.
If your table is fairly narrow, then you could possibly consider clustering on the natural key, but you'll have to make sure that your response time expectations can be met.
Best rate is PK sort the same as the insert order and no other indexes. 10-50 thousand a day is not that much. If only inserts then I don't see any down side to dirty reads.
If you are optimizing for select then use row level locking for inserts.
Measure index fragmentation. Defragment the indexes on a regular basis with a proper fill factor. Fill factor determined the how fast the indexes fragment and how often you defragment.