Redshift and ultra wide tables - amazon-redshift

In attempt to handle custom fields for specific objects in multi-tenant dimensional DW I created ultra wide denormalized dimension table (hundreds of columns, hard coded limit of column) that Redshift is not liking too much ;).
user1|attr1|attr2...attr500
Even innocent update query on single column on handful of records takes approximately 20 seconds. (Which is kind of surprising as I would guess it shouldn't be such a problem on columnar database.)
Any pointer how to modify design for better reporting from normalized source table (one user has multiple different attributes, one attribute is one line) to denormalized (one row per user with generic columns, different for each of the tenants)?
Or anyone tried to perform transposing (pivoting) of normalized records into denormalized view (table) in Redshift? I am worried about performance.

Probably important to think about how Redshift stores data and then implements updates on that data.
Each column is stored in it's own sequence of 1MB blocks and the content of those blocks is determined by the SORTKEY. So, how ever many rows of the sort key's values can fit in 1MB is how many (and which) values are in corresponding 1MB for all other columns.
When you ask Redshift to UPDATE a row it actually writes a new version of the entire block for all columns that correspond to that row - not just the block(s) which change. If you have 1,600 columns that means updating a single row requires Redshift to write a minimum of 1,600MB of new data to disk.
This issue can be amplified if your update touches many rows that are not located together. I'd strongly suggest choosing a SORTKEY that corresponds closely to the range of data being updated to minimise the volume of writes.

Related

Why not to use timestamp with Interleaved Sortkey?

I'm trying to figure out the different types of sortkeys in Amazon Redshift and I encountered a strange warning here, which is not explained:
Important: Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
And yet, in their own example, Amazon uses interleaved key on a date column with good performance.
So, my question is - what's the explanation to this warning and should I take it seriously? More precisely - is there a problem with using interleaved key over a timestamp column?
I think it might have been explained later on when they describe issues around vacuuming/reindexing:
When tables are initially loaded, Amazon Redshift analyzes the
distribution of the values in the sort key columns and uses that
information for optimal interleaving of the sort key columns. As a
table grows, the distribution of the values in the sort key columns
can change, or skew, especially with date or timestamp columns. If the
skew becomes too large, performance might be affected.
So if that is the only reason, then it just means you will have increased maintenance on index.
From https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html
As you add rows to a sorted table that already contains data, the
unsorted region grows, which has a significant effect on performance.
The effect is greater when the table uses interleaved sorting,
especially when the sort columns include data that increases
monotonically, such as date or timestamp columns.
The key point in the original quote is not that that data is a date or timestamp, it's that it increases "monotonically", which in this context presumably means increasing sequentially such as an event timestamp or an Id number.
The date(not timestamp) column as a interleaved sort key makes sense when you know in an average X number of rows are processed everyday and you are going to filter based on it, if you are not going to use it then leave it out.
Also a note on vacuum - when the VACUUM process is in progress, it needs temporary space to be able to complete the task by sorting and then merging the data in chunks. Cancelling the VACUUM process mid flight will cause extra spaces to be not reclaimed so if for some reason any Vacuum has ever been cancelled in your cluster this can be accounted to the space increase. See the link https://docs.aws.amazon.com/redshift/latest/dg/r_VACUUM_command.html#r_VACUUM_usage_notes and point 3 the last point is of particular interest.
In my case the tables ended up growing very rapidly compared to the amount of rows inserted and had to build an auto table creation using deep copy
Timestamp column may go to hours, minutes, seconds and milliseconds which is costly to sort the data. A data with milliseconds granularity is like having too much degree of zone maps to keep a record where the data starts and ends within dataset. Same is not true with date column in sort key. Date column has less degress of zone maps to keep a track of data residing in the table.

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

Redshift doesn't recognize newly loaded data as pre-sorted

I am trying to load very large volumes of data into redshift into a single table that will be too cost prohibitive to Vacuum once loaded. To avoid having to vacuum this table, I am loading data using COPY command, from a large number of pre-sorted CSV files. The files I am loading are pre-sorted based on the sort keys defined in the table.
However after loading the first two files, I find that redshift reports the table as ~50% unsorted. I have verified that the files have the data in the correct sort order. Why would redshift not recognize the new incoming data as already sorted?
Do I have to do anything special to let the copy command know that this new data is already in the correct sort order?
I am using the SVV_TABLE_INFO table to determine the sort percentage (using the unsorted field). The sort key is a composite key of three different fields (plane, x, y).
Official Answer by Redshift Support:
Here is what we say officially:
http://docs.aws.amazon.com/redshift/latest/dg/vacuum-load-in-sort-key-order.html
When your table has a sort key defined, the table is divided into 2
regions:
sorted, and
unsorted
As long as you load data in sorted key order, even though the data is
in the unsorted region, it is still in sort key order, so there is no
need for VACUUM to ensure the data is sorted. A VACUUM is still needed
to move the data from the unsorted region to the sorted region,
however this is less critical as the data in the unsorted region is
already sorted.
Storing sorted data in Amazon Redshift tables has an impact in several areas:
How data is displayed when ORDER BY is not used
The speed with which data is sorted when using ORDER BY (eg sorting is faster if it is mostly pre-sorted)
The speed of reading data from disk, based upon Zone Maps
While you might want to choose a SORTKEY that improves sort speed (eg Change order of sortkey to descending), the primary benefit of a SORTKEY is to make queries run faster by minimizing disk access through the use of Zone Maps.
I admit there doesn't seem to be a lot of documentation available about how Zone Maps work, so I'll try and explain it here.
Amazon Redshift stores data on disk in 1MB blocks. Each block contains data relating to one column of one table, and data from that column can occupy multiple blocks. Blocks can be compressed, so they will typically contain more than 1MB of data.
Each block on disk has an associated Zone Map that identifies the minimum and maximum value in that block for the column being stored. This enables Redshift to skip over blocks that do not contain relevant data. For example, if the SORTKEY is a timestamp and a query has a WHERE clause that limits data to a specific day, then Redshift can skip over any blocks where the desired date is not within that block.
Once Redshift locates the blocks with desired data, it will load those blocks into memory and will then perform the query across the loaded data.
Queries will run the fastest in Redshift when it only has to load the fewest possible blocks from disk. Therefore, it is best to use a SORTKEY that commonly matches WHERE clauses, such as timestamps where data is often restricted by date ranges. Sometimes it is worth setting the SORTKEY to the same column as the DISTKEY even though they are used for different purposes.
Zone maps can be viewed via the STV_BLOCKLIST virtual system table. Each row in this table includes:
Table ID
Column ID
Block Number
Minimum Value of the field stored in this block
Maximum Value of the field stored in this block
Sorted/Unsorted flag
I suspect that the Sorted flag is set after the table is vacuumed. However, tables do not necessarily need to be vacuumed. For example, if data is always appended in timestamp order, then the data is already sorted on disk, allowing Zone Maps to work most efficiently.
You mention that your SORTKEY is "a composite key using 3 fields". This might not be the best SORTKEY to use. It could be worth running some timing tests against tables with different SORTKEYs to determine whether the composite SORTKEY is better than using a single SORTKEY. The composite key would probably perform best if all 3 fields are often used in WHERE clauses.

Slow select from one billion rows GreenPlum DB

I've created the following table on GreenPlum:
CREATE TABLE data."CDR"
(
mcc text,
mnc text,
lac text,
cell text,
from_number text,
to_number text,
cdr_time timestamp without time zone
)
WITH (
OIDS = FALSE,appendonly=true, orientation=column,compresstype=quicklz, compresslevel=1
)
DISTRIBUTED BY (from_number);
I've loaded one billion rows to this table but every query works very slow.
I need to do queries on all fields (not only one),
What can I do to speed up my queries?
Using PARTITION? using indexes?
maybe using a different DB like Cassandra or Hadoop?
This highly depends on the actual queries you are doing and what your hardware setup looks like.
Since you are querying all the fields the selectivity gained by going columnar orientation is probably hurting you more than helping, as you needs to scan all the data anyway. I would remove columnar orientation.
Generally speaking indexes don't help in a Greenplum system. Usually the amount of hardware that is involved tends to make scanning the data directory faster than doing index lookups.
Partitioning could be a great help but there would need to be a better understanding of the data. You are probably accessing specific time intervals so creating a partitioning scheme around cdr_time could eliminate the scan of data not needed for the result. The last thing I would worry about is indexes.
Your distribution by from_number could have an impact on query speed. The system will hash the data based on from_number so if you are querying selectively on the from_number the data will only be returned by the node that has it and you won't be leveraging the parallel nature of the system and spreading the request across all of the nodes. Unless you are joining to other tables on from_number, which allows the joins to be collocated and performed within the node, I would change that to be distributed RANDOMLY.
On top of all of that there is the question of what the hardware is and if you have a proper amount of segments setup and resources to feed them. Essentially every segment is a database. Good hardware can handle multiple segments per node, but if you are doing this on a light hardware you need to find the sweet spot where number of segments matches what the underlying system can provide.
#Dor,
I have same type of data where CDR info is stored for a telecom company, and daily 10-12 millions rows inserted and also heavy queries running on those CDRs related tables, I was also facing the same issue last year, and i have created partitions on those tables on the CDR timings column.
As per My understanding GP creates physical tables for each partition whereas logical tables created in other RDBMS. After this I got better performance with all SELECTs on these tables. Also I think you should convert text datatype to Character Varying for all columns (if text is really not required) I felt DB operations on Text field is very slow(specially order by, group by)
index will help you depends on your queries in my case i have huge inserts so i didnt try yet
If you are selecting all the columns in select so no need of Column Oriented table
Regards

TSQL Optimal / Acceptable table column count

What is the optimal / acceptable column count when you are designing a table that serves as a datastore for huge forms to be filled?
I don't think there is any optimal number of columns, as such.
Obviously you need to consider the maximum width of a row, which is just over 8000 bytes.
Consider the way you're going to be handling the data. Will most of the columns be used all the time, or are there groups of columns that are sometimes unused (e.g., on the form, questions 8-12 are only used when the answer to question 7 is "Yes"). In those cases, it might make sense to put the optional questions into a separate table.
Making the rows wider means that your data pages, and hence your clustered key (if any) will take up more space; hence you can fit fewer records onto a page, and that means you'll have to do more IO operations when data is retrieved from the table. Thus, if there are logical ways to partition the data so that the rows are narrower, you may improve your read access speeds.
You might even consider going one step farther in normalization, so that a single form entry has a header row in a parent table, and then each question in the form gets its own row in a child table.