According to Amazon:
Load your data in sort key order to avoid needing to vacuum.
As long as each batch of new data follows the existing rows in your
table, your data will be properly stored in sort order, and you will
not need to run a vacuum. You don't need to presort the rows in each
load because COPY sorts each batch of incoming data as it loads.
The sort key is a timestamp and the data is loaded as it comes in. 200 rows are loaded at a time. However the rows are 99% unsorted. Why are so many rows unsorted?
You should double check the data you insert, as if your newly data is sorted by the SORTKEY and it intended to be placed at the end of the table, only then the VACUUM will not be necessary. If you have at least one row that is placed after the intentional data to be loaded, this data will be placed into an unsorted region in Redshift, which causes your data to be unsorted.
See this example for further information.
Also, read about VACUUM process.
Related
I have some time-series data that I am about to import into TimescaleDB, as (time, item_id, value) tuples in a hypertable.
I have created an index:
CREATE INDEX ON time_series (item_id, timestamp DESC);
Does TimescaleDB have different performance characteristics when inserting rows in the middle of time series vs. appending them at the end of the time? I know this is an issue for some of native PostgreSQL data structures like BRIN indexes.
I am asking because for some item_ids I might have patchy data and I need to insert those values after other item_ids have filled the tip of time series. Basically, some items might be old data that is seriously behind the rest of the items.
I don't think It reacts differently,
in your case the insert performance will be depends on
how many indexes you have on that table?are they all really needed?
Are those indexes have minimum required columns?
Use parallel insert/copy. see here for more info.
Insert rows in batches
configure your shared_buffers properly (25% of available RAM recommended by documentations)
but this tip is going to help you the best
Write data in loose time order
When chunks are sized appropriately, the latest chunk(s) and their associated indexes are naturally maintained in memory. New rows inserted with recent timestamps will be written to these chunks and indexes already in memory.
If a row with a sufficiently older timestamp is inserted – i.e., it's an out-of-order or backfilled write – the disk pages corresponding to the older chunk (and its indexes) will need to be read in from disk. This will significantly increase write latency and lower insert throughput.
Particularly, when you are loading data for the first time, try to load data in sorted, increasing timestamp order.
Be careful if you're bulk loading data about many different servers, devices, and so forth:
Do not bulk insert data sequentially by server (i.e., all data for server A, then server B, then C, and so forth). This will cause disk thrashing as loading each server will walk through all chunks before starting anew.
Instead, arrange your bulk load so that data from all servers are inserted in loose timestamp order (e.g., day 1 across all servers in parallel, then day 2 across all servers in parallel, etc.)
source: TimeScale blog
I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.
The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?
In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.
See my implementation:
Thanks in advance!!
Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.
Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.
This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".
The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.
you are looking for an incremental load. you can do it in two ways.
There is a step called "Insert/Update" , this will be used to do incremental load.
you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.
Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.
Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.
Hope this will help.
In attempt to handle custom fields for specific objects in multi-tenant dimensional DW I created ultra wide denormalized dimension table (hundreds of columns, hard coded limit of column) that Redshift is not liking too much ;).
user1|attr1|attr2...attr500
Even innocent update query on single column on handful of records takes approximately 20 seconds. (Which is kind of surprising as I would guess it shouldn't be such a problem on columnar database.)
Any pointer how to modify design for better reporting from normalized source table (one user has multiple different attributes, one attribute is one line) to denormalized (one row per user with generic columns, different for each of the tenants)?
Or anyone tried to perform transposing (pivoting) of normalized records into denormalized view (table) in Redshift? I am worried about performance.
Probably important to think about how Redshift stores data and then implements updates on that data.
Each column is stored in it's own sequence of 1MB blocks and the content of those blocks is determined by the SORTKEY. So, how ever many rows of the sort key's values can fit in 1MB is how many (and which) values are in corresponding 1MB for all other columns.
When you ask Redshift to UPDATE a row it actually writes a new version of the entire block for all columns that correspond to that row - not just the block(s) which change. If you have 1,600 columns that means updating a single row requires Redshift to write a minimum of 1,600MB of new data to disk.
This issue can be amplified if your update touches many rows that are not located together. I'd strongly suggest choosing a SORTKEY that corresponds closely to the range of data being updated to minimise the volume of writes.
I am trying to load very large volumes of data into redshift into a single table that will be too cost prohibitive to Vacuum once loaded. To avoid having to vacuum this table, I am loading data using COPY command, from a large number of pre-sorted CSV files. The files I am loading are pre-sorted based on the sort keys defined in the table.
However after loading the first two files, I find that redshift reports the table as ~50% unsorted. I have verified that the files have the data in the correct sort order. Why would redshift not recognize the new incoming data as already sorted?
Do I have to do anything special to let the copy command know that this new data is already in the correct sort order?
I am using the SVV_TABLE_INFO table to determine the sort percentage (using the unsorted field). The sort key is a composite key of three different fields (plane, x, y).
Official Answer by Redshift Support:
Here is what we say officially:
http://docs.aws.amazon.com/redshift/latest/dg/vacuum-load-in-sort-key-order.html
When your table has a sort key defined, the table is divided into 2
regions:
sorted, and
unsorted
As long as you load data in sorted key order, even though the data is
in the unsorted region, it is still in sort key order, so there is no
need for VACUUM to ensure the data is sorted. A VACUUM is still needed
to move the data from the unsorted region to the sorted region,
however this is less critical as the data in the unsorted region is
already sorted.
Storing sorted data in Amazon Redshift tables has an impact in several areas:
How data is displayed when ORDER BY is not used
The speed with which data is sorted when using ORDER BY (eg sorting is faster if it is mostly pre-sorted)
The speed of reading data from disk, based upon Zone Maps
While you might want to choose a SORTKEY that improves sort speed (eg Change order of sortkey to descending), the primary benefit of a SORTKEY is to make queries run faster by minimizing disk access through the use of Zone Maps.
I admit there doesn't seem to be a lot of documentation available about how Zone Maps work, so I'll try and explain it here.
Amazon Redshift stores data on disk in 1MB blocks. Each block contains data relating to one column of one table, and data from that column can occupy multiple blocks. Blocks can be compressed, so they will typically contain more than 1MB of data.
Each block on disk has an associated Zone Map that identifies the minimum and maximum value in that block for the column being stored. This enables Redshift to skip over blocks that do not contain relevant data. For example, if the SORTKEY is a timestamp and a query has a WHERE clause that limits data to a specific day, then Redshift can skip over any blocks where the desired date is not within that block.
Once Redshift locates the blocks with desired data, it will load those blocks into memory and will then perform the query across the loaded data.
Queries will run the fastest in Redshift when it only has to load the fewest possible blocks from disk. Therefore, it is best to use a SORTKEY that commonly matches WHERE clauses, such as timestamps where data is often restricted by date ranges. Sometimes it is worth setting the SORTKEY to the same column as the DISTKEY even though they are used for different purposes.
Zone maps can be viewed via the STV_BLOCKLIST virtual system table. Each row in this table includes:
Table ID
Column ID
Block Number
Minimum Value of the field stored in this block
Maximum Value of the field stored in this block
Sorted/Unsorted flag
I suspect that the Sorted flag is set after the table is vacuumed. However, tables do not necessarily need to be vacuumed. For example, if data is always appended in timestamp order, then the data is already sorted on disk, allowing Zone Maps to work most efficiently.
You mention that your SORTKEY is "a composite key using 3 fields". This might not be the best SORTKEY to use. It could be worth running some timing tests against tables with different SORTKEYs to determine whether the composite SORTKEY is better than using a single SORTKEY. The composite key would probably perform best if all 3 fields are often used in WHERE clauses.
I have a large set of data on S3 in the form of a few hundred CSV files that are ~1.7 TB in total (uncompressed). I am trying to copy it to an empty table on a Redshift cluster.
The cluster is empty (no other tables) and has 10 dw2.large nodes. If I set a sort key on the table, the copy commands uses up all available disk space about 25% of the way through, and aborts. If there's no sort key, the copy completes successfully and never uses more than 45% of the available disk space. This behavior is consistent whether or not I also set a distribution key.
I don't really know why this happens, or if it's expected. Has anyone seen this behavior? If so, do you have any suggestions for how to get around it? One idea would be to try importing each file individually, but I'd love to find a way to let Redshift deal with that part itself and do it all in one query.
Got an answer to this from the Redshift team. The cluster needs free space of at least 2.5x the incoming data size to use as temporary space for the sort. You can upsize your cluster, copy the data, and resize it back down.
Each dw2.large box has 0.16 TB disk space. When you said you you have cluster of 10 nodes, total space available is around 1.6 TB.
You have mentioned that you have around 1.7 TB raw data ( uncompressed) to be loaded in redshift.
When you load data to redshift using copy commands redshift automatically compresses you data and load it table.
once you load any db table you can see compression encoding by below query
Select "column", type, encoding
from pg_table_def where tablename = 'my_table_name'
Once you load your data when table has no sort key. See what are compression are being applied.
I suggested you drop and create table each time when you load data for your testing So that compressions encoding will be analysed each time.Once you load your table using copy commands see below link and fire script to determine table size
http://docs.aws.amazon.com/redshift/latest/dg/c_analyzing-table-design.html
Since when you apply sort key for your table and load data , sort key also occupies some disk space.
Since table with out sort key need less disk space than table with sort key.
You need to make sure that compression are being applied to table.
When we have sort key applied it need more space to store. When you apply sort key you need to check if you are loading data in sorted order as well,so that data will be stored in sorted fashion. This we need to avoid vacuum command to sort table after data being loaded.