Distribution in Amazon Redhsift - BTree or Hash

Distribution in Amazon Redhsift - BTree or Hash - amazon-redshift

What does Amazon Redshift use to distribute values in the cluster: Hash or BTree?
For example, if my distribution key is date in format "yyyy-MM-dd", are two subsequent days going to be stored on the same node (if Btree is used) or probably on different nodes (if hash is used, this will probably be the case)?
Thank you

Smart question. You know how most RDBMSs work.
There are no B-Trees.
A hashing function is applied to your distribution key, and the outcome of the hashing function determines what slice receives your data.
There are no indexes, in the traditional sense of the word. Redshift uses information in its "super block" to determine if it can avoid doing a full table scan for certain queries.
For large data sets there are 4 practices that will dramatically improve your performance:
DISTRIBUTION KEY --- The most important design decision in an MPP
system.
COMPRESSION -- This can be done automatically as you load the database.
SORT KEY -- Getting a good sort key is extremely important for large
tables.
ANALYZE and VACUUM --- This ensures that the SORT Key is optimized
and the database has good statistics.
Notice what is missing from my list? Yes, I did not say indexes. Redshift does not have indexes.

Related

Staging Table Design for Performance

I have a typical star pattern in my Azure SQL Data Warehouse. Data is first dumped into staging tables via Data Factory, then it calls a master procedure that calls other procedures to transform data into the appropriate format and then clear out the staging tables for that chunk of data.
Should these staging tables have indexes? Should they have statistics? I recently upgraded to Gen 2, but don't have auto create statistics turned on. I worry that statistics will get created but not updated, and so will end up slowing things down more than anything.
For more context, there is a procedure to rebuild indexes and update statistics which is run overnight, once a day. The data load process is run hourly.

Given that these are staging tables, the biggest impacts will come from the following.
Where possible, use a hash distribution. This will give best performance when you process the table in subsequent steps. While documentation sometimes suggests round_robin distribution, and this is slightly faster for ingestion, the next query on the table will be slower.
Always use statistics. I suggest creating them manually, based on expected usage, for greater predictability in your ELT performance. If you don't create and update statistics you're going to get dreadful performance at some time in future. If you don't want to undertake the effort of manually managing statistics, then definitely turn on auto statistics.
Consider the use of HEAP vs CLUSTERED COLUMNSTORE table structures for staging tables. In general, staging tables are processed on a whole-row basis, and you may find that your performance is better at the staging layer if you use a HEAP. This needs to be tested on your data, as the Gen2 caching that gives much greater performance does not apply to Heap tables.
Definitely create your fact and dimension tables as clustered columnstore indexes. Hash distribute your fact/s, and replicate your dimensions (unless you have billion row dimensions, in which case a hash distribution may be more appropriate).
If you're using CTAS algorithms your need for non-clustered indexes should be very low. I generally add indexes only when I see a performance problem with a query that I can't solve by any other technique.
Finally, make sure that you're using a reasonable DWU and Resource Class. A general rule of thumb is that you shouldn't be running your ELT at less than DWU500, and LARGERC. If you don't do this, you'll find that you get bad clustered columnstore indexes which will lead to future performance problems.

Some input from my side -
Your fact table should be partitioned . in fact you should have a job which creates the partitions in fact automatically .
how big is fact table ? if your fact table is becoming too big then based on your requirement you can think of introducing archiving of old table if its not required in fact table .

how the data will distribute when using dist key for a column in redshift

I am new to redshift. I dont understand which column will be suitable for setting a distribution key to get improved query performance. How to find the best column? and how the data will be distributed across the nodes using dist key?

Its very wide question, its hard to provide your short answer. Anyways, let me try to summarize here, in Redshift there are two types of key, distkey and sortkey.
distkey - table’s distkey is the column on which it’s distributed to each node. Rows with the same value in this column are guaranteed to be on the same node.
sortkey - table’s sortkey is the column by which it’s sorted within each node. It should be applied for columns you usually do order by.
Lets focus on distkey here.
Distribution key could be of two types, 'Even' or 'All'. Distribution keys are used for achieve following.
Distribute data evenly for parallel processing
Minimize data movement
'All' distribution style should be used for - have slowly changing data, reasonable size (i.e., few millions but not 100s of millions of rows), missing common distribution key for frequent joins.
'Even' distribution style should be used for- tables not frequently joined or aggregated and large tables without acceptable candidate keys.
Here are some good materials to read.
https://www.slideshare.net/AmazonWebServices/deep-dive-on-amazon-redshift-64919704
https://www.youtube.com/watch?v=iuQgZDs-W7A
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html
https://docs.aws.amazon.com/redshift/latest/dg/c_Distribution_examples.html
I hope this gives some way for you to move forward.

When should I use the postgresql 10 declarative partitioning function?

the official explain is:
The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
When a table is very large? How to judge a table is very large?
A rule of thumb is that the size of the table should exceed the physical memory of the database server? What does this sentence mean?

The typical use cases for table partitioning (not limited to Postgres) are:
Cleanup data
If you need to delete rows from large tables that can be identified by a single partition.
In that case drop partition would be a lot faster than using delete. A typical use case is a range-partitioned table on a timespan (week, month, year)
Improve queries
If all (or nearly all) queries you use, contain a condition on the partition key.
A typical use case would be partitioning an "orders" table on e.g. the country and all queries would involve a condition like where country_code = 'de' or something similar. Queries not including the partitioning key will be however be slower compared to a query on a non-partitioned table.
What is "large"? That depends very much on your hardware and system. But I would not consider a table with less 100 million rows "large". Indexing (including partial indexes) can get you a long way in Postgres.
Note that Postgres 10 partitioning is still severely limited compared to e.g. Oracle or SQL Server. One of the biggest limitations is the lack of support for foreign keys and global indexes (i.e. a primary key ensuring uniqueness across all partitions). So if you need that, partitioning is not for you.

Slow select from one billion rows GreenPlum DB

I've created the following table on GreenPlum:
CREATE TABLE data."CDR"
(
mcc text,
mnc text,
lac text,
cell text,
from_number text,
to_number text,
cdr_time timestamp without time zone
)
WITH (
OIDS = FALSE,appendonly=true, orientation=column,compresstype=quicklz, compresslevel=1
)
DISTRIBUTED BY (from_number);
I've loaded one billion rows to this table but every query works very slow.
I need to do queries on all fields (not only one),
What can I do to speed up my queries?
Using PARTITION? using indexes?
maybe using a different DB like Cassandra or Hadoop?

This highly depends on the actual queries you are doing and what your hardware setup looks like.
Since you are querying all the fields the selectivity gained by going columnar orientation is probably hurting you more than helping, as you needs to scan all the data anyway. I would remove columnar orientation.
Generally speaking indexes don't help in a Greenplum system. Usually the amount of hardware that is involved tends to make scanning the data directory faster than doing index lookups.
Partitioning could be a great help but there would need to be a better understanding of the data. You are probably accessing specific time intervals so creating a partitioning scheme around cdr_time could eliminate the scan of data not needed for the result. The last thing I would worry about is indexes.
Your distribution by from_number could have an impact on query speed. The system will hash the data based on from_number so if you are querying selectively on the from_number the data will only be returned by the node that has it and you won't be leveraging the parallel nature of the system and spreading the request across all of the nodes. Unless you are joining to other tables on from_number, which allows the joins to be collocated and performed within the node, I would change that to be distributed RANDOMLY.
On top of all of that there is the question of what the hardware is and if you have a proper amount of segments setup and resources to feed them. Essentially every segment is a database. Good hardware can handle multiple segments per node, but if you are doing this on a light hardware you need to find the sweet spot where number of segments matches what the underlying system can provide.

#Dor,
I have same type of data where CDR info is stored for a telecom company, and daily 10-12 millions rows inserted and also heavy queries running on those CDRs related tables, I was also facing the same issue last year, and i have created partitions on those tables on the CDR timings column.
As per My understanding GP creates physical tables for each partition whereas logical tables created in other RDBMS. After this I got better performance with all SELECTs on these tables. Also I think you should convert text datatype to Character Varying for all columns (if text is really not required) I felt DB operations on Text field is very slow(specially order by, group by)
index will help you depends on your queries in my case i have huge inserts so i didnt try yet
If you are selecting all the columns in select so no need of Column Oriented table
Regards

postgresql immutable read workload tuning

I have a table where the non-primary key columns are deterministic given the primary key.
I think this could be pretty common, for example a table representing memoization/caching of an expensive function, or where the primary key is a hash of the other columns.
Further assume that the workload is mostly reads of 1-100 individual rows, and that writes can be batched or "async" based on what gives the best performance.
What are interesting tuning options on the table/database in this case?

This would be an ideal candidate for index-only-scans in versions 9.2 or up, by creating an index on all the primary key columns plus the frequently queried other columns. Aggressively vacuum the table (i.e. manually after every batch update) because the default autovacuum settings are not aggressive enough to get maximal benefit from IOS.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse