How does Amazon Redshift store nulls - amazon-redshift

In Amazon Redshift, how are nulls stored? - will they take up physical space?
I'm looking to see how best to design a table - the data behind it may mean it will have many columns that are sparsely populated and so I would like to know if this has a negative impact (even after compression) or if nulls don't actually take up any space at all (for example like v5.0.3 or later in mysql)
thank you

Yes, columns with lots of nulls will provide excellent compression, and consequently great performance on Amazon Redshift.
Amazon Redshift is a columnar database engine. Columnar database are greatly optimized for data with repeating values, and those can be null.
So if you have a table where some of the columns have lots of nulls, this will more than likely compress extremely well, and provide savings in storage as well as processing speeds.
In order to achieve the proper compression you have two options:
DDL Design - Manually choose your encoding settings.
Automatic - Have the copy command automatically choose the optimal
encoding settings for your database.

Related

Data modeling in columnar database vs multi-dimensional for reporting

In my way of learning Redshift (my first columnar database), I am struggling to figure out the approach for designing the model. Columnar database does promote flat table design, yet admits that star schema or snowflake could be a better choice for some cases.
Here is a simple example of where I am struggling
As you can see multi-dimensional approach have few dimensions and 1 fact table. I could have made it snowflake design but I kept it simple for star schema.
Approach 1: Used common columns from tables (in this scenario demographics). This could reduce the table size for Customer & Store but will include the extra dimension.
Approach 2: Flat table design with all the columns
My Questions:
Which approach data modeler use to design data model in columnar databases like Redshift? Or they use different approach?
Considering this example, what is the best way to design a data model for data warehousing.
Which approach is good for reporting (considering that client PC\Laptop would have limited memory). Or even cloud reporting may become costly when heavy data set is used.
Approach 3 will produce a massive amount of data set for reporting. This could be a costly affair if doing reporting (using Power BI or Tableau or any other Self reporting tool)
Multidimenion approach is best for self reporting (cost & performance) but then it defeats the purpose of columnar database.
Approach 1 is also good for reporting but with more joins & complexity.
Sorry, late to the party.
I will post is as answer, because it is too long for a comment.
I saw in chat that test results show that star schema is better. But it was tested on regular (MSSQL), not columnar database (just as vertica, redshift, snowflake, bigquery..).
There is some experience from project implementation where I tested both approaches - OBT and star schema while implementing dwh for reporting. Ths was already more than 2 years ago, so don't expect much details.
Database: Redshift 2 nodes of dc2.8xlarge. Might be a bit overkill, but other option was to have a bunch of lower level nodes, which wouldn't be more cost efficient. This example will be just for one data area.
Data: ~ 6 tables which could be joined as somewhat similar to star schemas. Containing of 3 fact tables and based on denormalization level 5-8 dimensions.
With various approaches and different optimization paths, using star schema it would be common to reach SQL times to about 30 seconds. Which is not bad, but also not too responsive from user perspective.
SQLs on flat denormalized fact tables rarely exceed 5 seconds. Some tables contain more than 100 columns, row counts are between 50M and 100M. To not overcomplicate, we use zstd compression for all columns.
In columnar databases data compresses very well as many similar or same values are used in single column.
We took OBT table approach and there are some pros and cons:
pros:
Responsive reports in reporting tool (most important one)
Fewer objects for ETL developers to handle.
Analysts which query database directly can create simpler queries using less tables.
Don't need to worry about data inconsistencies if some dimensions are outdated, which could happen in star schema.
Easier approach for reporting tool cache clearing.
Easier reporting performance tuning.
Easier modeling in reporting tool, do not need to define table join strategies.
cons:
Might take more space. Didn't really tested this closely as storage space is not an issue for us.
Filters in reporting tools might take a bit longer to provide list of values (select distinct one_column from table)
Table refresh might take a bit longer for one big table compared to multiple smaller tables.
Hopefully this helps.

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

Normalization vs compression

I'm receiving messages from sensors into Kinesis, process it using lambda and load to Redshift using Kinesis Firehose. All messages are parsed and inserted into one large staging table. We need to do aggregation/analytics of sensor data. Beside sensor data, there are also a lot of info in the header we store but currently don't use.
Does it make sense for me to load data from this staging table into normalized star schema or just enable compression on columns and use one huge denormalized table instead? How well Redshift works with denormalized data? Pros and cons of both options?
In my experience huge tables with lots of columns cause slow queries. If you create narrower tables instead of a wide ones you might get better performance. Before deciding what to do you should consider the queries for analysis and the queries for creating aggregate tables as well as sparsity of the data. On the other hand Joins are expensive overall. And if you need a structure requiring a lot of 'join' then you should adjust the sort and dist keys accordingly.
Here is the documentation https://aws.amazon.com/blogs/big-data/optimizing-for-star-schemas-and-interleaved-sorting-on-amazon-redshift/

Do Redshift column encodings affect query execution speed?

When creating data tables in Amazon Redshift, you can specify various encodings such as MOSTLY32 or BYTEDICT or LZO. Those are the compressions used when storing the columnar values on disk.
I am wondering if my choice of encoding is supposed to make a difference in query execution times. For example, if I make a column BYTEDICT would that make a difference over LZO when it comes to SELECTs, GROUP BYs or FILTERs?
Yes. The compression encoding used translates to amount of disk storage. Generally, the lower the storage the better would be query performance.
But, which encoding would be be more beneficial to you depends on your data type and its distribution. There is no gurantee that LZO will always be better than Bytedict or vice-a-versa. In my experience, I usually load some sample data in the intended table. Than do a analyze compression. Now whatever Redshift suggests, I go with it. That has worked for me.
Amazon actually has released a python script that can apply this automatically to your database. You can find this script here https://github.com/awslabs/amazon-redshift-utils/blob/master/src/ColumnEncodingUtility/analyze-schema-compression.py
Bit late but likely useful to anyone taking a look here:
Amazon can now decide on the best compression to use (Loading Tables with Automatic Compression), if you are using a COPY command to load your table, and there is no existing compression defined in your table.
You just have to add COMPUPDATE ON to your COPY command.

Does PostgreSQL support transparent compressing of tables (fragments)?

I'm going to store large amount of data (logs) in fragmented PostgreSQL tables (table per day). I would like to compress some of them to save some space on my discs, but I don't want to lose the ability to query them in the usual manner.
Does PostgreSQL support such a transparent compression and where can I read about it in more detail? I think there should be some well-known magic name for such a feature.
Yes, PostgreSQL will do this automatically for you when they go above a certain size. Compression is applied at each individual data value though - not at the full table level. Meaning that if you have a billion rows that are very narrow, they won't get compressed. Or if you have very many columns each with only a small value in it, they won't get compressed. Details about this scheme in the manual.
If you need it on the full table level, a solution is to create a TABLESPACE for those tables that you want compressed, and point it to a compressed filesystem. As long as the filesystem still obeys fsync() and standard POSIX semantics, this should be perfectly safe. Details about this in the manual.
Probably not what you have in mind but still useful info - Chapter 53. Database Physical Storage of the fine manual. The TOAST section warrants further attention.