Data modeling in columnar database vs multi-dimensional for reporting - amazon-redshift

In my way of learning Redshift (my first columnar database), I am struggling to figure out the approach for designing the model. Columnar database does promote flat table design, yet admits that star schema or snowflake could be a better choice for some cases.
Here is a simple example of where I am struggling
As you can see multi-dimensional approach have few dimensions and 1 fact table. I could have made it snowflake design but I kept it simple for star schema.
Approach 1: Used common columns from tables (in this scenario demographics). This could reduce the table size for Customer & Store but will include the extra dimension.
Approach 2: Flat table design with all the columns
My Questions:
Which approach data modeler use to design data model in columnar databases like Redshift? Or they use different approach?
Considering this example, what is the best way to design a data model for data warehousing.
Which approach is good for reporting (considering that client PC\Laptop would have limited memory). Or even cloud reporting may become costly when heavy data set is used.
Approach 3 will produce a massive amount of data set for reporting. This could be a costly affair if doing reporting (using Power BI or Tableau or any other Self reporting tool)
Multidimenion approach is best for self reporting (cost & performance) but then it defeats the purpose of columnar database.
Approach 1 is also good for reporting but with more joins & complexity.

Sorry, late to the party.
I will post is as answer, because it is too long for a comment.
I saw in chat that test results show that star schema is better. But it was tested on regular (MSSQL), not columnar database (just as vertica, redshift, snowflake, bigquery..).
There is some experience from project implementation where I tested both approaches - OBT and star schema while implementing dwh for reporting. Ths was already more than 2 years ago, so don't expect much details.
Database: Redshift 2 nodes of dc2.8xlarge. Might be a bit overkill, but other option was to have a bunch of lower level nodes, which wouldn't be more cost efficient. This example will be just for one data area.
Data: ~ 6 tables which could be joined as somewhat similar to star schemas. Containing of 3 fact tables and based on denormalization level 5-8 dimensions.
With various approaches and different optimization paths, using star schema it would be common to reach SQL times to about 30 seconds. Which is not bad, but also not too responsive from user perspective.
SQLs on flat denormalized fact tables rarely exceed 5 seconds. Some tables contain more than 100 columns, row counts are between 50M and 100M. To not overcomplicate, we use zstd compression for all columns.
In columnar databases data compresses very well as many similar or same values are used in single column.
We took OBT table approach and there are some pros and cons:
pros:
Responsive reports in reporting tool (most important one)
Fewer objects for ETL developers to handle.
Analysts which query database directly can create simpler queries using less tables.
Don't need to worry about data inconsistencies if some dimensions are outdated, which could happen in star schema.
Easier approach for reporting tool cache clearing.
Easier reporting performance tuning.
Easier modeling in reporting tool, do not need to define table join strategies.
cons:
Might take more space. Didn't really tested this closely as storage space is not an issue for us.
Filters in reporting tools might take a bit longer to provide list of values (select distinct one_column from table)
Table refresh might take a bit longer for one big table compared to multiple smaller tables.
Hopefully this helps.

Related

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

When to use dynamoDB -UseCases

I've tried to figure out what will be the best use cases that suit for Amazon dynamoDB.
When I googled most of the blogs says DyanmoDb will be used only for a large amount of data (BigData).
I'm having a background of relational DB. NoSQL DB is new for me.So when I've tried to relate this to normal relation DB knowledge.
Most of the concepts related to DynamoDb is to create a schema-less table with partition keys/sort keys. And try to query them based on the keys.Also, there is no such concept of stored procedure which makes queries easier and simple.
If we managing such huge Data's doing such complex queries each and every time to retrieve data will be the correct approach without a stored procedure?
Note: I've maybe had a wrong understanding of the concept. So, please anyone clear my thoughts here
Thanks in advance
Jay
In short, systems like DynamoDB are designed to support big data sets (too big to fit a single server) and high write/read throughput by scaling horizontally, as opposed to scaling vertically, which is the more common approach for relational databases historically.
The main approach to support horizontal scalability is by partitioning data, i.e. a data set is split into multiple pieces and distributed among multiple servers. This way it may use more storage and more IOPS, allowing bigger data sets and higher read/write throughput.
However, data partitioning makes it difficult to support complex queries, such as joins etc., as data is distributed among multiple physical servers. As for stored procedures, they are not supported for the same reason - historically the idea behind stored procedures is data locality, i.e. they run on the server near the data without network operations, however, if data is distributed among multiple servers, this benefit disappears (at least in the form of stored procedure).
Therefore the most efficient way to query data from such systems is by record key, as data partitioning is based on a key and it's easy to figure out where a record lives physically for a given key. While many such systems also support secondary indexes, they are usually restricted in some way or expensive and may not be enough to satisfy requirements in a complex software solution. A quite common approach is to have a complementary indexing/query solution (I've seen solutions based on Elasticsearch and Solr), which allows running complex queries over some fragments of records to figure out a record key, which then used to load the record.

Normalization vs compression

I'm receiving messages from sensors into Kinesis, process it using lambda and load to Redshift using Kinesis Firehose. All messages are parsed and inserted into one large staging table. We need to do aggregation/analytics of sensor data. Beside sensor data, there are also a lot of info in the header we store but currently don't use.
Does it make sense for me to load data from this staging table into normalized star schema or just enable compression on columns and use one huge denormalized table instead? How well Redshift works with denormalized data? Pros and cons of both options?
In my experience huge tables with lots of columns cause slow queries. If you create narrower tables instead of a wide ones you might get better performance. Before deciding what to do you should consider the queries for analysis and the queries for creating aggregate tables as well as sparsity of the data. On the other hand Joins are expensive overall. And if you need a structure requiring a lot of 'join' then you should adjust the sort and dist keys accordingly.
Here is the documentation https://aws.amazon.com/blogs/big-data/optimizing-for-star-schemas-and-interleaved-sorting-on-amazon-redshift/

Guideline when designing a database in Postgresql

I am designing a database in Postgresql and I would like to have some expert advices before refactorizing my work.
The database naturally contains different parts that I plan to separate into schemas in order to have a mangling of object names that reflect logical organization of them. About 20 tables are for scientific purposes and 20 others are technical and 20 furthers are about administrative tasks.
Is that a good idea or am I misleading myself into a management overhead that I will regret later?
The database contains 3 tables that are huge. By huge, I mean there is more than 60 millions of rows in it and they might grow a little bit. I think I will create special tablespace for that tables. I would like to do it, in order to separate logically the place where data are stored because the rest of the database should be backuped in a different way than that three tables.
Further more one those 3 tables contains binary data that are not heavy but weight a bit when multiplying by the amount of rows and also this table grows faster than the 2 others. Then I will periodically purge it after backuping the table.
Is it a good idea to have more than one tablespace in a database? If so, is there any precaution to be taken when proceeding this way?
Thank you in advance for your advices.
Choosing good names & grouping database stuffs is always a wise choice, and such overheads are not usually considerable.
About separating tablespace of a single database, it also should not cause any special problem, I've a similar database (but in mysql) that has a large file table and I had to move all of it's content to another server for some optimization reasons and i had no problem with it till now.
There is a very more important matter in RDBMS designing and that's CORRECT TABLE INDEXING. I think choosing good indexes is most critical phase of designing a relational database and you'll see it's effect soon (when you begin to write JOIN queries!).
In general, designing and implementing database is an experimental job that depends to your situation and expertness, so you can't seek for a solid instruction.

PostgreSQL: What is the maximum number of tables can store in postgreSQL database?

Q1: What is the maximum number of tables can store in database?
Q2: What is the maximum number of tables can union in view?
Q1: There's no explicit limit in the docs. In practice, some operations are O(n) on number of tables; expect planning times to increase, and problems with things like autovacuum as you get to many thousands or tens of thousands of tables in a database.
Q2: It depends on the query. Generally, huge unions are a bad idea. Table inheritance will work a little better, but if you're using constraint_exclusion will result in greatly increased planning times.
Both these questions suggest an underlying problem with your design. You shouldn't need massive numbers of tables, and giant unions.
Going by the comment in the other answer, you should really just be creating a few tables. You seem to want to create one table per phone number, which is nonsensical, and to create views per number on top of that. Do not do this, it's mismodelling the data and will make it harder, not easier, to work with. Indexes, where clauses, and joins will allow you to use the data more effectively when it's logically structured into a few tables. I suggest studying basic relational modelling.
If you run into scalability issues later, you can look at partitioning, but you won't need thousands of tables for that.
Both are, in a practical sense, without limit.
The number of tables a database can hold is restricted by the space on your disk system. However, having a database with more than a few thousand tables is probably more an expression of an incorrect analysis of your application domain. Same goes for unions: if you have to union more than a handful of tables you probably should look at your table structure.
One practical scenario where this can happen is with Postgis: having many tables with similar attributes that could be joined in a single view (this is a flaw in the design of Postgis IMHO), but that would typically be handled at the application side (e.g. a GIS).
Can you explain your scenario where you would need a very large number of tables that need to be queried in one sweep?