I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.
Related
My source is JSON with nested arrays & structures. (examples at end of post)
Large volume of new data streaming real-time (20m/day)
I have to decide how to store this data, considering.....
-- End users want to use 'traditional' SQL
-- Performance (ingestion & query)
-- Load on Cluster
As far as I can see, my options, are to make use of the SUPER data type, or just convert everything to traditional relational tables & types.
(Even if I store the full JSON as a super, I still have to serialize critical attributes into regular columns for the purposes of Distribution/Sort.)
Regardless, been trying to weigh up the pros & cons of super vs. 'traditional'.
(1) Store full JSON as SUPER type
-- Very easy to ingest data with low load on cluster
-- Maybe an additional load on cluster & performance impact to execute end user queries?
-- End users would have to learn PartiQL and deal with unnesting & serialization etc
(2) 'Traditional' Relational Tables & Types
(a) Load as super, but then use PartiQL to unnest, serialize and store in relational tables
-- Additional continuous load on cluster
-- Easy to implement (insert into)
-- Would result in some massive tables for the 'tag' nodes
(b) Use lambda to pre-unnest & serialize json, insert/copy directly into relational tables
-- Lambda would be invoked continuously
(3) Redshift Spectrum
If I am converting to relational structure (2b), could simply store in S3 and utilze Redshift Spectrum to query
-- No load on cluster to ingest/process incoming data
-- Cost/maintenance of lambda/other process to transform JSON
Questions:
Are the above understandings correct ?
Any other considerations not listed ?
Is there a 'standard' for this scenario ?
Any other guidance/wisdeom welcome !
Background info
Schema:
events:array[ struct{
channels:array[
struct{
tags:array[struct{}]
}
tags:array[struct{}]
]
}
]
Schema Exploded view:
You will not want to leave things as a monolithic json. Any data that will be repeatedly queries in analytics will want to be its own column. The database work to expand the json at ingestion will be dwarfed by the work to repeatedly expand it for every query.
Any data that will be commonly used in a where clause, group by, partition, join condition etc will likely need to be its own column. I'd expect any data that is common for 90% of the json elements you will want to be in unique columns. Json element that are rare, unique, or of little analytic interest can be kept in super columns that have just these subset parts of the json.
The data size increase will be less than you think, Redshift is good a compressing columns. The ingestion load is unlikely to be a major concern but if it is then the Lambda approach is a reasonable way to extend the compute resources to address. I really don't expect this but if needed can be folded in easily to the existing ETL processes.
A hazard you will face is that users will only reference the json and not the extracted columns. Re-extracting the same data repeatedly costs. I'd consider NOT keeping the entire json in the main fact tables, only json pieces that represent the data not otherwise in columns. Keeping the original jsons in a separate table keyed with an identity column will allow joining if some need arises but the goal will be to not need to do this.
Spectrum does not look like a good fit for this use case. Spectrum does well when the compute elements in S3 can apply the first level where clauses and simple aggregations. I just don't see Spectrum working on this data so it will just send the entire json data to Redshift repeatedly. This will make things slow and tie up a ton of network bandwidth. Now storing the full original json with identity column in S3 and having all the expanded columns plus left-over json elements in native Redshift table does make sense. This way if some need to reference the full json arises it is just an external table reference away.
In my way of learning Redshift (my first columnar database), I am struggling to figure out the approach for designing the model. Columnar database does promote flat table design, yet admits that star schema or snowflake could be a better choice for some cases.
Here is a simple example of where I am struggling
As you can see multi-dimensional approach have few dimensions and 1 fact table. I could have made it snowflake design but I kept it simple for star schema.
Approach 1: Used common columns from tables (in this scenario demographics). This could reduce the table size for Customer & Store but will include the extra dimension.
Approach 2: Flat table design with all the columns
My Questions:
Which approach data modeler use to design data model in columnar databases like Redshift? Or they use different approach?
Considering this example, what is the best way to design a data model for data warehousing.
Which approach is good for reporting (considering that client PC\Laptop would have limited memory). Or even cloud reporting may become costly when heavy data set is used.
Approach 3 will produce a massive amount of data set for reporting. This could be a costly affair if doing reporting (using Power BI or Tableau or any other Self reporting tool)
Multidimenion approach is best for self reporting (cost & performance) but then it defeats the purpose of columnar database.
Approach 1 is also good for reporting but with more joins & complexity.
Sorry, late to the party.
I will post is as answer, because it is too long for a comment.
I saw in chat that test results show that star schema is better. But it was tested on regular (MSSQL), not columnar database (just as vertica, redshift, snowflake, bigquery..).
There is some experience from project implementation where I tested both approaches - OBT and star schema while implementing dwh for reporting. Ths was already more than 2 years ago, so don't expect much details.
Database: Redshift 2 nodes of dc2.8xlarge. Might be a bit overkill, but other option was to have a bunch of lower level nodes, which wouldn't be more cost efficient. This example will be just for one data area.
Data: ~ 6 tables which could be joined as somewhat similar to star schemas. Containing of 3 fact tables and based on denormalization level 5-8 dimensions.
With various approaches and different optimization paths, using star schema it would be common to reach SQL times to about 30 seconds. Which is not bad, but also not too responsive from user perspective.
SQLs on flat denormalized fact tables rarely exceed 5 seconds. Some tables contain more than 100 columns, row counts are between 50M and 100M. To not overcomplicate, we use zstd compression for all columns.
In columnar databases data compresses very well as many similar or same values are used in single column.
We took OBT table approach and there are some pros and cons:
pros:
Responsive reports in reporting tool (most important one)
Fewer objects for ETL developers to handle.
Analysts which query database directly can create simpler queries using less tables.
Don't need to worry about data inconsistencies if some dimensions are outdated, which could happen in star schema.
Easier approach for reporting tool cache clearing.
Easier reporting performance tuning.
Easier modeling in reporting tool, do not need to define table join strategies.
cons:
Might take more space. Didn't really tested this closely as storage space is not an issue for us.
Filters in reporting tools might take a bit longer to provide list of values (select distinct one_column from table)
Table refresh might take a bit longer for one big table compared to multiple smaller tables.
Hopefully this helps.
I've tried to figure out what will be the best use cases that suit for Amazon dynamoDB.
When I googled most of the blogs says DyanmoDb will be used only for a large amount of data (BigData).
I'm having a background of relational DB. NoSQL DB is new for me.So when I've tried to relate this to normal relation DB knowledge.
Most of the concepts related to DynamoDb is to create a schema-less table with partition keys/sort keys. And try to query them based on the keys.Also, there is no such concept of stored procedure which makes queries easier and simple.
If we managing such huge Data's doing such complex queries each and every time to retrieve data will be the correct approach without a stored procedure?
Note: I've maybe had a wrong understanding of the concept. So, please anyone clear my thoughts here
Thanks in advance
Jay
In short, systems like DynamoDB are designed to support big data sets (too big to fit a single server) and high write/read throughput by scaling horizontally, as opposed to scaling vertically, which is the more common approach for relational databases historically.
The main approach to support horizontal scalability is by partitioning data, i.e. a data set is split into multiple pieces and distributed among multiple servers. This way it may use more storage and more IOPS, allowing bigger data sets and higher read/write throughput.
However, data partitioning makes it difficult to support complex queries, such as joins etc., as data is distributed among multiple physical servers. As for stored procedures, they are not supported for the same reason - historically the idea behind stored procedures is data locality, i.e. they run on the server near the data without network operations, however, if data is distributed among multiple servers, this benefit disappears (at least in the form of stored procedure).
Therefore the most efficient way to query data from such systems is by record key, as data partitioning is based on a key and it's easy to figure out where a record lives physically for a given key. While many such systems also support secondary indexes, they are usually restricted in some way or expensive and may not be enough to satisfy requirements in a complex software solution. A quite common approach is to have a complementary indexing/query solution (I've seen solutions based on Elasticsearch and Solr), which allows running complex queries over some fragments of records to figure out a record key, which then used to load the record.
I'm receiving messages from sensors into Kinesis, process it using lambda and load to Redshift using Kinesis Firehose. All messages are parsed and inserted into one large staging table. We need to do aggregation/analytics of sensor data. Beside sensor data, there are also a lot of info in the header we store but currently don't use.
Does it make sense for me to load data from this staging table into normalized star schema or just enable compression on columns and use one huge denormalized table instead? How well Redshift works with denormalized data? Pros and cons of both options?
In my experience huge tables with lots of columns cause slow queries. If you create narrower tables instead of a wide ones you might get better performance. Before deciding what to do you should consider the queries for analysis and the queries for creating aggregate tables as well as sparsity of the data. On the other hand Joins are expensive overall. And if you need a structure requiring a lot of 'join' then you should adjust the sort and dist keys accordingly.
Here is the documentation https://aws.amazon.com/blogs/big-data/optimizing-for-star-schemas-and-interleaved-sorting-on-amazon-redshift/
I have data that correspond to 400 millions of rows in a table and it will certainly keep increasing, I would like to know what can I do to have such a table in PostgreSQL in a way that it would still be posible to make complex queries using it. In other words what should I do to have all the data in the most performative way?
Try to find a way to split your data into partitons (e.g. by day/month/week/year).
In Postgres, it is implemented using inheritance.
This way, if your queries are able to just use certain partitions, you'll have to handle less data at a time (e.g. read less data from disk).
You'll have to design your tables/indexes/partitions together with your queries - their struture will depend on how you want to use them.
Also, you could have overnight jobs preparing materialised views based on historical data. This way you don't have to delete you old data and you can deal with an aggregated view and most recent data only.