Star Schema horizontal scaling - amazon-redshift

AFAIK, in case of Relational Database on MPP hardware, the key to performance is a correct data distribution. While Dimensional Modeling is about query flexibility, you don't even know how the data will be queried (shuffled) in future.
For example, you have MPP Data Warehouse (Greenplum, Redshift, Synapse Analytics). For example, in 1-2 years, you expect your fact table will grow up to 10 billion of rows and you'll have 15-30 dimension tables of 10s millions of rows. How the data should be distributed accross DW nodes? Is there any common techniques? Like shard fact table and replicate dimension tables. Or should I minimize node amount in MPP DW?
I can bring specific use case, but I believe that the question arise from my misunderstanding of how Dimensional Modeling could be paired with scaling out.

One technique I’ve seen applied with success in the past is: segment the fact table (e.g., by mod’ing the date key), and distribute all dimensions across all nodes. That way all joins can be done locally.
Note that even with large dimensions, their total size on disk should be a small fraction of the total needed for the fact table.

Related

What is space partitioning and dimensions in TimesclaleDB

I am new to the timescale database. I was learning about chunks and how to create chunks based on time.
But there is another time/space chunking which is confusing me a lot. Please help me with below queries.
What is a dimension in a timescale DB?
What is space chunking and how it works?
Thanks in advance.
A dimension in TimescaleDB is associated with a column. Each hypertable requires to define at least a time dimension, which is a time column for the time series. Then a hypertable is divided into chunks, where each chunk contains data for a time interval of the time dimension. As result all new data usually arrives into the latets chunk, while other chunks contain older data.
Then, it is possible to define space dimensions on other columns, for example device column or/and location column. No interval is defined for space dimensions, instead a number of partitions is defined. So for the same time interval, several chunks will be created, which is equivalent to the number of partitions. Data are distributed by a hashing function on the values of the space dimension. For example, if 3 partitions are defined for a space dimension on device column and 12 different device values were present in the data, each space chunk will contain 4 different values with a hash function uniformly distributing the values.
Space dimensions are specifically useful for parallel I/O, when data are stored on several disks. Another scenario is multinode, i.e., distributed version of hypertable (beta feature, which coming to release in 2.0).
There are some complex usage cases when space partitioning will be also helpful.
You can read more in add_dimension docs, cloud KB about space partitioning
A note in the doc:
Supporting more than one additional dimension is currently experimental.

Mongodb Migration Threshold Controls?

I'm seeking a way to control sharded collection migration thresholds in mongodb. These thresholds are described at https://docs.mongodb.com/manual/core/sharding-balancer-administration/#sharding-migration-thresholds
What I see in those values is that they have tuned the migration thresholds for roughly 10% of the chunk counts for small numbers of chunks (0-20: 2, 20-80: 4, 80+: 8). Above that, it's locked at 8 chunks: just 8 chunk counts being different between shard members will trigger a migration activity.
For our collections having high activity rates and large bodies of data, this causes balancing thrash - there is almost always a difference of 8 chunks, all the time. With high transaction rates on a sharded collection, there are a range of perfectly-acceptable causes of temporary imbalance (which I won't go into here). When we shut off the balancer, small temporary imbalances are often then corrected organically as activity across the cluster shifts. With the balancer turned on, by the time it finishes one migration, another (or many in parallel) triggers right away.
With the thresholds locked down like this, our larger collections thrash all the time - consuming IOPS and network bandwidth that we would really like to use in other ways. These tiny migrations have no practical benefit, either: if we're talking about a large collection, then 8 chunks can be a vanishingly small quantity of data relative to any real workload. So we're spending a lot of energy moving lots of small snippets around for zero effective benefit.
I would love to find a config file setting that - at a minimum - allows me to redefine those values. Even better would be to force a fractional policy, like 10% of the number of chunks in the collection. I don't see any controls of this type in the mongo documentation, but could be missing it.
Failing that, I'll have to spin up on the code and retool it myself to build from source, so I'm hoping someone has already solved this and I just can't see where to control it. Thanks in advance!

Estimating Redshift Table Size

I am trying to create an estimate for how much space a table in Redshift is going to use, however, the only resources I found were in calculating the minimum table size:
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
The purpose of this estimate is that I need to calculate how much space a table with the following dimensions is going to occupy without running out of space on Redshift (I.e. it will define how many nodes we end up using)
Rows : ~500 Billion (The exact number of rows is known)
Columns: 15 (The data types are known)
Any help in estimating this size would be greatly appreciated.
Thanks!
The article you reference (Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?) does an excellent job of explaining how storage is consumed.
The main difficulty in predicting storage is predicting the efficiency of compression. Depending upon your data, Amazon Redshift will select an appropriate Compression Encoding that will reduce the storage space required by your data.
Compression also greatly improves the speed of Amazon Redshift queries by using Zone Maps, which identify the minimum and maximum value stored in each 1MB block. Highly compressed data will be stored on fewer blocks, thereby requiring less blocks to be read from disk during query execution.
The best way to estimate your storage space would be to load a subset of the data (eg 1 billion rows), allow Redshift to automatically select the compression types and then extrapolate to your full data size.

is kdb fast solely due to processing in memory

I've heard quite a couple times people talking about KDB deal with millions of rows in nearly no time. why is it that fast? is that solely because the data is all organized in memory?
another thing is that is there alternatives for this? any big database vendors provide in memory databases ?
A quick Google search came up with the answer:
Many operations are more efficient with a column-oriented approach. In particular, operations that need to access a sequence of values from a particular column are much faster. If all the values in a column have the same size (which is true, by design, in kdb), things get even better. This type of access pattern is typical of the applications for which q and kdb are used.
To make this concrete, let's examine a column of 64-bit, floating point numbers:
q).Q.w[] `used
108464j
q)t: ([] f: 1000000 ? 1.0)
q).Q.w[] `used
8497328j
q)
As you can see, the memory needed to hold one million 8-byte values is only a little over 8MB. That's because the data are being stored sequentially in an array. To clarify, let's create another table:
q)u: update g: 1000000 ? 5.0 from t
q).Q.w[] `used
16885952j
q)
Both t and u are sharing the column f. If q organized its data in rows, the memory usage would have gone up another 8MB. Another way to confirm this is to take a look at k.h.
Now let's see what happens when we write the table to disk:
q)`:t/ set t
`:t/
q)\ls -l t
"total 15632"
"-rw-r--r-- 1 kdbfaq staff 8000016 May 29 19:57 f"
q)
16 bytes of overhead. Clearly, all of the numbers are being stored sequentially on disk. Efficiency is about avoiding unnecessary work, and here we see that q does exactly what needs to be done when reading and writing a column - no more, no less.
OK, so this approach is space efficient. How does this data layout translate into speed?
If we ask q to sum all 1 million numbers, having the entire list packed tightly together in memory is a tremendous advantage over a row-oriented organization, because we'll encounter fewer misses at every stage of the memory hierarchy. Avoiding cache misses and page faults is essential to getting performance out of your machine.
Moreover, doing math on a long list of numbers that are all together in memory is a problem that modern CPU instruction sets have special features to handle, including instructions to prefetch array elements that will be needed in the near future. Although those features were originally created to improve PC multimedia performance, they turned out to be great for statistics as well. In addition, the same synergy of locality and CPU features enables column-oriented systems to perform linear searches (e.g., in where clauses on unindexed columns) faster than indexed searches (with their attendant branch prediction failures) up to astonishing row counts.
Sources(S): http://www.kdbfaq.com/kdb-faq/tag/why-kdb-fast
as for speed, the memory thing does play a big part but there are several other things, fast read from disk for hdb, splaying etc. From personal experienoce I can say, you can get pretty good speeds from c++ provided you want to write that much code. With kdb you get all that and some more.
another thing about speed is also speed of coding. Steep learning curve but once you get it, complex problems can be coded in minutes.
alternatives you can look at onetick or google in memory databases
kdb is fast but really expensive. Plus, it's a pain to learn Q. There are a few alternatives such as DolphinDB, Quasardb, etc.

OLAP Cube: per Business Process? per Fact table?

So i've finished my dimensional modeling, it resulted in 2 business process, 1 simple with only one fact table and a few dimension, the other one a bit more complex with 2 fact tables (related in a similar way has Invoice and InvoiceRecord) and a lot more dimensions.
My question now is how to start building the OLAP cube(s), one for each Business Process? Or one for each Business Process and for each fact table?
You need to consider all the fact tables and dimensional tables for creating a common star schema. You should consider creating single cube unless fact and dimension pairs are not interrelated at all. It all depends on your design.