Two performance enhancement features of db2, PARTITION and DISTRIBUTION is confusing me. How can I understand the exact difference between them? And what type of field to be used for PARTITION and what for DISTRIBUTION?
Please refer to the online DB2 Knowledge Centre for your version and operating-system platform, which explains this stuff in depth and gives the syntax. Below is only a summary
For DB2 on Linux/Unix/Windows, A partitioned DB2-instance can run on multiple physical or logical hostnames, but a database in that partitioned instance appears as a single database to applications. There can be logical-partitions (running on the same hostname) or physical-partitions (run on different hostnames) in a shared-nothing arrangement, i.e. different CPUs, different disks, different RAM etc. In a partitioned DB2-instance, tables can be distributed-by-hash ("hash partitioning") on a column chosen by the designer to equally distribute the table data over all chosen partitions. So a column with only 2 discrete values would be unsuitable. The designer can group partitions in as many groupings (partition groups) as make sense for the workload. To partition your DB2-instance you need a special licence for DB2 , and this configuration is also known as DPF (distributed partitioning feature), and IBM sells (or at please used to sell) a hardware/software solution (IBM Smart Analytics series) with configurations to suit a particular workload. This configuration is common for some warehouse workloads, decision support/OLAP workloads for very large databases.
On large warehouses it is common to combine hash-partitioning and range-partitioning. But they can be separately implemented.
Range-partitioning (partition by range) is a common technique to logically split up a table into multiple separate tables (which can be in different tablespaces/storage objects). In this case it is the table that is partitioned, as distinct from the DB2-instance. The designer chooses a partitioning column that suits the workload, often the column has geographic scope, or temporal scope (one partition per day/week/month/hour etc) or whatever makes sense logically. The designer usually arranges for the indexes to be also partitioned, although global indexes are allowed. Range-partitioning supports easier roll-in of new partitions on demand, and roll-out of old partitions (as part of table clean up) with minimal concurrency overheads. This is crucial if databases need to remain within a certain size with regular archival of old content that can be sent to tape or long-term less costly storage outside of DB2.
Related
I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.
I am still not able to relate in real-time how nosql is beneficial whereas we have indexes too in traditional RDBMS's. If someone can suggest columnar databases advantages in real application particularly in terms of using structure, semistructured or unstructured data.
Largely, it depends on what you want your datastore to do. If you want to be able to scale to meet storage or operational demands, a RDBMS can only take you so far.
It comes down how you can scale to meet demand. A RDBMS is really only capable of scaling vertically. That is, add more RAM, add more disk, etc. A distributed (NoSQL) database makes scaling easier by allowing you to add more machine instances. This is known as scaling horizontally.
Here's an example using Cassandra:
Let's say I have a 3 node cluster, and my keyspace (database) is also configured with a replication factor (RF) of 3. This means that each node is responsible for 100% of the data. I load my data, and it takes up 100GB of disk space (on each node). Now, while I might have 300GB of data total in my cluster, a single copy of my data is 100GB.
So my product team comes to me and says they need to double the amount of data they have. I know that I built their 3 node cluster with 200GB drives. If I did nothing, those drives would pretty much fill-up (and if they didn't they wouldn't leave room for much else).
Now it's up to me to scale the cluster to meet their space demands. I'll start by adding 3 new nodes to the cluster (for a total of 6), but I'll leave my RF at 3. This makes each node responsible for 50% of the data, or 50GB. When my product team loads more data to meet their "doubling" requirement, each node should climb back up to about 100GB. A single copy of the data is now 200GB. But with each node responsible for 50%, each 200GB drive still only has 100GB.
Example #2:
Let's say that the cluster above with 6 nodes is capable of supporting an operational load of 10,000 operations per second (ops). My product team comes to me again, saying that for the holiday season they project needing to support 20,000 ops. As the current cluster can only support half of that, it will choke under the intense throughput, and one or more nodes may crash.
As Cassandra scales linearly, the way to achieve this is to (again) double the size of the cluster. So I increase it from 6 nodes to 12 nodes, while still maintaining my RF of 3. After running some performance testing, they verify that it can indeed support 20,000 ops. As a single copy of my data is 200GB, the total data footprint remains 600GB. With 12 nodes, each node is now responsible for only 25% of the data, or 50GB.
So scalability is the advantage. But how about modeling the data? The main idea in distributed database modeling, is two-fold:
Build a table structure which is keyed to distribute well. We don't want uneven amounts of data on each node.
Build the key on the table so that it matches our query requirements.
One of the drawbacks of a NoSQL database, is that your query patterns become restricted. In an effort to cut down on network time, you want to ensure that your query can be served by a single node.
This usually means using natural keys, as those are more in-line with what you are asking of your data. Surrogate keys (alpha, numerical, or both) distribute well, but aren't really useful for querying. User "Bob Jones" might be id "3582346556230" in my system. But when I want to query Bob's data, I'll probably never want to ask for it by "3582346556230," because that doesn't mean anything to the application or the context in which the data is used.
Also, you want your data to have structure. Unstructured data is un-queryable data. Simple as that. If you want unstructured data to be queryable, you need to parse-out its identifying aspects to be used as keys. You don't want to "search" or run SELECT * FROM queries. Full table scans in NoSQL databases are even more resource consuming than their RDBMS counterparts, because they have to check each node, sort through replicas, and thus incurs extra network time.
NoSQL databases give you the ability to scale (for increases in data or demand). But it's important to note that their scalability can make some things (which a RDBMS might be good at), more difficult than you're used to.
The R in RDBMS, relational, is the biggest thing missing from Mongo. There's very little to no way to make the database understand how entries in different tables collections relate to each other. One of the big strengths of RDBMSs is the ability to define constraints which the database will enforce, most typically foreign key constraints which ensure that an id in one table refers to an existing id in another table.
One requirement for the database to be able to enforce such constraints is obviously that everything needs to go through one source of truth and there needs to be one central entity cross-checking the data; it cannot be decentralised since discrepancies between two different primary sources can lead to data inconsistencies.
In Mongo, each data blob is pretty much independent. It doesn't refer to other entries in any way enforced by the database. Mongo also has weak to no ACID guarantees, meaning there's little protection against race condition inserts or updates. In a word: Mongo makes little guarantees with regards to data consistency and mostly offloads these kinds of concerns to the application layer. That allows it to work more decentralised.
E.g. a good way to scale Mongo is to have many secondary servers which replicate a primary server for read-only access. There's no guarantee that the primary and secondaries will be in sync at any given time, it may take a couple of seconds for data written to the primary to trickle to the secondaries. But this allows you to have a virtually unlimited number of secondary read-only servers, which is great for scaling a database under heavy read load.
The way specifically Mongo handles its clusters also allows it to have a very high uptime, as the cluster will reorganise itself into primaries and secondaries automatically if a server goes down. This even allows for rolling maintenance without any client downtime.
Not having to enforce complex constraints or transaction consistency during writing also allows a more fire-and-forget style of writing to the database, which can be much faster. Again, at the cost of allowing inconsistent data. Which is why most writing pretty much means atomically updating a single document in a collection with no guarantees about other documents, which is something of a different paradigm than RDBMS transactional updates across many tables.
I would not recommend Mongo for storing things like a financial ledger, which heavily relies on transactional guarantees for consistency. However, things like Twitter are a perfect case for it: many independent snippets of data which must be read by a massive number of clients.
I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430񹫾
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.
I've seen two contradictory pieces of advice when it comes to designing row IDs in HBase, (specifically, but I think it applies to Cassandra as well.)
Group keys that you'll be aggregating together often to take advantage of data locality. (White, Hadoop: The Definitive Guide and I recall seeing it on the HBase site, but can't find it...)
Spread keys around so that work can be distributed across multiple machines (Twitter, Pig, and HBase at Twitter slide 14)
I'm guessing which one is optimal can depend on your use case, but does anyone have any experience with either strategy?
In HBase, a table is partitioned into regions by dividing up the key space, which is sorted lexicographically. Each region of the table belongs to a single region server, so all reads and writes are handled by that server (which allows for a strong consistency guarantee). This means that if all of your reads or writes are concentrated on a small range of your keyspace, that you will only be able to scale to what a single region server can handle. For example, if your data is a time series and keyed by the timestamp, then all writes are going to the last region in the table, and you will be constrained to writing at the rate that a single server can handle.
On the other hand, if you can choose your keys such that any given query only needs to scan a small range of rows, but that the overall set of reads and writes are spread across your keyspace, then the total load will be distributed and scale nicely, but you can still enjoy the locality benefits for your query.
Please explain what a cluster is in a RDBMS?
In SQL, a cluster can also refer to a specific physical ordering of rows.
For example, consider a database with two tables: INVOICES and INVOICE_ITEMS. If many INVOICE_ITEMs are inserted concurrently, chances are that items of the same invoice end up on multiple physical blocks of the underlying storage. When reading such an invoice, unneeded data will be read together with the interesting rows. Clustering INVOICE_ITEMS over the foreign key to INVOICES groups rows of items the same invoice together in the same block, thus reducing the amount of necessary read operations when accessing the invoice.
Read about clustered index on wikipedia.
In system administration, a "cluster" is a number of servers configured to provide the same service, but look like one server to the users.
This can be done for performance reasons (two servers can answer more requests than a single one) or redundancy (if one server crashes, the others still work).
Such configurations often need special software or setup to work. Some services, like serving static web content, can be clustered very easily. Others, like RDBMS, need complicated replication schemes to coordinate.
Read about computer clusters on wikipedia.
In statistics, a cluster is a "group of items so that objects from the same cluster are more similar to each other than objects from different clusters."
Read about Cluster analysis on wikipedia.
From here:
High-availability clusters (also known
as HA Clusters or Failover Clusters)
are computer clusters that are
implemented primarily for the purpose
of providing high availability of
services which the cluster provides.
They operate by having redundant
computers or nodes which are then used
to provide service when system
components fail. Normally, if a server
with a particular application crashes,
the application will be unavailable
until someone fixes the crashed
server. HA clustering remedies this
situation by detecting
hardware/software faults, and
immediately restarting the application
on another system without requiring
administrative intervention, a process
known as Failover
In database context it can have two completely different meanings:
may either mean data clustering or index clustering, which is grouping of similar rows. This is useful for data mining, some databases (e.g. Oracle) also use it to optimize physical data organization;
or cluster as database running on many closely linked servers.
Clustering, in the context of databases.
It refers to the ability of several servers or instances to connect to a single database.
An instance is the collection of memory and processes that interacts with a database, which is the set of physical files that actually store data.