Is it possible to alter an existing table in DB2 to add a hash partition? Something like...
ALTER TABLE EXAMPLE.TEST_TABLE
PARITION BY HASH(UNIQUE_ID)
Thanks!
If you run a Db2-LUW local server on zLinux, the following syntax may be available:
ALTER TABLE .. ADD DISTRIBUTE BY HASH (...)
This syntax is not available if the zLinux is not running a Db2-LUW server but is instead only a client of Db2-for-z/OS.
For this syntax to be meaningful, there are various pre-requisites. Refer to the documentation for details of partitioned instances, database partition groups , distribution key rules and default behaviours etc.
The intention of distributed tables (spread over multiple physical and/or logical partitions of a partitioned database in a partitioned Db2-instance) is to exploit hardware capabilities. So if your Db2-instance and database and tablespaces are not appropriately configured, this syntax has limited value.
Depending on your true motivations, partition by range may offer functionality that is useful. Note that partition by range can be combined with distribute by hash if the configuration is appropriate.
Related
I've been reading about logical replication in PostgreSQL, which seems to be a very good solution for sharing a small number of tables among several databases. My case is even simpler, as my subscribers will only use source tables in a read-only fashion.
I know that I can add extra columns to a subscribed table in the subscribing node, but what if I only want to import a subset of the whole set of columns of a source table? Is it possible or will it throw an error?
For example, my source table product, has a lot of columns, many of them irrelevant to my subscriber databases. Would it be feasible to create replicas of product with only the really needed columns at each subscriber?
The built in publication/subscription method does not support this. But the logical replication framework also supports any other decoding plugin you can write (or get someone else to write) and install, so you could make this happen that way. It looks like pglogical already supports this ("Selective replication of table columns at publisher side", but I have never tried to use this feature myself).
As of v15, PostgreSQL supports publishing a table partially, indicating which columns must be replicated out of the whole list of columns.
A case like this can be done now:
CREATE PUBLICATION users_filtered FOR TABLE users (user_id, firstname);
See https://www.postgresql.org/docs/15/sql-createpublication.html
I have read this in the Distributed Engine documentation about internal_replication setting.
If this parameter is set to ‘true’, the write operation selects the first healthy replica and writes data to it. Use this alternative if the Distributed table “looks at” replicated tables. In other words, if the table where data will be written is going to replicate them itself.
If it is set to ‘false’ (the default), data is written to all replicas. In essence, this means that the Distributed table replicates data itself. This is worse than using replicated tables, because the consistency of replicas is not checked, and over time they will contain slightly different data.
I am using the typical KafkaEngine with Materialized View(MV) setup, plus using Distributed tables.
I have a cluster of instances, where there are ReplicatedReplacingMergeTree and Distributed tables over them as you can see below:
CREATE TABLE IF NOT EXISTS pageviews_kafka (
// .. fields
) ENGINE = Kafka
SETTINGS
kafka_broker_list = '%%BROKER_LIST%%',
kafka_topic_list = 'pageviews',
kafka_group_name = 'clickhouse-%%DATABASE%%-pageviews',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n';
CREATE TABLE IF NOT EXISTS pageviews (
// fields
) ENGINE ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/%%DATABASE%%/pageviews', '{replica}', processingTimestampNs)
PARTITION BY toYYYYMM(dateTime)
ORDER BY (clientId, toDate(dateTime), userId, pageviewId);
CREATE TABLE IF NOT EXISTS pageviews_d AS pageviews ENGINE = Distributed('my-cluster', %%DATABASE%%, pageviews, sipHash64(toString(pageviewId)));
CREATE MATERIALIZED VIEW IF NOT EXISTS pageviews_mv TO pageviews_d AS
SELECT
// fields
FROM pageviews_kafka;
Questions:
I am using default value of internal_replication in the Distributed table, which is false. Does this mean that Distributed table is writing all data to all replicas? So, if I set internal_replication to true, then each instance of ReplicatedReplacingMergeTree will have only its share of the whole table, instead of the whole dataset, hence optimizing data storage? If it's like that, replication would be compromised too - how can you define a certain number replicas?
I am using the id of the entity as the distribution hash. I've read in the ClickHouse Kafka Engine FAQ by Altinity, question "Q. How can I use a Kafka engine table in a cluster?", the following:
Another possibility is to flush data from a Kafka engine table into a Distributed table. It requires more careful configuration, though. In particular, the Distributed table needs to have some sharding key (not a random hash). This is required in order for the deduplication of ReplicatedMergeTree to work properly. Distributed tables will retry inserts of the same block, and those can be deduped by ClickHouse.
However, I am using a semi-random hash here (it is the entity id, the idea being that different copies of the same entity instance - pageview, in this example case - are grouped together). What is the actual problem with it? Why is it discouraged?
I am using default value of internal_replication in the Distributed table, which is false.
You SHOULD NOT. It MUST BE TRUE.
You were lucky and data were not duplicated yet because of insert deduplication.
But eventually it will be duplicated because your Distributed table does 2 identical insert into 2 replica, and replicated table replicates inserted data to another replica (in your case the second replica skips insert from Distributed because you are lucky).
then each instance of ReplicatedReplacingMergeTree will have only its share of the whole table
you are mistaken.
Distributed (internal_replication=true) inserts into ALL SHARDS.
Distributed (internal_replication=false) inserts into ALL SHARDS + into ALL REPLICAS.
It requires more careful configuration
I am using a semi-random hash here
It requires more careful configuration and you did it!!!, using -- sipHash64(toString(pageviewId))
You stabilized order and if insert is repeated the same rows goes to the same shard because a shard number for a row is calculated using pageviewId not rand().
I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.
I am reading the legacy datawarehouse in postgresql and found a list of tables named like
command
\list
result:
abc_1
abc_2
abc_3
...
abc_10000
what do these sequential named tables in postgresql in the context of datawarehouse mean ? Why don't we just merge them into one table abc
It is extremely likely that these will be partitions of a parent table abc. Check with \d+ abc_1. Does it mention any inheritance or parent table?
Even if they aren't part of an inheritance scheme it's likely to be partitioning, just handled at the application level.
Partitioning is a useful workaround for limitations in the database engine. In an ideal world it wouldn't be necessary, but in reality it can help some workloads and query patterns.
I've got a problem with a PostgreSQL dump / restore. We have a production appliaction running with PostgresSQL 8.4. I need to create some values in the database in the testing environment and then import just this chunk of data into the production environment. The data is generated by the application and I need to use this approach because it needs testing before going into production.
Now that I described the environment, here is my problem:
In the testing database, I leave nothing but the data I need to move to the production database. The data is spread across multiple tables linked with foreign keys with multiple levels (like a tree). I then use pg_dump to export the desired tables into binary format.
When I try to import, the database will correctly import the root table entries with new primary key values, but does not import any of the data from the other tables. I believe that the problem is that foreign keys on child tables no longer recognizes the new primary keys.
Is there a way to achieve such an import which will update all the primary key values of all affected tables in the tree to correct serial (auto increment) values automatically and also update all foreign keys according to these new primary key values?
I have and idea how to do this with assistance of programming language while connected to both databases, but that would be very problematic to achieve for me since I don't have direct access to customers production server.
Thanks in advance!
That one seems to me like a complex migration issue. You can create PL/pgSQL migration scripts with inserts and use returning to get serials and use as foreign keys for other tables up the tree. I do not know the structure of your tree but in some cases reading sequence values in advance into arrays may be required due to complexity or performance reasons.
Other approach can be to examine production sequence values and estimate sequence values that will not be used in the near future. Fabricate test data in the test environment to have serial values that will not collide with production sequence values. Then load that data into the prod database and adjust sequence values of the prod environment so that test sequence values will not be used. It will leave a gap in your ID sequence so you must examine whether anything (like other processes) rely on the sequence values to be continuos.