ClickHouse: Usage of hash and internal_replication in Distributed & Replicated tables - apache-kafka

I have read this in the Distributed Engine documentation about internal_replication setting.
If this parameter is set to ‘true’, the write operation selects the first healthy replica and writes data to it. Use this alternative if the Distributed table “looks at” replicated tables. In other words, if the table where data will be written is going to replicate them itself.
If it is set to ‘false’ (the default), data is written to all replicas. In essence, this means that the Distributed table replicates data itself. This is worse than using replicated tables, because the consistency of replicas is not checked, and over time they will contain slightly different data.
I am using the typical KafkaEngine with Materialized View(MV) setup, plus using Distributed tables.
I have a cluster of instances, where there are ReplicatedReplacingMergeTree and Distributed tables over them as you can see below:
CREATE TABLE IF NOT EXISTS pageviews_kafka (
// .. fields
) ENGINE = Kafka
SETTINGS
kafka_broker_list = '%%BROKER_LIST%%',
kafka_topic_list = 'pageviews',
kafka_group_name = 'clickhouse-%%DATABASE%%-pageviews',
kafka_format = 'JSONEachRow',
kafka_row_delimiter = '\n';
CREATE TABLE IF NOT EXISTS pageviews (
// fields
) ENGINE ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/%%DATABASE%%/pageviews', '{replica}', processingTimestampNs)
PARTITION BY toYYYYMM(dateTime)
ORDER BY (clientId, toDate(dateTime), userId, pageviewId);
CREATE TABLE IF NOT EXISTS pageviews_d AS pageviews ENGINE = Distributed('my-cluster', %%DATABASE%%, pageviews, sipHash64(toString(pageviewId)));
CREATE MATERIALIZED VIEW IF NOT EXISTS pageviews_mv TO pageviews_d AS
SELECT
// fields
FROM pageviews_kafka;
Questions:
I am using default value of internal_replication in the Distributed table, which is false. Does this mean that Distributed table is writing all data to all replicas? So, if I set internal_replication to true, then each instance of ReplicatedReplacingMergeTree will have only its share of the whole table, instead of the whole dataset, hence optimizing data storage? If it's like that, replication would be compromised too - how can you define a certain number replicas?
I am using the id of the entity as the distribution hash. I've read in the ClickHouse Kafka Engine FAQ by Altinity, question "Q. How can I use a Kafka engine table in a cluster?", the following:
Another possibility is to flush data from a Kafka engine table into a Distributed table. It requires more careful configuration, though. In particular, the Distributed table needs to have some sharding key (not a random hash). This is required in order for the deduplication of ReplicatedMergeTree to work properly. Distributed tables will retry inserts of the same block, and those can be deduped by ClickHouse.
However, I am using a semi-random hash here (it is the entity id, the idea being that different copies of the same entity instance - pageview, in this example case - are grouped together). What is the actual problem with it? Why is it discouraged?

I am using default value of internal_replication in the Distributed table, which is false.
You SHOULD NOT. It MUST BE TRUE.
You were lucky and data were not duplicated yet because of insert deduplication.
But eventually it will be duplicated because your Distributed table does 2 identical insert into 2 replica, and replicated table replicates inserted data to another replica (in your case the second replica skips insert from Distributed because you are lucky).
then each instance of ReplicatedReplacingMergeTree will have only its share of the whole table
you are mistaken.
Distributed (internal_replication=true) inserts into ALL SHARDS.
Distributed (internal_replication=false) inserts into ALL SHARDS + into ALL REPLICAS.
It requires more careful configuration
I am using a semi-random hash here
It requires more careful configuration and you did it!!!, using -- sipHash64(toString(pageviewId))
You stabilized order and if insert is repeated the same rows goes to the same shard because a shard number for a row is calculated using pageviewId not rand().

Related

Sharding from application development point of view

I read a lot regarding sharding, what i understand about this its a DB managment concept. When I come to know about application side, Lets take a example a spring boot microservice having huge table orders where it needs to be shard with a shard Key(K1) in table.
Let's say I decided to shard based on K1 fields using range based sharding and will shard in multiple node of my MySQL DB.
Now I have the following question:
How this sharding is performed in existing data. Is it a background job?
What are the changes need to done in my existing application as currently its connecting to first Instance of MySQL db. while fetching data based on my shard key how can this application decided from which instance It need to request?
With Application Level Sharding you have a lot of options as you are the Application Developer/Architect who has the full control of it. There are a lot of options what you could do but for example here is one option or one Idea which could lead you in the right direction:
How this sharding is performed in existing data. Is it a background
job?
I guess by this you mean how do I separate or migrate the data from one Database to another database shard?
Background job. Yes having a background job is an option. With this background job you can move the data from one Db-shard to another Db-shard.
Migration Script. You can also write a migration script on your database level(SQL script) which will migrate all the data to other Db-shard.
With both of these options you have to think about the fact if you system need to be running and operational all the time? Can you live with down-time?
If yes this can be more challenging. As while you are migrating you have to stay operational. Doing this in non-Business hours can help, doing it in chunks key by key and similar. Still this depends on your business.
If no and you can have a down-time to do this then it will be much easier to separate the data to the appropriate Shards based on key. Here you do not have to consider a running system and some data mismatches in data. So if you can somehow can do it like this this would be much easier.
What are the changes need to done in my existing application as
currently its connecting to first Instance of MySQL db. while fetching
data based on my shard key how can this application decided from which
instance It need to request?
You have to provide that logic. Since it is on your application level you need to make that decision in code. In you DataAccess level code you need to know where to send your querys(or other sql statements): Service-Db-Shard1 or Service-Db-Shard2.
What you can do is for example in your Main Instance1 Db-Shard-1 you can have one table called Shards:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
The shards table
This table will contain the information where each shard data can be found. So the data which is sharded based on the key2 can be found in Service-Db-Shard2. Depending on your architecture you can put this table in one Main/Master(preferred option especially if you have some Read replicas to support downtime of Main instance) shard or in all shards(not preferred as it creates duplication). In addition you can cache this information in your micro-service Cache on startup and reuse its values from cache so you do not have to read this Table every time you need to execute an SQL statement on any other table.
The good thing about this is that you can control this and evolve this over time. For example in beginning when you do not have so much data separate/spread all your keys to 2 Instances(save money) and as the data grows you can increase the number of instances. Example:
shards Table
shard_key
database_instance
key1
Service-Db-Shard1
key2
Service-Db-Shard2
key3
Service-Db-Shard1
key4
Service-Db-Shard1
key5
Service-Db-Shard2
Multiple shards in one instance
Doing it like this gives you the option to have multiple shard keys data on the same instance to save money on to many resources. Keep in mind this does not work well with every key type. For example it could work quite well if you have a system which is a multi Customer/Tenant system and as your number of Tenants grows the data grows as well. Usually not all the Tenants have the same amount of data so having them in a dedicated Instance is not always the most efficient way to shard. This gives you this additional flexibility.
Keep the shard key column in every table
In addition you want to add to each of your tables the shard key column so that you can identify what needs to be moved where. Event when your data is distributed to multiple shard(instances) you still might want to keep this column for the fact that you might have multiple shard keys on the same instance and also having the option to migrate further(if needed).
Before executing sql statements
Before each sql statement against your DB you will need to get the Instance information from the "shards" Table and each sql statement to your "orders" Table or any other table which is sharded should contain the Sharding key in its filters.
Data Access layer
Consider DataAccess layer in your micro-service code, this is one good example why SOLID Principles and proper loose design of an DataAccess layer classes/modules and proper design can help you implement something like this easier. It is much easier to adjust a couple of classes to add additional step to find the Instance based on key and include the key in each query if your DataAccess layer code is done well.
Conclusion
This was just to give you an Idea how you could approach this. There are many ways how you can do this. It will heavily depend on your domain, your current service structure, your data, its Architecture, the way you have your Infrastructure setup and db deployments and migration strategy.

DB2 Alter Table and add Hash

Is it possible to alter an existing table in DB2 to add a hash partition? Something like...
ALTER TABLE EXAMPLE.TEST_TABLE
PARITION BY HASH(UNIQUE_ID)
Thanks!
If you run a Db2-LUW local server on zLinux, the following syntax may be available:
ALTER TABLE .. ADD DISTRIBUTE BY HASH (...)
This syntax is not available if the zLinux is not running a Db2-LUW server but is instead only a client of Db2-for-z/OS.
For this syntax to be meaningful, there are various pre-requisites. Refer to the documentation for details of partitioned instances, database partition groups , distribution key rules and default behaviours etc.
The intention of distributed tables (spread over multiple physical and/or logical partitions of a partitioned database in a partitioned Db2-instance) is to exploit hardware capabilities. So if your Db2-instance and database and tablespaces are not appropriately configured, this syntax has limited value.
Depending on your true motivations, partition by range may offer functionality that is useful. Note that partition by range can be combined with distribute by hash if the configuration is appropriate.

Postgresql can a logical replica have DB objects of its own?

I have 2 postgresql 12 DB servers, say A and B. A is the main DB.
B consists of some Foreign tables pointing to A and some materialized views built with those foreign table joins. The materialized views refresh nightly and with increasing data, refresh over FDWs are taking awfully long as SQLs over FDWs can’t parallelized.
I wanted to know if -
a logical replica ( which gives me the ability to have only few tables replicated) can have some of its own objects ( mat views in my case, so that the refresh does not have to pull and join tables over FDW)
For those familiar with Oracle’s Golden Gate, is there anything similar for postgres? i.e log based not trigger based? open source would be better!
Thanks
It sounds like logical replication would indeed be the solution for you:
You can replicate individual tables with it, which you should not modify on B, but otherwise B is a normal database where you can have other tables.
Logical replication works by parsing the transaction log, just like you want. So all data modifications are replicated incrementally.
The replicated tables on B will be duplicates of the tables on A, so they are physically present on B (with foreign tables, there are no data on B, and accessing the foreign tables will actually access data on A). So there is no immediate need for materialized views.
Note that there are some limitations to logical replication. Most notable, ALTER TABLE and other DDL statements are not replicated.

How does pglogical-2 handle logical replication on same table while allowing it to be writeable on both databases?

Based on the above image, there are certain tables I want to be in the Internal Database (right hand side). The other tables I want to be replicated in the external database.
In reality there's only one set of values that SHOULD NOT be replicated across. The rest of the database can be replicated. Basically the actual price columns in the prices table cannot be replicated across. It should stay within the internal database.
Because the vendors are external to the network, they have no access to the internal app.
My plan is to create a replicated version of the same app and allow vendors to submit quotations and picking items.
Let's say the replicated tables are at least quotations and quotation_line_items. These tables should be writeable (in terms of data for INSERTs, UPDATEs, and DELETEs) at both the external database and the internal database. Hence at both databases, the data in the quotations and quotation_line_items table are writeable and should be replicated across in both directions.
The data in the other tables are going to be replicated in a single direction (from internal to external) except for the actual raw prices columns in the prices table.
The quotation_line_items table will have a price_id column. However, the raw price values in the prices table should not appear in the external database.
Ultimately, I want the data to be consistent for the replicated tables on both databases. I am okay with synchronous replication, so a bit of delay (say, a couple of second for the write operations) is fine.
I came across pglogical https://github.com/2ndQuadrant/pglogical/tree/REL2_x_STABLE
and they have the concept of PUBLISHER and SUBSCRIBER.
I cannot tell based on the readme which one would be acting as publisher and subscriber and how to configure it for my situation.
That won't work. With the setup you are dreaming of, you will necessarily end up with replication conflicts.
How do you want to prevent that data are modified in a conflicting fashion in the two databases? If you say that that won't happen, think again.
I believe that you would be much better off using a single database with two users: one that can access the “secret” table and one that cannot.
If you want to restrict access only to certain columns, use a view. Simple views are updateable in PostgreSQL.
It is possible with BDR replication which uses pglogical. On a basic level by allocating ranges of key ids to each node so writes are possible in both locations without conflict. However BDR is now a commercial paid for product.

How to partition existing table in postgres?

I would like to partition a table with 1M+ rows by date range. How is this commonly done without requiring much downtime or risking losing data? Here are the strategies I am considering, but open to suggestions:
1.The existing table is the master and children inherit from it. Over time move data from master to child, but there will be a period of time where some of the data is in the master table and some in the children.
2.Create a new master and children tables. Create copy of data in existing table in child tables (so data will reside in two places). Once child tables have most recent data, change all inserts going forward to point to new master table and delete existing table.
First you have to ask yourself, if a table partition is really warranted. Go thru the partition document:
https://www.postgresql.org/docs/9.6/static/ddl-partitioning.html
Remember this very important info for partitioning data (from the link above)
The benefits will normally be worthwhile only when a table would
otherwise be very large. The exact point at which a table will benefit
from partitioning depends on the application, although a rule of thumb
is that the size of the table should exceed the physical memory of the
database server.
You can check the size of your table with this SQL
SELECT pg_size_pretty(pg_database_size(<table_name>))
if you are having performance problems, try re-indexing or re-evaluating your indexes. Check your postgres log for auto vacuuming.
1m+ rows do not need partitioning.