Maxwell hash(sql_column) write to kafka partition - apache-kafka

I am using Maxwell to write changes from a MySql database to Kafka 0.9.0, is there a way to configure Maxwell, so it can partition the data in different topics by a hash function over a column?

You can specify database | table | primary_key | column for producer_partition_by configuration parameter.
As seen in the documentation here,
HASH_FUNCTION(HASH_STRING) % TOPIC.NUMBER_OF_PARTITIONS
The HASH_FUNCTION is either java's hashCode or murmurhash3. The
default HASH_FUNCTION is hashCode. Murmurhash3 may be set with the
kafka_partition_hash option. The seed value for the murmurhash
function is hardcoded to 25342 in the MaxwellKafkaPartitioner class.
The HASH_STRING may be (database, table, primary_key, column). The
default HASH_STRING is the database. The partitioning field can be
configured using the producer_partition_by option.

Related

Why is Pyspark/Snowflake insensitive to underscores in column names?

Working in Python 3.7 and pyspark 3.2.0, we're writing out a PySpark dataframe to a Snowflake table using the following statement, where mode usually equals 'append'.
df.write \
.format('snowflake') \
.option('dbtable', table_name) \
.options(**sf_configs) \
.mode(mode)\
.save()
We've made the surprising discovery that this write can be insensitive to underscores in column names -- specifically, a dataframe with the column "RUN_ID" is successfully written out to a table with the column "RUNID" in Snowflake, with the column mapping accordingly. We're curious why this is so (I'm in particular wondering if the pathway runs through a LIKE statement somewhere, or if there's something interesting in the Snowflake table definition) and looking for documentation of this behavior (assuming that it's a feature, not a bug.)
According to the docs, the snowflake connector defaults to using column order instead of name, see parameter column_mapping.
The connector must map columns from the Spark data frame to the Snowflake table. This can be done based on column names (regardless of
order), or based on column order (i.e. the first column in the data
frame is mapped to the first column in the table, regardless of column
name).
By default, the mapping is done based on order. You can override that by setting this parameter to name, which tells the connector to
map columns based on column names. (The name mapping is
case-insensitive.)

Is it possible to define a different Reroute.key.field.name per postgres table?

Debezium by default uses the primary key as partition key, however some tables of mine should be paritioned by a different key (e.g. user)
hence I wanted to use: transforms.Reroute.key.field.name=user_id for that specific table only, and all of the rest would keep using the primary key
Docs:
https://debezium.io/documentation/reference/configuration/topic-routing.html#_example
However I'm not very clear on how to apply that transformer only to one table, but not all others.
Instead of re-routing, you could specify the message.key.columns connector option for customizing the columns that make up the message key for specific tables.
message.key.columns=inventory.customers:user_id

From Postgres table to KSQL table with updates tracking

My task is transfer data from Postgres table to KSQL table (for future joins with streams). Let's imagine table has three records:
id | name | description
-------------------------
1 | name1 | description1
2 | name2 | description2
3 | name3 | description3
It is easy to do by means of Kafka JdbcSourceConnector. But there is one little problem - data in table may be changed. Changes must be in KTable too.
According to documentation there is no way to track changes except bulk mode. But bulk mode takes absolutely all rows and inserts them into topic.
I thought to set up bulk mode for connector. Create a KSream for that topic. Create a KTable for that stream...
And here I do not know what to do. How to make sure changes in Postgres table were in KTable too?
Bulk mode would work, you just define the key of the stream, then new bulk writes will update the KTable of the same key. In other words, you need to ensure the primary keys don't change in your database
Alternatively, Debezium is the CDC version of Kafka Connect.
JDBC source doesn't capture UPDATE queries, as you've stated.
Debezium will produce records that contain previous and new versions of the modified rows

PostgreSQL sequence connects to columns

So im working on a database at the moment, and i can see there are loads of sequences. I was wondering how sequences link up to their corresponding column in order to increment the value.
for example if i create a new table with a column names ID how would i apply a sequence to that column.
Typically, sequences are created implicitly. With a serial column or (alternatively) with an IDENTITY column in Postgres 10 or later. Details:
Auto increment table column
Sequences are separate objects internally and can be "owned" by a column, which happens automatically for the above examples. (But you can also have free-standing sequences.) They are incremented with the dedicated function nextval() that is used for the column default of above columns automatically. More sequence manipulation functions in the manual.
Details:
Safely and cleanly rename tables that use serial primary key columns in Postgres?
Or you can use ALTER SEQUENCE to manipulate various properties.
Privileges on sequences have to be changed explicitly for serial columns, while that happens implicitly for the newer IDENTITY columns.

DB2 Partitioning

I know how partitioning in DB2 works but I am unaware about where this partition values exactly get stored. After writing a create partition query, for example:
CREATE TABLE orders(id INT, shipdate DATE, …)
PARTITION BY RANGE(shipdate)
(
STARTING '1/1/2006' ENDING '12/31/2006'
EVERY 3 MONTHS
)
after running the above query we know that partitions are created on order for every 3 month but when we run a select query the query engine refers this partitions. I am curious to know where this actually get stored, whether in the same table or DB2 has a different table where partition value for every table get stored.
Thanks,
table partitions in DB2 are stored in tablespaces.
For regular tables (if table partitioning is not used) table data is stored in a single tablespace (not considering LOBs).
For partitioned tables multiple tablespaces can used for its partitions.
This is achieved by the "" clause of the CREATE TABLE statement.
CREATE TABLE parttab
...
in TBSP1, TBSP2, TBSP3
In this example the first partition will be stored in TBSP1, the second in TBSP2, The third in TBSP3, the fourth in TBSP1 and so on.
Table partitions are named in DB2 - by default PART1 ..PARTn - and all these details can be looked up in the system catalog view SYSCAT.DATAPARTITIONS including the specified partition ranges.
See also http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?cp=SSEPGG_10.5.0%2F2-12-8-27&lang=en
The column used as partitioning key can be looked up in syscat.datapartitionexpression.
There is also a long syntax for creating partitioned tables where partition names can be explizitly specified as well as the tablespace where the partitions will get stored.
For applications partitioned tables look like a single normal table.
Partitions can be detached from a partitioned table. In this case a partition is "disconnected" from the partitioned table and converted to a table without moving the data (or vice versa).
best regards
Michael
After a bit of research I finally figure it out and want to share this information with others, I hope it may come useful to others.
How to see this key values ? => For LUW (Linux/Unix/Windows) you can see the keys in the Table Object Editor or the Object Viewer Script tab. For z/OS there is an Object Viewer tab called "Limit Keys". I've opened issue TDB-885 to create an Object Viewer tab for LUW tables.
A simple query to check this values:
SELECT * FROM SYSCAT.DATAPARTITIONS
WHERE TABSCHEMA = ? AND TABNAME = ?
ORDER BY SEQNO
reference: http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?lang=en
DB2 will create separate Physical Locations for each partition. So each partition will have its own Table-space. When you SELECT on this partitioned Table your SQL may directly go to a single partition or it may span across many depending on how your SQL is. Also, this may allow your SQL to run in parallel i.e. many TS can be accessed concurrently to speed up the SELECT.