Debezium is very slow, how to improve? - apache-kafka

First time user of Debezium, I get only around 1000 messages per MINUTE in debezium ( Which is very slow compared to online benchmark ). No throttling on Kafka connect/ MySQL/ Kafka Broker, not sure what I am doing here. I will post the config here for reference.
Config of Kafka-Connect Worker:
-e CONNECT_GROUP_ID="quickstart" \
-e CONNECT_CONFIG_STORAGE_TOPIC="quickstart-config" \
-e CONNECT_OFFSET_STORAGE_TOPIC="quickstart-offsets" \
-e CONNECT_STATUS_STORAGE_TOPIC="quickstart-status" \
-e CONNECT_KEY_CONVERTER="org.apache.kafka.connect.json.JsonConverter" \
-e CONNECT_VALUE_CONVERTER="org.apache.kafka.connect.json.JsonConverter" \
-e CONNECT_INTERNAL_KEY_CONVERTER="org.apache.kafka.connect.json.JsonConverter" \
-e CONNECT_INTERNAL_VALUE_CONVERTER="org.apache.kafka.connect.json.JsonConverter" \
-e CONNECT_REST_ADVERTISED_HOST_NAME="localhost" \
-e CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR=1\
-e CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR=1\
-e CONNECT_STATUS_STORAGE_REPLICATION_FACTOR=1\```
Config of Kafka Debezium MySQL Connector
I use all default config for Kafka Debezium MySQL Connector

The best is to reconfigure connector with Avro Serialization to reduce size of the messages and track schema changes. This should give you very significant improvement.
The Avro binary format is compact and efficient. Avro schemas make it
possible to ensure that each record has the correct structure. Avro’s
schema evolution mechanism enables schemas to evolve. This is
essential for Debezium connectors, which dynamically generate each
record’s schema to match the structure of the database table that was
changed.
To see difference removal of the schema from each message will make in your case, without Avro and set up of the Schema Registry, set settings below to false. Do not use it this way in production.
The default behavior is that the JSON converter includes the record’s
message schema, which makes each record very verbose.
If you want records to be serialized with JSON, consider setting the
following connector configuration properties to false:
key.converter.schemas.enable
value.converter.schemas.enable.
Setting these properties to false excludes the verbose schema information from each record.
See also: "Kafka Connect Deep Dive – Converters and Serialization Explained"

Related

Debezium PostgreSQL connector not creating topic

A Kafka topic is not created when I use start the Debezium connector for PostgreSQL. Here's what I have in my properties file:
name=testdb
connector.class=io.debezium.connector.postgresql.PostgresConnector
topic.prefix=test
database.hostname=localhost
database.port=5432
database.user=postgres
database.password=root
database.dbname=testdb
database.server.name=testdb
table.include.list=ipaddrs
plugin.name=pgoutput
According to this, the topic should be named testdb.myschema.ipaddrs (myschema is the name of my schema). However bin/kafka-topics.sh --list --bootstrap-server 192.168.56.1:9092 returns nothing. A topic is not created if I add a row to table ipaddrs.
Kakfka connect starts up successfully when I run bin/connect-standalone.sh config/connect-standalone.properties config/postgres.properties without any exceptions.
I have auto.create.topics.enable = true. http://localhost:8083/connectors/testdb/status shows this:
{"name":"testdb","connector":{"state":"RUNNING","worker_id":"10.0.0.48:8083"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"10.0.0.48:8083"}],"type":"source"}
I am not running Zookeeper. I am running Kafka with KRaft.
table.include.list was not correct. It has to include the schema. So it should have been myschema.ipaddrs in my example. In addition, it seems like the documentation is incorrect for the topic name. In my system it is <topic.prefix>.<table name>. So in my example, it's test.myschema.ipaddrs.
Assuming the database has the correct configuration to work with the Debezium postgreSQL connector, you can remove "database.dbname" if ipaddrs is table name not repeated in another schema.

Using mongomirror to sync collections within Atlas

I want to migrate a collection from one Mongo Atlas Cluster to another. How do I go about in doing that?
There are 2 poosible approaches here:
Migration with downtime: stop the service, export the data from the collection to some 3rd location, and then import the data into the new collection on the new cluster, and resume the service.
But there's a better way: using the MongoMirror utility. With this utility, you can sync collections across clusters without any downtime. the utility first syncs the db (or selected collections from it) and then ensures subsequent writes to the source are synced to the dest.
following is the syntax I used to get it to run:
./mongomirror --host atlas-something-shard-0/prod-mysourcedb--shard-00-02-pri.abcd.gcp.mongodb.net:27017 \
--username myUserName \
--password PASSWORD \
--authenticationDatabase admin \
--destination prod-somethingelse-shard-0/prod-mydestdb-shard-00-02-pri.abcd.gcp.mongodb.net:27017 \
--destinationUsername myUserName \
--destinationPassword PASSWORD \
--includeNamespace dbname.collection1 \
--includeNamespace dbname.collection2 \
--ssl \
--forceDump
Unfortunately, there are MANY pitfalls here:
ensure your user has the correct role. this is actually covered in the docs so read the relevant section closely.
to correctly specify the host and destination fields, you'll need to obtain both the RS name and the primary instance name. One way to get these is to use the mongosh tool and run rs.conf() on both source and destination clusters. The RS name is specified as "_id" in the command's output, and the instances are listed as "members" in the output. You'll want to take the primary instance's "host' field. the end result should look like RS-name/primary-instance-host:port
IF you specify replica-set, you MUST specify the PRIMARY instance. Failing to do so will result in an obscure error (EOF something).
I recommend adding the forceDump flag (at least until you manage to get it to run for the first time).
If you specify non-existing collections, the utility will only give one indication that they don't exist and then go on to "sync" these, rather than failing.

How to create a table in KAFKA using Confluent KSQLDB with a console / command line / cli?

I am trying to create a table in KAFKA using Confluent's KSQLDB. I have easily use the UX provided by Confluent but I am trying to automate this process and hence I need a command line scipting capability to execute this task.
I have been able to use Confluent's KSQLDB REST interface to execute curl for simple commands ("LIST STREAMS") but more complicated statements, such as creating a table, seem to be unavailable, or there are no examples available to clone/use.
Does anyone have a command line script or a know of a console command to create a KSQLDB table (in Confluent's version of Kafka)?
You can use the /ksql endpoint. For example:
curl -s -X "POST" "http://localhost:8088/ksql" \
-H "Content-Type: application/vnd.ksql.v1+json; charset=utf-8" \
-d '{"ksql":"CREATE TABLE MOVEMENTS (PERSON VARCHAR KEY, LOCATION VARCHAR) WITH (VALUE_FORMAT='\''JSON'\'', KAFKA_TOPIC='\''movements'\'');"}'

How to configure Kafka topics so an interconnected entity schema can be consumed in the form of events by databases like RDMS and graph

I have a case where I have Information objects that contain Element objects. If I store an Information object it will try to find preexisting Element objects based on a unique value field otherwise insert them. Information objects and Element objects can't be deleted for now. Adding a parent needs two preexisting Element objects. I was planning to use three topics: CreateElement, CreateInformation, AddParentOfElement for the events Created Element Event, Created Information Event and Added Parent Event. I realized since there are no order guarantees between topics and between topic-partitions that those events as shown in the picture could be consumed in different order so the schema won't be able to be persisted to an RDBMS for example. I assume that ids are used for partition assignment of the Topics as usual.
Here is my diagram:
The scenario is
Element with (id=1) was created by user
Information with (id=1) containing Elements (1,2,3) was created
by user
Element with (id=5) was created by user
Parent of Element with (id=5) was set to be Element with (id=3)
by the user
Information with (id=2) containing Elements (1,3 and 5) was
created by the user
I am curious if my topic selections are making sense and I would appreciate any suggestions on how to have events that when are processed by consumer database services are idempotent - don't put the system in the wrong state.
Thanks!
After considering this solution: How to implement a microservice Event Driven architecture with Spring Cloud Stream Kafka and Database per service but not being satisfied with the suggestions. I investigated Confluent Bottled Water (https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/) and later the more active but similar Debezium (http://debezium.io/)
Ι decided to follow the Debezium way. Debezium is a plugin that reads directly from Mysql/Postgres binlog and publishes those changes (schema and data) in Kafka.
The example setup I am using involves docker and here it is how I set it up for Docker Toolbox (Windows) and Docker (Linux).
1a) Linux (Docker)
sudo docker stop $(sudo docker ps -a -q) \
sudo docker rm -f $(sudo docker ps -a -q) \
sudo docker run -d --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium/example-mysql:0.5 \
sudo docker run -d --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper \
sudo docker run -d --name kafka -e ADVERTISED_HOST_NAME=<YOUR_IP> -e ZOOKEEPER_CONNECT=<YOUR_IP> --link zookeeper:zookeeper -p 9092:9092 debezium/kafka \
sudo docker run -d --name connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my-connect-configs -e OFFSET_STORAGE_TOPIC=my-connect-offsets -e ADVERTISED_HOST_NAME=<YOUR_IP> --link zookeeper:zookeeper --link kafka:kafka --link mysql:mysql debezium/connect \
sudo docker run -d --net=host -e "PROXY=true" -e ADV_HOST=<YOUR_IP> -e "KAFKA_REST_PROXY_URL=http://<YOUR_IP>:8082" -e "SCHEMAREGISTRY_UI_URL=http://<YOUR_IP>:8081" landoop/kafka-topics-ui \
sudo docker run -p 8082:8082 --name kafka-rest --env ZK_CONNECTION_STRING=<YOUR_IP>:2181 frontporch/kafka-rest:latest
1b) Windows (Docker Toolbox)
docker stop $(docker ps -a -q) ;
docker rm -f $(docker ps -a -q) ;
docker run -d --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=debezium -e MYSQL_USER=mysqluser -e MYSQL_PASSWORD=mysqlpw debezium/example-mysql:0.5 ;
docker run -d --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper ;
docker run -d --name kafka -e ADVERTISED_HOST_NAME=192.168.99.100 -e ZOOKEEPER_CONNECT=192.168.99.100 --link zookeeper:zookeeper -p 9092:9092 debezium/kafka ;
docker run -d --name connect -p 8083:8083 -e GROUP_ID=1 -e CONFIG_STORAGE_TOPIC=my-connect-configs -e OFFSET_STORAGE_TOPIC=my-connect-offsets -e ADVERTISED_HOST_NAME=192.168.99.100 --link zookeeper:zookeeper --link kafka:kafka --link mysql:mysql debezium/connect ;
docker run -d --net=host -e "PROXY=true" -e ADV_HOST=192.168.99.100 -e "KAFKA_REST_PROXY_URL=http://192.168.99.100:8082" -e "SCHEMAREGISTRY_UI_URL=http://192.168.99.100:8081" landoop/kafka-topics-ui ;
docker run -p 8082:8082 --name kafka-rest --env ZK_CONNECTION_STRING=192.168.99.100:2181 frontporch/kafka-rest:latest ;
2) connect the databse to the debezium connect
send a POST application/json to <YOUR_IP>/connectors (for Linux) or 192.168.99.100:8083/connectors (for Windows Docker Toolbox) with body
{
"name": "inventory-connector",
"config": {
"name": "inventory-connector",
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.id": "184054",
"database.server.name": "dbserver1",
"database.whitelist": "inventory",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.inventory"
}
}
Debezium creates kafka topics one for each table - by navigating to the landoop/kafka-topics-ui server on port 8000 you can have a look on how the schema of the message payloads look like below. The important part is the payload before and after that sends the old values and the new values of the corresponding database row. Also op is 'c' for create 'u' for update etc.
Each consuming Microservice is using spring-cloud kafka binders using those maven dependencies:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-dependencies</artifactId>
<version>Brixton.SR7</version>
<type>pom</type>
<scope>import</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>1.5.2.RELEASE</version>
<type>pom</type>
<scope>import</scope>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-stream-binder-kafka-parent</artifactId>
<version>1.2.0.RELEASE</version>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
[...]
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
</dependency>
[...]
</dependencies>
Then I have in each of my consuming Spring Cloud Microservices a Listener that listens to all of the topics that it's interested in at once and delegates each topic event to a dedicated event handler:
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Component;
import java.util.concurrent.CountDownLatch;
#Component
public class Listener {
public final CountDownLatch countDownLatch1 = new CountDownLatch(1);
#KafkaListener(id = "listener", topics = {
"dbserver1.inventory.entity",
"dbserver1.inventory.attribute",
"dbserver1.inventory.entity_types"
} , group = "group1")
public void listen(ConsumerRecord<?, ?> record) {
String topic = record.topic();
if (topic.equals("dbserver1.inventory.entity") {
// delegate to appropriate handler
// EntityEventHandler.handle(record);
}
else if (...) {}
}
}
In my case I wanted to be updating a graph based on the changes that happen on the RDBMS side. Of course the graph database will be eventually consistent with RDBMS. My concern was that since the topics include changes e.g in join_tables as well as the joined table sides, I wouldn't be able to create the corresponding edges and vertices without knowing that each of the vertices of the edges exist. So I decided to ask debezium gitter (https://gitter.im/debezium/dev):
From the discussion below two ways exist..Either create edges and vertices using placeholders for topics that haven't been consumed yet or use Kafka Streams to seam topics back to their original structures something that seems more painful to me than the first way. So I decided to go with the first way :)
Michail Michailidis #zifnab87 Apr 17 11:23 Hi I was able to integrate
Mysql with Debezium Connect and using landoop/topics-ui I am able to
see that the topics are picked up properly and messages are sent the
way they have to. I saw that for each of the tables there is a topic.
e.g join tables are separate topics too.. If lets say I have three
tables order, product and order_product and I have a service consuming
all three topics.. I might get first the insertion on order_product
and then the insertion of order.. That may cause a problem if I am
trying to push this information to a graph database.. I will try to
create an edge on vertex that is not there yet.. how can I make
consumers that consume events lets say based on a transactionId or at
least are aware of the boundary context.. is there an easy way to
listen to those events and then deserialize them to a real java object
so I can push that to a graph database or search index? If not how
would you approach this problem? Thanks!
Randall Hauch #rhauch Apr 17 19:19 #zifnab87 Debezium CDC is purely a
row-based approach, so by default all consumers see the row-level
change events in an eventually consistent manner. Of course, the
challenge to eventual consistency of downstream systems is that they
might potentially leak data states that never really existed in the
upstream source. But with that come lots of other really huge
benefits: downstream consumers are much simpler, more resilient to
failure, have lower latency (since there’s no need to wait for the
appearance of the upstream transaction’s completion before
processing), and are less decoupled to the upstream system. You gave
the example of an order and product tables with an order_product
intersect table. I agree that when thinking transactionally it does
not make sense for an order_product relationship to be added before
both the order and product instances exist. But are you required to
live with that constraint? Can the order_product consumer create
placeholder nodes in the graph database for any missing order and/or
product values referenced by the relationship? In this case when the
order_product consumer is a bit ahead of the order consumer, it might
create an empty order node with the proper key or identifier, and when
the order consumer finally processes the new order it would find the
existing placeholder node and fill in the details. Of course, when the
order arrives before the order_product relationships, then everything
works as one might expect. This kind of approach might not be allowed
by the downstream graph database system or the business-level
constraints defined in the graph database. But if it is allowed and
the downstream applications and services are designed to handle such
states, then you’ll benefit from the significant simplicity that this
approach affords, as the consumers become almost trivial. You’ll be
managing less intermediate state and your system will be more likely
to continue operating when things go wrong (e.g., consumers crash or
are taken down for maintenance). If your downstream consumers do have
to stick with ahering to the transaction boundaries in the source
database, then you might consider using Kafka Streams to join the
order and order_product topics and produce a single aggregate order
object with all the relationships to the referenced products. If you
can’t assume the product already exists, then you could also join with
the product topic to add more product detail to the aggregate order
object. Of course, there still are lots of challenges, since the only
way for a stream processor consuming these streams to know it’s seen
all of the row-level change events for a given transaction is when a
subsequent transaction is seen on each of the streams. As you might
expect, this is not ideal, since the last transaction prior to any
quiet period will not complete immediately.
Michail Michailidis #zifnab87 Apr 17 23:49 Thanks #rhauch really well
explained! I was investigating Kafka Streams while waiting for your
answers! now I am thinking I will try to code the placeholder
variation e.g when a vertex is not there etc
Randall Hauch #rhauch Apr 17 23:58 #zifnab87 glad it helped, at least
a bit! Be sure you also consider the fact that the consumer might see
a sequence of messages that it already has consumed. That will only
happen when something goes wrong (e.g., with the connector or the
process(es) where the connector is running, or the broker, network
partition, etc.); when everything is operating normally, the consumer
should see no duplicate messages.
Michail Michailidis #zifnab87 Apr 18 01:15 #rhauch Sure it helped!
Yeap I have that in mind - consumer processes need to be idempotent. I
am curious if for example sinks for lets say elastic search, mongodb
and graph databases can be implemented to consolidate events produced
from debezium-mysql no matter what the order by using placeholders for
missing things.. e.g the mountaineer sinks are doing that alreadu if
you know by any chance? I am trying to avoid reimplementing things
that already exist.. Also my solutions might be very fragile if mysql
schema changes and I dont consume the new events.. I feel so many
things are missing around the microservices world
Randall Hauch #rhauch Apr 18 03:30 I'm not sure how those sinks work.
Ideally they should handle create, update, and delete events
correctly. But because Debezium events have an envelope at the top
level of every event, you'll probably have to use SMTs to grab the
contents of the after field (or exclude the before field) so the
"meaningful" parts are put into the sink system. This will probably
get easier as more SMTs get added to KC. If you find that it takes too
many SMTs and would rather Debezium added an SMT that did this, please
log a feature request in JIRA.
Hopefully this answer/guide will help others jump start event sourcing having as a central piece a message broker like Kafka.

How to migrate data to remote server with PostgreSQL?

How can I dump my database schema and data in such a way that the usernames, database names and the schema names of the dumped data matches these variables on the servers I deploy to?
My current process entails moving the data in two steps. First, I dump the schema of the database (pg_dump --schema-only -C -c) then I dump out the data with pg_dump --data-only -C and restore these on the remote server in tandem using the psql command. But there has to be a better way than this.
We use the following to replicate databases.
pg_basebackup -x -P -D /var/lib/pgsql/9.2/data -h OTHER_DB_IP_ADDR -U postgres
It requires the "master" server at OTHER_DB_IP_ADDR to be running the replication service and pg_hba.conf must allow replication connections. You do not have to run the "slave" service as a hot/warm stand by in order to replicate. One downside of this method compared with a dump/restore, the restore operation effectively vacuums and re-indexes and resets EVERYTHING, while the replication doesn't, so replicating can use a bit more disk space if your database has been heavily edited. On the other hand, replicating is MUCH faster (15 mins vs 3 hours in our case) since indexes do not have to be rebuilt.
Some useful references:
http://opensourcedbms.com/dbms/setup-replication-with-postgres-9-2-on-centos-6redhat-el6fedora/
http://www.postgresql.org/docs/9.2/static/high-availability.html
http://www.rassoc.com/gregr/weblog/2013/02/16/zero-to-postgresql-streaming-replication-in-10-mins/