Debezium with Postgres | Kafka Consumer not able to consume any message - postgresql

Here is my docker-compose file:
version: '3.7'
services:
postgres:
image: debezium/postgres:12
container_name: postgres
networks:
- broker-kafka
environment:
POSTGRES_PASSWORD: admin
POSTGRES_USER: antriksh
ports:
- 5499:5432
zookeeper:
image: confluentinc/cp-zookeeper:latest
container_name: zookeeper
networks:
- broker-kafka
ports:
- 2181:2181
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-kafka:latest
container_name: kafka
networks:
- broker-kafka
depends_on:
- zookeeper
ports:
- 9092:9092
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_LOG_CLEANER_DELETE_RETENTION_MS: 5000
KAFKA_BROKER_ID: 1
KAFKA_MIN_INSYNC_REPLICAS: 1
connector:
image: debezium/connect:latest
container_name: kafka_connect_with_debezium
networks:
- broker-kafka
ports:
- "8083:8083"
environment:
GROUP_ID: 1
CONFIG_STORAGE_TOPIC: my_connect_configs
OFFSET_STORAGE_TOPIC: my_connect_offsets
BOOTSTRAP_SERVERS: kafka:29092
depends_on:
- zookeeper
- kafka
networks:
broker-kafka:
driver: bridge
I am able to create table and insert data into it. I am also able to initialise connector using following config -
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
{
"name": "payment-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "antriksh",
"database.password": "admin",
"database.dbname" : "payment",
"database.server.name": "dbserver1",
"database.whitelist": "payment",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "schema-changes.payment",
"publication.name": "mytestpub",
"publication.autocreate.mode": "all_tables"
}
}'
I start my Kafka Consumer like this
kafka-console-consumer --bootstrap-server kafka:29092 --from-beginning --topic dbserver1.public.transaction --property print.key=true --property key.separator="-"
But whenever I update or insert any changes inside my db I don't see the messages being relayed to Kafka Consumer.
I have put the config property - "publication.autocreate.mode": "all_tables" which will create a publication automatically for all tables. But when I do select * from pg_publication I see nothing. It's an empty table.
There is a Debezium named replication slot so I don't know which config or step am I missing which is preventing Kafka Consumer to consume the message.
Update:
I found out that in order for Debezium to create publications automatically we will need pgoutput as the output plugin. Also after OneCricketer's comment this is my connector's config
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '
{
"name": "payment-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "antriksh",
"database.password": "admin",
"database.dbname" : "payment",
"database.server.name": "dbserver1",
"database.whitelist": "payment",
"database.history.kafka.bootstrap.servers": "kafka:29092",
"database.history.kafka.topic": "schema-changes.payment",
"plugin.name": "pgoutput",
"publication.autocreate.mode": "all_tables",
"publication.name": "my_publication"
}
}'
Now I am able to see the publication being created.
16395 | my_publication | 10 | t | t | t | t | t
The issue now seems to be that the LSN is not moving ahead when I check the replication_slots
select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
debezium | pgoutput | logical | 16385 | payment | f | t | 260 | | 491 | 0/176F268 | 0/176F268
(1 row)
It's stuck at 0/176F268 ever since the payment db was created.
When I see the topics list I can see that Debezium has created the topic for the transaction table
[appuser#a112a33992d1 ~]$ kafka-topics --zookeeper zookeeper:2181 --list
__consumer_offsets
connect-status
dbserver1.public.transaction
my_connect_configs
my_connect_offsets
I am unable to understand where is it going wrong.

Related

DOCKER: Pyspark reading from Postgresql doesn't show data

I am trying to read data from a table in a postgresql database and proceed with an ETL project. I have an Docker enviroment using this docker-compose:
version: "3.3"
services:
spark-master:
image: docker.io/bitnami/spark:3.3
ports:
- "9090:8080"
- "7077:7077"
volumes:
- /opt/spark-apps
- /opt/spark-data
environment:
- SPARK_LOCAL_IP=spark-master
- SPARK_WORKLOAD=master
spark-worker-a:
image: docker.io/bitnami/spark:3.3
ports:
- "9091:8080"
- "7000:7000"
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark://spark-master:7077
- SPARK_WORKER_CORES=1
- SPARK_WORKER_MEMORY=1G
- SPARK_DRIVER_MEMORY=1G
- SPARK_EXECUTOR_MEMORY=1G
- SPARK_WORKLOAD=worker
- SPARK_LOCAL_IP=spark-worker-a
volumes:
- /opt/spark-apps
- /opt/spark-data
spark-worker-b:
image: docker.io/bitnami/spark:3.3
ports:
- "9092:8080"
- "7001:7000"
depends_on:
- spark-master
environment:
- SPARK_MASTER=spark://spark-master:7077
- SPARK_WORKER_CORES=1
- SPARK_WORKER_MEMORY=1G
- SPARK_DRIVER_MEMORY=1G
- SPARK_EXECUTOR_MEMORY=1G
- SPARK_WORKLOAD=worker
- SPARK_LOCAL_IP=spark-worker-b
volumes:
- /opt/spark-apps
- /opt/spark-data
postgres:
container_name: postgres_container
image: postgres:11.7-alpine
environment:
POSTGRES_USER: admin
POSTGRES_PASSWORD: admin
volumes:
- /data/postgres
ports:
- "4560:5432"
restart: unless-stopped
# jupyterlab with pyspark
jupyter-pyspark:
image: jupyter/pyspark-notebook:latest
environment:
JUPYTER_ENABLE_LAB: "yes"
ports:
- "9999:8888"
volumes:
- /app/data
I was succesful connecting to the DB, but I can't print any data. Here's my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("salesETL")\
.config("spark.driver.extraClassPath", "./postgresql-42.5.1.jar")\
.getOrCreate()
df = spark.read.format("jdbc").option("url", "jdbc:postgresql://postgres_container:5432/postgres")\
.option("dbtable", "sales")\
.option("driver", "org.postgresql.Driver")\
.option("user", "admin")\
.option("password", "admin").load()
df.show(10).toPandas()
With .toPandas() it gives me this error:
AttributeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 df.show(10).toPandas()
AttributeError: 'NoneType' object has no attribute 'toPandas'
Without .toPandas() it print the columns but no data
+--------+----------+-----------+-------------+-----------------+-------------+--------------+----------+--------+-----------+
|order_id|order_date|customer_id|customer_name|customer_lastname|customer_city|customer_state|product_id|quantity|order_value|
+--------+----------+-----------+-------------+-----------------+-------------+--------------+----------+--------+-----------+
+--------+----------+-----------+-------------+-----------------+-------------+--------------+----------+--------+-----------+
I am new to Pyspark/Spark so I can't figure out what I am missing. It's my very first project. What can it be?
ps: when I run type(df) it returns pyspark.sql.dataframe.DataFrame
show returns nothing. You should call pandas on the dataframe directly. Moreover, I think it's to_pandas not toPandas (https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_pandas.html). So it seems the error will be vanished, with something like that:
df.to_pandas()
About the empty dataset, is there any error? If there is no error, are you sure that any records exist on the table?
Well, I couldn't find a justification of why this has happened and fix it. Instead, I took a workaround: I loaded data to Python using Pandas and then changed the pandas DF to Pyspark DF.
Here's my code:
import psycopg2
import pandas as pd
from pyspark.sql import SparkSession
from sqlalchemy import create_engine
appName = "salesETL"
master = "local"
spark = SparkSession.builder.master(master).appName(appName).getOrCreate()
engine = create_engine(
"postgresql+psycopg2://admin:admin#postgres_container/postgres?client_encoding=utf8")
pdf = pd.read_sql('select * from sales.sales', engine)
# Convert Pandas dataframe to spark DataFrame
df = spark.createDataFrame(pdf)

Model JSON input data for JDBC Sink Connector

My goal is to process an input stream that contains DB updates coming from a legacy system and replicate it into a different DB. The messages have the following format:
Insert - The after struct contains the state of the table entry after the insert operation is complete. No before struct present;
Update - The before struct contains the state of the table entry before the update operation and the after struct only lists the modified fields along with the table key;
Delete - The before struct contains the state of the table entry before the delete operation and the after struct is not present in the message;
Examples:
Insert
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":8934,"test_value_1":26910,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:13.781217"}}
Update
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":8934,"test_value_1":26910,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:13.781217"},"after":{"id_test":8934,"test_value_1":null,"test_value_3":"2023-01-12T18:32:18.787337"}}
Delete
{"table":"SCHEMA.TEST_TABLE","op_type":"D","before":{"id_test":8934,"test_value_1":1499,"test_value_2":"XYZ","test_value_3":"2023-01-12T18:32:18.787337"}}
After struggling a lot with trying to deal with the lack of a schema in the input messages, I turned to kSQL to overcome this and was able - not sure if by the best approach - to get ksql to generate a output topic that can be processed by a Kafka JDBC Sink Connector using a JSON_SR format.
The problem is that the output format I get in the output topic from ksql does not allow for a correct interpretation of all the scenarios - Only the Insert scenario is correctly interpreted.
The flow in ksql was:
Process input topic to map it as a JSON object by defining a schema. Created a stream where the input object is defined TEST_TABLE_INPUT):
CREATE OR REPLACE STREAM TEST_TABLE_INPUT (
"table" VARCHAR,
op_type VARCHAR,
after STRUCT<
id_test INTEGER, test_value_1 INTEGER, test_value_2 VARCHAR, test_value_3 VARCHAR
>,
before STRUCT<
id_test INTEGER
>
) WITH (kafka_topic='input_topic', value_format='JSON');
Process TEST_TABLE_INPUT to flatten the input structures (tried using the BEFORE->* and AFTER->* strategy instead but got a Caused by: line 3:11: mismatched input '*' expecting ... error):
CREATE OR REPLACE STREAM TEST_TABLE_INTERNAL
AS SELECT
OP_TYPE,
BEFORE->id_test AS B_id_test,
AFTER->id_test AS A_id_test,
AFTER
FROM TEST_TABLE_INPUT;
With this the output stream is created:
CREATE OR REPLACE STREAM TEST_TABLE_UPDATES
WITH (
VALUE_FORMAT='JSON_SR',
KEY_FORMAT='JSON_SR',
KAFKA_TOPIC='T_TEST_UPDATES'
)
AS SELECT
COALESCE(B_id_test, A_id_test) AS id_test,
AFTER->test_value_1 AS test_value_1,
AFTER->test_value_2 AS test_value_2,
AFTER->test_value_3 AS test_value_3
FROM TEST_TABLE_INTERNAL
PARTITION BY COALESCE(B_id_test, A_id_test)
;
When processing the example input topic messages below
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":0,"test_value_1":17315,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:49.694383"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":0,"test_value_1":17315,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:49.694383"},"after":{"id_test":0,"test_value_1":null,"test_value_3":"2023-01-12T19:06:54.702107"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"D","before":{"id_test":0,"test_value_1":28605,"test_value_2":"XYZ","test_value_3":"2023-01-12T19:06:54.702107"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":1,"test_value_1":15601,"test_value_2":"KLM","test_value_3":"2023-01-12T19:07:04.716303"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"U","before":{"id_test":1,"test_value_1":15601,"test_value_2":"KLM","test_value_3":"2023-01-12T19:07:04.716303"},"after":{"id_test":1,"test_value_1":null,"test_value_3":"2023-01-12T19:07:09.723473"}}
{"table":"SCHEMA.TEST_TABLE","op_type":"I","after":{"id_test":2,"test_value_1":13386,"test_value_2":"ABC","test_value_3":"2023-01-12T19:14:23.633204"}}
to simulate the I, I-U and I-U-D scenarios the output I get in KSQL is:
ksql> SELECT * FROM TEST_TABLE_UPDATES EMIT CHANGES;
+----------+-------------+--------------+----------------------------+
|ID_TEST |TEST_VALUE_1 |TEST_VALUE_2 |TEST_VALUE_3 |
+----------+-------------+--------------+----------------------------+
|0 |17315 |XYZ |2023-01-12T19:06:49.694383 |
|0 |null |null |2023-01-12T19:06:54.702107 |
|0 |null |null |null |
|1 |15601 |KLM |2023-01-12T19:07:04.716303 |
|1 |null |null |2023-01-12T19:07:09.723473 |
|2 |13386 |ABC |2023-01-12T19:14:23.633204 |
And outside of ksQL:
kafka-console-consumer.sh --property print.key=true --topic TEST_TABLE_UPDATES --from-beginning --bootstrap-server localhost:9092
0 {"TEST_VALUE_1":17315,"TEST_VALUE_2":"XYZ","TEST_VALUE_3":"2023-01-12T19:06:49.694383"}
0 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":"2023-01-12T19:06:54.702107"}
0 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":null}
1 {"TEST_VALUE_1":15601,"TEST_VALUE_2":"KLM","TEST_VALUE_3":"2023-01-12T19:07:04.716303"}
1 {"TEST_VALUE_1":null,"TEST_VALUE_2":null,"TEST_VALUE_3":"2023-01-12T19:07:09.723473"}
2 {"TEST_VALUE_1":13386,"TEST_VALUE_2":"ABC","TEST_VALUE_3":"2023-01-12T19:14:23.633204"}
On top of this the Connector with the configuration as mentioned below is instantiated:
{
"name": "t_test_updates-v2",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"errors.log.enable": true,
"errors.log.include.messages": true,
"topics": "TEST_TABLE_UPDATES",
"key.converter": "io.confluent.connect.json.JsonSchemaConverter",
"key.converter.schemas.enable": false,
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schemas.enable": false,
"value.converter.schema.registry.url": "http://schema-registry:8081",
"connection.url": "jdbc:oracle:thin:#database:1521/xe",
"connection.user": "<REDACTED>",
"connection.password": "<REDACTED>",
"dialect.name": "OracleDatabaseDialect",
"insert.mode": "upsert",
"delete.enabled": true,
"table.name.format": "TEST_TABLE_REPL",
"pk.mode": "record_key",
"pk.fields": "ID_TEST",
"auto.create": false,
"error.tolerance": "all"
}
}
Which processes the topic - apparently - without a problem but the output in the DB does not match the expected result:
SQL> select * from test_table_repl;
ID_TEST TEST_VALUE_1 TEST_VALUE_2 TEST_VALUE_3
__________ _______________ __________________ _____________________________
1 2023-01-12T19:07:09.723473
0
2 13386 ABC 2023-01-12T19:14:23.633204
I would expect the DB query to return:
ID_TEST TEST_VALUE_1 TEST_VALUE_2 TEST_VALUE_3
__________ _______________ __________________ _____________________________
1 KLM 2023-01-12T19:07:09.723473
2 13386 ABC 2023-01-12T19:14:23.633204
As it can be seen, only the insert scenario is correctly working.
I cannot understand what I am doing wrong.
Is it the data format, the connector configuration, both or another thing I am missing?

Is there a way to setup a sink and source connector for this debezium connector?

I'm using the debezium-connector found here: https://repo1.maven.org/maven2/io/debezium/debezium-connector-oracle/1.4.0.Final/debezium-connector-oracle-1.4.0.Final-plugin.tar.gz
And I'm following these instructions for docker-compose: https://github.com/confluentinc/demo-scene/blob/master/oracle-and-kafka/docker-compose.yml
I did it for jdbc-connector by using confluent-hub but I don't know how to do it for debezium. It's not solved by adding it into /usr/share/java and running
So my docker-compose is:
---
version: '2'
services:
zookeeper:
image: confluentinc/cp-zookeeper:6.0.1
hostname: zookeeper
container_name: zookeeper
volumes:
- /dados/persistence/zookeeper/data:/var/lib/zookeeper/data
- /dados/persistence/zookeeper/log:/var/lib/zookeeper/log
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
image: confluentinc/cp-server:6.0.1
hostname: broker
container_name: broker
volumes:
- /dados/persistence/broker/data:/var/lib/kafka/data
depends_on:
- zookeeper
ports:
- "9092:9092"
- "9101:9101"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_METRIC_REPORTERS: io.confluent.metrics.reporter.ConfluentMetricsReporter
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
KAFKA_CONFLUENT_LICENSE_TOPIC_REPLICATION_FACTOR: 1
KAFKA_CONFLUENT_BALANCER_TOPIC_REPLICATION_FACTOR: 1
KAFKA_JMX_PORT: 9101
KAFKA_JMX_HOSTNAME: localhost
KAFKA_CONFLUENT_SCHEMA_REGISTRY_URL: http://schema-registry:8081
CONFLUENT_METRICS_REPORTER_BOOTSTRAP_SERVERS: kafka:29092
CONFLUENT_METRICS_REPORTER_TOPIC_REPLICAS: 1
CONFLUENT_METRICS_ENABLE: 'true'
CONFLUENT_SUPPORT_CUSTOMER_ID: 'anonymous'
schema-registry:
image: confluentinc/cp-schema-registry:6.0.1
hostname: schema-registry
container_name: schema-registry
depends_on:
- zookeeper
- kafka
ports:
- "8081:8081"
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: zookeeper:2181
kafka-connect:
image: cnfldemos/cp-server-connect-datagen:0.4.0-6.0.1
hostname: connect
container_name: kafka-connect
volumes:
- /dados/packages/confluent-hub/share/confluent-hub-components:/usr/share/confluent-hub-components/custom
- /dados/persistence/kafka-connect/jars:/etc/kafka-connect/jars
depends_on:
- zookeeper
- kafka
- schema-registry
ports:
- "8083:8083"
environment:
CONNECT_BOOTSTRAP_SERVERS: 'kafka:29092'
CONNECT_REST_PORT: 8083
CONNECT_GROUP_ID: compose-connect-group
CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter
CONNECT_VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter
CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081'
CONNECT_INTERNAL_KEY_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
CONNECT_INTERNAL_VALUE_CONVERTER: "org.apache.kafka.connect.json.JsonConverter"
CONNECT_REST_ADVERTISED_HOST_NAME: "kafka-connect"
CONNECT_LOG4J_ROOT_LOGLEVEL: "INFO"
CONNECT_LOG4J_APPENDER_STDOUT_LAYOUT_CONVERSIONPATTERN: "[%d] %p %X{connector.context}%m (%c:%L)%n"
CONNECT_LOG4J_LOGGERS: "org.apache.kafka.connect.runtime.rest=WARN,org.reflections=ERROR"
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: "1"
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: "1"
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: "1"
CONNECT_PLUGIN_PATH: "/usr/share/java,/usr/share/confluent-hub-components,/usr/share/confluent-hub-components/custom"
LD_LIBRARY_PATH: '/usr/share/java/debezium-connector-oracle/instantclient_19_6/'
control-center:
image: confluentinc/cp-enterprise-control-center:6.0.1
hostname: control-center
container_name: control-center
depends_on:
- kafka
- schema-registry
- kafka-connect
- ksqldb
ports:
- "9021:9021"
environment:
CONTROL_CENTER_BOOTSTRAP_SERVERS: 'kafka:29092'
CONTROL_CENTER_CONNECT_CLUSTER: 'kafka-connect:8083'
CONTROL_CENTER_KSQL_KSQLDB1_URL: "http://10.58.0.207:8088"
CONTROL_CENTER_KSQL_KSQLDB1_ADVERTISED_URL: "http://10.58.0.207:8088"
CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://10.58.0.207:8081"
CONTROL_CENTER_REPLICATION_FACTOR: 1
CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1
CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1
CONFLUENT_METRICS_TOPIC_REPLICATION: 1
PORT: 9021
ksqldb:
image: confluentinc/cp-ksqldb-server:6.0.1
hostname: ksqldb
container_name: ksqldb-server
depends_on:
- kafka
- kafka-connect
ports:
- "8088:8088"
environment:
KSQL_CONFIG_DIR: "/etc/ksql"
KSQL_LISTENERS: "http://0.0.0.0:8088"
KSQL_BOOTSTRAP_SERVERS: kafka:29092
KSQL_KSQL_LOGGING_PROCESSING_STREAM_AUTO_CREATE: "true"
KSQL_KSQL_LOGGING_PROCESSING_TOPIC_AUTO_CREATE: "true"
KSQL_KSQL_CONNECT_URL: http://kafka-connect:8083
KSQL_KSQL_SCHEMA_REGISTRY_URL: http://schema-registry:8081
ksqldb-cli:
image: confluentinc/cp-ksqldb-cli:6.0.1
container_name: ksqldb-cli
depends_on:
- kafka
- kafka-connect
- ksqldb
entrypoint: /bin/sh
tty: true
ksql-datagen:
image: confluentinc/ksqldb-examples:6.0.1
hostname: ksql-datagen
container_name: ksql-datagen
depends_on:
- ksqldb
- kafka
- schema-registry
- kafka-connect
command: "bash -c 'echo Waiting for Kafka to be ready... && \
cub kafka-ready -b broker:29092 1 40 && \
echo Waiting for Confluent Schema Registry to be ready... && \
cub sr-ready schema-registry 8081 40 && \
echo Waiting a few seconds for topic creation to finish... && \
sleep 11 && \
tail -f /dev/null'"
environment:
KSQL_CONFIG_DIR: "/etc/ksql"
STREAMS_BOOTSTRAP_SERVERS: kafka:29092
STREAMS_SCHEMA_REGISTRY_HOST: schema-registry
STREAMS_SCHEMA_REGISTRY_PORT: 8081
rest-proxy:
image: confluentinc/cp-kafka-rest:6.0.1
depends_on:
- kafka
- schema-registry
ports:
- 8082:8082
hostname: rest-proxy
container_name: rest-proxy
environment:
KAFKA_REST_HOST_NAME: rest-proxy
KAFKA_REST_BOOTSTRAP_SERVERS: 'kafka:29092'
KAFKA_REST_LISTENERS: "http://0.0.0.0:8082"
KAFKA_REST_SCHEMA_REGISTRY_URL: 'http://schema-registry:8081'
You need to add /etc/kafka-connect/jars to CONNECT_PLUGIN_PATH

Kafka Connect topic.key.ignore not works as expected

As I understand from the documentation of kafka connect this configuration should ignore the keys for metricbeat and filebeat topic but not for alarms. But kafka connect does not ignore any key.
So that's the fully json config that i pushing to kafka-connect over rest
{
"auto.create.indices.at.start": false,
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"connection.url": "http://elasticsearch:9200",
"connection.timeout.ms": 5000,
"read.timeout.ms": 5000,
"tasks.max": "5",
"topics": "filebeat,metricbeat,alarms",
"behavior.on.null.values": "delete",
"behavior.on.malformed.documents": "warn",
"flush.timeout.ms":60000,
"max.retries":42,
"retry.backoff.ms": 100,
"max.in.flight.requests": 5,
"max.buffered.records":20000,
"batch.size":4096,
"drop.invalid.message": true,
"schema.ignore": true,
"topic.key.ignore": "metricbeat,filebeat",
"key.ignore": false
"name": "elasticsearch-ecs-connector",
"type.name": "_doc",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"transforms":"routeTS",
"transforms.routeTS.type":"org.apache.kafka.connect.transforms.TimestampRouter",
"transforms.routeTS.topic.format":"${topic}-${timestamp}",
"transforms.routeTS.timestamp.format":"YYYY.MM.dd",
"errors.tolerance": "all" ,
"errors.log.enable": false ,
"errors.log.include.messages": false,
"errors.deadletterqueue.topic.name":"logstream-dlq",
"errors.deadletterqueue.context.headers.enable":true ,
"errors.deadletterqueue.topic.replication.factor": 1
}
That's the logging during start of the connector
[2020-05-01 21:07:49,960] INFO ElasticsearchSinkConnectorConfig values:
auto.create.indices.at.start = false
batch.size = 4096
behavior.on.malformed.documents = warn
behavior.on.null.values = delete
compact.map.entries = true
connection.compression = false
connection.password = null
connection.timeout.ms = 5000
connection.url = [http://elasticsearch:9200]
connection.username = null
drop.invalid.message = true
elastic.https.ssl.cipher.suites = null
elastic.https.ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
elastic.https.ssl.endpoint.identification.algorithm = https
elastic.https.ssl.key.password = null
elastic.https.ssl.keymanager.algorithm = SunX509
elastic.https.ssl.keystore.location = null
elastic.https.ssl.keystore.password = null
elastic.https.ssl.keystore.type = JKS
elastic.https.ssl.protocol = TLS
elastic.https.ssl.provider = null
elastic.https.ssl.secure.random.implementation = null
elastic.https.ssl.trustmanager.algorithm = PKIX
elastic.https.ssl.truststore.location = null
elastic.https.ssl.truststore.password = null
elastic.https.ssl.truststore.type = JKS
elastic.security.protocol = PLAINTEXT
flush.timeout.ms = 60000
key.ignore = false
linger.ms = 1
max.buffered.records = 20000
max.in.flight.requests = 5
max.retries = 42
read.timeout.ms = 5000
retry.backoff.ms = 100
schema.ignore = true
topic.index.map = []
topic.key.ignore = [metricbeat, filebeat]
topic.schema.ignore = []
type.name = _doc
write.method = insert
Iam using Confluent Platform 5.5.0
Let's recap here, because there have been several edits to your question and problem statement :)
You want to stream multiple topics to Elasticsearch with a single connector
You want to use the message key for some topics as the Elasticsearch document ID, and for others you don't and want to use the Kafka message coordinates instead (topic+partition+offset)
You are trying to do this with key.ignore and topic.key.ignore settings
Here's my test data in three topics, test01, test02, test03:
ksql> PRINT test01 from beginning;
Key format: KAFKA_STRING
Value format: AVRO or KAFKA_STRING
rowtime: 2020/05/12 11:08:32.441 Z, key: X, value: {"COL1": 1, "COL2": "FOO"}
rowtime: 2020/05/12 11:08:32.594 Z, key: Y, value: {"COL1": 2, "COL2": "BAR"}
ksql> PRINT test02 from beginning;
Key format: KAFKA_STRING
Value format: AVRO or KAFKA_STRING
rowtime: 2020/05/12 11:08:50.865 Z, key: X, value: {"COL1": 1, "COL2": "FOO"}
rowtime: 2020/05/12 11:08:50.936 Z, key: Y, value: {"COL1": 2, "COL2": "BAR"}
ksql> PRINT test03 from beginning;
Key format: ¯\_(ツ)_/¯ - no data processed
Value format: AVRO or KAFKA_STRING
rowtime: 2020/05/12 11:16:15.166 Z, key: <null>, value: {"COL1": 1, "COL2": "FOO"}
rowtime: 2020/05/12 11:16:46.404 Z, key: <null>, value: {"COL1": 2, "COL2": "BAR"}
With this data I create a connector (I'm using ksqlDB but it's the same as if you use the REST API directly):
CREATE SINK CONNECTOR SINK_ELASTIC_TEST WITH (
'connector.class' = 'io.confluent.connect.elasticsearch.ElasticsearchSinkConnector',
'connection.url' = 'http://elasticsearch:9200',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter',
'type.name' = '_doc',
'topics' = 'test02,test01,test03',
'key.ignore' = 'false',
'topic.key.ignore'= 'test02,test03',
'schema.ignore' = 'false'
);
The resulting indices are created and populated in Elasticsearch. Here's the index and document ID of the documents:
➜ curl -s http://localhost:9200/test01/_search \
-H 'content-type: application/json' \
-d '{ "size": 5 }' |jq -c '.hits.hits[] | [._index, ._id]'
["test01","Y"]
["test01","X"]
➜ curl -s http://localhost:9200/test02/_search \
-H 'content-type: application/json' \
-d '{ "size": 5 }' |jq -c '.hits.hits[] | [._index, ._id]'
["test02","test02+0+0"]
["test02","test02+0+1"]
➜ curl -s http://localhost:9200/test03/_search \
-H 'content-type: application/json' \
-d '{ "size": 5 }' |jq -c '.hits.hits[] | [._index, ._id]'
["test03","test03+0+0"]
["test03","test03+0+1"]
So key.ignore is the default and for test01 in effect, which means that the key of the messages are used for the document ID.
Topics test02 and test03 are listed for topic.key.ignore which means that the key of the message is ignored (i.e. in effect key.ignore=true), and thus the document ID is the topic/partition/offset of the message.
I would recommend, given that I've proven out above that this does work, that you start your test again from scratch to double-check your working.

Unable to initialize MongoDB When the Container Starts

Here is my docker-compose.yaml:
version: '3.3'
mongo:
build:
context: '.'
dockerfile: 'Dockerfile'
environment:
MONGO_INITDB_DATABASE: 'mydb'
ports:
- '27017:27017'
volumes:
- 'data-storage:/data/db'
networks:
mynet:
volumes:
data-storage:
networks:
mynet:
Here is my Dockerfile:
FROM mongo:latest
COPY ./initdb.js /docker-entrypoint-initdb.d/
And finally here is my inidb.js:
db.createCollection("strategyitems");
db.strategyitems.createIndex( {strategy: 1 }, { unique: false } );
db.strategyitems.createIndex( {strategy: 1, symbol: 1 }, { unique: true } );
db.strategyitems.insertMany([
{ strategy: "crypto", symbol: "btcusd", eval_period: 15, buy_booster: 8.0, sell_booster: 5.0, buy_lot: 0.2, sell_lot: 0.2 },
{ strategy: "crypto", symbol: "ethusd", eval_period: 15, buy_booster: 8.0, sell_booster: 5.0, buy_lot: 0.2, sell_lot: 0.2 },
{ strategy: "crypto", symbol: "neousd", eval_period: 15, buy_booster: 8.0, sell_booster: 5.0, buy_lot: 0.2, sell_lot: 0.2 }
]);
The container builds and starts successfully... but no way to get the db statements above executed.
If I log into the container, folder /docker-entrypoint-initdb.d/ contains initdb.js... so I'd expect the db get intialized.
Am I missing something?
So the supplied compose file doesn't work for me, I had to edit it to get it up & running (v18.06 CE), so heads-up on that.
version: '3.3'
services:
mongo:
build:
context: .
dockerfile: Dockerfile
environment:
MONGO_INITDB_DATABASE: 'mydb'
ports:
- '27017:27017'
volumes:
- 'data-storage:/data/db'
networks:
mynet:
volumes:
data-storage:
networks:
mynet:
Next, if you'd run docker-compose up before adding the initdb.js file and then stopped with docker-compose down, then docker-compose down stops the containers, but doesn't remove the volume
docker ps
| CONTAINER | ID | IMAGE | COMMAND | CREATED | STATUS | PORTS | NAMES | | | |
|--------------|------------------|----------------------|---------|---------|--------|-------|-------|---------|--------------------------|--------------------|
| c412bbd9a22b | lumberjack_mongo | docker-entrypoint.s… | 7 | minutes | ago | Up | 6 | minutes | 0.0.0.0:27017->27017/tcp | lumberjack_mongo_1 |
docker volume ls
| DRIVER | | VOLUME | NAME |
|--------|---|--------|-------------------------|
| local | | | lumberjack_data-storage |
docker-compose down
Removing lumberjack_mongo_1 ... done
Removing network lumberjack_mynet
docker volume ls
| DRIVER | | VOLUME | NAME |
|--------|---|--------|-------------------------|
| local | | | lumberjack_data-storage |
The problem arises when docker-compose up is run when the volume exists - Docker mounts the volume before the container starts up. Mongo does some pre-checks and if it finds that the directories are present, then skips the initdb sequence.
If you remove the volume after docker-compose down and do a docker-compose up, the volume will be created from scratch, the pre-check finds nothing and initializes the mongodb
docker volume rm lumberjack_data-storage
lumberjack_data-storage
docker-compose up
Creating network "lumberjack_mynet" with the default driver
Creating volume "lumberjack_data-storage" with default driver
Creating lumberjack_mongo_1 ... done
Attaching to lumberjack_mongo_1
[....]
mongo_1 | /usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/initdb.js
mongo_1 | 2018-08-04T18:08:47.699+0000 I INDEX [LogicalSessionCacheRefresh] build index on: config.system.sessions properties: { v: 2, key: { lastUse: 1 }, name: "lsidTTLIndex", ns: "config.system.sessions", expireAfterSeconds: 1800 }
mongo_1 | 2018-08-04T18:08:47.745+0000 I NETWORK [conn2] received client metadata from 127.0.0.1:45324 conn2: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "4.0.0" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "16.04" } }
mongo_1 | 2018-08-04T18:08:47.747+0000 I STORAGE [conn2] createCollection: initdb.strategyitems with generated UUID: 585edb14-bc63-4879-bc5d-504867fb5e12
mongo_1 | 2018-08-04T18:08:47.851+0000 I INDEX [conn2] build index on: initdb.strategyitems properties: { v: 2, key: { strategy: 1.0 }, name: "strategy_1", ns: "initdb.strategyitems" }
mongo_1 | 2018-08-04T18:08:47.851+0000 I INDEX [conn2] building index using bulk method; build may temporarily use up to 500 megabytes of RAM
mongo_1 | 2018-08-04T18:08:47.852+0000 I INDEX [conn2] build index done. scanned 0 total records. 0 secs
mongo_1 | 2018-08-04T18:08:47.881+0000 I INDEX [conn2] build index on: initdb.strategyitems properties: { v: 2, unique: true, key: { strategy: 1.0, symbol: 1.0 }, name: "strategy_1_symbol_1", ns: "initdb.strategyitems" }
mongo_1 | 2018-08-04T18:08:47.881+0000 I INDEX [conn2] building index using bulk method; build may temporarily use up to 500 megabytes of RAM
mongo_1 | 2018-08-04T18:08:47.882+0000 I INDEX [conn2] build index done. scanned 0 total records. 0 secs
mongo_1 | 2018-08-04T18:08:47.886+0000 I NETWORK [conn2] end connection 127.0.0.1:45324 (0 connections now open)
[....]
mongo_1 | MongoDB init process complete; ready for start up.
mongo_1 |
mongo_1 | 2018-08-04T18:08:48.933+0000 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
mongo_1 | 2018-08-04T18:08:48.939+0000 I CONTROL [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db 64-bit host=e90c80083360