Optimal Kakfa Connect Hourly S3AvroSink Config - apache-kafka

{
"name":"{{name}}",
"tasks.max": "6", //have 6 partitions for this topic
"topics": "{{topic}}",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schemas.enable": "true",
"key.converter.schema.registry.url": "xx",
"key.converter.key.subject.name.strategy": "io.confluent.kafka.serializers.subject.TopicRecordNameStrategy",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.schema.registry.url": "xx",
"value.converter.value.subject.name.strategy": "io.confluent.kafka.serializers.subject.TopicRecordNameStrategy",
"errors.retry.timeout":"600000",
"errors.log.enable":"true",
"errors.log.include.messages":"true",
"schema.compatibility": "BACKWARD",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"flush.size": "100000",
"rotate.schedule.interval.ms": "3600000",
"rotate.interval.ms": "3600000",
"enhanced.avro.schema.support": "true",
"connect.meta.data": "false",
"partitioner.class": "{{partitioner}}somepartitioner",
"partition.duration.ms": "3600000",
"path.format": "'avro/event=?eventClass?/tenant=?tenant?/date'=YYYY-MM-dd/'hour'=HH",
"locale": "en",
"timezone": "UTC",
"timestamp.extractor": "RecordField",
"timestamp.field": "{{timestampField}}",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"s3.bucket.name": "somebucket",
"s3.region": "region",
"s3.part.size": "5242880",
"offset.flush.interval.ms": "1200000"
}
The count for the topic is around 739,180 and size 1.1Gb
I'm not fully sure if my config is fully correct or not, if i can improve it somehwere. I want to flush in two cases, hourly or if the size hits 5gb.

Related

Synchning Postgres & Elastic Search with Kafka Connect ( Debezium )

I have set up using docker a postgres image, aswell as a elastic search one.
What i'm trying to achieve is that i have a Vehicle entity ( on a microservice with spring data jpa) , as well as Vehicle document ( on a microservice with spring data elastic search ) .
#Document(indexName = "vehicles")
#Builder
#Data
public class Vehicle {
#Id
private UUID id;
#Field(name = "vin")
private String vin;
#Field(name = "brand")
private String brand;
#Field(name = "model")
private String model;
}
I also have jsons for kafka connect for elastic search and postgres:
{
"name": "eh-vehicles-sink",
"config": {
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "vehicles",
"connection.url": "http://elasticsearch:9200",
"key.ignore": "true",
"type.name": "vehicles",
"index.mapping.dynamic": false,
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false"
}
}
{
"name": "postgres-vehicles-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"plugin.name": "pgoutput",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname": "postgres",
"schema.include.list": "public",
"include.schema.changes": "true",
"database.server.name": "Vehicles",
"database.server.id": "5401",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "public.history",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"transforms":"Reroute",
"transforms.Reroute.type": "io.debezium.transforms.ByLogicalTableRouter",
"transforms.Reroute.topic.regex":"(.*)vehicles",
"transforms.Reroute.topic.replacement": "vehicles",
"transforms.Reroute.key.field.name": "id",
"transforms.Reroute.key.enforce.uniqueness":"false"
}
}
The problem is that after a entty is persisted in postgres, kafka will send it to elastic search, but it will store it in the following format:
"hits": [
{
"_index": "vehicles",
"_type": "_doc",
"_id": "vehicles+0+0",
"_score": 1,
"_source": {
"op": "c",
"before": null,
"after": {
"generation": "F10",
"cylindrical_capacity": 2000,
"country": "Germany",
"tva_deductible": false,
"km": "100000",
"fuel": "Diesel",
"first_owner": "John Doe",
"production_date": 15126,
"created_at": null,
"traction": "ALLWHEELS",
"owner_account_id": "4a2ac2a2-3323-4b42-8960-b044897a180c",
"first_registration_date": 15354,
"colour": "Black",
"soft_deleted": false,
"transmission": "AUTOMATIC",
"accident_free": true,
"vin": "WBAJB51090B513560",
"model": "5 Series",
"id": "739c9d56-d50b-4b17-a86a-e2561f54c1a9",
"power": 180,
"brand": "BMW",
"favorite_accounts": null
},
"source": {
"schema": "public",
"sequence": "[\"24015744\",\"24015744\"]",
"xmin": null,
"connector": "postgresql",
"lsn": 24015744,
"name": "Vehicles",
"txId": 506,
"version": "1.8.1.Final",
"ts_ms": 1676855693566,
"snapshot": "false",
"db": "postgres",
"table": "vehicles"
},
"ts_ms": 1676855694066,
"transaction": null
}
}
]
Which will be a problem when fetching it in the microservice for elastic search, because the payload is wrapped in another object and the wrong id will be fetched unless i do some aditional processing which i don't wanna do cause it seems a little bit boiler.
How can i configure kafka debezium in order to store in the vehicle index, only the entity, wihout aditional metadata like "after" ?

Debezium with MongoDB - Produced record's payload contains backslash

I'm implementing data extract using the debezium mongodb connector, building up upon the official documentation: https://debezium.io/documentation/reference/stable/connectors/mongodb.html
Everything is working quite fine - except that the payload contains backslash as you can see in the after attribute. Well, oddly enough, the attribute source is right.
{
"after": "{\"_id\": {\"$oid\": \"63626d5993801d8fd1140993\"},\"document\": \"29973569000204\",\"document_type\": \"CNPJ\"}",
"patch": null,
"filter": null,
"source": {
"version": "1.7.1.Final",
"connector": "mongodb",
"name": "xxxxxxxxxx",
"ts_ms": 8466513,
"snapshot": "false",
"db": "database",
"sequence": null,
"rs": "atlas-iurhise-shard-0",
"collection": "mongo_collection",
"ord": 1,
"h": null,
"tord": 4,
"stxnid": "281f4230-d8cc-3d23-a556-89923b45e25f:168"
},
"op": "c",
"ts_ms": 1667394905422,
"transaction": null
}
I tried this solution, but it doesn't work for me: Debezium Outbox Pattern property transforms.outbox.table.expand.json.payload not working
these are my settings:
{
"name": "DebeziumDataExtract",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"tasks.max": "3",
"mongodb.hosts": "removed",
"mongodb.name": "removed",
"mongodb.user": "removed",
"mongodb.password": "removed",
"mongodb.ssl.enabled": "true",
"collection.whitelist": "removed",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"hstore.handling.mode": "json",
"decimal.handling.mode": "string",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"heartbeat.interval.ms": "1000",
"heartbeat.topics.prefix": "removed",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 1,
"topic.creation.default.cleanup.policy": "compact",
"topic.creation.default.compression.type": "lz4",
"transforms": "unwrap",
"transforms.unwrap.collection.expand.json.payload": "true"
}
}
and waiting for a payload like this:
{
"after": {
"_id": {
"$oid": "63626d5993801d8fd1140993"
},
"document": "29973585214796",
"document_type": "CNPJ"
},
"patch": null,
"filter": null,
"source": {
"version": "1.7.1.Final",
"connector": "mongodb",
"name": "xxxxxxxxxx",
"ts_ms": 8466513,
"snapshot": "false",
"db": "database",
"sequence": null,
"rs": "atlas-iurhise-shard-0",
"collection": "mongo_collection",
"ord": 1,
"h": null,
"tord": 4,
"stxnid": "281f4230-d8cc-3d23-a556-89923b45e25f:168"
},
"op": "c",
"ts_ms": 1667394905422,
"transaction": null
}
Could someone help me?
########## UPDATES ##########
After #onecricketeer comments I tried this:
{
"name": "DebeziumTransportPlanner",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"tasks.max": "3",
"mongodb.hosts": "stg-transport-planner-0-shard-00-00-00.xmapa.mongodb.net,stg-transport-planner-0-shard-00-01.xmapa.mongodb.net,stg-transport-planner-0-shard-00-02.xmapa.mongodb.net",
"mongodb.name": "stg-transport-planner-01",
"mongodb.user": "oploguser-stg",
"mongodb.password": "vCh1NtV4PoY8PeSJ",
"mongodb.ssl.enabled": "true",
"collection.whitelist": "stg-transport-planner-01[.]aggregated_transfers",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"hstore.handling.mode": "json",
"decimal.handling.mode": "string",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"heartbeat.interval.ms": "1000",
"heartbeat.topics.prefix": "__debeziumtransport-planner-heartbeat",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 1,
"topic.creation.default.cleanup.policy": "compact",
"topic.creation.default.compression.type": "lz4",
"transforms": "unwrap",
"transforms.unwrap.type":"io.debezium.connector.mongodb.transforms.ExtractNewDocumentState",
"transforms.unwrap.collection.expand.json.payload": "true",
"transforms.unwrap.collection.fields.additional.placement": "route_external_id:header,transfer_index:header"
}
}
You need to use JsonConverter instead of StringConverter if you want the data to be a JSON object rather than a String.
Also, you are missing transforms.unwrap.type

Kafka HDFS Sink Connector with constant lag offset

I have a Kafka HDFS Sink Connector with a constant offset lag. The kafka_consumergroup_group_lag from Kafka Lag Exporter is illustrated in the following figure:
Note that the topic receives messages once a day, hence the spike. I would like the offset lag to go to zero, but as seen, the offset lag stabilizes at a value of ~833. How can I configure the connector to reach an offset lag of zero?
The connector configuration is given below
{
"name": "my_connector",
"config": {
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max": "1",
"retries": "2147483647",
"topics": "my_kafka_topic",
"format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "86400000",
"path.format": "'date_id'=YYYYMMdd",
"timezone": "UTC",
"locale": "en-US",
"timestamp.extractor": "RecordField",
"timestamp.field": "message_timestamp",
"compression.type": "snappy",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.retry.delay.max.ms": "60000",
"hadoop.conf.dir": "/var/run/configmaps/{{ stage }}",
"hdfs.url": "{{ hdfs_url }}",
"logs.dir": "{{ logs_dir }}",
"topics.dir": "my_hdfs_path",
"hdfs.authentication.kerberos": "true",
"hdfs.namenode.principal": "{{ hdfs_namenode_principal }}",
"connect.hdfs.principal": "{{ connect_hdfs_principal }}",
"connect.hdfs.keytab": "{{ connect_hdfs_keytab }}",
"flush.size": "600000",
"rotate.interval.ms": "1600000",
"transforms": "insertTS,formatTS",
"transforms.insertTS.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.insertTS.timestamp.field": "message_timestamp",
"transforms.formatTS.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.formatTS.format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ",
"transforms.formatTS.field": "message_timestamp",
"transforms.formatTS.target.type": "string"
}
}
For topics receiving records more frequently, the connectors have no problem in receiving a zero offset lag (or close to zero):
The configuration for this connector is identical:
{
"name": "my_other_connector",
"config": {
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max": "1",
"retries": "2147483647",
"topics": "my_other_topic",
"format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "86400000",
"path.format": "'date_id'=YYYYMMdd",
"timezone": "UTC",
"locale": "en-US",
"timestamp.extractor": "RecordField",
"timestamp.field": "message_timestamp",
"compression.type": "snappy",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.retry.delay.max.ms": "60000",
"hadoop.conf.dir": "/var/run/configmaps/{{ stage }}",
"hdfs.url": "{{ hdfs_url }}",
"logs.dir": "{{ logs_dir }}",
"topics.dir": "my_other_hdfs_location",
"hdfs.authentication.kerberos": "true",
"hdfs.namenode.principal": "{{ hdfs_namenode_principal }}",
"connect.hdfs.principal": "{{ connect_hdfs_principal }}",
"connect.hdfs.keytab": "{{ connect_hdfs_keytab }}",
"flush.size": "600000",
"rotate.interval.ms": "1600000",
"transforms": "insertTS,formatTS",
"transforms.insertTS.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.insertTS.timestamp.field": "message_timestamp",
"transforms.formatTS.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.formatTS.format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ",
"transforms.formatTS.field": "message_timestamp",
"transforms.formatTS.target.type": "string"
}
}

How to update azure DevOps build definition source from azure repos to GitHub using rest api

I am trying to change source from azure repos git to GitHub in azure DevOps build using rest api.
This is the response I get for azure repos using azure Devops build definitions rest api - GET https://dev.azure.com/{org_name}/{project_name}/_apis/build/definitions/{Build_Id}?api-version=6.0?
"repository": {
"properties": {
"cleanOptions": "0",
"labelSources": "0",
"labelSourcesFormat": "$(build.buildNumber)",
"reportBuildStatus": "true",
"gitLfsSupport": "false",
"skipSyncSource": "false",
"checkoutNestedSubmodules": "false",
"fetchDepth": "0"
},
"id": "xxxx",
"type": "TfsGit",
"name": "{repo_name}",
"url": "https://dev.azure.com/{org_name}/{project_name}/_git/{repo_name}",
"defaultBranch": "refs/heads/master",
"clean": "false",
"checkoutSubmodules": false
},
If manually I change source from azure repos to GitHub this is the json response I get for GitHub repo -
"repository": {
"properties": {
"apiUrl": "https://api.github.com/repos/{github_id}/{repo_name}",
"branchesUrl": "https://api.github.com/repos/{github_id}/{repo_name}/branches",
"cloneUrl": "https://github.com/{github_id}/{repo_name}.git",
"connectedServiceId": "xxxxxxx",
"defaultBranch": "master",
"fullName": "{github_id}/{repo_name}",
"hasAdminPermissions": "True",
"isFork": "False",
"isPrivate": "False",
"lastUpdated": "10/16/2019 17:28:29",
"manageUrl": "https://github.com/{github_id}/{repo_name}",
"nodeId": "xxxxxx",
"ownerId": "xxxxx",
"orgName": "{github_id}",
"refsUrl": "https://api.github.com/repos/{github_id}/pyapp/git/refs",
"safeRepository": "{github_id}/pyapp",
"shortName": "{repo_name}",
"ownerAvatarUrl": "https://avatars2.githubusercontent.com/u/xxxxx?v=4",
"archived": "False",
"externalId": "xxxxxx",
"ownerIsAUser": "True",
"checkoutNestedSubmodules": "false",
"cleanOptions": "0",
"fetchDepth": "0",
"gitLfsSupport": "false",
"reportBuildStatus": "true",
"skipSyncSource": "false",
"labelSourcesFormat": "$(build.buildNumber)",
"labelSources": "0"
},
"id": "{github_id}/{repo_name}",
"type": "GitHub",
"name": "{github_id}/{repo_name}",
"url": "https://github.com/{github_id}/{repo_name}.git",
"defaultBranch": "master",
"clean": "false",
"checkoutSubmodules": false
I tried to change azure repo to github using postman by copying GitHub json response body and adding in postman and tried to call put -https://dev.azure.com/{org_name}/{project_name}/_apis/build/definitions/{Build_Id}?api-version=6.0?
But this does not work
How can I achieve this using script or postman ? what am I missing here ?
How can I achieve this using script or postman ? what am I missing here ?
You could copy the content of the Get Build Definition API.
Here is my example:
URL:
PUT https://dev.azure.com/{OrganizationName}/{ProjectName}/_apis/build/definitions/{DefinitionID}?api-version=5.0-preview.6
Request Body sample:
{
"process": {
"phases": [
{
"steps": [
],
"name": "Phase 1",
"refName": "Phase_1",
"condition": "succeeded()",
"target": {
"executionOptions": {
"type": 0
},
"allowScriptsAuthAccessOption": false,
"type": 1
},
"jobAuthorizationScope": "projectCollection",
"jobCancelTimeoutInMinutes": 1
}
],
"type": 1
},
"repository": {
"properties": {
"cleanOptions": "0",
"labelSources": "0",
"labelSourcesFormat": "$(build.buildNumber)",
"reportBuildStatus": "true",
"gitLfsSupport": "false",
"skipSyncSource": "false",
"checkoutNestedSubmodules": "false",
"fetchDepth": "0"
},
"id": "{github_id}/{repo_name}",
"type": "GitHub",
"name": "{github_id}/{repo_name}",
"url": "https://github.com/{github_id}/{repo_name}.git",
"defaultBranch": "master",
"clean": "false",
"checkoutSubmodules": false
},
"id": {DefinitionID},
"revision": {revisionID},
"name": "definitionCreatedByRESTAPI",
"type": "build",
"queueStatus": "enabled"
}
In the Reuqest Body, there are the following key points:
The Process field is required. You could copy the content from the Get Build Definition Rest API.
The "id": {DefinitionID} is required.
"revision": {revisionID} You need to input the valid revision. This is very important.
To get the correct revision, you need to Navigate to Azure Pipelines -> Target Build Definition -> History.
You need to count how many Update records. The correct revision is the total number + 1.
For example: In my screenshot, the correct revision is 10 (9+1 =10).

Kafka sink to mongoDB, How do I set the "_ID" field to an existing value in one of the columns in my topic?

I have the following topic (JSON, not AVRO) generated by Debezium
"payload": {"id": 1, "name": "test": "uuid": "f9a96ea4-3ff9-480f-bf8a-ee53a1e6e583"}
How do I set the "_ID" field (in mongo collection) to the same value "uuid"?
This is my SINK config:
{
"name": "mongo-sink",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max": 3,
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"topics": "s4farm.animal",
"connection.uri": "mongodb://user:password#host:port/?authSource=database",
"database": "database",
"collection": "s4farm_animal",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"value.projection.list": "id",
"value.projection.type": "whitelist",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.ReplaceOneBusinessKeyStrategy"
}
}
Can you help me?