Optimal Kakfa Connect Hourly S3AvroSink Config

Optimal Kakfa Connect Hourly S3AvroSink Config - apache-kafka

{
"name":"{{name}}",
"tasks.max": "6", //have 6 partitions for this topic
"topics": "{{topic}}",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schemas.enable": "true",
"key.converter.schema.registry.url": "xx",
"key.converter.key.subject.name.strategy": "io.confluent.kafka.serializers.subject.TopicRecordNameStrategy",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.schema.registry.url": "xx",
"value.converter.value.subject.name.strategy": "io.confluent.kafka.serializers.subject.TopicRecordNameStrategy",
"errors.retry.timeout":"600000",
"errors.log.enable":"true",
"errors.log.include.messages":"true",
"schema.compatibility": "BACKWARD",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"flush.size": "100000",
"rotate.schedule.interval.ms": "3600000",
"rotate.interval.ms": "3600000",
"enhanced.avro.schema.support": "true",
"connect.meta.data": "false",
"partitioner.class": "{{partitioner}}somepartitioner",
"partition.duration.ms": "3600000",
"path.format": "'avro/event=?eventClass?/tenant=?tenant?/date'=YYYY-MM-dd/'hour'=HH",
"locale": "en",
"timezone": "UTC",
"timestamp.extractor": "RecordField",
"timestamp.field": "{{timestampField}}",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"s3.bucket.name": "somebucket",
"s3.region": "region",
"s3.part.size": "5242880",
"offset.flush.interval.ms": "1200000"
}
The count for the topic is around 739,180 and size 1.1Gb
I'm not fully sure if my config is fully correct or not, if i can improve it somehwere. I want to flush in two cases, hourly or if the size hits 5gb.

Related

Synchning Postgres & Elastic Search with Kafka Connect ( Debezium )

I have set up using docker a postgres image, aswell as a elastic search one.
What i'm trying to achieve is that i have a Vehicle entity ( on a microservice with spring data jpa) , as well as Vehicle document ( on a microservice with spring data elastic search ) .
#Document(indexName = "vehicles")
#Builder
#Data
public class Vehicle {
#Id
private UUID id;
#Field(name = "vin")
private String vin;
#Field(name = "brand")
private String brand;
#Field(name = "model")
private String model;
}
I also have jsons for kafka connect for elastic search and postgres:
{
"name": "eh-vehicles-sink",
"config": {
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "vehicles",
"connection.url": "http://elasticsearch:9200",
"key.ignore": "true",
"type.name": "vehicles",
"index.mapping.dynamic": false,
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false"
}
}
{
"name": "postgres-vehicles-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"plugin.name": "pgoutput",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname": "postgres",
"schema.include.list": "public",
"include.schema.changes": "true",
"database.server.name": "Vehicles",
"database.server.id": "5401",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "public.history",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"transforms":"Reroute",
"transforms.Reroute.type": "io.debezium.transforms.ByLogicalTableRouter",
"transforms.Reroute.topic.regex":"(.*)vehicles",
"transforms.Reroute.topic.replacement": "vehicles",
"transforms.Reroute.key.field.name": "id",
"transforms.Reroute.key.enforce.uniqueness":"false"
}
}
The problem is that after a entty is persisted in postgres, kafka will send it to elastic search, but it will store it in the following format:
"hits": [
{
"_index": "vehicles",
"_type": "_doc",
"_id": "vehicles+0+0",
"_score": 1,
"_source": {
"op": "c",
"before": null,
"after": {
"generation": "F10",
"cylindrical_capacity": 2000,
"country": "Germany",
"tva_deductible": false,
"km": "100000",
"fuel": "Diesel",
"first_owner": "John Doe",
"production_date": 15126,
"created_at": null,
"traction": "ALLWHEELS",
"owner_account_id": "4a2ac2a2-3323-4b42-8960-b044897a180c",
"first_registration_date": 15354,
"colour": "Black",
"soft_deleted": false,
"transmission": "AUTOMATIC",
"accident_free": true,
"vin": "WBAJB51090B513560",
"model": "5 Series",
"id": "739c9d56-d50b-4b17-a86a-e2561f54c1a9",
"power": 180,
"brand": "BMW",
"favorite_accounts": null
},
"source": {
"schema": "public",
"sequence": "[\"24015744\",\"24015744\"]",
"xmin": null,
"connector": "postgresql",
"lsn": 24015744,
"name": "Vehicles",
"txId": 506,
"version": "1.8.1.Final",
"ts_ms": 1676855693566,
"snapshot": "false",
"db": "postgres",
"table": "vehicles"
},
"ts_ms": 1676855694066,
"transaction": null
}
}
]
Which will be a problem when fetching it in the microservice for elastic search, because the payload is wrapped in another object and the wrong id will be fetched unless i do some aditional processing which i don't wanna do cause it seems a little bit boiler.
How can i configure kafka debezium in order to store in the vehicle index, only the entity, wihout aditional metadata like "after" ?

Debezium with MongoDB - Produced record's payload contains backslash

I'm implementing data extract using the debezium mongodb connector, building up upon the official documentation: https://debezium.io/documentation/reference/stable/connectors/mongodb.html
Everything is working quite fine - except that the payload contains backslash as you can see in the after attribute. Well, oddly enough, the attribute source is right.
{
"after": "{\"_id\": {\"$oid\": \"63626d5993801d8fd1140993\"},\"document\": \"29973569000204\",\"document_type\": \"CNPJ\"}",
"patch": null,
"filter": null,
"source": {
"version": "1.7.1.Final",
"connector": "mongodb",
"name": "xxxxxxxxxx",
"ts_ms": 8466513,
"snapshot": "false",
"db": "database",
"sequence": null,
"rs": "atlas-iurhise-shard-0",
"collection": "mongo_collection",
"ord": 1,
"h": null,
"tord": 4,
"stxnid": "281f4230-d8cc-3d23-a556-89923b45e25f:168"
},
"op": "c",
"ts_ms": 1667394905422,
"transaction": null
}
I tried this solution, but it doesn't work for me: Debezium Outbox Pattern property transforms.outbox.table.expand.json.payload not working
these are my settings:
{
"name": "DebeziumDataExtract",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"tasks.max": "3",
"mongodb.hosts": "removed",
"mongodb.name": "removed",
"mongodb.user": "removed",
"mongodb.password": "removed",
"mongodb.ssl.enabled": "true",
"collection.whitelist": "removed",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"hstore.handling.mode": "json",
"decimal.handling.mode": "string",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"heartbeat.interval.ms": "1000",
"heartbeat.topics.prefix": "removed",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 1,
"topic.creation.default.cleanup.policy": "compact",
"topic.creation.default.compression.type": "lz4",
"transforms": "unwrap",
"transforms.unwrap.collection.expand.json.payload": "true"
}
}
and waiting for a payload like this:
{
"after": {
"_id": {
"$oid": "63626d5993801d8fd1140993"
},
"document": "29973585214796",
"document_type": "CNPJ"
},
"patch": null,
"filter": null,
"source": {
"version": "1.7.1.Final",
"connector": "mongodb",
"name": "xxxxxxxxxx",
"ts_ms": 8466513,
"snapshot": "false",
"db": "database",
"sequence": null,
"rs": "atlas-iurhise-shard-0",
"collection": "mongo_collection",
"ord": 1,
"h": null,
"tord": 4,
"stxnid": "281f4230-d8cc-3d23-a556-89923b45e25f:168"
},
"op": "c",
"ts_ms": 1667394905422,
"transaction": null
}
Could someone help me?
########## UPDATES ##########
After #onecricketeer comments I tried this:
{
"name": "DebeziumTransportPlanner",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"tasks.max": "3",
"mongodb.hosts": "stg-transport-planner-0-shard-00-00-00.xmapa.mongodb.net,stg-transport-planner-0-shard-00-01.xmapa.mongodb.net,stg-transport-planner-0-shard-00-02.xmapa.mongodb.net",
"mongodb.name": "stg-transport-planner-01",
"mongodb.user": "oploguser-stg",
"mongodb.password": "vCh1NtV4PoY8PeSJ",
"mongodb.ssl.enabled": "true",
"collection.whitelist": "stg-transport-planner-01[.]aggregated_transfers",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"hstore.handling.mode": "json",
"decimal.handling.mode": "string",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"heartbeat.interval.ms": "1000",
"heartbeat.topics.prefix": "__debeziumtransport-planner-heartbeat",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 1,
"topic.creation.default.cleanup.policy": "compact",
"topic.creation.default.compression.type": "lz4",
"transforms": "unwrap",
"transforms.unwrap.type":"io.debezium.connector.mongodb.transforms.ExtractNewDocumentState",
"transforms.unwrap.collection.expand.json.payload": "true",
"transforms.unwrap.collection.fields.additional.placement": "route_external_id:header,transfer_index:header"
}
}

You need to use JsonConverter instead of StringConverter if you want the data to be a JSON object rather than a String.
Also, you are missing transforms.unwrap.type

Kafka HDFS Sink Connector with constant lag offset

I have a Kafka HDFS Sink Connector with a constant offset lag. The kafka_consumergroup_group_lag from Kafka Lag Exporter is illustrated in the following figure:
Note that the topic receives messages once a day, hence the spike. I would like the offset lag to go to zero, but as seen, the offset lag stabilizes at a value of ~833. How can I configure the connector to reach an offset lag of zero?
The connector configuration is given below
{
"name": "my_connector",
"config": {
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max": "1",
"retries": "2147483647",
"topics": "my_kafka_topic",
"format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "86400000",
"path.format": "'date_id'=YYYYMMdd",
"timezone": "UTC",
"locale": "en-US",
"timestamp.extractor": "RecordField",
"timestamp.field": "message_timestamp",
"compression.type": "snappy",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.retry.delay.max.ms": "60000",
"hadoop.conf.dir": "/var/run/configmaps/{{ stage }}",
"hdfs.url": "{{ hdfs_url }}",
"logs.dir": "{{ logs_dir }}",
"topics.dir": "my_hdfs_path",
"hdfs.authentication.kerberos": "true",
"hdfs.namenode.principal": "{{ hdfs_namenode_principal }}",
"connect.hdfs.principal": "{{ connect_hdfs_principal }}",
"connect.hdfs.keytab": "{{ connect_hdfs_keytab }}",
"flush.size": "600000",
"rotate.interval.ms": "1600000",
"transforms": "insertTS,formatTS",
"transforms.insertTS.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.insertTS.timestamp.field": "message_timestamp",
"transforms.formatTS.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.formatTS.format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ",
"transforms.formatTS.field": "message_timestamp",
"transforms.formatTS.target.type": "string"
}
}
For topics receiving records more frequently, the connectors have no problem in receiving a zero offset lag (or close to zero):
The configuration for this connector is identical:
{
"name": "my_other_connector",
"config": {
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max": "1",
"retries": "2147483647",
"topics": "my_other_topic",
"format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "86400000",
"path.format": "'date_id'=YYYYMMdd",
"timezone": "UTC",
"locale": "en-US",
"timestamp.extractor": "RecordField",
"timestamp.field": "message_timestamp",
"compression.type": "snappy",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.retry.delay.max.ms": "60000",
"hadoop.conf.dir": "/var/run/configmaps/{{ stage }}",
"hdfs.url": "{{ hdfs_url }}",
"logs.dir": "{{ logs_dir }}",
"topics.dir": "my_other_hdfs_location",
"hdfs.authentication.kerberos": "true",
"hdfs.namenode.principal": "{{ hdfs_namenode_principal }}",
"connect.hdfs.principal": "{{ connect_hdfs_principal }}",
"connect.hdfs.keytab": "{{ connect_hdfs_keytab }}",
"flush.size": "600000",
"rotate.interval.ms": "1600000",
"transforms": "insertTS,formatTS",
"transforms.insertTS.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.insertTS.timestamp.field": "message_timestamp",
"transforms.formatTS.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.formatTS.format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ",
"transforms.formatTS.field": "message_timestamp",
"transforms.formatTS.target.type": "string"
}
}

How to update azure DevOps build definition source from azure repos to GitHub using rest api

I am trying to change source from azure repos git to GitHub in azure DevOps build using rest api.
This is the response I get for azure repos using azure Devops build definitions rest api - GET https://dev.azure.com/{org_name}/{project_name}/_apis/build/definitions/{Build_Id}?api-version=6.0?
"repository": {
"properties": {
"cleanOptions": "0",
"labelSources": "0",
"labelSourcesFormat": "$(build.buildNumber)",
"reportBuildStatus": "true",
"gitLfsSupport": "false",
"skipSyncSource": "false",
"checkoutNestedSubmodules": "false",
"fetchDepth": "0"
},
"id": "xxxx",
"type": "TfsGit",
"name": "{repo_name}",
"url": "https://dev.azure.com/{org_name}/{project_name}/_git/{repo_name}",
"defaultBranch": "refs/heads/master",
"clean": "false",
"checkoutSubmodules": false
},
If manually I change source from azure repos to GitHub this is the json response I get for GitHub repo -
"repository": {
"properties": {
"apiUrl": "https://api.github.com/repos/{github_id}/{repo_name}",
"branchesUrl": "https://api.github.com/repos/{github_id}/{repo_name}/branches",
"cloneUrl": "https://github.com/{github_id}/{repo_name}.git",
"connectedServiceId": "xxxxxxx",
"defaultBranch": "master",
"fullName": "{github_id}/{repo_name}",
"hasAdminPermissions": "True",
"isFork": "False",
"isPrivate": "False",
"lastUpdated": "10/16/2019 17:28:29",
"manageUrl": "https://github.com/{github_id}/{repo_name}",
"nodeId": "xxxxxx",
"ownerId": "xxxxx",
"orgName": "{github_id}",
"refsUrl": "https://api.github.com/repos/{github_id}/pyapp/git/refs",
"safeRepository": "{github_id}/pyapp",
"shortName": "{repo_name}",
"ownerAvatarUrl": "https://avatars2.githubusercontent.com/u/xxxxx?v=4",
"archived": "False",
"externalId": "xxxxxx",
"ownerIsAUser": "True",
"checkoutNestedSubmodules": "false",
"cleanOptions": "0",
"fetchDepth": "0",
"gitLfsSupport": "false",
"reportBuildStatus": "true",
"skipSyncSource": "false",
"labelSourcesFormat": "$(build.buildNumber)",
"labelSources": "0"
},
"id": "{github_id}/{repo_name}",
"type": "GitHub",
"name": "{github_id}/{repo_name}",
"url": "https://github.com/{github_id}/{repo_name}.git",
"defaultBranch": "master",
"clean": "false",
"checkoutSubmodules": false
I tried to change azure repo to github using postman by copying GitHub json response body and adding in postman and tried to call put -https://dev.azure.com/{org_name}/{project_name}/_apis/build/definitions/{Build_Id}?api-version=6.0?
But this does not work
How can I achieve this using script or postman ? what am I missing here ?

How can I achieve this using script or postman ? what am I missing here ?
You could copy the content of the Get Build Definition API.
Here is my example:
URL:
PUT https://dev.azure.com/{OrganizationName}/{ProjectName}/_apis/build/definitions/{DefinitionID}?api-version=5.0-preview.6
Request Body sample:
{
"process": {
"phases": [
{
"steps": [
],
"name": "Phase 1",
"refName": "Phase_1",
"condition": "succeeded()",
"target": {
"executionOptions": {
"type": 0
},
"allowScriptsAuthAccessOption": false,
"type": 1
},
"jobAuthorizationScope": "projectCollection",
"jobCancelTimeoutInMinutes": 1
}
],
"type": 1
},
"repository": {
"properties": {
"cleanOptions": "0",
"labelSources": "0",
"labelSourcesFormat": "$(build.buildNumber)",
"reportBuildStatus": "true",
"gitLfsSupport": "false",
"skipSyncSource": "false",
"checkoutNestedSubmodules": "false",
"fetchDepth": "0"
},
"id": "{github_id}/{repo_name}",
"type": "GitHub",
"name": "{github_id}/{repo_name}",
"url": "https://github.com/{github_id}/{repo_name}.git",
"defaultBranch": "master",
"clean": "false",
"checkoutSubmodules": false
},
"id": {DefinitionID},
"revision": {revisionID},
"name": "definitionCreatedByRESTAPI",
"type": "build",
"queueStatus": "enabled"
}
In the Reuqest Body, there are the following key points:
The Process field is required. You could copy the content from the Get Build Definition Rest API.
The "id": {DefinitionID} is required.
"revision": {revisionID} You need to input the valid revision. This is very important.
To get the correct revision, you need to Navigate to Azure Pipelines -> Target Build Definition -> History.
You need to count how many Update records. The correct revision is the total number + 1.
For example: In my screenshot, the correct revision is 10 (9+1 =10).

Kafka sink to mongoDB, How do I set the "_ID" field to an existing value in one of the columns in my topic?

I have the following topic (JSON, not AVRO) generated by Debezium
"payload": {"id": 1, "name": "test": "uuid": "f9a96ea4-3ff9-480f-bf8a-ee53a1e6e583"}
How do I set the "_ID" field (in mongo collection) to the same value "uuid"?
This is my SINK config:
{
"name": "mongo-sink",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max": 3,
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"topics": "s4farm.animal",
"connection.uri": "mongodb://user:password#host:port/?authSource=database",
"database": "database",
"collection": "s4farm_animal",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"value.projection.list": "id",
"value.projection.type": "whitelist",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.ReplaceOneBusinessKeyStrategy"
}
}
Can you help me?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimal Kakfa Connect Hourly S3AvroSink Config - apache-kafka

Related

Synchning Postgres & Elastic Search with Kafka Connect ( Debezium )

Debezium with MongoDB - Produced record's payload contains backslash

Kafka HDFS Sink Connector with constant lag offset

How to update azure DevOps build definition source from azure repos to GitHub using rest api

Kafka sink to mongoDB, How do I set the "_ID" field to an existing value in one of the columns in my topic?

Categories

Resources