Kafka Connect Transformations - RegexRouter replacement topic names in lowercase - apache-kafka

We are trying to setup a connector (Debezium) in Kafka Connect and transform all the topic names generated by this connector via regular expressions. The regex below is working and detects the patterns we want, but we also need to create all the topic names in lowercase.
We have tried to put this in the replacement expression as \L$1 but it is just printing and L in front of our topic names, for example LOutbound.Policy instead of outbound.policy
Does anybody know how to do this? Thanks in advance!
This is the connector curl command
curl -i -X PUT http://kafka-alpha-cp-kafka-connect:8083/connectors/kafka-bi-datacontract/config -H "Content-Type: application/json" -d '{
"name": "kafka-bi-datacontract",
"connector.class" : "io.debezium.connector.sqlserver.SqlServerConnector",
"database.hostname" : "ukdb3232123",
"database.server.name" : "ukdb3232123\\perf",
"database.port" : "12442",
"database.user" : "KafkaConnect-BI",
"database.password" : "*******",
"database.dbname" : "BeazleyIntelligenceDataContract",
"snapshot.lock.timeout.ms" : "10000000",
"table.whitelist" : "Outbound.Policy,Outbound.Section",
"database.history.kafka.bootstrap.servers" : "kafka-alpha-cp-kafka-headless:9092",
"database.history.kafka.topic": "schema-changes.bidatacontract",
"transforms": "dropTopicPrefix",
"transforms.dropTopicPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropTopicPrefix.regex":"^[^.]+.(.*)",
"transforms.dropTopicPrefix.replacement":"\\L$1"
}'

\L$1 or \\L$1 would be the same as L$1.
You would need to create/find your own transform for lowercasing.
Once you do, you can do something like this
"transforms": "dropTopicPrefix,lowertopic",
"transforms.dropTopicPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropTopicPrefix.regex":"^[^.]+.(.*)",
"transforms.dropTopicPrefix.replacement":"$1",
"transforms.lowerTopic.type":"com.example.foo.LowerCase$Topic",

Related

Kafka REST API source connector with authentication header

I need to create kafka source connector for REST API with Header authentication like
curl -H "Authorization: Basic " -H "clientID: " "https:< url for source> " .
I am using apache kafka , I used connector class com.github.castorm.kafka.connect.http.HttpSourceConnector
Here is my json file for connector
{
"name": "rest_data6",
"config": {
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"true",
"value.converter.schemas.enable":"true",
"connector.class": "com.github.castorm.kafka.connect.http.HttpSourceConnector",
"tasks.max": "1",
"http.request.headers": "Authorization: Basic <key1>",
"http.request.headers": "clientID: <key>",
"http.request.url": "https:<url for source ?",
"kafka.topic": "mysqltopic2"
}
}
Also I tried with "connector.class": "com.tm.kafka.connect.rest.RestSourceConnector", My joson file as below
"name": "rest_data2",
"config": {
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"true",
"value.converter.schemas.enable":"true",
"connector.class": "com.tm.kafka.connect.rest.RestSourceConnector",
"rest.source.poll.interval.ms": "900",
"rest.source.method": "GET",
"rest.source.url":"URL of source ",
"tasks.max": "1",
"rest.source.headers": "Authorization: Basic <key> , clientId :<key2>",
"rest.source.topic.selector": "com.tm.kafka.connect.rest.selector.SimpleTopicSelector",
"rest.source.destination.topics": "mysql1"
}
}
But no hope . Any idea how to GET REST API data with authentication . My authentication parameter is
Authorization: Basic and Authorization: Basic .
Just for mention both the file are working with REST API without authentication , once I added authentication parameter then wither connector status is failed or It produce ":"Cannot route. Codebase/company is invalid"" message in topic.
Can any one suggest what is way to solve it
I mailed to original developer to Cástor Rodríguez. As per his solution I modified my json
Put header into a single form and it works
{
"name": "rest_data6",
"config": {
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"true",
"value.converter.schemas.enable":"true",
"connector.class": "com.github.castorm.kafka.connect.http.HttpSourceConnector",
"tasks.max": "1",
"http.request.headers": "Authorization: Basic <key1>, clientID: <key>"
"http.request.url": "https:<url for source ?",
"kafka.topic": "mysqltopic2"
}
}

Kafka Connect issue when reading from a RabbitMQ queue

I'm trying to read data into my topic from a RabbitMQ queue using the Kafka connector with the configuration below:
{
"name" : "RabbitMQSourceConnector1",
"config" : {
"connector.class" : "io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max" : "1",
"kafka.topic" : "rabbitmqtest3",
"rabbitmq.queue" : "taskqueue",
"rabbitmq.host" : "localhost",
"rabbitmq.username" : "guest",
"rabbitmq.password" : "guest",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true"
}
}
But I´m having troubles when converting the source stream to JSON format as I´m losing the original message
Original:
{'id': 0, 'body': '010101010101010101010101010101010101010101010101010101010101010101010'}
Received:
{"schema":{"type":"bytes","optional":false},"payload":"eyJpZCI6IDEsICJib2R5IjogIjAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMCJ9"}
Does anyone have an idea why this is happening?
EDIT: I tried to convert the message to String using the "value.converter": "org.apache.kafka.connect.storage.StringConverter", but the result is the same:
11/27/19 4:07:37 PM CET , 0 , [B#1583a488
EDIT2:
I´m now receiving the JSON file but the content is still encoded in BASE64
Any idea on how to convert it back to UTF8 directly?
{
"name": "adls-gen2-sink",
"config": {
"connector.class":"io.confluent.connect.azure.datalake.gen2.AzureDataLakeGen2SinkConnector",
"tasks.max":"1",
"topics":"rabbitmqtest3",
"flush.size":"3",
"format.class":"io.confluent.connect.azure.storage.format.json.JsonFormat",
"value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"internal.value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"topics.dir":"sw66jsoningest",
"confluent.topic.bootstrap.servers":"localhost:9092",
"confluent.topic.replication.factor":"1",
"partitioner.class" : "io.confluent.connect.storage.partitioner.DefaultPartitioner"
}
}
UPDATE:
I got the solution, considering this flow:
Message (JSON) --> RabbitMq (ByteArray) --> Kafka (ByteArray) -->ADLS (JSON)
I used this converter on the RabbitMQ to Kafka connector to decode the message from Base64 to UTF8.
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter"
Afterwards I treated the message as a String and saved it as a JSON.
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"format.class":"io.confluent.connect.azure.storage.format.json.JsonFormat",
Many thanks!
If you set schemas.enable": "false", you shouldn't be getting the schema and payload fields
If you want no translation to happen at all, use ByteArrayConverter
If your data is just a plain string (which includes JSON), use StringConverter
It's not clear how you're printing the resulting message, but looks like you're printing the byte array and not decoding it to a String

How to convert Livy curl call to Livy Rest API call

I am getting started with Livy, in my setup Livy server is running on Unix machine and I am able to do curl to it and execute the job. I have created a fat jar and uploaded it on hdfs and I am simply calling its main method from Livy. My Json payload for Livy looks like below:
{
"file" : "hdfs:///user/data/restcheck/spark_job_2.11-3.0.0-RC1-
SNAPSHOT.jar",
"proxyUser" : "test_user",
"className" : "com.local.test.spark.pipeline.path.LivyTest",
"files" : ["hdfs:///user/data/restcheck/hivesite.xml","hdfs:///user/data/restcheck/log4j.properties"],
"driverMemory" : "5G",
"executorMemory" : "10G",
"executorCores" : 5,
"numExecutors" : 10,
"queue" : "user.queue",
"name" : "LivySampleTest2",
"conf" : {"spark.master" : "yarn","spark.executor.extraClassPath" :
"/etc/hbase/conf/","spark.executor.extraJavaOptions" : "-Dlog4j.configuration=file:log4j.properties","spark.driver.extraJavaOptions" : "-Dlog4j.configuration=file:log4j.properties","spark.ui.port" : 4100,"spark.port.maxRetries" : 100,"JAVA_HOME" : "/usr/java/jdk1.8.0_60","HADOOP_CONF_DIR" :
"/etc/hadoop/conf:/etc/hive/conf:/etc/hbase/conf","HIVE_CONF_DIR" :
"/etc/hive/conf"}
}
and below is my curl call to it:
curl -X POST --negotiate -u:"test_user" --data #/user/data/Livy/SampleFile.json  -H "Content-Type: application/json" https://livyhost:8998/batches
I am trying to convert this a REST API call and following the WordCount example provided by Cloudera but not able to covert my curl call to the REST API. I have all the jars already added in HDFS so I dont think I need to do the upload jar call.
It should work with curl also
Please try the below JSON.
curl -H "Content-Type: application/json" https://livyhost:8998/batches
-X POST --data '{
"name" : "LivyREST",
"className" : "com.local.test.spark.pipeline.path.LivyTest",
"file" : "/user/data/restcheck/spark_job_2.11-3.0.0-RC1-
SNAPSHOT.jar"
}'
Also, I am adding some more references
http://gethue.com/how-to-use-the-livy-spark-rest-job-server-api-for-submitting-batch-jar-python-and-streaming-spark-jobs/

Unable to create kafka connector using REST API

I am trying to run Kafka workers in distributed mode. Unlike standalone mode, we cannot pass the connector property file while starting the worker in distributed mode. In Distributed mode, workers are started separately and we deploy and manage the connectors on those workers using REST API
Reference Link - https://docs.confluent.io/current/connect/managing/configuring.html#connect-managing-distributed-mode
I tried building a connector by passing the below values in curl command and execued it
curl -X POST -H "Content-Type: application/json" --data '{"name":"sailpointdb","connector.class":"io.confluent.connect.jdbc.JdbcSourceConnector","tasks.max":"1","connection.password " : " abc","connection.url " : "jdbc:mysql://localhost:3306/db","connection.user " : "abc" ,"query" : " SELECT * FROM (SELECT NAME, FROM_UNIXTIME(completed/1000) AS
TASKFAILEDON FROM abc WHERE COMPLETION_STATUS = 'Error') as A","mode" : " timestamp","timestamp.column.name" : "TASKFAILEDON","topic.prefix" : "dbevents","validate.non.null" : "false" }}' http://localhost:8089/connectors/
I am getting below error - curl: (3) URL using bad/illegal format or missing URL
Please let me know what is wrong with the above curl statement, am i missing anything here
You had an extra closing curly brace in your JSON which won't help
If you're POSTing to /connectors you need the name and config root level elements. But, I recommend using PUT /config because you can re-run it to update the config if you need to
Try this:
curl -X PUT -H "Content-Type:application/json" \
http://localhost:8089/connectors/source-jdbc-sailpointdb-00/config \
-d '{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.password ": " abc",
"connection.url ": "jdbc:mysql://localhost:3306/db",
"connection.user ": "abc",
"query": " SELECT * FROM (SELECT NAME, FROM_UNIXTIME(completed/1000) AS TASKFAILEDON FROM abc WHERE COMPLETION_STATUS = 'Error') as A",
"mode": " timestamp",
"timestamp.column.name": "TASKFAILEDON",
"topic.prefix": "dbevents",
"validate.non.null": "false"
}'

get task id's from kafka connect API to print in logs

I have a kafka connect sink code for which below json is passed as curl command to register tasks.
Please let me know if anyone has any idea on how to get the task id's of my connect. For example in below example, we have defined max tasks is 3, so I need to know
the name of 3 tasks for logs i.e. I need to know which line of my log belongs to which task.
In below example, I know I have 3 tasks - TestCheck-1, TestCheck-2 and TestCheck-3 based on the kafka connect logs. I want to know how to get the task names so that I can print them in my kafka connect log lines.
{
"name": "TestCheck",
"config": {
"topics": "topic1",
"connector.class": "ApplicationSinkTask Class package",
"tasks.max": "3",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"connector.url": "jdbc connection url",
"driver.name": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"username": "myusername",
"password": "mypassword",
"table.name": "test_table",
"database.name": "test",
}
}
When I register, I will get below details.
curl -X POST -H "Content-Type: application/json" --data #myjson.json http://service:8082/connectors
{"name":"TestCheck","config":{"topics":"topic1","connector.class":"ApplicationSinkTask Class package","tasks.max":"3","key.converter":"org.apache.kafka.connect.storage.StringConverter","value.converter":"org.apache.kafka.connect.storage.StringConverter","connector.url":"jdbc:sqlserver://datahubprod.database.windows.net:1433;","driver.name":"jdbc connection url","username":"myuser","password":"mypassword","table.name":"test_table","database.name":"test","name":"TestCheck"},"tasks":[{"connector":"TestCheck","task":0},{"connector":"TestCheck","task":1},{"connector":"TestCheck","task":2}],"type":null}
You can manage the connectors with the Kafka Connect Rest API. There's a whole heap of commands which you can find here
The example given in the above link shows you can retrieve all task for a given connector using the command
$ curl localhost:8083/connectors/local-file-sink/tasks
[
{
"id": {
"connector": "local-file-sink",
"task": 0
},
"config": {
"task.class": "org.apache.kafka.connect.file.FileStreamSinkTask",
"topics": "connect-test",
"file": "test.sink.txt"
}
}
]
You can use a language of your choice to send the curl command and import the json response into a variable/dictionary for further use, such as printing to a log. Here's a very simple example using python which will assign the whole output to a variable.
import requests
import json
connectors = 'http://localhost:8083/connectors'
p = requests.get(connectors)
data = p.json()
If you parse the data variable to a a dictionary, you can the access each element, i.e the task id
I hope this helps!