Cassandra Sink Connector for Confluent Platform - apache-kafka

I am trying to run Cassandra sink connector for confluent platform.The cassandra-sink.json file is as below :
{
"name" : "cassandra-sink",
"config" : {
"connector.class" : "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max" : "1",
"topics" : "topic1",
"cassandra.contact.points" : "127.0.0.1",
"cassandra.keyspace" : "test",
"confluent.topic.bootstrap.servers": "127.0.0.1:9092",
"cassandra.write.mode" : "Update",
"connect.cassandra.port":"127.0.0.1:9042"
}
}
I downloaded confluent-hub install confluentinc/kafka-connect-cassandra:latest as per the link.
I am able to load the file but when i check the status i get the below error. I am unable to figure out what the issue is.
FAILED worker_id:127.0.0.1:8083,trace:com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed
com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1:9042] Cannot connect
com.datastax.driver.core.ControlConnection.reconnectInternal
com.datastax.driver.core.ControlConnection.connect
com.datastax.driver.core.Cluster$Manager.negotiateProtocolVersionAndConnect
com.datastax.driver.core.Cluster$Manager.init
com.datastax.driver.core.Cluster.init
com.datastax.driver.core.SessionManager.initAsync
com.datastax.driver.core.SessionManager.executeAsync
com.datastax.driver.core.AbstractSession.execute
io.confluent.connect.cassandra.CassandraSessionImpl.executeStatement
io.confluent.connect.cassandra.CassandraSinkConnector.doStart
io.confluent.connect.cassandra.CassandraSinkConnector.start
org.apache.kafka.connect.runtime.WorkerConnector.doStart
org.apache.kafka.connect.runtime.WorkerConnector.start
org.apache.kafka.connect.runtime.WorkerConnector.transitionTo
org.apache.kafka.connect.runtime.Worker.startConnector
org.apache.kafka.connect.runtime.distributed.DistributedHerder.startConnector
org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1300
org.apache.kafka.connect.runtime.distributed.DistributedHerder$14
org.apache.kafka.connect.runtime.distributed.DistributedHerder$14
java.util.concurrent.FutureTask.run java.util.concurrent.ThreadPoolExecutor.runWorker
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run
Please guide.

Related

Error With RowKey Definition on Confluent BigTable Sink Connector

I'm trying to use the BigTable Sink Connector from Confluent to read data from kafka and write it into my BigTable Instance, but I'm receiving the following message error:
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:614)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:329)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.connect.errors.ConnectException: Error with RowKey definition: Row key definition was defined, but received, deserialized kafka key is not a struct. Unable to construct a row key.
at io.confluent.connect.bigtable.client.RowKeyExtractor.getRowKey(RowKeyExtractor.java:69)
at io.confluent.connect.bigtable.client.BufferedWriter.addWriteToBatch(BufferedWriter.java:84)
at io.confluent.connect.bigtable.client.InsertWriter.write(InsertWriter.java:47)
at io.confluent.connect.bigtable.BaseBigtableSinkTask.put(BaseBigtableSinkTask.java:99)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:586)
... 10 more
The message producer, due to some technical limitations, will not be able to produce the messages with the key property and, because of that, I'm using some Transforms to get information from payload and setting it as the key message.
Here's my connector payload:
{
"name" : "DATALAKE.BIGTABLE.SINK.QUEUEING.ZTXXD",
"config" : {
"connector.class" : "io.confluent.connect.gcp.bigtable.BigtableSinkConnector",
"key.converter" : "org.apache.kafka.connect.storage.StringConverter",
"value.converter" : "org.apache.kafka.connect.json.JsonConverter",
"topics" : "APP-DATALAKE-QUEUEING-ZTXXD_DATALAKE-V1",
"transforms" : "HoistField,AddKeys,ExtractKey",
"gcp.bigtable.project.id" : "bigtable-project-id",
"gcp.bigtable.instance.id" : "bigtable-instance-id",
"gcp.bigtable.credentials.json" : "XXXXX",
"transforms.ExtractKey.type" : "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.HoistField.field" : "raw_data_cf",
"transforms.ExtractKey.field" : "KEY1,ATT1",
"transforms.HoistField.type" : "org.apache.kafka.connect.transforms.HoistField$Value",
"transforms.AddKeys.type" : "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.AddKeys.fields" : "KEY1,ATT1",
"row.key.definition" : "KEY1,ATT1",
"table.name.format" : "raw_ZTXXD_DATALAKE",
"consumer.override.group.id" : "svc-datalake-KAFKA_2_BIGTABLE",
"confluent.topic.bootstrap.servers" : "xxxxxx:9092",
"input.data.format" : "JSON",
"confluent.topic" : "_dsp-confluent-license",
"input.key.format" : "STRING",
"key.converter.schemas.enable" : "false",
"confluent.topic.security.protocol" : "SASL_SSL",
"row.key.delimiter" : "/",
"confluent.topic.sasl.jaas.config" : "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"XXXXX\" password=\"XXXXXX\";",
"value.converter.schemas.enable" : "false",
"auto.create.tables" : "true",
"auto.create.column.families" : "true",
"confluent.topic.sasl.mechanism" : "PLAIN"
}
}
And here's my message produced to Kafka:
{
"MANDT": "110",
"KEY1": "1",
"KEY2": null,
"ATT1": "1M",
"ATT2": "0000000000",
"TABLE_NAME": "ZTXXD_DATALAKE",
"IUUC_OPERATION": "I",
"CREATETIMESTAMP": "2022-01-24T20:26:45.247Z"
}
In my transforms I'm doing three operations:
HoistField is putting my payload inside a two-level structure (the connect docs for BigTable says that connect expects a two-level structure in order to be able to infer the family columns
addKey is adding the columns that I consider key to the message key
ExtractKey is removing the key from the fields added in the header, leaving only the values ​​themselves.
I've been reading the documentation for this connector for Bigtable and it's not clear to me if the connector works well with the JSON format. Could you let me know?
JSON should work, but...
deserialized kafka key is not a struct
This is because you have set the schemas.enable=false property on the value converter, such that when you do ValueToKey, it's not a Connect Struct type; the HoistField makes a Java Map instead.
If you're not able to use the Schema Registry and switch the serialization format, then you'll need to try and find a way to get the REST Proxy to infer the schema of the JSON message before it produces the data (I don't think it can). Otherwise, your records need to include schema and payload fields, and you need to enable schemas on the converters. Explained here
Another option - There may be a transform project around that sets the schema of the record, but it's not builtin.. (it's not part of SetSchemaMetadata)

Spark Job SUBMITTED but not RUNNING after submit via REST API

Following the instructions in this website, I'm trying to submit a job to Spark via REST API /v1/submissions.
I tried to submit SparkPi in the example:
$ ./create.sh
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20211212044718-0003",
"serverSparkVersion" : "3.1.2",
"submissionId" : "driver-20211212044718-0003",
"success" : true
}
$ ./status.sh driver-20211212044718-0003
{
"action" : "SubmissionStatusResponse",
"driverState" : "SUBMITTED",
"serverSparkVersion" : "3.1.2",
"submissionId" : "driver-20211212044718-0003",
"success" : true
}
create.sh:
curl -X POST http://172.17.197.143:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"appResource": "/home/ruc/spark-3.1.2/examples/jars/spark-examples_2.12-3.1.2.jar",
"sparkProperties": {
"spark.master": "spark://172.17.197.143:7077",
"spark.driver.memory": "1g",
"spark.driver.cores": "1",
"spark.app.name": "REST API - PI",
"spark.jars": "/home/ruc/spark-3.1.2/examples/jars/spark-examples_2.12-3.1.2.jar",
"spark.driver.supervise": "true"
},
"clientSparkVersion": "3.1.2",
"mainClass": "org.apache.spark.examples.SparkPi",
"action": "CreateSubmissionRequest",
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"appArgs": [
"400"
]
}'
status.sh:
export DRIVER_ID=$1
curl http://172.17.197.143:6066/v1/submissions/status/$DRIVER_ID
But when I try to get the status of the job (even after a few minutes), I got a "SUBMITTED" rather than "RUNNING" or "FINISHED".
Then I looked up the log and found that
21/12/12 04:47:18 INFO master.Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
21/12/12 04:47:18 WARN master.Master: Driver driver-20211212044718-0003 requires more resource than any of Workers could have.
# ...
21/12/12 04:49:02 WARN master.Master: Driver driver-20211212044718-0003 requires more resource than any of Workers could have.
However, in my spark-env.sh, I have
export SPARK_WORKER_MEMORY=10g
export SPARK_WORKER_CORES=2
I have no idea what happened. How can I make it run normally?
Since you've checked resources and You have enough. It might be network issue. executor maybe cannot connect back to driver program. Allow traffic on both master and workers.

Setting kafka mirroring using Brooklin

I am trying to test out Brooklin for mirroring data between kafka clusters. I am following the wiki https://github.com/linkedin/brooklin/wiki/mirroring-kafka-clusters
Unlike the wiki - I am trying to setup the mirroring between 2 different clusters. I am able to start the Brooklin process and the Datastream but I cannot manage to mirror messages. Brooklin is running on the source kafka cluster ATM. I am trying to mirror topic 'test'
The server.properties for brooklin is
############################# Server Basics #############################
brooklin.server.coordinator.cluster=brooklin-cluster
brooklin.server.coordinator.zkAddress=localhost:2181
brooklin.server.httpPort=32311
brooklin.server.connectorNames=file,test,kafkaMirroringConnector
brooklin.server.transportProviderNames=kafkaTransportProvider
brooklin.server.csvMetricsDir=/tmp/brooklin-example/
########################### Transport provider configs ######################
brooklin.server.transportProvider.kafkaTransportProvider.factoryClassName=com.linkedin.datastream.kafka.KafkaTransportProviderAdminFactory
brooklin.server.transportProvider.kafkaTransportProvider.bootstrap.servers=kafka-dest:9092
brooklin.server.transportProvider.kafkaTransportProvider.zookeeper.connect=kafka-dest:2181
brooklin.server.transportProvider.kafkaTransportProvider.client.id=datastream-producer
########################### File connector Configs ######################
brooklin.server.connector.file.factoryClassName=com.linkedin.datastream.connectors.file.FileConnectorFactory
brooklin.server.connector.file.assignmentStrategyFactory=com.linkedin.datastream.server.assignment.BroadcastStrategyFactory
brooklin.server.connector.file.strategy.maxTasks=1
########################### Test event producing connector Configs ######################
brooklin.server.connector.test.factoryClassName=com.linkedin.datastream.connectors.TestEventProducingConnectorFactory
brooklin.server.connector.test.assignmentStrategyFactory=com.linkedin.datastream.server.assignment.LoadbalancingStrategyFactory
brooklin.server.connector.test.strategy.TasksPerDatastream = 4
########################### Kafka Mirroring connector Configs ######################
brooklin.server.connector.kafkaMirroringConnector.factoryClassName=com.linkedin.datastream.connectors.kafka.mirrormaker.KafkaMirrorMakerConnectorFactory
brooklin.server.connector.kafkaMirroringConnector.assignmentStrategyFactory=com.linkedin.datastream.server.assignment.BroadcastStrategyFactory
I then try to start the following Datastream;
bin/brooklin-rest-client.sh -o CREATE -u http://localhost:32311/ -n first-mirroring-stream -s "kafka://localhost:9092/test" -c kafkaMirroringConnector -t kafkaTransportProvider -m '{"owner":"root","system.reuseExistingDestination":"false"}' 2>/dev/null
Trying to check the Datastream;
bin/brooklin-rest-client.sh -o READALL -u http://localhost:32311/ 2>/dev/null
[2020-10-14 05:55:45,087] INFO Creating RestClient for http://localhost:32311/ with {}, count=1 (com.linkedin.datastream.DatastreamRestClientFactory)
[2020-10-14 05:55:45,113] INFO The service 'null' has been assigned to the ChannelPoolManager with key 'noSpecifiedNamePrefix 1138266797 ' (com.linkedin.r2.transport.http.client.HttpClientFactory)
[2020-10-14 05:55:45,215] INFO DatastreamRestClient created with retryPeriodMs=6000 retryTimeoutMs=90000 (com.linkedin.datastream.DatastreamRestClient)
[2020-10-14 05:55:45,502] INFO getAllDatastreams took 272 ms (com.linkedin.datastream.DatastreamRestClient)
{
"name" : "first-mirroring-stream",
"connectorName" : "kafkaMirroringConnector",
"transportProviderName" : "kafkaTransportProvider",
"source" : {
"connectionString" : "kafka://localhost:9092/test"
},
"Status" : "READY",
"destination" : {
"connectionString" : "kafka://kafka-dest:9092/*"
},
"metadata" : {
"datastreamUUID" : "df081002-fc7b-4f3a-b1ce-016e879d4b29",
"group.id" : "first-mirroring-stream",
"owner" : "root",
"system.IsConnectorManagedDestination" : "true",
"system.creation.ms" : "1602665999603",
"system.destination.KafkaBrokers" : "kafka-dest:9092",
"system.reuseExistingDestination" : "false",
"system.taskPrefix" : "first-mirroring-stream"
}
}
After this is running I try to produce on the source and consume on the destination but I do not get any mirroring.
Does anyone have a clue what I'm missing/what I did wrong?
Thanks!
This was an issue on my end - I had a typo in the topic name configured for mirroring.

Kafka Connect issue when reading from a RabbitMQ queue

I'm trying to read data into my topic from a RabbitMQ queue using the Kafka connector with the configuration below:
{
"name" : "RabbitMQSourceConnector1",
"config" : {
"connector.class" : "io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max" : "1",
"kafka.topic" : "rabbitmqtest3",
"rabbitmq.queue" : "taskqueue",
"rabbitmq.host" : "localhost",
"rabbitmq.username" : "guest",
"rabbitmq.password" : "guest",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true"
}
}
But I´m having troubles when converting the source stream to JSON format as I´m losing the original message
Original:
{'id': 0, 'body': '010101010101010101010101010101010101010101010101010101010101010101010'}
Received:
{"schema":{"type":"bytes","optional":false},"payload":"eyJpZCI6IDEsICJib2R5IjogIjAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMCJ9"}
Does anyone have an idea why this is happening?
EDIT: I tried to convert the message to String using the "value.converter": "org.apache.kafka.connect.storage.StringConverter", but the result is the same:
11/27/19 4:07:37 PM CET , 0 , [B#1583a488
EDIT2:
I´m now receiving the JSON file but the content is still encoded in BASE64
Any idea on how to convert it back to UTF8 directly?
{
"name": "adls-gen2-sink",
"config": {
"connector.class":"io.confluent.connect.azure.datalake.gen2.AzureDataLakeGen2SinkConnector",
"tasks.max":"1",
"topics":"rabbitmqtest3",
"flush.size":"3",
"format.class":"io.confluent.connect.azure.storage.format.json.JsonFormat",
"value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"internal.value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"topics.dir":"sw66jsoningest",
"confluent.topic.bootstrap.servers":"localhost:9092",
"confluent.topic.replication.factor":"1",
"partitioner.class" : "io.confluent.connect.storage.partitioner.DefaultPartitioner"
}
}
UPDATE:
I got the solution, considering this flow:
Message (JSON) --> RabbitMq (ByteArray) --> Kafka (ByteArray) -->ADLS (JSON)
I used this converter on the RabbitMQ to Kafka connector to decode the message from Base64 to UTF8.
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter"
Afterwards I treated the message as a String and saved it as a JSON.
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"format.class":"io.confluent.connect.azure.storage.format.json.JsonFormat",
Many thanks!
If you set schemas.enable": "false", you shouldn't be getting the schema and payload fields
If you want no translation to happen at all, use ByteArrayConverter
If your data is just a plain string (which includes JSON), use StringConverter
It's not clear how you're printing the resulting message, but looks like you're printing the byte array and not decoding it to a String

Spark REST API: Failed to find data source: com.databricks.spark.csv

I have a pyspark file stored on s3. I am trying to run it using spark REST API.
I am running the following command:
curl -X POST http://<ip-address>:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "testing.py"],
"appResource" : "s3n://accessKey:secretKey/<bucket-name>/testing.py",
"clientSparkVersion" : "1.6.1",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "org.apache.spark.deploy.SparkSubmit",
"sparkProperties" : {
"spark.driver.supervise" : "false",
"spark.app.name" : "Simple App",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://<ip-address>:6066",
"spark.jars" : "spark-csv_2.10-1.4.0.jar",
"spark.jars.packages" : "com.databricks:spark-csv_2.10:1.4.0"
}
}'
and the testing.py file has a code snippet:
myContext = SQLContext(sc)
format = "com.databricks.spark.csv"
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
dataFrame2 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location2).repartition(1)
outDataFrame = dataFrame1.join(dataFrame2, dataFrame1.values == dataFrame2.valuesId)
outDataFrame.write.format(format).option("header", "true").option("nullValue","").save(outLocation)
But on this line:
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
I get exception:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource
I was trying different things out and one of those things was that I logged into the ip-address machine and ran this command:
./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
so that It would download the spark-csv in .ivy2/cache folder. But that didn't solve the problem. What am I doing wrong?
(Posted on behalf of the OP).
I first added spark-csv_2.10-1.4.0.jar on driver and worker machines. and added
"spark.driver.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
"spark.executor.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
Then I got following error:
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
Caused by: java.lang.ClassNotFoundException: org.apache.commons.csv.CSVFormat
And then I added commons-csv-1.4.jar on both machines and added:
"spark.driver.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
"spark.executor.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
And that solved my problem.