Spring Cloud Stream Kafka application not generating messages with the correct Avro schema - apache-kafka

I have an application (spring-boot-shipping-service) with a KStream that gets OrderCreatedEvent messages generated by an external producer (spring-boot-order-service). This producer uses the following schema:
order-created-event.avsc
{
"namespace" : "com.codependent.statetransfer.order",
"type" : "record",
"name" : "OrderCreatedEvent",
"fields" : [
{"name":"id","type":"int"},
{"name":"productId","type":"int"},
{"name":"customerId","type":"int"}
]
}
My KStream<Int, OrderCreatedEvent> is joined with a KTable<Int, Customer> and publishes to the order topic a new kind of message: OrderShippedEvent.
order-shipped-event.avsc
{
"namespace" : "com.codependent.statetransfer.order",
"type" : "record",
"name" : "OrderShippedEvent",
"fields" : [
{"name":"id","type":"int"},
{"name":"productId","type":"int"},
{"name":"customerName","type":"string"},
{"name":"customerAddress","type":"string"}
]
}
For some reason the new OrderShippedEvent messages aren't generated with a header application/vnd.ordershippedevent.v1+avro but application/vnd.ordercreatedevent.v1+avro.
This is the original OrderCreatedEvent in the order topic:
Key (4 bytes): +
Value (4 bytes): V?
Timestamp: 1555943926163
Partition: 0
Offset: 34
Headers: contentType="application/vnd.ordercreatedevent.v1+avro",spring_json_header_types={"contentType":"java.lang.String"}
And the produced OrderShippedEvent with the incorrect schema:
Key (4 bytes): +
Value (26 bytes): V?
JamesHill Street
Timestamp: 1555943926163
Partition: 0
Offset: 35
Headers: contentType="application/vnd.ordercreatedevent.v1+avro",spring_json_header_types={"contentType":"java.lang.String"}
I've checked the Confluent Schema Registry contents, and the order-shipped-event.avsc schema is there:
Why isn't it using the correct shema in the generated message?
Below you can see the full configuration and code of the example, which is also available on Github (https://github.com/codependent/event-carried-state-transfer/tree/avro)
In order to test it just start a Confluent Platform (v5.2.1), spring-boot-customer-service, spring-boot-order-service, spring-boot-shipping-service and execute the following curl commands:
curl -X POST http://localhost:8080/customers -d '{"id":1,"name":"James","address":"Hill Street"}' -H "content-type: application/json"
curl -X POST http://localhost:8084/orders -H "content-type: application/json" -d '{"id":1,"productId":1001,"/customerId":1}'
application.yml
server:
port: 8085
spring:
application:
name: spring-boot-shipping-service
cloud:
stream:
kafka:
streams:
binder:
configuration:
default:
key:
serde: org.apache.kafka.common.serialization.Serdes$IntegerSerde
bindings:
input:
destination: customer
contentType: application/*+avro
order:
destination: order
contentType: application/*+avro
output:
destination: order
contentType: application/*+avro
schema-registry-client:
endpoint: http://localhost:8081
ShippingKStreamProcessor
interface ShippingKStreamProcessor {
#Input("input")
fun input(): KStream<Int, Customer>
#Input("order")
fun order(): KStream<String, OrderCreatedEvent>
#Output("output")
fun output(): KStream<String, OrderShippedEvent>
ShippingKStreamConfiguration
#StreamListener
#SendTo("output")
fun process(#Input("input") input: KStream<Int, Customer>, #Input("order") orderEvent: KStream<Int, OrderCreatedEvent>): KStream<Int, OrderShippedEvent> {
val serdeConfig = mapOf(
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG to "http://localhost:8081")
val intSerde = Serdes.IntegerSerde()
val customerSerde = SpecificAvroSerde<Customer>()
customerSerde.configure(serdeConfig, true)
val orderCreatedSerde = SpecificAvroSerde<OrderCreatedEvent>()
orderCreatedSerde.configure(serdeConfig, true)
val orderShippedSerde = SpecificAvroSerde<OrderShippedEvent>()
orderShippedSerde.configure(serdeConfig, true)
val stateStore: Materialized<Int, Customer, KeyValueStore<Bytes, ByteArray>> =
Materialized.`as`<Int, Customer, KeyValueStore<Bytes, ByteArray>>("customer-store")
.withKeySerde(intSerde)
.withValueSerde(customerSerde)
val customerTable: KTable<Int, Customer> = input.groupByKey(Serialized.with(intSerde, customerSerde))
.reduce({ _, y -> y }, stateStore)
return (orderEvent.filter { _, value -> value is OrderCreatedEvent && value.id != 0 }
.selectKey { _, value -> value.customerId } as KStream<Int, OrderCreatedEvent>)
.join(customerTable, { orderIt, customer ->
OrderShippedEvent(orderIt.id, orderIt.productId, customer.name, customer.address)
}, Joined.with(intSerde, orderCreatedSerde, customerSerde))
.selectKey { _, value -> value.id }
}
UPDATE: I've set trace logging level for org.springframework.messaging and apparently it looks ok:
2019-04-22 23:40:39.953 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : HTTP GET http://localhost:8081/subjects/ordercreatedevent/versions/1
2019-04-22 23:40:39.971 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Accept=[application/json, application/*+json]
2019-04-22 23:40:39.972 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Writing [] as "application/vnd.schemaregistry.v1+json"
2019-04-22 23:40:39.984 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Response 200 OK
2019-04-22 23:40:39.985 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Reading to [java.util.Map<?, ?>]
2019-04-22 23:40:40.186 INFO 46039 --- [read-1-producer] org.apache.kafka.clients.Metadata : Cluster ID: 5Sw6sBD0TFOaximF3Or-dQ
2019-04-22 23:40:40.318 DEBUG 46039 --- [-StreamThread-1] AvroSchemaRegistryClientMessageConverter : Obtaining schema for class class com.codependent.statetransfer.order.OrderShippedEvent
2019-04-22 23:40:40.318 DEBUG 46039 --- [-StreamThread-1] AvroSchemaRegistryClientMessageConverter : Avro type detected, using schema from object
2019-04-22 23:40:40.342 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : HTTP POST http://localhost:8081/subjects/ordershippedevent/versions
2019-04-22 23:40:40.342 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Accept=[application/json, application/*+json]
2019-04-22 23:40:40.342 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Writing [{"schema":"{\"type\":\"record\",\"name\":\"OrderShippedEvent\",\"namespace\":\"com.codependent.statetransfer.order\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"productId\",\"type\":\"int\"},{\"name\":\"customerName\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"customerAddress\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}]}"}] as "application/json"
2019-04-22 23:40:40.348 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Response 200 OK
2019-04-22 23:40:40.348 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Reading to [java.util.Map<?, ?>]
2019-04-22 23:40:40.349 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : HTTP POST http://localhost:8081/subjects/ordershippedevent
2019-04-22 23:40:40.349 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Accept=[application/json, application/*+json]
2019-04-22 23:40:40.349 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Writing [{"schema":"{\"type\":\"record\",\"name\":\"OrderShippedEvent\",\"namespace\":\"com.codependent.statetransfer.order\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"productId\",\"type\":\"int\"},{\"name\":\"customerName\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"customerAddress\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}]}"}] as "application/json"
2019-04-22 23:40:40.361 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Response 200 OK
2019-04-22 23:40:40.362 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Reading to [java.util.Map<?, ?>]
2019-04-22 23:40:40.362 DEBUG 46039 --- [-StreamThread-1] AvroSchemaRegistryClientMessageConverter : Finding correct DatumWriter for type com.codependent.statetransfer.order.OrderShippedEvent
How come the message is written with an incorrect content type header then?
UPDATE 2:
I've kept digging into the source code and found this:
KafkaStreamsMessageConversionDelegate correctly converts and determines the right header values, as seen in the logs above.
However in the serializeOnOutbound method we can find that it returns to the Kafka API only the payload, so the headers aren't taken into account:
return
messageConverter.toMessage(message.getPayload(),
messageHeaders).getPayload();
Moving forward in the record processing org.apache.kafka.streams.processor.internals.SinkNode.process() accesses the headers present in the context, which incorrectly contain application/vnd.ordercreatedevent.v1+avro instead of application/vnd.ordershippedevent.v1+avro (?):
collector.send(topic, key, value, context.headers(), timestamp, keySerializer, valSerializer, partitioner);
UPDATE 3:
Steps to reproduce:
Download and start Confluent 5.2.1
confluent start
Start the applications spring-boot-order-service, spring-boot-customer-service, spring-boot-shipping-service
Create a customer curl -X POST http://localhost:8080/customers -d '{"id":1,"name":"John","address":"Some Street"}' -H "content-type: application/json"
Create an order that will be joined with the customer: curl -X POST http://localhost:8084/orders -H "content-type: application/json" -d '{"id":1,"productId":1,"customerId":1}'
ShippingKStreamConfiguration's process() will create a KTable for the Customer and a state store (customer-store). Besides, it will join the order stream with the customer KTable to transform an OrderCreatedEvent into an OrderShippedEvent.
You can check that the newly created OrderShippedEvent message added to the order topic has an incorrect header. This can be seen either in the Confluent Control Center (localhost:9092 -> topics -> order) or running kafkacat:
$> kafkacat -b localhost:9092 -t order -C \
-f '\nKey (%K bytes): %k
Value (%S bytes): %s
Timestamp: %T
Partition: %p
Offset: %o
Headers: %h\n'

#codependent It is indeed an issue that we need to address in the binder which we will fix soon. In the meantime, as a workaround can you make your processor not return a KStream, but rather do the sending in the method itself. You can call to(TopicNameExtractor) on the currently returned KStream. TopicNameExtractor will give you access to the record context using which you can manually set the content type.

Related

kafka mongodb sink connector issue while writing to mongodb

I am facing issue while writing to mongodb using mongo kafka sink connector.I am using mongodb of v5.0.3 and Strimzi kafka of v2.8.0. I have added p1/mongo-kafka-connect-1.7.0-all.jar and p2/mongodb-driver-core-4.5.0.jar in connect cluster plugins path.Created connector using below
{
"name": "mongo-sink",
"config": {
"topics": "sinktest2",
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max": "1",
"connection.uri": "mongodb://mm-0.mongoservice.st.svc.cluster.local:27017,mm-1.mongoservice.st.svc.cluster.local:27017",
"database": "sinkdb",
"collection": "sinkcoll",
"mongo.errors.tolerance": "all",
"mongo.errors.log.enable": true,
"errors.log.include.messages": true,
"errors.deadletterqueue.topic.name": "sinktest2.deadletter",
"errors.deadletterqueue.context.headers.enable": true
}
}
root#ubuntuserver-0:/persistent# curl http://localhost:8083/connectors/mongo-sink/status
{"name":"mongo-sink","connector":{"state":"RUNNING","worker_id":"localhost:8083"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"localhost:8083"}],"type":"sink"}
When I check the status after creating connector it is showing running, but when I start sending records to kafka topic connector is running into issues.connector status is showing as below.
root#ubuntuserver-0:/persistent# curl http://localhost:8083/connectors/mongo-sink/status
{
"name":"mongo-sink",
"connector":{
"state":"RUNNING",
"worker_id":"localhost:8083"
},
"tasks":[
{
"id":0,
"state":"FAILED",
"worker_id":"localhost:8083",
"trace":"org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:206)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:132)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:496)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:473)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:328)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:182)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:231)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed due to serialization error: \n\tat org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:324)\n\tat org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertValue(WorkerSinkTask.java:540)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$2(WorkerSinkTask.java:496)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:156)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:190)\n\t... 13 more\nCaused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input: expected close marker for Object (start marker at [Source: (byte[])\"{ \"; line: 1, column: 1])\n at [Source: (byte[])\"{ \"; line: 1, column: 4]\nCaused by: com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input: expected close marker for Object (start marker at [Source: (byte[])\"{ \"; line: 1, column: 1])\n at [Source: (byte[])\"{ \"; line: 1, column: 4]\n\tat com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:664)\n\tat com.fasterxml.jackson.core.base.ParserBase._handleEOF(ParserBase.java:486)\n\tat com.fasterxml.jackson.core.base.ParserBase._eofAsNextChar(ParserBase.java:498)\n\tat com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipWSOrEnd2(UTF8StreamJsonParser.java:3033)\n\tat com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipWSOrEnd(UTF8StreamJsonParser.java:3003)\n\tat com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextFieldName(UTF8StreamJsonParser.java:989)\n\tat com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:249)\n\tat com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:68)\n\tat com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)\n\tat com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4270)\n\tat com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2734)\n\tat org.apache.kafka.connect.json.JsonDeserializer.deserialize(JsonDeserializer.java:64)\n\tat org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:322)\n\tat org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertValue(WorkerSinkTask.java:540)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$2(WorkerSinkTask.java:496)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:156)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:190)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:132)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:496)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:473)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:328)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:182)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:231)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n"
}
],
"type":"sink"
}
I am writing sample json record to kafka topic.
./kafka-console-producer.sh --topic sinktest2 --bootstrap-server sample-kafka-kafka-bootstrap:9093 --producer.config /persistent/client.txt < /persistent/emp.json
emp.json is below file
{
"employee": {
"name": "abc",
"salary": 56000,
"married": true
}
}
I don't see any logs in connector pod and no databse and collection being created in mongodb.
Please help to resolve this issue. Thank you !!
I think you are missing some configuration parameters like converter, and schema.
Update your config to add following:
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
If you are using KafkaConnect on kubernetes, you may create the sink connector as shown below. Create a file with name like mongo-sink-connector.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: mongodb-sink-connector
labels:
strimzi.io/cluster: my-connect-cluster
spec:
class: com.mongodb.kafka.connect.MongoSinkConnector
tasksMax: 2
config:
connection.uri: "mongodb://root:password#mongodb-0.mongodb-headless.default.svc.cluster.local:27017"
database: test
collection: sink
topics: sink-topic
key.converter: org.apache.kafka.connect.json.JsonConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable: false
value.converter.schemas.enable: false
Execute the command:
$ kubectl apply -f mongo-sink-connector.yaml
you should see the output:
kafkaconnector.kafka.strimzi.io/mongo-apps-sink-connector created
Before starting the producer, check the status of connector and verify the topic has created as follows:
Status:
[kafka#my-connect-cluster-connect-5d47fb574-69xpv kafka]$ curl http://localhost:8083/connectors/mongodb-sink-connector/status
{"name":"mongodb-sink-connector","connector":{"state":"RUNNING","worker_id":"IP-ADDRESS:8083"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"IP-ADDRESS:8083"},{"id":1,"state":"RUNNING","worker_id":"IP-ADDRESS:8083"}],"type":"sink"}
[kafka#my-connect-cluster-connect-5d47fb574-69xpv kafka]$
Check topic creation, you will see sink-topic
[kafka#my-connect-cluster-connect-5d47fb574-69xpv kafka]$ bin/kafka-topics.sh --bootstrap-server my-cluster-kafka-bootstrap:9092 --list
__consumer_offsets
__strimzi-topic-operator-kstreams-topic-store-changelog
__strimzi_store_topic
connect-cluster-configs
connect-cluster-offsets
connect-cluster-status
sink-topic
Now, go on kafka server to execute the producer
[kafka#my-cluster-kafka-0 kafka]$ bin/kafka-console-producer.sh --broker-list my-cluster-kafka-bootstrap:9092 --topic sink-topic
Successful execution will show you a prompt like > to enter/test the data
>{"employee": {"name": "abc", "salary": 56000, "married": true}}
>
On anther terminal, connect to kafka server and start consumer to verify the data
[kafka#my-cluster-kafka-0 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic sink-topic --from-beginning
{"employee": {"name": "abc", "salary": 56000, "married": true}}
If you see this data, means everything is working fine. Now let us check on mongodb. Connect with your mongodb server and check
rs0:PRIMARY> use test
switched to db test
rs0:PRIMARY> show collections
sink
rs0:PRIMARY> db.sink.find()
{ "_id" : ObjectId("6234a4a0dad1a2638f57a6b2"), "employee" : { "name" : "abc", "salary" : NumberLong(56000), "married" : true } }
et Voila!
You're hitting a serialization exception. I'll break the message out a bit:
com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input:
expected close marker for Object (start marker at [Source: (byte[])"{ "; line: 1, column: 1])
at [Source: (byte[])"{ "; line: 1, column: 4]
Caused by: com.fasterxml.jackson.core.io.JsonEOFException:
Unexpected end-of-input: expected close marker for Object (start marker at [Source: (byte[])"{ "; line: 1, column: 1])
at [Source: (byte[])"{ "; line: 1, column: 4]
"expected close marker for Object" suggests to me that the parser is expecting to see the entire JSON object as one line, rather than pretty-printed.
{"employee": {"name": "abc", "salary": 56000, "married": true}}

TimeoutException when trying to run a Pulsar source connector

I'm trying to run a Pulsar DebeziumPostgresSource connector.
This is the command I'm running:
bin/pulsar-admin \
--admin-url https://localhost:8443 \
--auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--auth-params file:///pulsar/tokens/broker/token \
--tls-allow-insecure \
source localrun \
--broker-service-url pulsar+ssl://my-pulsar-server:6651 \
--client-auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--client-auth-params file:///pulsar/tokens/broker/token \
--tls-allow-insecure \
--source-config-file /pulsar/debezium-config/my-source-config.yaml
Here's the /pulsar/debezium-config/my-source-config.yaml file:
tenant: my-tenant
namespace: my-namespace
name: my-source
topicName: my-topic
archive: connectors/pulsar-io-debezium-postgres-2.6.0-SNAPSHOT.nar
parallelism: 1
configs:
plugin.name: pgoutput
database.hostname: my-db-server
database.port: "5432"
database.user: my-db-user
database.password: my-db-password
database.dbname: my-db
database.server.name: my-db-server-name
table.whitelist: my_schema.my_table
pulsar.service.url: pulsar+ssl://my-pulsar-server:6651/
And here's the output from the command above:
11:47:29.924 [main] INFO org.apache.pulsar.functions.runtime.RuntimeSpawner - my-tenant/my-namespace/my-source-0 RuntimeSpawner starting function
11:47:29.925 [main] INFO org.apache.pulsar.functions.runtime.thread.ThreadRuntime - ThreadContainer starting function with instance config InstanceConfig(instanceId=0, functionId=4073a1d9-1312-4570-981b-6723626e394a, functionVersion=01d5a3a7-c6d7-4f79-8717-403ad1371411, functionDetails=tenant: "my-tenant"
namespace: "my-namespace"
name: "my-source"
className: "org.apache.pulsar.functions.api.utils.IdentityFunction"
autoAck: true
parallelism: 1
source {
className: "org.apache.pulsar.io.debezium.postgres.DebeziumPostgresSource"
configs: "{\"database.user\":\"my-db-user\",\"database.dbname\":\"my-db\",\"database.hostname\":\"my-db-server\",\"database.password\":\"my-db-password\",\"database.server.name\":\"my-db-server-name\",\"plugin.name\":\"pgoutput\",\"database.port\":\"5432\",\"pulsar.service.url\":\"pulsar+ssl://my-pulsar-server:6651/\",\"table.whitelist\":\"my_schema.my_table\"}"
typeClassName: "org.apache.pulsar.common.schema.KeyValue"
}
sink {
topic: "my-topic"
typeClassName: "org.apache.pulsar.common.schema.KeyValue"
}
resources {
cpu: 1.0
ram: 1073741824
disk: 10737418240
}
componentType: SOURCE
, maxBufferedTuples=1024, functionAuthenticationSpec=null, port=39135, clusterName=local, maxPendingAsyncRequests=1000)
11:47:32.552 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xf8ffbf24, L:/redacted-ip-l:43802 - R:my-pulsar-server/redacted-ip-r:6651]] Connected to server
11:47:33.240 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Starting Pulsar producer perf with config: {
"topicName" : "my-topic",
"producerName" : null,
"sendTimeoutMs" : 0,
"blockIfQueueFull" : true,
"maxPendingMessages" : 1000,
"maxPendingMessagesAcrossPartitions" : 50000,
"messageRoutingMode" : "CustomPartition",
"hashingScheme" : "Murmur3_32Hash",
"cryptoFailureAction" : "FAIL",
"batchingMaxPublishDelayMicros" : 10000,
"batchingPartitionSwitchFrequencyByPublishDelay" : 10,
"batchingMaxMessages" : 1000,
"batchingMaxBytes" : 131072,
"batchingEnabled" : true,
"chunkingEnabled" : false,
"compressionType" : "LZ4",
"initialSequenceId" : null,
"autoUpdatePartitions" : true,
"multiSchema" : true,
"properties" : {
"application" : "pulsar-source",
"id" : "my-tenant/my-namespace/my-source",
"instance_id" : "0"
}
}
11:47:33.259 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Pulsar client config: {
"serviceUrl" : "pulsar+ssl://my-pulsar-server:6651",
"authPluginClassName" : "org.apache.pulsar.client.impl.auth.AuthenticationToken",
"authParams" : "file:///pulsar/tokens/broker/token",
"authParamMap" : null,
"operationTimeoutMs" : 30000,
"statsIntervalSeconds" : 60,
"numIoThreads" : 1,
"numListenerThreads" : 1,
"connectionsPerBroker" : 1,
"useTcpNoDelay" : true,
"useTls" : true,
"tlsTrustCertsFilePath" : null,
"tlsAllowInsecureConnection" : true,
"tlsHostnameVerificationEnable" : false,
"concurrentLookupRequest" : 5000,
"maxLookupRequest" : 50000,
"maxLookupRedirects" : 20,
"maxNumberOfRejectedRequestPerConnection" : 50,
"keepAliveIntervalSeconds" : 30,
"connectionTimeoutMs" : 10000,
"requestTimeoutMs" : 60000,
"initialBackoffIntervalNanos" : 100000000,
"maxBackoffIntervalNanos" : 60000000000,
"listenerName" : null,
"useKeyStoreTls" : false,
"sslProvider" : null,
"tlsTrustStoreType" : "JKS",
"tlsTrustStorePath" : null,
"tlsTrustStorePassword" : null,
"tlsCiphers" : [ ],
"tlsProtocols" : [ ],
"proxyServiceUrl" : null,
"proxyProtocol" : null
}
11:47:33.418 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xab39f703, L:/redacted-ip-l:43806 - R:my-pulsar-server/redacted-ip-r:6651]] Connected to server
11:47:33.422 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ClientCnx - [id: 0xab39f703, L:/redacted-ip-l:43806 - R:my-pulsar-server/redacted-ip-r:6651] Connected through proxy to target broker at my-broker:6651
11:47:33.484 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [my-topic] [null] Creating producer on cnx [id: 0xab39f703, L:/redacted-ip-l:43806 - R:my-pulsar-server/redacted-ip-r:6651]
11:48:33.434 [pulsar-client-io-1-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [my-topic] [null] Failed to create producer: 3 lookup request timedout after ms 30000
11:48:33.438 [pulsar-client-io-1-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0xab39f703, L:/redacted-ip-l:43806 - R:my-pulsar-server/redacted-ip-r:6651] request 3 timed out after 30000 ms
11:48:33.629 [main] INFO org.apache.pulsar.functions.LocalRunner - RuntimeSpawner quit because of
java.lang.RuntimeException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 3 lookup request timedout after ms 30000
at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkAtMostOnceProcessor.<init>(PulsarSink.java:177) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkAtLeastOnceProcessor.<init>(PulsarSink.java:206) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.sink.PulsarSink.open(PulsarSink.java:284) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.instance.JavaInstanceRunnable.setupOutput(JavaInstanceRunnable.java:819) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.instance.JavaInstanceRunnable.setup(JavaInstanceRunnable.java:224) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.instance.JavaInstanceRunnable.run(JavaInstanceRunnable.java:246) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_252]
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 3 lookup request timedout after ms 30000
at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:821) ~[org.apache.pulsar-pulsar-client-api-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.client.impl.ProducerBuilderImpl.create(ProducerBuilderImpl.java:93) ~[org.apache.pulsar-pulsar-client-original-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkProcessorBase.createProducer(PulsarSink.java:106) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkAtMostOnceProcessor.<init>(PulsarSink.java:174) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
... 6 more
11:48:59.956 [function-timer-thread-5-1] ERROR org.apache.pulsar.functions.runtime.RuntimeSpawner - my-tenant/my-namespace/my-source-java.lang.RuntimeException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 3 lookup request timedout after ms 30000 Function Container is dead with exception.. restarting
As you can see, it failed to create a producer due to a TimeoutException. What are the likely causes of this error? What's the best way to further investigate this issue?
Additional info:
I have also tried the --tls-trust-cert-path /my/ca-certificates.crt option instead of --tls-allow-insecure, but got the same error.
I am able to list tenants:
bin/pulsar-admin \
--admin-url https://localhost:8443 \
--auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--auth-params file:///pulsar/tokens/broker/token \
tenants list
# Output:
# "public"
# "pulsar"
# "my-topic"
But I am not able to get an OK broker health-check:
bin/pulsar-admin \
--admin-url https://localhost:8443 \
--auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--auth-params file:///pulsar/tokens/broker/token \
brokers healthcheck
# Output:
# null
# Reason: java.util.concurrent.TimeoutException
bin/pulsar-admin \
--admin-url https://localhost:8443 \
--auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--auth-params file:///pulsar/tokens/broker/token \
--tls-allow-insecure \
brokers healthcheck
# Output:
# HTTP 500 Internal Server Error
# Reason: HTTP 500 Internal Server Error
In my case, the root cause was an expired TLS certificate.

Setting kafka mirroring using Brooklin

I am trying to test out Brooklin for mirroring data between kafka clusters. I am following the wiki https://github.com/linkedin/brooklin/wiki/mirroring-kafka-clusters
Unlike the wiki - I am trying to setup the mirroring between 2 different clusters. I am able to start the Brooklin process and the Datastream but I cannot manage to mirror messages. Brooklin is running on the source kafka cluster ATM. I am trying to mirror topic 'test'
The server.properties for brooklin is
############################# Server Basics #############################
brooklin.server.coordinator.cluster=brooklin-cluster
brooklin.server.coordinator.zkAddress=localhost:2181
brooklin.server.httpPort=32311
brooklin.server.connectorNames=file,test,kafkaMirroringConnector
brooklin.server.transportProviderNames=kafkaTransportProvider
brooklin.server.csvMetricsDir=/tmp/brooklin-example/
########################### Transport provider configs ######################
brooklin.server.transportProvider.kafkaTransportProvider.factoryClassName=com.linkedin.datastream.kafka.KafkaTransportProviderAdminFactory
brooklin.server.transportProvider.kafkaTransportProvider.bootstrap.servers=kafka-dest:9092
brooklin.server.transportProvider.kafkaTransportProvider.zookeeper.connect=kafka-dest:2181
brooklin.server.transportProvider.kafkaTransportProvider.client.id=datastream-producer
########################### File connector Configs ######################
brooklin.server.connector.file.factoryClassName=com.linkedin.datastream.connectors.file.FileConnectorFactory
brooklin.server.connector.file.assignmentStrategyFactory=com.linkedin.datastream.server.assignment.BroadcastStrategyFactory
brooklin.server.connector.file.strategy.maxTasks=1
########################### Test event producing connector Configs ######################
brooklin.server.connector.test.factoryClassName=com.linkedin.datastream.connectors.TestEventProducingConnectorFactory
brooklin.server.connector.test.assignmentStrategyFactory=com.linkedin.datastream.server.assignment.LoadbalancingStrategyFactory
brooklin.server.connector.test.strategy.TasksPerDatastream = 4
########################### Kafka Mirroring connector Configs ######################
brooklin.server.connector.kafkaMirroringConnector.factoryClassName=com.linkedin.datastream.connectors.kafka.mirrormaker.KafkaMirrorMakerConnectorFactory
brooklin.server.connector.kafkaMirroringConnector.assignmentStrategyFactory=com.linkedin.datastream.server.assignment.BroadcastStrategyFactory
I then try to start the following Datastream;
bin/brooklin-rest-client.sh -o CREATE -u http://localhost:32311/ -n first-mirroring-stream -s "kafka://localhost:9092/test" -c kafkaMirroringConnector -t kafkaTransportProvider -m '{"owner":"root","system.reuseExistingDestination":"false"}' 2>/dev/null
Trying to check the Datastream;
bin/brooklin-rest-client.sh -o READALL -u http://localhost:32311/ 2>/dev/null
[2020-10-14 05:55:45,087] INFO Creating RestClient for http://localhost:32311/ with {}, count=1 (com.linkedin.datastream.DatastreamRestClientFactory)
[2020-10-14 05:55:45,113] INFO The service 'null' has been assigned to the ChannelPoolManager with key 'noSpecifiedNamePrefix 1138266797 ' (com.linkedin.r2.transport.http.client.HttpClientFactory)
[2020-10-14 05:55:45,215] INFO DatastreamRestClient created with retryPeriodMs=6000 retryTimeoutMs=90000 (com.linkedin.datastream.DatastreamRestClient)
[2020-10-14 05:55:45,502] INFO getAllDatastreams took 272 ms (com.linkedin.datastream.DatastreamRestClient)
{
"name" : "first-mirroring-stream",
"connectorName" : "kafkaMirroringConnector",
"transportProviderName" : "kafkaTransportProvider",
"source" : {
"connectionString" : "kafka://localhost:9092/test"
},
"Status" : "READY",
"destination" : {
"connectionString" : "kafka://kafka-dest:9092/*"
},
"metadata" : {
"datastreamUUID" : "df081002-fc7b-4f3a-b1ce-016e879d4b29",
"group.id" : "first-mirroring-stream",
"owner" : "root",
"system.IsConnectorManagedDestination" : "true",
"system.creation.ms" : "1602665999603",
"system.destination.KafkaBrokers" : "kafka-dest:9092",
"system.reuseExistingDestination" : "false",
"system.taskPrefix" : "first-mirroring-stream"
}
}
After this is running I try to produce on the source and consume on the destination but I do not get any mirroring.
Does anyone have a clue what I'm missing/what I did wrong?
Thanks!
This was an issue on my end - I had a typo in the topic name configured for mirroring.

Scala/Spark Streaming Store transformed kafka messages to Hive

As a datasource I am using a kafka stream to consume tweets.
I have written a simple spark streaming application.
I am able to consume the tweets and I am able convert the records to my own case class.
But I am not able to write to a hive table located in the same docker as spark.
Can you give me a hint of how to solve it.
import com.fhjoanneum.swd18.grp3.bigdata.convert.TweetToTwitterDbRecord
import com.fhjoanneum.swd18.grp3.bigdata.domain.Tweet
import com.typesafe.scalalogging.LazyLogging
import org.apache.spark.sql._
import org.json4s.jackson.JsonMethods.parse
import org.json4s.{DefaultFormats}
case object TwitterInputStream extends App with LazyLogging {
val spark = SparkSession
.builder()
.appName(s"TestApp")
.master("local[*]")
.config("hive.metastore.uris", "thrift://0.0.0.0:9083")
.enableHiveSupport()
.getOrCreate()
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("SET hive.exec.parallel=true")
spark.sql("SET hive.exec.parallel.thread.number=16")
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.156:9092")
.option("subscribe", "twitter-status")
.option("startingOffsets", "latest") // From starting
.load()
import spark.implicits._
val testerDF = df.selectExpr("CAST(value AS STRING)").as[String]
val parsedMsgs = testerDF.map(value => {
implicit val formats = DefaultFormats
val tweet = parse(value).extract[Tweet]
tweet
})
// the following part causes my problems:
val query = parsedMsgs.map(TweetToTwitterDbRecord)
.writeStream.foreachBatch((batchDs: Dataset[_], batchId: Long) =>
batchDs.write
.format("parquet")
.mode(SaveMode.Append)
.insertInto("grp3.tweets")
).start().awaitTermination()
// the commented part works:
// parsedMsgs.writeStream
// .format("console")
// .outputMode("append")
// .start()
// .awaitTermination()
}
The table where I wanna write was created by this statement:
CREATE EXTERNAL TABLE `tweets`(
`id` BigInt,
`createdAt` String,
`text` String,
`userId` Int,
`geo` String,
`coordinates` String,
`place` String,
`quoteCount` Int,
`replyCount` Int,
`retweetCount` Int,
`favoriteCount` Int,
`timestampMs` BigInt
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS PARQUET LOCATION '/data/bigdata/tweets'
TBLPROPERTIES ("parquet.compression"="SNAPPY");
It does not break. I have to switch my LOGGER to DEBUG to see whats going on.
The following are the last lines of my logger output:
2020-01-28 21:38:53 INFO ParquetWriteSupport:54 - Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : true,
"metadata" : { }
}, {
"name" : "createdat",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "text",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "userid",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "geo",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "coordinates",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "place",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "quotecount",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "replycount",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "retweetcount",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "favoritecount",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "timestampms",
"type" : "long",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
optional int64 id;
optional binary createdat (UTF8);
optional binary text (UTF8);
required int32 userid;
optional binary geo (UTF8);
optional binary coordinates (UTF8);
optional binary place (UTF8);
required int32 quotecount;
required int32 replycount;
required int32 retweetcount;
required int32 favoritecount;
optional int64 timestampms;
}
2020-01-28 21:38:53 DEBUG DFSClient:1646 - /data/bigdata/tweets/_temporary/0/_temporary/attempt_20200128213853_0001_m_000000_1/part-00000-72cb3fde-ecbe-4969-aac5-becbd65a147d-c000.snappy.parquet: masked=rw-r--r--
2020-01-28 21:38:53 DEBUG Client:1026 - IPC Client (1473539708) connection to nodemaster/0.0.0.0:9000 from andreas sending #7
2020-01-28 21:38:53 DEBUG Client:1083 - IPC Client (1473539708) connection to nodemaster/0.0.0.0:9000 from andreas got value #7
2020-01-28 21:38:53 DEBUG ProtobufRpcEngine:253 - Call: create took 4ms
2020-01-28 21:38:53 DEBUG DFSClient:1802 - computePacketChunkSize: src=/data/bigdata/tweets/_temporary/0/_temporary/attempt_20200128213853_0001_m_000000_1/part-00000-72cb3fde-ecbe-4969-aac5-becbd65a147d-c000.snappy.parquet, chunkSize=516, chunksPerPacket=127, packetSize=65532
2020-01-28 21:38:53 DEBUG LeaseRenewer:301 - Lease renewer daemon for [DFSClient_NONMAPREDUCE_469557971_72] with renew id 1 started
2020-01-28 21:38:53 DEBUG ParquetFileWriter:281 - 0: start
2020-01-28 21:38:53 DEBUG MemoryManager:63 - Allocated total memory pool is: 3626971910
2020-01-28 21:38:53 INFO CodecPool:151 - Got brand-new compressor [.snappy]
2020-01-28 21:38:53 DEBUG RunLengthBitPackingHybridEncoder:119 - Encoding: RunLengthBitPackingHybridEncoder with bithWidth: 1 initialCapacity 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG RunLengthBitPackingHybridEncoder:119 - Encoding: RunLengthBitPackingHybridEncoder with bithWidth: 1 initialCapacity 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG RunLengthBitPackingHybridEncoder:119 - Encoding: RunLengthBitPackingHybridEncoder with bithWidth: 1 initialCapacity 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG RunLengthBitPackingHybridEncoder:119 - Encoding: RunLengthBitPackingHybridEncoder with bithWidth: 1 initialCapacity 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG RunLengthBitPackingHybridEncoder:119 - Encoding: RunLengthBitPackingHybridEncoder with bithWidth: 1 initialCapacity 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG RunLengthBitPackingHybridEncoder:119 - Encoding: RunLengthBitPackingHybridEncoder with bithWidth: 1 initialCapacity 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG RunLengthBitPackingHybridEncoder:119 - Encoding: RunLengthBitPackingHybridEncoder with bithWidth: 1 initialCapacity 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 64
2020-01-28 21:38:53 DEBUG CapacityByteArrayOutputStream:276 - initial slab of size 1024
2020-01-28 21:38:53 DEBUG MemoryManager:138 - Adjust block size from 134,217,728 to 134,217,728 for writer: org.apache.parquet.hadoop.InternalParquetRecordWriter#61628754
2020-01-28 21:38:53 DEBUG RecordConsumerLoggingWrapper:69 - <!-- flush -->
2020-01-28 21:38:53 INFO InternalParquetRecordWriter:165 - Flushing mem columnStore to file. allocated memory: 0
2020-01-28 21:38:53 DEBUG ParquetFileWriter:682 - 4: end
2020-01-28 21:38:54 DEBUG ParquetFileWriter:692 - 1209: footer length = 1205
2020-01-28 21:38:54 DEBUG BytesUtils:159 - write le int: 1205 => 181 4 0 0
2020-01-28 21:38:54 DEBUG DFSClient:1869 - DFSClient writeChunk allocating new packet seqno=0, src=/data/bigdata/tweets/_temporary/0/_temporary/attempt_20200128213853_0001_m_000000_1/part-00000-72cb3fde-ecbe-4969-aac5-becbd65a147d-c000.snappy.parquet, packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
2020-01-28 21:38:54 DEBUG DFSClient:1815 - Queued packet 0
2020-01-28 21:38:54 DEBUG DFSClient:1815 - Queued packet 1
2020-01-28 21:38:54 DEBUG DFSClient:2133 - Waiting for ack for: 1
2020-01-28 21:38:54 DEBUG DFSClient:585 - Allocating new block
2020-01-28 21:38:54 DEBUG Client:1026 - IPC Client (1473539708) connection to nodemaster/0.0.0.0:9000 from andreas sending #8
2020-01-28 21:38:54 DEBUG Client:1083 - IPC Client (1473539708) connection to nodemaster/0.0.0.0:9000 from andreas got value #8
2020-01-28 21:38:54 DEBUG ProtobufRpcEngine:253 - Call: addBlock took 6ms
2020-01-28 21:38:54 DEBUG DFSClient:1390 - pipeline = 172.18.1.3:9866
2020-01-28 21:38:54 DEBUG DFSClient:1390 - pipeline = 172.18.1.2:9866
2020-01-28 21:38:54 DEBUG DFSClient:1601 - Connecting to datanode 172.18.1.3:9866
2020-01-28 21:38:54 DEBUG AbstractCoordinator:833 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-d50ae41c-0b12-45c2-838f-c83c7a7e856d-1198433466-driver-0] Sending Heartbeat request to coordinator 192.168.1.156:9092 (id: 2147482646 rack: null)
2020-01-28 21:38:54 DEBUG AbstractCoordinator:846 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-d50ae41c-0b12-45c2-838f-c83c7a7e856d-1198433466-driver-0] Received successful Heartbeat response
2020-01-28 21:38:57 DEBUG AbstractCoordinator:833 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-d50ae41c-0b12-45c2-838f-c83c7a7e856d-1198433466-driver-0] Sending Heartbeat request to coordinator 192.168.1.156:9092 (id: 2147482646 rack: null)
2020-01-28 21:38:57 DEBUG AbstractCoordinator:846 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-d50ae41c-0b12-45c2-838f-c83c7a7e856d-1198433466-driver-0] Received successful Heartbeat response
2020-01-28 21:39:00 DEBUG AbstractCoordinator:833 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-d50ae41c-0b12-45c2-838f-c83c7a7e856d-1198433466-driver-0] Sending Heartbeat request to coordinator 192.168.1.156:9092 (id: 2147482646 rack: null)
2020-01-28 21:39:00 DEBUG AbstractCoordinator:846 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-d50ae41c-0b12-45c2-838f-c83c7a7e856d-1198433466-driver-0] Received successful Heartbeat response
2020-01-28 21:39:03 DEBUG AbstractCoordinator:833 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-d50ae41c-0b12-45c2-838f-c83c7a7e856d-1198433466-driver-0] Sending Heartbeat request to coordinator 192.168.1.156:9092 (id: 2147482646 rack: null)
2020-01-28 21:39:03 DEBUG AbstractCoordinator:846 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-d50ae41c-0b12-45c2-838f-c83c7a7e856d-1198433466-driver-0] Received successful Heartbeat response
I am really stuck. I would be grateful for any hint.
Thank you.
Andreas
edit
Ok, after some time it breaks with following message:
2020-01-28 22:20:21 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/9g/24386ccd2lg11pqzxj2w5f0r0000gn/T/spark-f5efe1d0-c8d1-4b6b-bb60-352114a9cf2d
2020-01-28 22:20:21 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/9g/24386ccd2lg11pqzxj2w5f0r0000gn/T/temporaryReader-3bdb4248-1e34-460c-b0d0-78c01460ff63
2020-01-28 22:20:21 INFO ShutdownHookManager:54 - Deleting directory /private/var/folders/9g/24386ccd2lg11pqzxj2w5f0r0000gn/T/temporary-d59aa3d8-d255-4134-9793-20a892abaf38
2020-01-28 22:20:21 ERROR DFSClient:930 - Failed to close inode 16620
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /data/bigdata/tweets/_temporary/0/_temporary/attempt_20200128221820_0001_m_000000_1/part-00000-3e620934-eae5-4235-8a7b-2de07b269e8e-c000.snappy.parquet could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2135)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2771)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
so a docker issue? How can I map my containers to my host?
My starting script:
#!/bin/bash
# Bring the services up
function startServices {
docker start nodemaster node2 node3
sleep 5
echo ">> Starting hdfs ..."
docker exec -u hadoop -it nodemaster start-dfs.sh
sleep 5
echo ">> Starting yarn ..."
docker exec -u hadoop -d nodemaster start-yarn.sh
sleep 5
echo ">> Starting MR-JobHistory Server ..."
docker exec -u hadoop -d nodemaster mr-jobhistory-daemon.sh start historyserver
sleep 5
echo ">> Starting Spark ..."
docker exec -u hadoop -d nodemaster start-master.sh
docker exec -u hadoop -d node2 start-slave.sh nodemaster:7077
docker exec -u hadoop -d node3 start-slave.sh nodemaster:7077
sleep 5
echo ">> Starting Spark History Server ..."
docker exec -u hadoop nodemaster start-history-server.sh
sleep 5
echo ">> Preparing hdfs for hive ..."
docker exec -u hadoop -it nodemaster hdfs dfs -mkdir -p /tmp
docker exec -u hadoop -it nodemaster hdfs dfs -mkdir -p /user/hive/warehouse
docker exec -u hadoop -it nodemaster hdfs dfs -chmod g+w /tmp
docker exec -u hadoop -it nodemaster hdfs dfs -chmod g+w /user/hive/warehouse
sleep 5
echo ">> Starting Hive Metastore ..."
docker exec -u hadoop -d nodemaster hive --service metastore
echo "Hadoop info # nodemaster: http://172.18.1.1:8088/cluster"
echo "DFS Health # nodemaster : http://172.18.1.1:50070/dfshealth"
echo "MR-JobHistory Server # nodemaster : http://172.18.1.1:19888"
echo "Spark info # nodemaster : http://172.18.1.1:8080"
echo "Spark History Server # nodemaster : http://172.18.1.1:18080"
}
function stopServices {
echo ">> Stopping Spark Master and slaves ..."
docker exec -u hadoop -d nodemaster stop-master.sh
docker exec -u hadoop -d node2 stop-slave.sh
docker exec -u hadoop -d node3 stop-slave.sh
echo ">> Stopping containers ..."
docker stop nodemaster node2 node3 psqlhms
}
if [[ $1 = "start" ]]; then
docker network create --subnet=172.18.0.0/16 hadoopnet # create custom network
# Starting Postresql Hive metastore
echo ">> Starting postgresql hive metastore ..."
docker run -d --net hadoopnet --ip 172.18.1.4 --hostname psqlhms --name psqlhms -it postgresql-hms
sleep 5
# 3 nodes
echo ">> Starting nodes master and worker nodes ..."
docker run -d --net hadoopnet --ip 172.18.1.1 --hostname nodemaster -p 9083:9083 -p 9000:9000 -p 7077:7077 -p 8080:8080 -p 8088:8088 -p 50070:50070 -p 6066:6066 -p 4040:4040 -p 20002:20002 --add-host node2:172.18.1.2 --add-host node3:172.18.1.3 --name nodemaster -it hive
docker run -d --net hadoopnet --ip 172.18.1.2 --hostname node2 -p 8081:8081 --add-host nodemaster:172.18.1.1 --add-host node3:172.18.1.3 --name node2 -it spark
docker run -d --net hadoopnet --ip 172.18.1.3 --hostname node3 -p 8082:8081 --add-host nodemaster:172.18.1.1 --add-host node2:172.18.1.2 --name node3 -it spark
# Format nodemaster
echo ">> Formatting hdfs ..."
docker exec -u hadoop -it nodemaster hdfs namenode -format
startServices
exit
fi
if [[ $1 = "stop" ]]; then
stopServices
docker rm nodemaster node2 node3 psqlhms
docker network rm hadoopnet
exit
fi
if [[ $1 = "uninstall" ]]; then
stopServices
docker rmi hadoop spark hive postgresql-hms -f
docker network rm hadoopnet
docker system prune -f
exit
fi
echo "Usage: cluster.sh start|stop|uninstall"
echo " start - start existing containers"
echo " stop - stop running processes"
echo " uninstall - remove all docker images"

Geting messages of Offset is getting reset in structured streaming mode in Spark

Spark (v2.4) Program function:
Read JSON data from Kafka queue in structured streaming mode in spark
Print the read data on the console as it is
Issues getting:
- Getting Resetting offset for partition nifi-log-batch-0 to offset 2826180.
Source code:
package io.xyz.streaming
import org.apache.spark.sql.avro._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.functions._
object readKafkaJson {
private val topic = "nifi-log-batch"
private val kafkaUrl = "http://<hostname>:9092"
private val chk = "/home/xyz/tmp/checkpoint"
private val outputFileLocation = "/home/xyz/abc/data"
private val sparkSchema = StructType(Array(
StructField("timestamp", StringType),
StructField("level", StringType),
StructField("thread", StringType),
StructField("class", StringType),
StructField("message", StringType),
StructField("updatedOn", StringType),
StructField("stackTrace", StringType)))
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("ConfluentConsumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// ===================Read Kafka data in JSON==================
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("startingOffsets", "latest")
.option("subscribe", topic)
.load()
val dfs1 = df
.selectExpr("CAST(value AS STRING)")
.select(from_json(col("value"), sparkSchema).alias("my_column"))
.select("my_column.*")
// ===================Write to console==================
dfs1
.writeStream
.format("console")
.start()
.awaitTermination()
}
}
Detailed issue log on console:
2019-04-10 01:12:58 INFO WriteToDataSourceV2Exec:54 - Start processing data source writer: org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#622d0057. The input RDD has 0 partitions.
2019-04-10 01:12:58 INFO SparkContext:54 - Starting job: start at readKafkaJson.scala:70
2019-04-10 01:12:58 INFO DAGScheduler:54 - Job 0 finished: start at readKafkaJson.scala:70, took 0.003870 s
2019-04-10 01:12:58 INFO WriteToDataSourceV2Exec:54 - Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#622d0057 is committing.
-------------------------------------------
Batch: 0
-------------------------------------------
2019-04-10 01:12:58 INFO CodeGenerator:54 - Code generated in 41.952695 ms
+---------+-----+------+-----+-------+---------+----------+
|timestamp|level|thread|class|message|updatedOn|stackTrace|
+---------+-----+------+-----+-------+---------+----------+
+---------+-----+------+-----+-------+---------+----------+
2019-04-10 01:12:58 INFO WriteToDataSourceV2Exec:54 - Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#622d0057 committed.
2019-04-10 01:12:58 INFO SparkContext:54 - Starting job: start at readKafkaJson.scala:70
2019-04-10 01:12:58 INFO DAGScheduler:54 - Job 1 finished: start at readKafkaJson.scala:70, took 0.000104 s
2019-04-10 01:12:58 INFO CheckpointFileManager:54 - Writing atomically to file:/tmp/temporary-df2fea18-7b2f-4146-bcfd-7923cfab65e7/commits/0 using temp file file:/tmp/temporary-df2fea18-7b2f-4146-bcfd-7923cfab65e7/commits/.0.eb290a31-1965-40e7-9028-d18f2eea0627.tmp
2019-04-10 01:12:58 INFO CheckpointFileManager:54 - Renamed temp file file:/tmp/temporary-df2fea18-7b2f-4146-bcfd-7923cfab65e7/commits/.0.eb290a31-1965-40e7-9028-d18f2eea0627.tmp to file:/tmp/temporary-df2fea18-7b2f-4146-bcfd-7923cfab65e7/commits/0
2019-04-10 01:12:58 INFO MicroBatchExecution:54 - Streaming query made progress: {
"id" : "fb44fbef-5d05-4bb8-ae72-3327b98af261",
"runId" : "ececfe49-bbc6-4964-8798-78980cbec525",
"name" : null,
"timestamp" : "2019-04-10T06:12:56.414Z",
"batchId" : 0,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 1324,
"getBatch" : 10,
"getEndOffset" : 1,
"queryPlanning" : 386,
"setOffsetRange" : 609,
"triggerExecution" : 2464,
"walCommit" : 55
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[nifi-log-batch]]",
"startOffset" : null,
"endOffset" : {
"nifi-log-batch" : {
"0" : 2826180
}
},
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider#6ced6212"
}
}
2019-04-10 01:12:58 INFO Fetcher:583 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-9a027b2b-0a3a-4773-a356-a585e488062c--81433247-driver-0] Resetting offset for partition nifi-log-batch-0 to offset 2826180.
2019-04-10 01:12:58 INFO MicroBatchExecution:54 - Streaming query made progress: {
"id" : "fb44fbef-5d05-4bb8-ae72-3327b98af261",
"runId" : "ececfe49-bbc6-4964-8798-78980cbec525",
"name" : null,
"timestamp" : "2019-04-10T06:12:58.935Z",
"batchId" : 1,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getEndOffset" : 1,
"setOffsetRange" : 11,
"triggerExecution" : 15
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[nifi-log-batch]]",
"startOffset" : {
"nifi-log-batch" : {
"0" : 2826180
}
},
"endOffset" : {
"nifi-log-batch" : {
"0" : 2826180
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider#6ced6212"
}
}
2019-04-10 01:12:58 INFO Fetcher:583 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-9a027b2b-0a3a-4773-a356-a585e488062c--81433247-driver-0] Resetting offset for partition nifi-log-batch-0 to offset 2826180.
2019-04-10 01:12:58 INFO Fetcher:583 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-9a027b2b-0a3a-4773-a356-a585e488062c--81433247-driver-0] Resetting offset for partition nifi-log-batch-0 to offset 2826180.
2019-04-10 01:12:58 INFO Fetcher:583 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-9a027b2b-0a3a-4773-a356-a585e488062c--81433247-driver-0] Resetting offset for partition nifi-log-batch-0 to offset 2826180.
Even when I run an equivalent code in pySpark also, I face the same issue.
Please suggest how to resolve this issue.
Kafka: v2.1.0 cpl, confluent
Spark: 2.4
Job submitted through following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 --jars /home/xyz/Softwares/spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar --class io.xyz.streaming.readKafkaJson --master local[*] /home/xyz/ScalaCode/target/SparkSchemaKafka-0.0.1-SNAPSHOT-jar-with-dependencies.jar
It seems the asker already found the solution, here are the relevant parts from the comments:
Main resolution
It was an issue of a schema structure in Scala. After correcting the
schema the issue resolved.
Secondary topic
in Pyspark code the processing is happening but the messages are not
stopping i.e. I am able to run the code and able to write the stream
data into a JSON file, but the console messages are filled with the
above mentioned Resetting offset for ... log messages
That pyspark issue was actually, INFO messages were getting printed,
which I disabled
After which all was good.