Geting messages of Offset is getting reset in structured streaming mode in Spark - scala

Spark (v2.4) Program function:
Read JSON data from Kafka queue in structured streaming mode in spark
Print the read data on the console as it is
Issues getting:
- Getting Resetting offset for partition nifi-log-batch-0 to offset 2826180.
Source code:
package io.xyz.streaming
import org.apache.spark.sql.avro._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.functions._
object readKafkaJson {
private val topic = "nifi-log-batch"
private val kafkaUrl = "http://<hostname>:9092"
private val chk = "/home/xyz/tmp/checkpoint"
private val outputFileLocation = "/home/xyz/abc/data"
private val sparkSchema = StructType(Array(
StructField("timestamp", StringType),
StructField("level", StringType),
StructField("thread", StringType),
StructField("class", StringType),
StructField("message", StringType),
StructField("updatedOn", StringType),
StructField("stackTrace", StringType)))
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("ConfluentConsumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// ===================Read Kafka data in JSON==================
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("startingOffsets", "latest")
.option("subscribe", topic)
.load()
val dfs1 = df
.selectExpr("CAST(value AS STRING)")
.select(from_json(col("value"), sparkSchema).alias("my_column"))
.select("my_column.*")
// ===================Write to console==================
dfs1
.writeStream
.format("console")
.start()
.awaitTermination()
}
}
Detailed issue log on console:
2019-04-10 01:12:58 INFO WriteToDataSourceV2Exec:54 - Start processing data source writer: org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#622d0057. The input RDD has 0 partitions.
2019-04-10 01:12:58 INFO SparkContext:54 - Starting job: start at readKafkaJson.scala:70
2019-04-10 01:12:58 INFO DAGScheduler:54 - Job 0 finished: start at readKafkaJson.scala:70, took 0.003870 s
2019-04-10 01:12:58 INFO WriteToDataSourceV2Exec:54 - Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#622d0057 is committing.
-------------------------------------------
Batch: 0
-------------------------------------------
2019-04-10 01:12:58 INFO CodeGenerator:54 - Code generated in 41.952695 ms
+---------+-----+------+-----+-------+---------+----------+
|timestamp|level|thread|class|message|updatedOn|stackTrace|
+---------+-----+------+-----+-------+---------+----------+
+---------+-----+------+-----+-------+---------+----------+
2019-04-10 01:12:58 INFO WriteToDataSourceV2Exec:54 - Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#622d0057 committed.
2019-04-10 01:12:58 INFO SparkContext:54 - Starting job: start at readKafkaJson.scala:70
2019-04-10 01:12:58 INFO DAGScheduler:54 - Job 1 finished: start at readKafkaJson.scala:70, took 0.000104 s
2019-04-10 01:12:58 INFO CheckpointFileManager:54 - Writing atomically to file:/tmp/temporary-df2fea18-7b2f-4146-bcfd-7923cfab65e7/commits/0 using temp file file:/tmp/temporary-df2fea18-7b2f-4146-bcfd-7923cfab65e7/commits/.0.eb290a31-1965-40e7-9028-d18f2eea0627.tmp
2019-04-10 01:12:58 INFO CheckpointFileManager:54 - Renamed temp file file:/tmp/temporary-df2fea18-7b2f-4146-bcfd-7923cfab65e7/commits/.0.eb290a31-1965-40e7-9028-d18f2eea0627.tmp to file:/tmp/temporary-df2fea18-7b2f-4146-bcfd-7923cfab65e7/commits/0
2019-04-10 01:12:58 INFO MicroBatchExecution:54 - Streaming query made progress: {
"id" : "fb44fbef-5d05-4bb8-ae72-3327b98af261",
"runId" : "ececfe49-bbc6-4964-8798-78980cbec525",
"name" : null,
"timestamp" : "2019-04-10T06:12:56.414Z",
"batchId" : 0,
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 1324,
"getBatch" : 10,
"getEndOffset" : 1,
"queryPlanning" : 386,
"setOffsetRange" : 609,
"triggerExecution" : 2464,
"walCommit" : 55
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[nifi-log-batch]]",
"startOffset" : null,
"endOffset" : {
"nifi-log-batch" : {
"0" : 2826180
}
},
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider#6ced6212"
}
}
2019-04-10 01:12:58 INFO Fetcher:583 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-9a027b2b-0a3a-4773-a356-a585e488062c--81433247-driver-0] Resetting offset for partition nifi-log-batch-0 to offset 2826180.
2019-04-10 01:12:58 INFO MicroBatchExecution:54 - Streaming query made progress: {
"id" : "fb44fbef-5d05-4bb8-ae72-3327b98af261",
"runId" : "ececfe49-bbc6-4964-8798-78980cbec525",
"name" : null,
"timestamp" : "2019-04-10T06:12:58.935Z",
"batchId" : 1,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getEndOffset" : 1,
"setOffsetRange" : 11,
"triggerExecution" : 15
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaV2[Subscribe[nifi-log-batch]]",
"startOffset" : {
"nifi-log-batch" : {
"0" : 2826180
}
},
"endOffset" : {
"nifi-log-batch" : {
"0" : 2826180
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider#6ced6212"
}
}
2019-04-10 01:12:58 INFO Fetcher:583 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-9a027b2b-0a3a-4773-a356-a585e488062c--81433247-driver-0] Resetting offset for partition nifi-log-batch-0 to offset 2826180.
2019-04-10 01:12:58 INFO Fetcher:583 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-9a027b2b-0a3a-4773-a356-a585e488062c--81433247-driver-0] Resetting offset for partition nifi-log-batch-0 to offset 2826180.
2019-04-10 01:12:58 INFO Fetcher:583 - [Consumer clientId=consumer-1, groupId=spark-kafka-source-9a027b2b-0a3a-4773-a356-a585e488062c--81433247-driver-0] Resetting offset for partition nifi-log-batch-0 to offset 2826180.
Even when I run an equivalent code in pySpark also, I face the same issue.
Please suggest how to resolve this issue.
Kafka: v2.1.0 cpl, confluent
Spark: 2.4
Job submitted through following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 --jars /home/xyz/Softwares/spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar --class io.xyz.streaming.readKafkaJson --master local[*] /home/xyz/ScalaCode/target/SparkSchemaKafka-0.0.1-SNAPSHOT-jar-with-dependencies.jar

It seems the asker already found the solution, here are the relevant parts from the comments:
Main resolution
It was an issue of a schema structure in Scala. After correcting the
schema the issue resolved.
Secondary topic
in Pyspark code the processing is happening but the messages are not
stopping i.e. I am able to run the code and able to write the stream
data into a JSON file, but the console messages are filled with the
above mentioned Resetting offset for ... log messages
That pyspark issue was actually, INFO messages were getting printed,
which I disabled
After which all was good.

Related

TimeoutException when trying to run a Pulsar source connector

I'm trying to run a Pulsar DebeziumPostgresSource connector.
This is the command I'm running:
bin/pulsar-admin \
--admin-url https://localhost:8443 \
--auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--auth-params file:///pulsar/tokens/broker/token \
--tls-allow-insecure \
source localrun \
--broker-service-url pulsar+ssl://my-pulsar-server:6651 \
--client-auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--client-auth-params file:///pulsar/tokens/broker/token \
--tls-allow-insecure \
--source-config-file /pulsar/debezium-config/my-source-config.yaml
Here's the /pulsar/debezium-config/my-source-config.yaml file:
tenant: my-tenant
namespace: my-namespace
name: my-source
topicName: my-topic
archive: connectors/pulsar-io-debezium-postgres-2.6.0-SNAPSHOT.nar
parallelism: 1
configs:
plugin.name: pgoutput
database.hostname: my-db-server
database.port: "5432"
database.user: my-db-user
database.password: my-db-password
database.dbname: my-db
database.server.name: my-db-server-name
table.whitelist: my_schema.my_table
pulsar.service.url: pulsar+ssl://my-pulsar-server:6651/
And here's the output from the command above:
11:47:29.924 [main] INFO org.apache.pulsar.functions.runtime.RuntimeSpawner - my-tenant/my-namespace/my-source-0 RuntimeSpawner starting function
11:47:29.925 [main] INFO org.apache.pulsar.functions.runtime.thread.ThreadRuntime - ThreadContainer starting function with instance config InstanceConfig(instanceId=0, functionId=4073a1d9-1312-4570-981b-6723626e394a, functionVersion=01d5a3a7-c6d7-4f79-8717-403ad1371411, functionDetails=tenant: "my-tenant"
namespace: "my-namespace"
name: "my-source"
className: "org.apache.pulsar.functions.api.utils.IdentityFunction"
autoAck: true
parallelism: 1
source {
className: "org.apache.pulsar.io.debezium.postgres.DebeziumPostgresSource"
configs: "{\"database.user\":\"my-db-user\",\"database.dbname\":\"my-db\",\"database.hostname\":\"my-db-server\",\"database.password\":\"my-db-password\",\"database.server.name\":\"my-db-server-name\",\"plugin.name\":\"pgoutput\",\"database.port\":\"5432\",\"pulsar.service.url\":\"pulsar+ssl://my-pulsar-server:6651/\",\"table.whitelist\":\"my_schema.my_table\"}"
typeClassName: "org.apache.pulsar.common.schema.KeyValue"
}
sink {
topic: "my-topic"
typeClassName: "org.apache.pulsar.common.schema.KeyValue"
}
resources {
cpu: 1.0
ram: 1073741824
disk: 10737418240
}
componentType: SOURCE
, maxBufferedTuples=1024, functionAuthenticationSpec=null, port=39135, clusterName=local, maxPendingAsyncRequests=1000)
11:47:32.552 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xf8ffbf24, L:/redacted-ip-l:43802 - R:my-pulsar-server/redacted-ip-r:6651]] Connected to server
11:47:33.240 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Starting Pulsar producer perf with config: {
"topicName" : "my-topic",
"producerName" : null,
"sendTimeoutMs" : 0,
"blockIfQueueFull" : true,
"maxPendingMessages" : 1000,
"maxPendingMessagesAcrossPartitions" : 50000,
"messageRoutingMode" : "CustomPartition",
"hashingScheme" : "Murmur3_32Hash",
"cryptoFailureAction" : "FAIL",
"batchingMaxPublishDelayMicros" : 10000,
"batchingPartitionSwitchFrequencyByPublishDelay" : 10,
"batchingMaxMessages" : 1000,
"batchingMaxBytes" : 131072,
"batchingEnabled" : true,
"chunkingEnabled" : false,
"compressionType" : "LZ4",
"initialSequenceId" : null,
"autoUpdatePartitions" : true,
"multiSchema" : true,
"properties" : {
"application" : "pulsar-source",
"id" : "my-tenant/my-namespace/my-source",
"instance_id" : "0"
}
}
11:47:33.259 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerStatsRecorderImpl - Pulsar client config: {
"serviceUrl" : "pulsar+ssl://my-pulsar-server:6651",
"authPluginClassName" : "org.apache.pulsar.client.impl.auth.AuthenticationToken",
"authParams" : "file:///pulsar/tokens/broker/token",
"authParamMap" : null,
"operationTimeoutMs" : 30000,
"statsIntervalSeconds" : 60,
"numIoThreads" : 1,
"numListenerThreads" : 1,
"connectionsPerBroker" : 1,
"useTcpNoDelay" : true,
"useTls" : true,
"tlsTrustCertsFilePath" : null,
"tlsAllowInsecureConnection" : true,
"tlsHostnameVerificationEnable" : false,
"concurrentLookupRequest" : 5000,
"maxLookupRequest" : 50000,
"maxLookupRedirects" : 20,
"maxNumberOfRejectedRequestPerConnection" : 50,
"keepAliveIntervalSeconds" : 30,
"connectionTimeoutMs" : 10000,
"requestTimeoutMs" : 60000,
"initialBackoffIntervalNanos" : 100000000,
"maxBackoffIntervalNanos" : 60000000000,
"listenerName" : null,
"useKeyStoreTls" : false,
"sslProvider" : null,
"tlsTrustStoreType" : "JKS",
"tlsTrustStorePath" : null,
"tlsTrustStorePassword" : null,
"tlsCiphers" : [ ],
"tlsProtocols" : [ ],
"proxyServiceUrl" : null,
"proxyProtocol" : null
}
11:47:33.418 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xab39f703, L:/redacted-ip-l:43806 - R:my-pulsar-server/redacted-ip-r:6651]] Connected to server
11:47:33.422 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ClientCnx - [id: 0xab39f703, L:/redacted-ip-l:43806 - R:my-pulsar-server/redacted-ip-r:6651] Connected through proxy to target broker at my-broker:6651
11:47:33.484 [pulsar-client-io-1-1] INFO org.apache.pulsar.client.impl.ProducerImpl - [my-topic] [null] Creating producer on cnx [id: 0xab39f703, L:/redacted-ip-l:43806 - R:my-pulsar-server/redacted-ip-r:6651]
11:48:33.434 [pulsar-client-io-1-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [my-topic] [null] Failed to create producer: 3 lookup request timedout after ms 30000
11:48:33.438 [pulsar-client-io-1-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0xab39f703, L:/redacted-ip-l:43806 - R:my-pulsar-server/redacted-ip-r:6651] request 3 timed out after 30000 ms
11:48:33.629 [main] INFO org.apache.pulsar.functions.LocalRunner - RuntimeSpawner quit because of
java.lang.RuntimeException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 3 lookup request timedout after ms 30000
at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkAtMostOnceProcessor.<init>(PulsarSink.java:177) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkAtLeastOnceProcessor.<init>(PulsarSink.java:206) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.sink.PulsarSink.open(PulsarSink.java:284) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.instance.JavaInstanceRunnable.setupOutput(JavaInstanceRunnable.java:819) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.instance.JavaInstanceRunnable.setup(JavaInstanceRunnable.java:224) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.instance.JavaInstanceRunnable.run(JavaInstanceRunnable.java:246) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_252]
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 3 lookup request timedout after ms 30000
at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:821) ~[org.apache.pulsar-pulsar-client-api-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.client.impl.ProducerBuilderImpl.create(ProducerBuilderImpl.java:93) ~[org.apache.pulsar-pulsar-client-original-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkProcessorBase.createProducer(PulsarSink.java:106) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkAtMostOnceProcessor.<init>(PulsarSink.java:174) ~[org.apache.pulsar-pulsar-functions-instance-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
... 6 more
11:48:59.956 [function-timer-thread-5-1] ERROR org.apache.pulsar.functions.runtime.RuntimeSpawner - my-tenant/my-namespace/my-source-java.lang.RuntimeException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 3 lookup request timedout after ms 30000 Function Container is dead with exception.. restarting
As you can see, it failed to create a producer due to a TimeoutException. What are the likely causes of this error? What's the best way to further investigate this issue?
Additional info:
I have also tried the --tls-trust-cert-path /my/ca-certificates.crt option instead of --tls-allow-insecure, but got the same error.
I am able to list tenants:
bin/pulsar-admin \
--admin-url https://localhost:8443 \
--auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--auth-params file:///pulsar/tokens/broker/token \
tenants list
# Output:
# "public"
# "pulsar"
# "my-topic"
But I am not able to get an OK broker health-check:
bin/pulsar-admin \
--admin-url https://localhost:8443 \
--auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--auth-params file:///pulsar/tokens/broker/token \
brokers healthcheck
# Output:
# null
# Reason: java.util.concurrent.TimeoutException
bin/pulsar-admin \
--admin-url https://localhost:8443 \
--auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationToken \
--auth-params file:///pulsar/tokens/broker/token \
--tls-allow-insecure \
brokers healthcheck
# Output:
# HTTP 500 Internal Server Error
# Reason: HTTP 500 Internal Server Error
In my case, the root cause was an expired TLS certificate.

Spring Cloud Stream Kafka application not generating messages with the correct Avro schema

I have an application (spring-boot-shipping-service) with a KStream that gets OrderCreatedEvent messages generated by an external producer (spring-boot-order-service). This producer uses the following schema:
order-created-event.avsc
{
"namespace" : "com.codependent.statetransfer.order",
"type" : "record",
"name" : "OrderCreatedEvent",
"fields" : [
{"name":"id","type":"int"},
{"name":"productId","type":"int"},
{"name":"customerId","type":"int"}
]
}
My KStream<Int, OrderCreatedEvent> is joined with a KTable<Int, Customer> and publishes to the order topic a new kind of message: OrderShippedEvent.
order-shipped-event.avsc
{
"namespace" : "com.codependent.statetransfer.order",
"type" : "record",
"name" : "OrderShippedEvent",
"fields" : [
{"name":"id","type":"int"},
{"name":"productId","type":"int"},
{"name":"customerName","type":"string"},
{"name":"customerAddress","type":"string"}
]
}
For some reason the new OrderShippedEvent messages aren't generated with a header application/vnd.ordershippedevent.v1+avro but application/vnd.ordercreatedevent.v1+avro.
This is the original OrderCreatedEvent in the order topic:
Key (4 bytes): +
Value (4 bytes): V?
Timestamp: 1555943926163
Partition: 0
Offset: 34
Headers: contentType="application/vnd.ordercreatedevent.v1+avro",spring_json_header_types={"contentType":"java.lang.String"}
And the produced OrderShippedEvent with the incorrect schema:
Key (4 bytes): +
Value (26 bytes): V?
JamesHill Street
Timestamp: 1555943926163
Partition: 0
Offset: 35
Headers: contentType="application/vnd.ordercreatedevent.v1+avro",spring_json_header_types={"contentType":"java.lang.String"}
I've checked the Confluent Schema Registry contents, and the order-shipped-event.avsc schema is there:
Why isn't it using the correct shema in the generated message?
Below you can see the full configuration and code of the example, which is also available on Github (https://github.com/codependent/event-carried-state-transfer/tree/avro)
In order to test it just start a Confluent Platform (v5.2.1), spring-boot-customer-service, spring-boot-order-service, spring-boot-shipping-service and execute the following curl commands:
curl -X POST http://localhost:8080/customers -d '{"id":1,"name":"James","address":"Hill Street"}' -H "content-type: application/json"
curl -X POST http://localhost:8084/orders -H "content-type: application/json" -d '{"id":1,"productId":1001,"/customerId":1}'
application.yml
server:
port: 8085
spring:
application:
name: spring-boot-shipping-service
cloud:
stream:
kafka:
streams:
binder:
configuration:
default:
key:
serde: org.apache.kafka.common.serialization.Serdes$IntegerSerde
bindings:
input:
destination: customer
contentType: application/*+avro
order:
destination: order
contentType: application/*+avro
output:
destination: order
contentType: application/*+avro
schema-registry-client:
endpoint: http://localhost:8081
ShippingKStreamProcessor
interface ShippingKStreamProcessor {
#Input("input")
fun input(): KStream<Int, Customer>
#Input("order")
fun order(): KStream<String, OrderCreatedEvent>
#Output("output")
fun output(): KStream<String, OrderShippedEvent>
ShippingKStreamConfiguration
#StreamListener
#SendTo("output")
fun process(#Input("input") input: KStream<Int, Customer>, #Input("order") orderEvent: KStream<Int, OrderCreatedEvent>): KStream<Int, OrderShippedEvent> {
val serdeConfig = mapOf(
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG to "http://localhost:8081")
val intSerde = Serdes.IntegerSerde()
val customerSerde = SpecificAvroSerde<Customer>()
customerSerde.configure(serdeConfig, true)
val orderCreatedSerde = SpecificAvroSerde<OrderCreatedEvent>()
orderCreatedSerde.configure(serdeConfig, true)
val orderShippedSerde = SpecificAvroSerde<OrderShippedEvent>()
orderShippedSerde.configure(serdeConfig, true)
val stateStore: Materialized<Int, Customer, KeyValueStore<Bytes, ByteArray>> =
Materialized.`as`<Int, Customer, KeyValueStore<Bytes, ByteArray>>("customer-store")
.withKeySerde(intSerde)
.withValueSerde(customerSerde)
val customerTable: KTable<Int, Customer> = input.groupByKey(Serialized.with(intSerde, customerSerde))
.reduce({ _, y -> y }, stateStore)
return (orderEvent.filter { _, value -> value is OrderCreatedEvent && value.id != 0 }
.selectKey { _, value -> value.customerId } as KStream<Int, OrderCreatedEvent>)
.join(customerTable, { orderIt, customer ->
OrderShippedEvent(orderIt.id, orderIt.productId, customer.name, customer.address)
}, Joined.with(intSerde, orderCreatedSerde, customerSerde))
.selectKey { _, value -> value.id }
}
UPDATE: I've set trace logging level for org.springframework.messaging and apparently it looks ok:
2019-04-22 23:40:39.953 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : HTTP GET http://localhost:8081/subjects/ordercreatedevent/versions/1
2019-04-22 23:40:39.971 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Accept=[application/json, application/*+json]
2019-04-22 23:40:39.972 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Writing [] as "application/vnd.schemaregistry.v1+json"
2019-04-22 23:40:39.984 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Response 200 OK
2019-04-22 23:40:39.985 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Reading to [java.util.Map<?, ?>]
2019-04-22 23:40:40.186 INFO 46039 --- [read-1-producer] org.apache.kafka.clients.Metadata : Cluster ID: 5Sw6sBD0TFOaximF3Or-dQ
2019-04-22 23:40:40.318 DEBUG 46039 --- [-StreamThread-1] AvroSchemaRegistryClientMessageConverter : Obtaining schema for class class com.codependent.statetransfer.order.OrderShippedEvent
2019-04-22 23:40:40.318 DEBUG 46039 --- [-StreamThread-1] AvroSchemaRegistryClientMessageConverter : Avro type detected, using schema from object
2019-04-22 23:40:40.342 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : HTTP POST http://localhost:8081/subjects/ordershippedevent/versions
2019-04-22 23:40:40.342 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Accept=[application/json, application/*+json]
2019-04-22 23:40:40.342 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Writing [{"schema":"{\"type\":\"record\",\"name\":\"OrderShippedEvent\",\"namespace\":\"com.codependent.statetransfer.order\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"productId\",\"type\":\"int\"},{\"name\":\"customerName\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"customerAddress\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}]}"}] as "application/json"
2019-04-22 23:40:40.348 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Response 200 OK
2019-04-22 23:40:40.348 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Reading to [java.util.Map<?, ?>]
2019-04-22 23:40:40.349 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : HTTP POST http://localhost:8081/subjects/ordershippedevent
2019-04-22 23:40:40.349 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Accept=[application/json, application/*+json]
2019-04-22 23:40:40.349 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Writing [{"schema":"{\"type\":\"record\",\"name\":\"OrderShippedEvent\",\"namespace\":\"com.codependent.statetransfer.order\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"productId\",\"type\":\"int\"},{\"name\":\"customerName\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"customerAddress\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}}]}"}] as "application/json"
2019-04-22 23:40:40.361 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Response 200 OK
2019-04-22 23:40:40.362 DEBUG 46039 --- [-StreamThread-1] o.s.web.client.RestTemplate : Reading to [java.util.Map<?, ?>]
2019-04-22 23:40:40.362 DEBUG 46039 --- [-StreamThread-1] AvroSchemaRegistryClientMessageConverter : Finding correct DatumWriter for type com.codependent.statetransfer.order.OrderShippedEvent
How come the message is written with an incorrect content type header then?
UPDATE 2:
I've kept digging into the source code and found this:
KafkaStreamsMessageConversionDelegate correctly converts and determines the right header values, as seen in the logs above.
However in the serializeOnOutbound method we can find that it returns to the Kafka API only the payload, so the headers aren't taken into account:
return
messageConverter.toMessage(message.getPayload(),
messageHeaders).getPayload();
Moving forward in the record processing org.apache.kafka.streams.processor.internals.SinkNode.process() accesses the headers present in the context, which incorrectly contain application/vnd.ordercreatedevent.v1+avro instead of application/vnd.ordershippedevent.v1+avro (?):
collector.send(topic, key, value, context.headers(), timestamp, keySerializer, valSerializer, partitioner);
UPDATE 3:
Steps to reproduce:
Download and start Confluent 5.2.1
confluent start
Start the applications spring-boot-order-service, spring-boot-customer-service, spring-boot-shipping-service
Create a customer curl -X POST http://localhost:8080/customers -d '{"id":1,"name":"John","address":"Some Street"}' -H "content-type: application/json"
Create an order that will be joined with the customer: curl -X POST http://localhost:8084/orders -H "content-type: application/json" -d '{"id":1,"productId":1,"customerId":1}'
ShippingKStreamConfiguration's process() will create a KTable for the Customer and a state store (customer-store). Besides, it will join the order stream with the customer KTable to transform an OrderCreatedEvent into an OrderShippedEvent.
You can check that the newly created OrderShippedEvent message added to the order topic has an incorrect header. This can be seen either in the Confluent Control Center (localhost:9092 -> topics -> order) or running kafkacat:
$> kafkacat -b localhost:9092 -t order -C \
-f '\nKey (%K bytes): %k
Value (%S bytes): %s
Timestamp: %T
Partition: %p
Offset: %o
Headers: %h\n'
#codependent It is indeed an issue that we need to address in the binder which we will fix soon. In the meantime, as a workaround can you make your processor not return a KStream, but rather do the sending in the method itself. You can call to(TopicNameExtractor) on the currently returned KStream. TopicNameExtractor will give you access to the record context using which you can manually set the content type.

unable to write data to Mongo from Spark 2.2.0 structured Streaming?

I have the following code and I am unable to write data to Mongo using the following. I don't even see the database or collection names being populated in MongoDB. something seems to be wrong. There is no exceptions when I run this code.
private SparkSession sparkSession;
SparkConf sparkConf = new SparkConf();
sparkConf.setMaster(Configuration.getConfig().getString("spark.master"));
sparkConf.set("spark.mongodb.input.uri", "mongodb://localhost/analytics.counters");
sparkConf.set("spark.mongodb.output.uri", "mongodb://localhost/analytics.counters");
SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
sparkSession.sparkContext().setLogLevel("INFO");
this.sparkSession = sparkSession;
MongoConnector mongoConnector = MongoConnector.apply(sparkSession.sparkContext());
WriteConfig writeConfig = getMongoWriteConfig(sparkSession, "hello");
ReadConfig readConfig = getMongoReadConfig(sparkSession, "hello");
Dataset<String> jsonDS = newDS.select(to_json(struct(col("*")))).as(Encoders.STRING());
Dataset<String> dataset = jsonDS
.map(new MapFunction<String, Boolean>() {
#Override
public Boolean call(String kafkaPayload) throws Exception {
System.out.println(kafkaPayload);
Document jsonDocument = Document.parse(kafkaPayload);
String id = jsonDocument.getString("ID");
jsonDocument.put("_id", id);
return mongoConnector.withCollectionDo(writeConfig, Document.class, new Function<MongoCollection<Document>, Boolean>() {
#Override
public Boolean call(MongoCollection<Document> collection) throws Exception {
return collection.replaceOne(and(eq("_id", id), lt("TIMESTAMP", jsonDocument.getString("TIMESTAMP"))),
jsonDocument, new UpdateOptions().upsert(true)).wasAcknowledged();
}
});
}
}, Encoders.BOOLEAN())
StreamingQuery query1 = dataset
.writeStream()
.trigger(Trigger.ProcessingTime(1000))
.foreach(new KafkaSink("metrics"))
.option("checkpointLocation", getCheckpointPath(CheckpointPath.LOCAL_WRITE) + "/metrics")
.start();
query1.awaitTermination();
private static ReadConfig getMongoReadConfig(SparkSession sparkSession, String collectionName){
ReadConfig readConfig = ReadConfig.create(sparkSession);
Map<String, String> readOverrides = new HashMap<String, String>();
readOverrides.put("readConcern.level", "majority");
readConfig.withOptions(readOverrides);
return readConfig;
}
private static WriteConfig getMongoWriteConfig(SparkSession sparkSession, String collectionName) {
WriteConfig writeConfig = WriteConfig.create(sparkSession);
Map<String, String> writeOverrides = new HashMap<String, String>();
writeOverrides.put("writeConcern.w", "majority");
writeConfig.withOptions(writeOverrides);
return writeConfig;
}
I use spark-submit and pass in the following parameters;
spark-submit --master local[*] \
--driver-memory 4g \
--executor-memory 2g \
--class com.hello.stream.app.Hello
--conf "spark.mongodb.input.uri=mongodb://localhost/analytics.counters" \
--conf "spark.mongodb.output.uri=mongodb://localhost/analytics.counters" \
build/libs/hello-stream.jar
here are the list of jars I use
def sparkVersion = '2.2.0'
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-streaming_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-streaming-kafka-0-10_2.11', version: sparkVersion
compile group: 'org.apache.spark', name: 'spark-sql-kafka-0-10_2.11', version: sparkVersion
compile group: 'org.apache.kafka', name: 'kafka-clients', version: '0.10.0.1'
compile group: 'org.mongodb.spark', name: 'mongo-spark-connector_2.11', version: sparkVersion
compile 'org.mongodb:mongodb-driver:3.0.4'
When I run my Job I get the following output (shorter version of my info log)
17/09/12 10:16:12 INFO MongoClientCache: Closing MongoClient: [localhost:27017]
17/09/12 10:16:12 INFO connection: Closed connection [connectionId{localValue:2, serverValue:2897}] to localhost:27017 because the pool has been closed.
17/09/12 10:16:18 INFO StreamExecution: Streaming query made progress: {
"id" : "ddc38876-c44d-4370-a2e0-3c96974e6f24",
"runId" : "2ae73227-b9e1-4908-97d6-21d9067994c7",
"name" : null,
"timestamp" : "2017-09-12T17:16:18.001Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 2,
"triggerExecution" : 2
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[hello]]",
"startOffset" : {
"pn_ingestor_json" : {
"0" : 826404
}
},
"endOffset" : {
"pn_ingestor_json" : {
"0" : 826404
}
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ForeachSink#7656801e"
}
}
...... And it keeps going while printing INFO StreamExecution: Streaming query made progress: But I don't see any database or collection being created in Mongo
You cannot use map in that way with a Structured Stream. I believe you should use the foreach method instead.
There is a scala example in the repo - SparkStructuredStreams.scala which may be some help!

Spark + Flume : "Unable to create Rpc client"

I am trying to connect a scala spark-shell with a flume agent.
I launch the shell :
 ./bin/spark-shell
--jars /Users/romain/Informatique/zoo/spark-1.6.0-bin-hadoop2.4/lib/spark-streaming-flume-sink_2.10-1.6.0.jar
--jars /Users/romain/Informatique/zoo/spark-1.6.0-bin-hadoop2.4/lib/spark-streaming-flume_2.10-1.6.0.jar
--jars /Users/romain/Informatique/zoo/spark-1.6.0-bin-hadoop2.4/lib/spark-streaming-flume-assembly_2.10-1.6.0.jar
And launch a scala script to listen on port 10000 of localhost :
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
 
 
val host = "localhost"
val port = 10000
val batchInterval = Milliseconds(5000)
 
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sc, batchInterval)
 
// Create a flume stream that polls the Spark Sink running in a Flume agent
val stream = FlumeUtils.createPollingStream(ssc, host, port)
 
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received " + cnt + " flume events." ).print()
 
ssc.start()
ssc.awaitTermination()
 
Then I configure and start a shell agent :
1) configuration : I make a tail - f on a file appended by another running script (I think it doesn't matter to detail that here)
agent1.sources = source1
agent1.channels = channel1
agent1.sinks = spark
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /Users/romain/Informatique/notebooks/spark_scala/velib/logs/trajets.csv
agent1.sources.source1.channels = channel1
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 2000000
agent1.channels.channel1.transactionCapacity = 1000000
agent1.sinks = avroSink
agent1.sinks.avroSink.type = avro
agent1.sinks.avroSink.channel = channel1
agent1.sinks.avroSink.hostname = localhost
agent1.sinks.avroSink.port = 10000
2) start :
./bin/flume-ng agent --conf conf --conf-file ./conf/avro_velib.conf --name agent1 -Dflume.root.logger=INFO,console
And then things go ok till an error :
2017-02-01 15:09:55,688 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:173)] Starting Sink avroSink
2017-02-01 15:09:55,688 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.AbstractRpcSink.start(AbstractRpcSink.java:289)] Starting RpcSink avroSink { host: localhost, port: 10000 }...
2017-02-01 15:09:55,688 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:184)] Starting Source source1
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.ExecSource.start(ExecSource.java:169)] Exec source starting with command:tail -f /Users/romain/Informatique/notebooks/spark_scala/velib/logs/trajets.csv
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SINK, name: avroSink: Successfully registered new MBean.
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SINK, name: avroSink started
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:206)] Rpc sink avroSink: Building RpcClient with hostname: localhost, port: 10000
2017-02-01 15:09:55,689 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:126)] Attempting to create Avro Rpc client.
2017-02-01 15:09:55,691 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:120)] Monitored counter group for type: SOURCE, name: source1: Successfully registered new MBean.
2017-02-01 15:09:55,692 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:96)] Component type: SOURCE, name: source1 started
2017-02-01 15:09:55,710 (lifecycleSupervisor-1-1) [WARN - org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:634)] Using default maxIOWorkers
2017-02-01 15:10:15,802 (lifecycleSupervisor-1-1) [WARN - org.apache.flume.sink.AbstractRpcSink.start(AbstractRpcSink.java:294)] Unable to create Rpc client using hostname: localhost, port: 10000
org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost, port: 10000 }: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:182)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:121)
at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:638)
at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:89)
at org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:127)
at org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:211)
at org.apache.flume.sink.AbstractRpcSink.start(AbstractRpcSink.java:292)
at org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Error connecting to localhost/127.0.0.1:10000
at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:203)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:152)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:168)
... 16 more
Any idea welcom
You should start "listenner"(spark app in your case) first and then launch "writer" (flume-ng)
Instead of localhost use your machine name i.e (machine name used in spark conf) in both scala and avroconf.
agent1.sinks.avroSink.hostname = <Machine name>
agent1.sinks.avroSink.port = 10000

Spark REST API: Failed to find data source: com.databricks.spark.csv

I have a pyspark file stored on s3. I am trying to run it using spark REST API.
I am running the following command:
curl -X POST http://<ip-address>:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "testing.py"],
"appResource" : "s3n://accessKey:secretKey/<bucket-name>/testing.py",
"clientSparkVersion" : "1.6.1",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "org.apache.spark.deploy.SparkSubmit",
"sparkProperties" : {
"spark.driver.supervise" : "false",
"spark.app.name" : "Simple App",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://<ip-address>:6066",
"spark.jars" : "spark-csv_2.10-1.4.0.jar",
"spark.jars.packages" : "com.databricks:spark-csv_2.10:1.4.0"
}
}'
and the testing.py file has a code snippet:
myContext = SQLContext(sc)
format = "com.databricks.spark.csv"
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
dataFrame2 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location2).repartition(1)
outDataFrame = dataFrame1.join(dataFrame2, dataFrame1.values == dataFrame2.valuesId)
outDataFrame.write.format(format).option("header", "true").option("nullValue","").save(outLocation)
But on this line:
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
I get exception:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource
I was trying different things out and one of those things was that I logged into the ip-address machine and ran this command:
./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
so that It would download the spark-csv in .ivy2/cache folder. But that didn't solve the problem. What am I doing wrong?
(Posted on behalf of the OP).
I first added spark-csv_2.10-1.4.0.jar on driver and worker machines. and added
"spark.driver.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
"spark.executor.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
Then I got following error:
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
Caused by: java.lang.ClassNotFoundException: org.apache.commons.csv.CSVFormat
And then I added commons-csv-1.4.jar on both machines and added:
"spark.driver.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
"spark.executor.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
And that solved my problem.