SparkStream unable to read data from Kafka topic - scala

I am a beginner in Kafka and I am trying to read the data from a Kafka topic using spark in scala.
My main function is:
def main(args : Array[String]) : Unit = {
val spark = SparkSession
.builder()
.appName("testKafka")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "sample_topic")
.load()
df.printSchema()
df
.writeStream
.outputMode("append")
.format("com.databricks.spark.csv")
.option("checkpointLocation", "/home/wintersoldier/Desktop/checkpoint")
.option("path","/home/wintersoldier/Documents/tookitaki/sparkTest/src/main/scala/kafka_out/outCSV")
.start()
.awaitTermination()
spark.stop()
spark.close()
}
Then I am sending messages through the Kafka producer terminal by:
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic sample_topic
The application is running without any error, but no csv file with the topic message data is being created.
Instead of writing to CSV I also tried printing on terminal by : df.format("console") still I was not able to get any output.
My kafka version is : kafka_2.11-0.9.0.0
My build.sbt contains:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.5" ,
"org.apache.spark" %% "spark-mllib" % "2.4.5" ,
"org.apache.spark" %% "spark-sql" % "2.4.5" ,
"org.apache.spark" %% "spark-hive" % "2.4.5" ,
"org.apache.spark" %% "spark-streaming" % "2.4.5" ,
"org.apache.spark" %% "spark-graphx" % "2.4.5",
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.3",
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.7",
)
UPDATE:
The terminal LOGS:
[info] Running com.test.kafka.testKafka
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/11/29 16:35:52 INFO SparkContext: Running Spark version 2.4.5
20/11/29 16:35:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/29 16:35:52 INFO SparkContext: Submitted application: testKafka
20/11/29 16:35:52 INFO SecurityManager: Changing view acls to: wintersoldier
20/11/29 16:35:52 INFO SecurityManager: Changing modify acls to: wintersoldier
20/11/29 16:35:52 INFO SecurityManager: Changing view acls groups to:
20/11/29 16:35:52 INFO SecurityManager: Changing modify acls groups to:
20/11/29 16:35:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(wintersoldier); groups with view permissions: Set(); users with modify permissions: Set(wintersoldier); groups with modify permissions: Set()
20/11/29 16:35:53 INFO Utils: Successfully started service 'sparkDriver' on port 46047.
20/11/29 16:35:53 INFO SparkEnv: Registering MapOutputTracker
20/11/29 16:35:53 INFO SparkEnv: Registering BlockManagerMaster
20/11/29 16:35:53 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/11/29 16:35:53 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/11/29 16:35:53 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-8579f1c4-621e-428c-ba2f-aaa457a9b1d4
20/11/29 16:35:53 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/11/29 16:35:53 INFO SparkEnv: Registering OutputCommitCoordinator
20/11/29 16:35:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/11/29 16:35:53 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://anonymouspirates:4040
20/11/29 16:35:53 INFO Executor: Starting executor ID driver on host localhost
20/11/29 16:35:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35529.
20/11/29 16:35:53 INFO NettyBlockTransferService: Server created on anonymouspirates:35529
20/11/29 16:35:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/11/29 16:35:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, anonymouspirates, 35529, None)
20/11/29 16:35:53 INFO BlockManagerMasterEndpoint: Registering block manager anonymouspirates:35529 with 366.3 MB RAM, BlockManagerId(driver, anonymouspirates, 35529, None)
20/11/29 16:35:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, anonymouspirates, 35529, None)
20/11/29 16:35:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, anonymouspirates, 35529, None)
20/11/29 16:35:54 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/wintersoldier/Documents/tookitaki/sparkTest/spark-warehouse').
20/11/29 16:35:54 INFO SharedState: Warehouse path is 'file:/home/wintersoldier/Documents/tookitaki/sparkTest/spark-warehouse'.
20/11/29 16:35:54 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
20/11/29 16:35:56 INFO MicroBatchExecution: Starting [id = 3918834a-7b1d-41d6-9069-60bfb807019f, runId = 9a8756a8-de0e-4bc1-8e37-91fc02bfaeb0]. Use file:///home/wintersoldier/Desktop/checkpoint to store the query checkpoint.
20/11/29 16:35:56 INFO MicroBatchExecution: Using MicroBatchReader [KafkaV2[Subscribe[sample_topic]]] from DataSourceV2 named 'kafka' [org.apache.spark.sql.kafka010.KafkaSourceProvider#ebf74a2]
20/11/29 16:35:56 INFO MicroBatchExecution: Starting new streaming query.
20/11/29 16:35:56 INFO MicroBatchExecution: Stream started from {}
20/11/29 16:35:56 INFO ConsumerConfig: ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [localhost:9092]
check.crcs = true
client.id =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = spark-kafka-source-dba9525c-4dfe-4cde-bcc7-0d54fde3e897-810093942-driver-0
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 1
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
20/11/29 16:35:56 INFO AppInfoParser: Kafka version : 2.0.0
20/11/29 16:35:56 INFO AppInfoParser: Kafka commitId : 3402a8361b734732

Related

Kafka Spark Structured Streaming with SASL_SSL authentication

I have been trying to use Spark Structured Streaming API to connect to Kafka cluster with SASL_SSL. I have passed the jaas.conf file to the executors. It seems I couldn't set the values of keystore and truststore authentications.
I tried passing the values as mentioned in thisspark link
Also, tried passing it through the code as in this link
Still no luck.
Here is the log
20/02/28 10:00:53 INFO streaming.StreamExecution: Starting [id = e176f5e7-7157-4df5-93ce-1e267bae6125, runId = 03225a69-ec00-45d9-8092-1467da34980f]. Use flight/checkpoint to store the query checkpoint.
20/02/28 10:00:53 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
20/02/28 10:00:53 INFO spark.SparkContext: Invoking stop() from shutdown hook
20/02/28 10:00:53 INFO server.AbstractConnector: Stopped Spark#46202f7b{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
20/02/28 10:00:53 INFO consumer.ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [broker1:9093, broker2:9093]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id =
ssl.endpoint.identification.algorithm = null
max.poll.records = 1
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
group.id = spark-kafka-source-93d170e9-977c-40fc-9e5d-790d253fcff5-409016337-driver-0
retry.backoff.ms = 100
ssl.secure.random.implementation = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = SASL_SSL
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = earliest
20/02/28 10:00:53 INFO ui.SparkUI: Stopped Spark web UI at http://<Server>:41037
20/02/28 10:00:53 ERROR streaming.StreamExecution: Query [id = e176f5e7-7157-4df5-93ce-1e267bae6125, runId = 03225a69-ec00-45d9-8092-1467da34980f] terminated with error
org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:702)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:557)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:540)
at org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
at org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:297)
at org.apache.spark.sql.kafka010.KafkaOffsetReader.<init>(KafkaOffsetReader.scala:78)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.createSource(KafkaSourceProvider.scala:88)
at org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:243)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2$$anonfun$applyOrElse$1.apply(StreamExecution.scala:158)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2$$anonfun$applyOrElse$1.apply(StreamExecution.scala:155)
at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:155)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:153)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:153)
at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:147)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:276)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner authentication information from the user
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:86)
at org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:70)
at org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:83)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:623)
... 22 more
Caused by: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner authentication information from the user
at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Unknown Source)
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Unknown Source)
at com.sun.security.auth.module.Krb5LoginModule.login(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at javax.security.auth.login.LoginContext.invoke(Unknown Source)
at javax.security.auth.login.LoginContext.access$000(Unknown Source)
at javax.security.auth.login.LoginContext$4.run(Unknown Source)
at javax.security.auth.login.LoginContext$4.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(Unknown Source)
at javax.security.auth.login.LoginContext.login(Unknown Source)
at org.apache.kafka.common.security.authenticator.AbstractLogin.login(AbstractLogin.java:69)
at org.apache.kafka.common.security.kerberos.KerberosLogin.login(KerberosLogin.java:110)
at org.apache.kafka.common.security.authenticator.LoginManager.<init>(LoginManager.java:46)
at org.apache.kafka.common.security.authenticator.LoginManager.acquireLoginManager(LoginManager.java:68)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:78)
... 25 more
20/02/28 10:00:53 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
20/02/28 10:00:53 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
20/02/28 10:00:53 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
20/02/28 10:00:53 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/02/28 10:00:53 INFO memory.MemoryStore: MemoryStore cleared
20/02/28 10:00:53 INFO storage.BlockManager: BlockManager stopped
20/02/28 10:00:53 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
20/02/28 10:00:53 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/02/28 10:00:53 INFO spark.SparkContext: Successfully stopped SparkContext
20/02/28 10:00:53 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
20/02/28 10:00:53 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
20/02/28 10:00:53 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://nameservice1/user/hasif.subair/.sparkStaging/application_1582866369627_0029
20/02/28 10:00:53 INFO util.ShutdownHookManager: Shutdown hook called
20/02/28 10:00:53 INFO util.ShutdownHookManager: Deleting directory /yarn/nm/usercache/hasif.subair/appcache/application_1582866369627_0029/spark-5addfec0-a99f-49e1-b9d1-671c331efb40
Code
val rawData = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "broker1:9093, broker2:9093")
.option("subscribe", "hasif_test")
.option("spark.executor.extraJavaOptions", "-Djava.security.auth.login.config=jaas.conf")
.option("kafka.security.protocol", "SASL_SSL")
.option("ssl.truststore.location", "/etc/connect_ts/truststore.jks")
.option("ssl.truststore.password", "<PASSWORD>")
.option("ssl.keystore.location", "/etc/connect_ts/keystore.jks")
.option("ssl.keystore.password", "<PASSWORD>")
.option("ssl.key.password", "<PASSWORD>")
.load()
rawData.writeStream.option("path", "flight/output")
.option("checkpointLocation", "flight/checkpoint").format("csv").start()
spark-submit
spark2-submit --master yarn --deploy-mode cluster \
--conf spark.yarn.keytab=hasif.subair.keytab \
--conf spark.yarn.principal=hasif.subair#TEST.ABC \
--files /home/hasif.subair/jaas.conf \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" \
--conf "spark.kafka.clusters.hasif.ssl.truststore.location=/etc/ts/truststore.jks" \
--conf "spark.kafka.clusters.hasif.ssl.truststore.password=testcluster" \
--conf "spark.kafka.clusters.hasif.ssl.keystore.location=/etc/ts/keystore.jks" \
--conf "spark.kafka.clusters.hasif.ssl.keystore.password=testcluster" \
--conf "spark.kafka.clusters.hasif.ssl.key.password=testcluster" \
--jars spark-sql-kafka-0-10_2.11-2.2.0.jar \
--class TestApp test_app_2.11-0.1.jar \
jaas.conf
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
principal="hasif.subair#TEST.ABC"
useKeyTab=true
serviceName="kafka"
keyTab="hasif.subair.keytab"
client=true;
};
Any help will be deeply appreciated.
Kafka’s own configurations can be set via DataStreamReader.option with kafka. prefix, e.g.
val clusterName = "hasif"
stream.option(s"spark.kafka.clusters.${clusterName}.kafka.ssl.keystore.location", "/etc/connect_ts/keystore.jks")
Use kafka.ssl.truststore.location instead of ssl.truststore.location.
Similarly, you can prefix kafka for other properties and try.
I suspect the values for SSL is not getting picked up. As you can notice in your log the values are shown as null.
ssl.truststore.location = null
ssl.truststore.password = null
ssl.keystore.password = null
ssl.keystore.location = null
If the values are set properly it would reflect as
ssl.truststore.location = /etc/connect_ts/truststore.jks
ssl.truststore.password = [hidden]
ssl.keystore.password = [hidden]
ssl.keystore.location = /etc/connect_ts/keystore.jks

Problem with sending message from flume to kafka

I have two hosts in two different VMs,
i have flume on a Centos 7 host and kafka on cloudera host.
So i connected the two of them and make kafka in the cloudera host as sink in flume as represent the code bellow :
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/Bureau/V1/outputCV
a1.sources.r1.fileHeader = true
a1.sources.r1.interceptors = timestampInterceptor
a1.sources.r1.interceptors.timestampInterceptor.type = timestamp
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = flume-topic
a1.sinks.k1.kafka.bootstrap.servers =
192.168.5.129:9090,192.168.5.129:9091,192.168.5.129:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
I got this error when flume try to send messages to kafka :
[root#localhost V1]# /root/flume/bin/flume-ng agent --name a1 --conf-file /root/Bureau/V1/flumeconf/flumetest.conf
Warning: No configuration directory set! Use --conf <dir> to override.
Warning: JAVA_HOME is not set!
Info: Including Hadoop libraries found via (/root/hadoop/bin/hadoop) for HDFS access
Info: Including Hive libraries found via () for Hive access
+ exec /usr/bin/java -Xmx20m -cp '/root/flume/lib/*:/root/hadoop/etc/hadoop:/root/hadoop/share/hadoop/common/lib/*:/root/hadoop/share/hadoop/common/*:/root/hadoop/share/hadoop/hdfs:/root/hadoop/share/hadoop/hdfs/lib/*:/root/hadoop/share/hadoop/hdfs/*:/root/hadoop/share/hadoop/mapreduce/lib/*:/root/hadoop/share/hadoop/mapreduce/*:/root/hadoop/share/hadoop/yarn:/root/hadoop/share/hadoop/yarn/lib/*:/root/hadoop/share/hadoop/yarn/*:/lib/*' -Djava.library.path=:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib org.apache.flume.node.Application --name a1 --conf-file /root/Bureau/V1/flumeconf/flumetest.conf
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/flume/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-05-07 19:04:41,363 INFO node.PollingPropertiesFileConfigurationProvider: Configuration provider starting
2019-05-07 19:04:41,366 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:/root/Bureau/V1/flumeconf/flumetest.conf
2019-05-07 19:04:41,374 INFO conf.FlumeConfiguration: Processing:k1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:c1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:r1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:k1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:r1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:k1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Added sinks: k1 Agent: a1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:k1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:k1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:r1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:k1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:r1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:r1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:c1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:r1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:k1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:c1
2019-05-07 19:04:41,375 INFO conf.FlumeConfiguration: Processing:k1
2019-05-07 19:04:41,375 WARN conf.FlumeConfiguration: Agent configuration for 'a1' has no configfilters.
2019-05-07 19:04:41,392 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [a1]
2019-05-07 19:04:41,392 INFO node.AbstractConfigurationProvider: Creating channels
2019-05-07 19:04:41,397 INFO channel.DefaultChannelFactory: Creating instance of channel c1 type memory
2019-05-07 19:04:41,400 INFO node.AbstractConfigurationProvider: Created channel c1
2019-05-07 19:04:41,401 INFO source.DefaultSourceFactory: Creating instance of source r1, type spooldir
2019-05-07 19:04:41,415 INFO sink.DefaultSinkFactory: Creating instance of sink: k1, type: org.apache.flume.sink.kafka.KafkaSink
2019-05-07 19:04:41,420 INFO kafka.KafkaSink: Using the static topic flume-topic. This may be overridden by event headers
2019-05-07 19:04:41,426 INFO node.AbstractConfigurationProvider: Channel c1 connected to [r1, k1]
2019-05-07 19:04:41,432 INFO node.Application: Starting new configuration:{ sourceRunners:{r1=EventDrivenSourceRunner: { source:Spool Directory source r1: { spoolDir: /root/Bureau/V1/outputCV } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#492cf580 counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }
2019-05-07 19:04:41,432 INFO node.Application: Starting Channel c1
2019-05-07 19:04:41,492 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean.
2019-05-07 19:04:41,492 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: c1 started
2019-05-07 19:04:41,492 INFO node.Application: Starting Sink k1
2019-05-07 19:04:41,494 INFO node.Application: Starting Source r1
2019-05-07 19:04:41,494 INFO source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /root/Bureau/V1/outputCV
2019-05-07 19:04:41,525 INFO producer.ProducerConfig: ProducerConfig values:
acks = 1
batch.size = 16384
bootstrap.servers = [192.168.5.129:9090, 192.168.5.129:9091, 192.168.5.129:9092]
buffer.memory = 33554432
client.id =
compression.type = snappy
connections.max.idle.ms = 540000
enable.idempotence = false
interceptor.classes = []
key.serializer = class org.apache.kafka.common.serialization.StringSerializer
linger.ms = 1
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retries = 0
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 60000
transactional.id = null
value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
2019-05-07 19:04:41,536 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: r1: Successfully registered new MBean.
2019-05-07 19:04:41,536 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: r1 started
2019-05-07 19:04:41,591 INFO utils.AppInfoParser: Kafka version : 2.0.1
2019-05-07 19:04:41,591 INFO utils.AppInfoParser: Kafka commitId : fa14705e51bd2ce5
2019-05-07 19:04:41,592 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: k1: Successfully registered new MBean.
2019-05-07 19:04:41,592 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: k1 started
2019-05-07 19:05:01,780 INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
2019-05-07 19:05:01,781 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /root/Bureau/V1/outputCV/cv.txt to /root/Bureau/V1/outputCV/cv.txt.COMPLETED
2019-05-07 19:05:03,673 WARN clients.NetworkClient: [Producer clientId=producer-1] Connection to node -2 could not be established. Broker may not be available.
2019-05-07 19:05:03,776 WARN clients.NetworkClient: [Producer clientId=producer-1] Connection to node -2 could not be established. Broker may not be available.
2019-05-07 19:05:03,856 WARN clients.NetworkClient: [Producer clientId=producer-1] Connection to node -2 could not be established. Broker may not be available.
2019-05-07 19:05:03,910 INFO clients.Metadata: Cluster ID: F_Byx5toQju8jaLb3zFwAA
2019-05-07 19:05:04,084 WARN clients.NetworkClient: [Producer clientId=producer-1] Error connecting to node quickstart.cloudera:9092 (id: 0 rack: null)
java.io.IOException: Can't resolve address: quickstart.cloudera:9092
at org.apache.kafka.common.network.Selector.doConnect(Selector.java:235)
at org.apache.kafka.common.network.Selector.connect(Selector.java:214)
at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:265)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:266)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.apache.kafka.common.network.Selector.doConnect(Selector.java:233)
... 7 more
2019-05-07 19:05:04,140 WARN clients.NetworkClient: [Producer clientId=producer-1] Error connecting to node quickstart.cloudera:9092 (id: 0 rack: null)
java.io.IOException: Can't resolve address: quickstart.cloudera:9092
at org.apache.kafka.common.network.Selector.doConnect(Selector.java:235)
at org.apache.kafka.common.network.Selector.connect(Selector.java:214)
at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:265)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:266)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.apache.kafka.common.network.Selector.doConnect(Selector.java:233)
... 7 more
2019-05-07 19:05:04,231 WARN clients.NetworkClient: [Producer clientId=producer-1] Error connecting to node quickstart.cloudera:9092 (id: 0 rack: null)
java.io.IOException: Can't resolve address: quickstart.cloudera:9092
at org.apache.kafka.common.network.Selector.doConnect(Selector.java:235)
at org.apache.kafka.common.network.Selector.connect(Selector.java:214)
at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:265)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:266)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.apache.kafka.common.network.Selector.doConnect(Selector.java:233)
... 7 more
2019-05-07 19:05:04,393 WARN clients.NetworkClient: [Producer clientId=producer-1] Error connecting to node quickstart.cloudera:9092 (id: 0 rack: null)
Error connectiong to node shows up only after sending messages.
Can anyone helps me solve this problem.

Kafka-Spark Batch Streaming: WARN clients.NetworkClient: Bootstrap broker disconnected

I am trying to write the rows of a Dataframe into a Kafka topic. The kafka cluster is Kerberized and I am providing the jaas.conf in the --conf arguments to be able to authenticate and connect to the cluster. Below is my code:
object app {
val conf = new SparkConf().setAppName("Kerberos kafka ")
val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
System.setProperty("java.security.auth.login.config", "path to jaas.conf")
spark.sparkContext.setLogLevel("ERROR")
def main(args: Array[String]): Unit = {
val test= spark.sql("select * from testing.test")
test.show()
println("publishing to kafka...")
val test_final = test.selectExpr("cast(to_json(struct(*)) as string) AS value")
test_final .show()
test_final.write.format("kafka")
.option("kafka.bootstrap.servers","XXXXXXXXX:9093")
.option("topic", "test")
.option("security.protocol", "SASL_SSL")
.option("sasl.kerberos.service.name","kafka")
.save()
}
}
When I run the above code, it fails with this error:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
When I looked into the error logs on executors, I see this:
18/08/20 22:06:05 INFO producer.ProducerConfig: ProducerConfig values:
compression.type = none
metric.reporters = []
metadata.max.age.ms = 300000
metadata.fetch.timeout.ms = 60000
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
bootstrap.servers = [xxxxxxxxx:9093]
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
buffer.memory = 33554432
timeout.ms = 30000
key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.keystore.type = JKS
ssl.trustmanager.algorithm = PKIX
block.on.buffer.full = false
ssl.key.password = null
max.block.ms = 60000
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
ssl.truststore.password = null
max.in.flight.requests.per.connection = 5
metrics.num.samples = 2
client.id =
ssl.endpoint.identification.algorithm = null
ssl.protocol = TLS
request.timeout.ms = 30000
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
acks = 1
batch.size = 16384
ssl.keystore.location = null
receive.buffer.bytes = 32768
ssl.cipher.suites = null
ssl.truststore.type = JKS
**security.protocol = PLAINTEXT**
retries = 0
max.request.size = 1048576
value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
ssl.truststore.location = null
ssl.keystore.password = null
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
send.buffer.bytes = 131072
linger.ms = 0
18/08/20 22:06:05 **INFO utils.AppInfoParser: Kafka version : 0.9.0-kafka-2.0.2**
18/08/20 22:06:05 INFO utils.AppInfoParser: Kafka commitId : unknown
18/08/20 22:06:05 INFO datasources.FileScanRDD: Reading File path: hdfs://nameservice1/user/test5/dt=2017-08-04/5a8bb121-3cab-4bed-a32b-9d0fae4a4e8b.parquet, range: 0-142192, partition values: [2017-08-04]
18/08/20 22:06:05 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
18/08/20 22:06:05 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 33.9 KB, free 5.2 GB)
18/08/20 22:06:05 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 224 ms
18/08/20 22:06:05 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 472.2 KB, free 5.2 GB)
18/08/20 22:06:06 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:06 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:07 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:07 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:07 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4
18/08/20 22:06:08 INFO executor.Executor: Running task 1.0 in stage 2.0 (TID 4)
18/08/20 22:06:08 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:08 INFO datasources.FileScanRDD: Reading File path: hdfs://nameservice1/user/test5/dt=2017-08-10/2175e5d9-e969-41e9-8aa2-f329b5df06bf.parquet, range: 0-77484, partition values: [2017-08-10]
18/08/20 22:06:08 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:09 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:09 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:10 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
18/08/20 22:06:10 WARN clients.NetworkClient: Bootstrap broker xxxxxxxxx:9093:9093 disconnected
In this above log, I see three entries that are conflicting:
security.protocol = PLAINTEXT
sasl.kerberos.service.name = null
INFO utils.AppInfoParser: Kafka version : 0.9.0-kafka-2.0.2
I am setting the security.protocol and sasl.kerberos.service.name values in my test_final.write..... Does that mean the configs are not being passed? The Kafka dependency I am using in my jar is:
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.2.1</version>
</dependency>
Does the 0.10.2.1 version conflict with the 0.9.0-kafka-2.0.2 ? and could this be causing the issue?
Here is my jaas.conf:
/* $Id$ */
kinit {
com.sun.security.auth.module.Krb5LoginModule required;
};
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
doNotPrompt=true
useTicketCache=true
useKeyTab=true
principal="user#CORP.COM"
serviceName="kafka"
keyTab="/data/home/keytabs/user.keytab"
client=true;
};
Here is my spark-submit command:
spark2-submit --master yarn --class app --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=path to jaas.conf" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=path to jaas.conf" --files path to jaas.conf --conf "spark.driver.extraClassPath=path to spark-sql-kafka-0-10_2.11-2.2.0.jar" --conf "spark.executor.extraClassPath=path to spark-sql-kafka-0-10_2.11-2.2.0.jar" --num-executors 2 --executor-cores 4 --executor-memory 10g --driver-memory 5g ./KerberizedKafkaConnect-1.0-SNAPSHOT-shaded.jar
Any help would be appreciated. Thank you!
I am not completely sure if this will fix your specific issue, but in spark structured streaming, the security options must be prefixed with kafka.
So you will have the following:
import org.apache.kafka.clients.CommonClientConfigs
import org.apache.kafka.common.config.SslConfigs
val security = Map(
CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> security.securityProtocol,
SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG -> security.sslTrustStoreLocation,
SslConfigs.SSL_TRUSTSTORE_PASSWORD_CONFIG -> security.sslTrustStorePassword,
SslConfigs.SSL_KEYSTORE_LOCATION_CONFIG -> security.sslKeyStoreLocation,
SslConfigs.SSL_KEYSTORE_PASSWORD_CONFIG -> security.sslKeyStorePassword,
SslConfigs.SSL_KEY_PASSWORD_CONFIG -> security.sslKeyPassword
).map(x => "kafka." + x._1 -> x._2)
test_final.write.format("kafka")
.option("kafka.bootstrap.servers","XXXXXXXXX:9093")
.option("topic", "test")
.options(security)
.save()

Kafka Connect Distributed is not working

I am trying to run kafka connect in distributed mode to read from topic and write to HDFS.
The process is starting successfully with the required properties files given, but I am not able to commit the events to HDFS.
Please see the full logs below.
Note: Broker details have been masked and all other details are mentioned in the logs.
[2018-06-26 11:49:48,474] INFO ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [****:****, ****:****]
check.crcs = true
client.id = consumer-3
connections.max.idle.ms = 540000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = connect-cluster
heartbeat.interval.ms = 3000
interceptor.classes = null
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.ms = 50
request.timeout.ms = 305000
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
session.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
(org.apache.kafka.clients.consumer.ConsumerConfig:180)
[2018-06-26 11:49:48,476] WARN The configuration 'config.storage.topic' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'status.storage.topic' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'internal.key.converter.schemas.enable' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'rest.port' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'value.converter.schema.registry.url' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'internal.key.converter' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'internal.value.converter.schemas.enable' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'internal.value.converter' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'offset.storage.topic' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'value.converter' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,476] WARN The configuration 'key.converter' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,477] WARN The configuration 'key.converter.schema.registry.url' was supplied but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig:188)
[2018-06-26 11:49:48,477] INFO Kafka version : 0.10.1.0-IBM-8 (org.apache.kafka.common.utils.AppInfoParser:83)
[2018-06-26 11:49:48,477] INFO Kafka commitId : unknown (org.apache.kafka.common.utils.AppInfoParser:84)
[2018-06-26 11:49:48,682] INFO Discovered coordinator ****:**** (id: 2147483644 rack: null) for group connect-cluster. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:555)
[2018-06-26 11:49:48,686] INFO Finished reading KafkaBasedLog for topic connect-test-configs (org.apache.kafka.connect.util.KafkaBasedLog:148)
[2018-06-26 11:49:48,686] INFO Started KafkaBasedLog for topic connect-test-configs (org.apache.kafka.connect.util.KafkaBasedLog:150)
[2018-06-26 11:49:48,686] INFO Started KafkaConfigBackingStore (org.apache.kafka.connect.storage.KafkaConfigBackingStore:261)
[2018-06-26 11:49:48,687] INFO Herder started (org.apache.kafka.connect.runtime.distributed.DistributedHerder:171)
[2018-06-26 11:49:48,689] INFO Discovered coordinator ****:**** (id: 2147483644 rack: null) for group connect-cluster. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:555)
[2018-06-26 11:49:48,691] INFO (Re-)joining group connect-cluster (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:381)
[2018-06-26 11:49:48,702] INFO Successfully joined group connect-cluster with generation 3 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:349)
[2018-06-26 11:49:48,702] INFO Joined group and got assignment: Assignment{error=0, leader='connect-1-67ad9053-4b20-41dd-9a6a-ec8e4f4d622e', leaderUrl='http://****:****/', offset=-1, connectorIds=[], taskIds=[]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1009)
[2018-06-26 11:49:48,702] INFO Starting connectors and tasks using config offset -1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:745)
[2018-06-26 11:49:48,702] INFO Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:752)
[2018-06-26 11:49:55,748] INFO Reflections took 7022 ms to scan 225 urls, producing 15521 keys and 98977 values (org.reflections.Reflections:229)
but I am not able to commit the events to HDFS
When you start distributed mode, it doesn't take any individual connector properties, you load those later
Just want to know where should we give the properties of the topics,hdfs url,task in a distributed mode?
You POST JSON into it.
Make a connect-hdfs.json file, for example
{
"name": "quickstart-hdfs-sink"
"config": {
"store.url": "hdfs:///apps/kafka-connect",
"hadoop.conf.dir": "/etc/hadoop/conf",
"topics" : ...
}
}
Send it to Connect
curl -XPOST -d#connect-hdfs.json http://connect-server:8083/connectors

Spark Streaming NoClassDefFoundError error

I am trying to create Spark Kafka Cassandra Integration. Now I am able to connect to cassandra but when I m trying to create SparkStreamingContext object using
val ssc = new StreamingContext(sparkConf, Seconds(60))
I am able to import and write the above code. But when I m trying build and run the code, I m facing below error:
org/apache/spark/SparkConf
at KafkaSparkCassandra$.main(KafkaSparkCassandra.scala:38)
at KafkaSparkCassandra.main(KafkaSparkCassandra.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
now I am not able to understand why I m unable to create SparkStreaming Object at runtime.
Please Help. As I m new using the whole scala and lambda Architecture stack.
below is the configuration inside build.sbt:
libraryDependencies ++=Seq(
"org.apache.spark" % "spark-core_2.10" % "1.4.1",
"com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.4.0",
"org.apache.spark" % "spark-sql_2.10" % "1.4.1",
"mysql" % "mysql-connector-java" % "5.1.12")
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.0" % "provided"
libraryDependencies += ("org.apache.spark" %% "spark-streaming-kafka" % "1.6.0").exclude("org.spark-project.spark", "unused")
/*
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.6.0" % "provided"
*/
javaOptions ++= Seq("-Xmx5G", "-XX:MaxPermSize=5G", "-XX:+CMSClassUnloadingEnabled"
below are the logs. Now M unable to print word count and also to store the same into cassandra db.
log4j:WARN No appenders could be found for logger (com.datastax.driver.core.SystemProperties).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/11/15 18:54:52 INFO SparkContext: Running Spark version 1.6.0
16/11/15 18:54:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/15 18:54:52 INFO SecurityManager: Changing view acls to: romit.srivastava
16/11/15 18:54:52 INFO SecurityManager: Changing modify acls to: romit.srivastava
16/11/15 18:54:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(romit.srivastava); users with modify permissions: Set(romit.srivastava)
16/11/15 18:54:53 INFO Utils: Successfully started service 'sparkDriver' on port 53789.
16/11/15 18:54:53 INFO Slf4jLogger: Slf4jLogger started
16/11/15 18:54:53 INFO Remoting: Starting remoting
16/11/15 18:54:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.56.1:53802]
16/11/15 18:54:53 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 53802.
16/11/15 18:54:54 INFO SparkEnv: Registering MapOutputTracker
16/11/15 18:54:54 INFO SparkEnv: Registering BlockManagerMaster
16/11/15 18:54:54 INFO DiskBlockManager: Created local directory at C:\Users\romit.srivastava\AppData\Local\Temp\blockmgr-c60aeba8-a317-4066-99ce-71ec3595bdf3
16/11/15 18:54:54 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
16/11/15 18:54:54 INFO SparkEnv: Registering OutputCommitCoordinator
16/11/15 18:54:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/11/15 18:54:54 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
16/11/15 18:54:54 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
16/11/15 18:54:54 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
16/11/15 18:54:54 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
16/11/15 18:54:54 INFO Utils: Successfully started service 'SparkUI' on port 4045.
16/11/15 18:54:54 INFO SparkUI: Started SparkUI at http://192.168.56.1:4045
16/11/15 18:54:54 INFO Executor: Starting executor ID driver on host localhost
16/11/15 18:54:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53814.
16/11/15 18:54:54 INFO NettyBlockTransferService: Server created on 53814
16/11/15 18:54:54 INFO BlockManagerMaster: Trying to register BlockManager
16/11/15 18:54:54 INFO BlockManagerMasterEndpoint: Registering block manager localhost:53814 with 2.4 GB RAM, BlockManagerId(driver, localhost, 53814)
16/11/15 18:54:54 INFO BlockManagerMaster: Registered BlockManager
16/11/15 18:54:55 INFO VerifiableProperties: Verifying properties
16/11/15 18:54:55 INFO VerifiableProperties: Property group.id is overridden to
16/11/15 18:54:55 INFO VerifiableProperties: Property zookeeper.connect is overridden to
16/11/15 18:54:58 INFO Cluster: New Cassandra host /136.243.174.23:9042 added
16/11/15 18:54:58 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
16/11/15 18:55:00 INFO ForEachDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO MappedDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO ShuffledDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO MappedDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO FilteredDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO FlatMappedDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO MappedDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO DirectKafkaInputDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka.DirectKafkaInputDStream#2d1a0e90
16/11/15 18:55:00 INFO MappedDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO MappedDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO MappedDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#678a042d
16/11/15 18:55:00 INFO FlatMappedDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO FlatMappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO FlatMappedDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO FlatMappedDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO FlatMappedDStream: Initialized and validated org.apache.spark.streaming.dstream.FlatMappedDStream#7d8e7cf5
16/11/15 18:55:00 INFO FilteredDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO FilteredDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO FilteredDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO FilteredDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO FilteredDStream: Initialized and validated org.apache.spark.streaming.dstream.FilteredDStream#183e79df
16/11/15 18:55:00 INFO MappedDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO MappedDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO MappedDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#652d8ac6
16/11/15 18:55:00 INFO ShuffledDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO ShuffledDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO ShuffledDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO ShuffledDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO ShuffledDStream: Initialized and validated org.apache.spark.streaming.dstream.ShuffledDStream#52b15122
16/11/15 18:55:00 INFO MappedDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO MappedDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO MappedDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#5c56f655
16/11/15 18:55:00 INFO ForEachDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO ForEachDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO ForEachDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream#37cd8c81
16/11/15 18:55:00 INFO ForEachDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO MappedDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO ShuffledDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO MappedDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO FilteredDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO FlatMappedDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO MappedDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO DirectKafkaInputDStream: metadataCleanupDelay = -1
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka.DirectKafkaInputDStream#2d1a0e90
16/11/15 18:55:00 INFO MappedDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO MappedDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO MappedDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#678a042d
16/11/15 18:55:00 INFO FlatMappedDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO FlatMappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO FlatMappedDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO FlatMappedDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO FlatMappedDStream: Initialized and validated org.apache.spark.streaming.dstream.FlatMappedDStream#7d8e7cf5
16/11/15 18:55:00 INFO FilteredDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO FilteredDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO FilteredDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO FilteredDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO FilteredDStream: Initialized and validated org.apache.spark.streaming.dstream.FilteredDStream#183e79df
16/11/15 18:55:00 INFO MappedDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO MappedDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO MappedDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#652d8ac6
16/11/15 18:55:00 INFO ShuffledDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO ShuffledDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO ShuffledDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO ShuffledDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO ShuffledDStream: Initialized and validated org.apache.spark.streaming.dstream.ShuffledDStream#52b15122
16/11/15 18:55:00 INFO MappedDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO MappedDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO MappedDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#5c56f655
16/11/15 18:55:00 INFO ForEachDStream: Slide time = 60000 ms
16/11/15 18:55:00 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/11/15 18:55:00 INFO ForEachDStream: Checkpoint interval = null
16/11/15 18:55:00 INFO ForEachDStream: Remember duration = 60000 ms
16/11/15 18:55:00 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream#3e3f4b04
16/11/15 18:55:00 INFO RecurringTimer: Started timer for JobGenerator at time 1479216360000
16/11/15 18:55:00 INFO JobGenerator: Started JobGenerator at 1479216360000 ms
16/11/15 18:55:00 INFO JobScheduler: Started JobScheduler
16/11/15 18:55:00 INFO StreamingContext: StreamingContext started
16/11/15 18:55:00 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
16/11/15 18:55:30 INFO JobGenerator: Stopping JobGenerator immediately
16/11/15 18:55:30 INFO RecurringTimer: Stopped timer for JobGenerator after time -1
16/11/15 18:55:30 INFO JobGenerator: Stopped JobGenerator
16/11/15 18:55:30 INFO JobScheduler: Stopped JobScheduler
16/11/15 18:55:30 INFO StreamingContext: StreamingContext stopped successfully
16/11/15 18:55:30 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4045
16/11/15 18:55:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/15 18:55:30 INFO MemoryStore: MemoryStore cleared
16/11/15 18:55:30 INFO BlockManager: BlockManager stopped
16/11/15 18:55:30 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/15 18:55:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/15 18:55:30 INFO SparkContext: Successfully stopped SparkContext
16/11/15 18:55:30 WARN StreamingContext: StreamingContext has already been stopped
16/11/15 18:55:30 INFO SparkContext: SparkContext already stopped.
16/11/15 18:55:30 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/11/15 18:55:30 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/11/15 18:55:30 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
This was primarily due to different versions of spark modules. By fixing the version I was able to run the code.
Further I was able to do word count and save it to cassandra.