Since spark is built on top of Akka, I want to use Akka to send and receive messages between spark clusters.
According to this tutorial, https://github.com/jaceklaskowski/spark-activator/blob/master/src/main/scala/StreamingApp.scala, I can run StreamingApp locally and send messages to the actorStream itself.
Then I tried to attach the sender part to my another spark master and send message from spark master to the remote actor in StreamingApp. The code is as follows
object SenderApp extends Serializable {
def main(args: Array[String]) {
val driverPort = 12345
val driverHost = "xxxx"
val conf = new SparkConf(false)
.setMaster("spark://localhost:8888") // Connecting to my spark master
.setAppName("Spark Akka Streaming Sender")
.set("spark.logConf", "true")
.set("spark.akka.logLifecycleEvents", "true")
val actorName = "helloer"
val sc = new SparkContext(conf)
val actorSystem = SparkEnv.get.actorSystem
val url = s"akka.tcp://sparkDriver#$driverHost:$driverPort/user/Supervisor0/$actorName"
val helloer = actorSystem.actorSelection(url)
helloer ! "Hello"
helloer ! "from"
helloer ! "Spark Streaming"
helloer ! "with"
helloer ! "Scala"
helloer ! "and"
helloer ! "Akka"
}
}
Then I got messages from StreamingApp saying it encountered DeadLetters.
The detailed messages are:
INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkDriver/deadLetters] to Actor[akka://sparkDriver/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkDriver%40111.22.33.444%3A56840-4#-2094758237] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
According to this article:
http://typesafe.com/activator/template/spark-streaming-scala-akka
I changed the helloer, it works now
val timeout = 100 seconds
val helloer = Await.result(actorSystem.actorSelection(url).resolveOne(timeout),
timeout)
Related
I am trying to write a simple consumer of messages from kafka using akka streams.
build.sbt
"com.typesafe.akka" %% "akka-stream-kafka" % "0.17"
My code
object AkkaStreamskafka extends App {
// producer settings
implicit val system = ActorSystem()
implicit val actorMaterializer = ActorMaterializer()
val consumerSettings = ConsumerSettings(system, Some(new ByteArrayDeserializer), Some(new StringDeserializer))
.withBootstrapServers("foo:9092")
.withGroupId("abhi")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
val source = Consumer
.committableSource(consumerSettings, Subscriptions.topics("my-topic))
val flow = Flow[ConsumerMessage.CommittableMessage[Array[Byte], String]].mapAsync(1){msg =>
msg.committableOffset.commitScaladsl().map(_ => msg.record.value);
}
val sink = Sink.foreach[String](println)
val graph = RunnableGraph.fromGraph(GraphDSL.create(sink){implicit builder =>
s =>
import GraphDSL.Implicits._
source ~> flow ~> s.in
ClosedShape
})
val future = graph.run()
Await.result(future, Duration.Inf)
}
But I get an error
[WARN] [09/28/2017 13:12:52.333] [default-akka.kafka.default-dispatcher-7]
[akka://default/system/kafka-consumer-1] Consumer interrupted with WakeupException after timeout.
Message: null. Current value of akka.kafka.consumer.wakeup-timeout is 3000 milliseconds
Edit:
I can do a ssh foo and then type the following command on the server terminal ./kafka-console-consumer --zookeeper localhost:2181 --topic my-topic and I can see data. So I guess my server name foo is correct and kafka is up and running on that machine.
Edit2:
On the Kafka Server I am running Cloudera 5.7.1. Kafka version is jars/kafka_2.10-0.9.0-kafka-2.0.0.jar
I was able to solve the problem myself.
The library "com.typesafe.akka" %% "akka-stream-kafka" only works for Kafka 0.10 and beyond. it does not work for earlier versions of Kafka. When I listed the kafka jars on my kafka server I found that I am using Cloudera 5.7.1 which comes with Kafka 0.9.
In order to create a Akka Streams source for this version. I needed to use
"com.softwaremill.reactivekafka" % "reactive-kafka-core_2.11" % "0.10.0"
They also have an example there https://github.com/kciesielski/reactive-kafka
This code worked perfectly for me
implicit val actorSystem = ActorSystem()
implicit val actorMaterializer = ActorMaterializer()
val kafka = new ReactiveKafka()
val consumerProperties = ConsumerProperties(
bootstrapServers = "foo:9092",
topic = "my-topic",
groupId = "abhi",
valueDeserializer = new StringDeserializer()
)
val source = Source.fromPublisher(kafka.consume(consumerProperties))
val flow = Flow[ConsumerRecord[Array[Byte], String]].map(r => r.value())
val sink = Sink.foreach[String](println)
val graph = RunnableGraph.fromGraph(GraphDSL.create(sink) {implicit builder =>
s =>
import GraphDSL.Implicits._
source ~> flow ~> s.in
ClosedShape
})
val future = graph.run()
future.onComplete{_ =>
actorSystem.terminate()
}
Await.result(actorSystem.whenTerminated, Duration.Inf)
I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.
I am trying to consume messages from Kafka using akka's reactive kafka library. I am getting one message printed and after that I got
[INFO] [01/24/2017 10:36:52.934] [CommittableSourceConsumerMain-akka.actor.default-dispatcher-5] [akka://CommittableSourceConsumerMain/system/kafka-consumer-1] Message [akka.kafka.KafkaConsumerActor$Internal$Stop$] from Actor[akka://CommittableSourceConsumerMain/deadLetters] to Actor[akka://CommittableSourceConsumerMain/system/kafka-consumer-1#-1726905274] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
This is the code I am executing
import akka.actor.ActorSystem
import akka.kafka.scaladsl.Consumer
import akka.kafka.{ConsumerSettings, Subscriptions}
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import org.apache.kafka.clients.consumer.ConsumerConfig
import play.api.libs.json._
import org.apache.kafka.common.serialization.{ByteArrayDeserializer, StringDeserializer}
object CommittableSourceConsumerMain extends App {
implicit val system = ActorSystem("CommittableSourceConsumerMain")
implicit val materializer = ActorMaterializer()
val consumerSettings =ConsumerSettings(system, new ByteArrayDeserializer, new StringDeserializer).withBootstrapServers("localhost:9092").withGroupId("CommittableSourceConsumer").withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val done =
Consumer.committableSource(consumerSettings, Subscriptions.topics("topic1"))
.mapAsync(1) { msg =>
val record=(msg.record.value())
val data=Json.parse(record)
val recordType=data \ "data" \"event" \"type"
val actualData=data \ "data" \ "row"
if(recordType.as[String]=="created"){
"Some saving logic"
}
else{
"Some logic"
}
msg.committableOffset.commitScaladsl()
}
.runWith(Sink.ignore)
}
I finally figured out the solution. Due to a runtime exception in the stream a Future of failure is returned which terminates the stream immediately.
Akka-stream does not provide or display the runtime exception. So as to know the exception
done.onFailure{
case NonFatal(e)=>println(e)
}
The exception was in the if-else block.
Also one can use Actor Strategy to resume stream if exception occurs.
I've been using spark to stream data from kafka and it's pretty easy.
I thought using the MQTT utils would also be easy, but it is not for some reason.
I'm trying to execute the following piece of code.
val sparkConf = new SparkConf(true).setAppName("amqStream").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val actorSystem = ActorSystem()
implicit val kafkaProducerActor = actorSystem.actorOf(Props[KafkaProducerActor])
MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
.foreachRDD { rdd =>
println("got rdd: " + rdd.toString())
rdd.foreach { msg =>
println("got msg: " + msg)
}
}
ssc.start()
ssc.awaitTermination()
The weird thing is that spark logs the msg I sent in the console, but not my println.
It logs something like this:
19:38:18.803 [RecurringTimer - BlockGenerator] DEBUG
o.a.s.s.receiver.BlockGenerator - Last element in
input-0-1435790298600 is SOME MESSAGE
foreach is a distributed action, so your println may be executing on the workers. If you want to see some of the messages printed out locally, you could use the built in print function on the DStream or instead of your foreachRDD collect (or take) some of the elements back to the driver and print them there. Hope that helps and best of luck with Spark Streaming :)
If you wish to just print incoming messages, try something like this instead of the for_each (translating from a working Python version, so do check for Scala typos):
val mqttStream = MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
mqttStream.print()
I am trying to integrate Spark streaming and Kafka. I wrote my source code in intellij idea IDE; the complier compiled the code without any error, but when I try to build the jar file, an error message is generated that shows:
Error:scalac: bad symbolic reference. A signature in KafkaUtils.class
refers to term kafka in package which is not available. It may
be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when
compiling KafkaUtils.class.
I did research on google, many people say this is because of different versions between Scala version and spark_streaming_kafka jar file. But I have checked the version, they are the same.
Does someone know why this error happened?
Here are more details:
scala version : 2.10
spark streaming kafka jar version : spark_streaming_kafka_2.10-1.20.jar,spark_streaming_2.10-1.20.jar
My source code:
object Kafka {
def main(args: Array[String]) {
val master = "local[*]"
val zkQuorum = "localhost:2181"
val group = ""
val topics = "test"
val numThreads = 1
val conf = new SparkConf().setAppName("Kafka")
val ssc = new StreamingContext(conf, Seconds(2))
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val words = lines.flatMap(_.split(" "))
words.print()
ssc.start()
ssc.awaitTermination()
}
}