stopping spark streaming after reading first batch of data - apache-kafka

I am using spark streaming to consume kafka messages. I want to get some messages as sample from kafka instead of reading all messages. So I want to read a batch of messages, return them to caller and stopping spark streaming. Currently I am passing batchInterval time in awaitTermination method of spark streaming context method. I don't now how to return processed data to caller from spark streaming. Here is my code that I am using currently
def getsample(params: scala.collection.immutable.Map[String, String]): Unit = {
if (params.contains("zookeeperQourum"))
zkQuorum = params.get("zookeeperQourum").get
if (params.contains("userGroup"))
group = params.get("userGroup").get
if (params.contains("topics"))
topics = params.get("topics").get
if (params.contains("numberOfThreads"))
numThreads = params.get("numberOfThreads").get
if (params.contains("sink"))
sink = params.get("sink").get
if (params.contains("batchInterval"))
interval = params.get("batchInterval").get.toInt
val sparkConf = new SparkConf().setAppName("KafkaConsumer").setMaster("spark://cloud2-server:7077")
val ssc = new StreamingContext(sparkConf, Seconds(interval))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
var consumerConfig = scala.collection.immutable.Map.empty[String, String]
consumerConfig += ("auto.offset.reset" -> "smallest")
consumerConfig += ("zookeeper.connect" -> zkQuorum)
consumerConfig += ("group.id" -> group)
var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, consumerConfig, topicMap, StorageLevel.MEMORY_ONLY).map(_._2)
val streams = data.window(Seconds(interval), Seconds(interval)).map(x => new String(x))
streams.foreach(rdd => rdd.foreachPartition(itr => {
while (itr.hasNext && size >= 0) {
var msg=itr.next
println(msg)
sample.append(msg)
sample.append("\n")
size -= 1
}
}))
ssc.start()
ssc.awaitTermination(5000)
ssc.stop(true)
}
So instead of saving messages in a String builder called "sample" I want to return to caller.

You can implement a StreamingListener and then inside it, onBatchCompleted you can call ssc.stop()
private class MyJobListener(ssc: StreamingContext) extends StreamingListener {
override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) = synchronized {
ssc.stop(true)
}
}
This is how you attach your SparkStreaming to the JobListener:
val listen = new MyJobListener(ssc)
ssc.addStreamingListener(listen)
ssc.start()
ssc.awaitTermination()

We can get sample messages using following piece of code
var sampleMessages=streams.repartition(1).mapPartitions(x=>x.take(10))
and if we want to stop after first batch then we should implement our own StreamingListener interface and should stop streaming in onBatchCompleted method.

Related

How to iterate over key values of a Kafka Streams Table

I'm new to Kafka streams and I tried to iterate over items in a kafka Streams table via the keyValueStore:
The idea is to create a Ktable (I've also tried with a globalKTable) with a KeyValueStore.
Then a separated thread is in charge to read the content of the KeyValueStore in order to iterate over last value of each key.
val streamProperties: Properties = {
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "test-application")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, config.getStringList("kafka.bootstrap.servers").toList.mkString(","))
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.ByteArray.getClass.getName)
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p
}
val builder: StreamsBuilder = new StreamsBuilder()
import org.apache.kafka.streams.kstream.Materialized
import org.apache.kafka.streams.state.KeyValueStore
val globalTable = builder.table("test",
Materialized
.as[String, Array[Byte], KeyValueStore[org.apache.kafka.common.utils.Bytes, Array[Byte]]]("TestStore")
.withCachingDisabled()
.withKeySerde(Serdes.String())
.withValueSerde(Serdes.ByteArray())
)
val streams: KafkaStreams = new KafkaStreams(builder.build(), streamProperties)
streams.start()
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
println("\n\n\n tick \n\n\n")
try {
val keyValueStore = streams.store(globalTable.queryableStoreName(), QueryableStoreTypes.keyValueStore())
keyValueStore.all().toIterator.map { k =>
print(k.key)
}
} catch {
case _ => println("error")
}
}
}
val f = ex.scheduleAtFixedRate(task, 1, 10, TimeUnit.SECONDS)
}
}
In the thread the keyValueStore stays empty even when I produce items on topic "test".
Is there something I missed or didn't understand?
One thing missing is state directory location config:
p.put(StreamsConfig.STATE_DIR_CONFIG, "/tmp")
Without it Kafka Streams would not throw exception, but stateful things like global KTables would silently not work.

Kafka producer hangs on send

The logic is that a streaming job, getting data from a custom source has to write both to Kafka as well as HDFS.
I wrote a (very) basic Kafka producer to do this, however the whole streaming job hangs on the send method.
class KafkaProducer(val kafkaBootstrapServers: String, val kafkaTopic: String, val sslCertificatePath: String, val sslCertificatePassword: String) {
val kafkaProps: Properties = new Properties()
kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBootstrapServers)
kafkaProps.put("acks", "1")
kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
kafkaProps.put("ssl.truststore.location", sslCertificatePath)
kafkaProps.put("ssl.truststore.password", sslCertificatePassword)
val kafkaProducer: KafkaProducer[Long, Array[String]] = new KafkaProducer(kafkaProps)
def sendKafkaMessage(message: Message): Unit = {
message.data.foreach(list => {
val producerRecord: ProducerRecord[Long, Array[String]] = new ProducerRecord[Long, Array[String]](kafkaTopic, message.timeStamp.getTime, list.toArray)
kafkaProducer.send(producerRecord)
})
}
}
And the code calling the producer:
receiverStream.foreachRDD(rdd => {
val messageRowRDD: RDD[Row] = rdd.mapPartitions(partition => {
val parser: Parser = new Parser
val kafkaProducer: KafkaProducer = new KafkaProducer(kafkaBootstrapServers, kafkaTopic, kafkaSslCertificatePath, kafkaSslCertificatePass)
val newPartition = partition.map(message => {
Logger.getLogger("importer").error("Writing Message to Kafka...")
kafkaProducer.sendKafkaMessage(message)
Logger.getLogger("importer").error("Finished writing Message to Kafka")
Message.data.map(singleMessage => parser.parseMessage(Message.timeStamp.getTime, singleMessage))
})
newPartition.flatten
})
val df = sqlContext.createDataFrame(messageRowRDD, Schema.messageSchema)
Logger.getLogger("importer").info("Entries-count: " + df.count())
val row = Try(df.first)
row match {
case Success(s) => Persister.writeDataframeToDisk(df, outputFolder)
case Failure(e) => Logger.getLogger("importer").warn("Resulting DataFrame is empty. Nothing can be written")
}
})
From the logs I can tell that each executor is reaching the "sending to kafka" point, however not any further. All executors hang on that and no exception is thrown.
The Message class is a very simple case class with 2 fields, a timestamp and an array of strings.
This was due to the acks field in Kafka.
Acks was set to 1 and sends went ahead a lot faster.

Reading and processing parallelism in Kafka Spark streaming

I am trying to parallelize reading of Kafka messages thereby processing them in parallel. My Kafka topic has 10 partitions. I'm trying to create 5 DStreams and applying Union method to operate on a single DStream. Here is the code I tried so far:
def main(args: scala.Array[String]): Unit = {
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[2]").setAppName("KafkaStream")
val ssc = new StreamingContext(streamConf, Seconds(1))
// println("defaultParallelism: "+ssc.sparkContext.defaultParallelism)
ssc.sparkContext.setLogLevel("WARN")
val numPartitionsOfInputTopic = 5
val group_id = Random.alphanumeric.take(4).mkString("consumer_group")
val kafkaStream = {
val kafkaParams = Map("zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> group_id,
"zookeeper.connection.timeout.ms" -> "3000")
val streams = (1 to numPartitionsOfInputTopic).map { _ =>
KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
ssc, kafkaParams, Map("kafka_topic" -> 1), StorageLevel.MEMORY_ONLY_SER).map(_._2)
}
val unifiedStream = ssc.union(streams)
val sparkProcessingParallelism = 5
unifiedStream.repartition(sparkProcessingParallelism)
}
kafkaStream.foreachRDD { x =>
x.foreach {
msg => println("Message: "+msg)
processMessage(msg)
}
}
ssc.start()
ssc.awaitTermination()
}
Upon execution, it's not even receiving a single message, let alone processing it further. Am I missing something here? Please suggest for changes if required. Thanks.
I highly recommend to switch to Direct Stream. Why?
Direct Stream by default sets the parallelism to the number of partition you have in Kafka. Nothing more must be done - just create Direct Stream and do your job :)
If you create 5 DStreams, you will by default read in 5 thread, one non-Direct-DStream = one thread

How to Test Kafka Consumer

I have a Kafka Consumer (built in Scala) which extracts latest records from Kafka. The consumer looks like this:
val consumerProperties = new Properties()
consumerProperties.put("bootstrap.servers", "localhost:9092")
consumerProperties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("group.id", "something")
consumerProperties.put("auto.offset.reset", "latest")
val consumer = new KafkaConsumer[String, String](consumerProperties)
consumer.subscribe(java.util.Collections.singletonList("topic"))
Now, I want to write an integration test for it. Is there any way or any best practice for Testing Kafka Consumers?
You need to start zookeeper and kafka programmatically for integration tests.
1.1 start zookeeper (ZooKeeperServer)
def startZooKeeper(zooKeeperPort: Int, zkLogsDir: Directory): ServerCnxnFactory = {
val tickTime = 2000
val zkServer = new ZooKeeperServer(zkLogsDir.toFile.jfile, zkLogsDir.toFile.jfile, tickTime)
val factory = ServerCnxnFactory.createFactory
factory.configure(new InetSocketAddress("0.0.0.0", zooKeeperPort), 1024)
factory.startup(zkServer)
factory
}
1.2 start kafka (KafkaServer)
case class StreamConfig(streamTcpPort: Int = 9092,
streamStateTcpPort :Int = 2181,
stream: String,
numOfPartition: Int = 1,
nodes: Map[String, String] = Map.empty)
def startKafkaBroker(config: StreamConfig,
kafkaLogDir: Directory): KafkaServer = {
val syncServiceAddress = s"localhost:${config.streamStateTcpPort}"
val properties: Properties = new Properties
properties.setProperty("zookeeper.connect", syncServiceAddress)
properties.setProperty("broker.id", "0")
properties.setProperty("host.name", "localhost")
properties.setProperty("advertised.host.name", "localhost")
properties.setProperty("port", config.streamTcpPort.toString)
properties.setProperty("auto.create.topics.enable", "true")
properties.setProperty("log.dir", kafkaLogDir.toAbsolute.path)
properties.setProperty("log.flush.interval.messages", 1.toString)
properties.setProperty("log.cleaner.dedupe.buffer.size", "1048577")
config.nodes.foreach {
case (key, value) => properties.setProperty(key, value)
}
val broker = new KafkaServer(new KafkaConfig(properties))
broker.startup()
println(s"KafkaStream Broker started at ${properties.get("host.name")}:${properties.get("port")} at ${kafkaLogDir.toFile}")
broker
}
emit some events to stream using KafkaProducer
Then consume with your consumer to test and verify its working
You can use scalatest-eventstream that has startBroker method which will start Zookeeper and Kafka for you.
Also has destroyBroker which will cleanup your kafka after tests.
eg.
class MyStreamConsumerSpecs extends FunSpec with BeforeAndAfterAll with Matchers {
implicit val config =
StreamConfig(streamTcpPort = 9092, streamStateTcpPort = 2181, stream = "test-topic", numOfPartition = 1)
val kafkaStream = new KafkaEmbeddedStream
override protected def beforeAll(): Unit = {
kafkaStream.startBroker
}
override protected def afterAll(): Unit = {
kafkaStream.destroyBroker
}
describe("Kafka Embedded stream") {
it("does consume some events") {
//uses application.properties
//emitter.broker.endpoint=localhost:9092
//emitter.event.key.serializer=org.apache.kafka.common.serialization.StringSerializer
//emitter.event.value.serializer=org.apache.kafka.common.serialization.StringSerializer
kafkaStream.appendEvent("test-topic", """{"MyEvent" : { "myKey" : "myValue"}}""")
val consumerProperties = new Properties()
consumerProperties.put("bootstrap.servers", "localhost:9092")
consumerProperties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
consumerProperties.put("group.id", "something")
consumerProperties.put("auto.offset.reset", "earliest")
val myConsumer = new KafkaConsumer[String, String](consumerProperties)
myConsumer.subscribe(java.util.Collections.singletonList("test-topic"))
val events = myConsumer.poll(2000)
events.count() shouldBe 1
events.iterator().next().value() shouldBe """{"MyEvent" : { "myKey" : "myValue"}}"""
println("=================" + events.count())
}
}
}

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}
Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file