Kafka + spark streaming : Multi topic processing in single job

Kafka + spark streaming : Multi topic processing in single job - scala

There are 40 topic in Kafka and written spark streaming job to process 5 table each.
Only objective of spark streaming job is to read 5 kafka topic and write it into corresponding 5 hdfs path. Most of the time its working fine, but some time it writing the topic 1 data to other hdfs path.
Below is the code tried to archive one spark streaming job to process 5 topic and write it into corresponding hdfs, But this writing topic 1 data to HDFS 5 instead of HDFS 1.
Please provide your suggestion :
import java.text.SimpleDateFormat
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{ SparkConf, TaskContext }
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka010._
import org.apache.kafka.common.serialization.StringDeserializer
object SparkKafkaMultiConsumer extends App {
override def main(args: Array[String]) {
if (args.length < 1) {
System.err.println(s"""
|Usage: KafkaStreams auto.offset.reset latest/earliest table1,table2,etc
|
""".stripMargin)
System.exit(1)
}
val date_today = new SimpleDateFormat("yyyy_MM_dd");
val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
val PATH_SEPERATOR = "/";
import com.typesafe.config.ConfigFactory
val conf = ConfigFactory.load("env.conf")
val topicconf = ConfigFactory.load("topics.conf")
// Create context with custom second batch interval
val sparkConf = new SparkConf().setAppName("pt_streams")
val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
var kafka_topics="kafka.topics"
// Create direct kafka stream with brokers and topics
var topicsSet = topicconf.getString(kafka_topics).split(",").toSet
if(args.length==2 ) {
print ("This stream job will process table(s) : "+ args(1))
topicsSet=args {1}.split(",").toSet
}
val topicList = topicsSet.toList
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups"),
"auto.offset.reset" -> args { 0 },
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
for (i <- 0 until topicList.length) {
/**
* set timer to see how much time takes for the filter operation for each topics
*/
val topicStream = messages.filter(_.topic().equals(topicList(i)))
val data = topicStream.map(_.value())
data.foreachRDD((rdd, batchTime) => {
// val data = rdd.map(_.value())
if (!rdd.isEmpty()) {
rdd.coalesce(1).saveAsTextFile(conf.getString("hdfs.streamoutpath") + PATH_SEPERATOR + topicList(i) + PATH_SEPERATOR + date_today.format(System.currentTimeMillis())
+ PATH_SEPERATOR + date_today_hour.format(System.currentTimeMillis()) + PATH_SEPERATOR + System.currentTimeMillis())
}
})
}
try{
// After all successful processing, commit the offsets to kafka
messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
} catch {
case e: Exception =>
e.printStackTrace()
print("error while commiting the offset")
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}

You're better off using the HDFS connector for Kafka Connect. It is open source and available standalone or as part of Confluent Platform. Simple configuration file to stream from Kafka topics to HDFS, and as a bonus it will create the Hive table for you if you have a schema for your data.
You're re-inventing the wheel if you try to code this yourself; it's a solved problem :)

Related

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

I am trying to count the number of words in the text and save result to the Cassandra database.
Producer reads the data from the file and sends it to kafka. Consumer uses spark streaming to read and process the date,and then sends the result of the calculations to the table.
My producer looks like this:
object ProducerPlayground extends App {
val topicName = "test"
private def createProducer: Properties = {
val producerProperties = new Properties()
producerProperties.setProperty(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092"
)
producerProperties.setProperty(
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
classOf[IntegerSerializer].getName
)
producerProperties.setProperty(
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
classOf[StringSerializer].getName
)
producerProperties
}
val producer = new KafkaProducer[Int, String](createProducer)
val source = Source.fromFile("G:\\text.txt", "UTF-8")
val lines = source.getLines()
var key = 0
for (line <- lines) {
producer.send(new ProducerRecord[Int, String](topicName, key, line))
key += 1
}
source.close()
producer.flush()
}
Consumer looks like this:
object BatchLayer {
def main(args: Array[String]) {
val brokers = "localhost:9092"
val topics = "test"
val groupId = "groupId-1"
val sparkConf = new SparkConf()
.setAppName("BatchLayer")
.setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val sc = ssc.sparkContext
sc.setLogLevel("OFF")
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false"
)
val stream =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val cass = CassandraConnector(sparkConf)
cass.withSessionDo { session =>
session.execute(
s"CREATE KEYSPACE IF NOT EXISTS batch_layer WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }"
)
session.execute(s"CREATE TABLE IF NOT EXISTS batch_layer.test (key VARCHAR PRIMARY KEY, value INT)")
session.execute(s"TRUNCATE batch_layer.test")
}
stream
.map(v => v.value())
.flatMap(x => x.split(" "))
.filter(x => !x.contains(Array('\n', '\t')))
.map(x => (x, 1))
.reduceByKey(_ + _)
.saveToCassandra("batch_layer", "test", SomeColumns("key", "value"))
ssc.start()
ssc.awaitTermination()
}
}
After starting producer, the program stops working with this error. What did I do wrong ?

It makes very little sense to use legacy streaming in 2021st - it's very cumbersome to use, and you also need to track offsets for Kafka, etc. It's better to use Structured Streaming instead - it will track offsets for your through the checkpoints, you will work with high-level Dataset APIs, etc.
In your case code could look as following (didn't test, but it's adopted from this working example):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val wordsCountsDF = streamingInputDF.selectExpr("CAST(value AS STRING) as value")
.selectExpr("split(value, '\\w+', -1) as words")
.selectExpr("explode(words) as word")
.filter("word != ''")
.groupBy($"word")
.count()
.select($"word", $"count")
// create table ...
val query = wordsCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "path_to_checkpoint)
.option("keyspace", "test")
.option("table", "<table_name>")
.start()
query.awaitTermination()
P.S. In your example, most probable error is that you're trying to use .saveToCassandra directly on DStream - it doesn't work this way.

How to store data into HDFS using spark streaming

I want to store streaming data into hdfs. Its a spark streaming code capture data from kafka topic.
I tried this
lines.saveAsHadoopFiles("hdfs://192.168.10.31:9000/user/spark/mystream/", "abc")
this is my code let me know here to write code for save data into hdfs and how.in console i am receiving output need to store in hdfs
Thanks in advance
package com.spark.cons.conskafka
import java.util.HashMap
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.storage.StorageLevel
import _root_.kafka.serializer.StringDecoder
object Consume {
def createContext(brokers: String, topics: String, checkpointDirectory: String): StreamingContext = {
println("Creating new context")
val conf = new SparkConf().setMaster("local[*]").setAppName("Spark Streaming - Kafka DirectReceiver - PopularHashTags").set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// Set the Spark StreamingContext to create a DStream for every 2 seconds
val ssc = new StreamingContext(sc, Seconds(2))
ssc.checkpoint("checkpoint")
// Define the Kafka parameters, broker list must be specified
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokers,
// start from the smallest available offset, ie the beginning of the kafka log
"auto.offset.reset" -> "largest")
// Define which topics to read from
val topicsSet = topics.split(",").toSet
// Map value from the kafka message (k, v) pair
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
// Filter hashtags
val hashTags = lines.map(_._2).flatMap(_.split(" ")).filter(_.startsWith("#"))
// Get the top hashtags over the previous 60/10 sec window
val topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))
.map { case (topic, count) => (count, topic) }
.transform(_.sortByKey(false))
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10))
.map { case (topic, count) => (count, topic) }
.transform(_.sortByKey(false))
lines.print()
// Print popular hashtags
topCounts60.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 60 seconds (%s total):".format(rdd.count()))
topList.foreach { case (count, tag) => println("%s (%s tweets)".format(tag, count)) }
})
topCounts10.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 10 seconds (%s total):".format(rdd.count()))
topList.foreach { case (count, tag) => println("%s (%s tweets)".format(tag, count)) }
})
lines.count().map(cnt => "Received " + cnt + " kafka messages.").print()
ssc
}
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(s"""
|Usage: KafkaDirectPopularHashTags <brokers> <topics>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
| <checkpointDirectory> the directory where the metadata is stored
|
""".stripMargin)
System.exit(1)
}
// Create an array of arguments: brokers, topicname, checkpoint directory
val Array(brokers, topics, checkpointDirectory) = args
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
() => createContext(brokers, topics, checkpointDirectory))
ssc.start()
ssc.awaitTermination()
}
}

Inconsistent fetch by spark kafka consumer

I have written a code to fetch records from kafka into spark. I have come across some strange behaviour. It is consuming in inconsistent order.
val conf = new SparkConf()
.setAppName("Test Data")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.cores.max", "12")
.set("spark.driver.cores", "4")
.set("spark.ui.port", "4040")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.kafka.maxRatePerPartition", "30")
.set("spark.local.dir", "//tmp//")
.set("spark.sql.warehouse.dir", "/tmp/hive/")
.set("hive.exec.scratchdir", "/tmp/hive2")
val spark = SparkSession
.builder
.appName("Test Data")
.config(conf)
.getOrCreate()
import spark.implicits._
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val topics = Map("topictest" -> 1)
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "192.168.0.40:2181",
"group.id" -> "=groups",
"auto.offset.reset" -> "smallest")
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
}
kafkaStream.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
try {
println("Count of rows " + rdd.count())
} catch {
case e: Exception => e.printStackTrace
}
} else {
println("blank rdd")
}
})
So, Initially I produced 10 million records in kafka. Now producer is stopped and then started Spark Consumer Application. I checked Spark UI, initially I received 700,000-900,000 records per batch(every 10 seconds ) per stream, afterwards started getting 4-6K records per batch. So wanted to understand why the fetch count fell so badly despite the fact that data is present in Kafka so instead of giving 4k per batch , I'am open to consumer directly big size batch. What can be done and how ?
Thanks,

Spark w Kafka - can't get enough parallelization

I am running spark with the local[8] configuration. The input is a kafka stream with 8 brokers. But as seen in the system monitor, it isn't parallel enough, it seems that about only one node is running. The input to the kafka streamer is about 1.6GB big, so it should process much faster.
system monitor
Kafka Producer:
import java.io.{BufferedReader, FileReader}
import java.util
import java.util.{Collections, Properties}
import logparser.LogEvent
import org.apache.hadoop.conf.Configuration
import org.apache.kafka.clients.producer.{KafkaProducer, Producer, ProducerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
object sparkStreaming{
private val NUMBER_OF_LINES = 100000000
val brokers ="localhost:9092,localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097,localhost:9098,localhost:9099"
val topicName = "log-1"
val fileName = "data/HDFS.log"
val producer = getProducer()
// no hdfs , read from text file.
def produce(): Unit = {
try { //1. Get the instance of Configuration
val configuration = new Configuration
val fr = new FileReader(fileName)
val br = new BufferedReader(fr)
var line = ""
line = br.readLine
var count = 1
//while (line != null){
while ( {
line != null && count < NUMBER_OF_LINES
}) {
System.out.println("Sending batch " + count + " " + line)
producer.send(new ProducerRecord[String, LogEvent](topicName, new LogEvent(count,line,System.currentTimeMillis())))
line = br.readLine
count = count + 1
}
producer.close()
System.out.println("Producer exited successfully for " + fileName)
} catch {
case e: Exception =>
System.out.println("Exception while producing for " + fileName)
System.out.println(e)
}
}
private def getProducer() : KafkaProducer[String,LogEvent] = { // create instance for properties to access producer configs
val props = new Properties
//Assign localhost id
props.put("bootstrap.servers", brokers)
props.put("auto.create.topics.enable", "true")
//Set acknowledgements for producer requests.
props.put("acks", "all")
//If the request fails, the producer can automatically retry,
props.put("retries", "100")
//Specify buffer size in config
props.put("batch.size", "16384")
//Reduce the no of requests less than 0
props.put("linger.ms", "1")
//The buffer.memory controls the total amount of memory available to the producer for buffering.
props.put("buffer.memory", "33554432")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "logparser.LogEventSerializer")
props.put("topic.metadata.refresh.interval.ms", "1")
val producer = new KafkaProducer[String, LogEvent](props)
producer
}
def sendBackToKafka(logEvent: LogEvent): Unit ={
producer.send(new ProducerRecord[String, LogEvent] ("times",logEvent))
}
def main (args: Array[String]): Unit = {
println("Starting to produce");
this.produce();
}
}
Consumer:
package logparser
import java.io._
import java.util.Properties
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._
object consumer {
var tFromKafkaToSpark: Long = 0
var tParsing : Long = 0
val startTime = System.currentTimeMillis()
val CPUNumber = 8
val pw = new PrintWriter(new FileOutputStream("data/Streaming"+CPUNumber+"config2x.txt",false))
pw.write("Writing Started")
def printstarttime(): Unit ={
pw.print("StartTime : " + System.currentTimeMillis())
}
def printendtime(): Unit ={
pw.print("EndTime : " + System.currentTimeMillis());
}
val producer = getProducer()
private def getProducer() : KafkaProducer[String,TimeList] = { // create instance for properties to access producer configs
val props = new Properties
val brokers ="localhost:9090,"
//Assign localhost id
props.put("bootstrap.servers", brokers)
props.put("auto.create.topics.enable", "true")
//Set acknowledgements for producer requests.
props.put("acks", "all")
//If the request fails, the producer can automatically retry,
props.put("retries", "100")
//Specify buffer size in config
props.put("batch.size", "16384")
//Reduce the no of requests less than 0
props.put("linger.ms", "1")
//The buffer.memory controls the total amount of memory available to the producer for buffering.
props.put("buffer.memory", "33554432")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "logparser.TimeListSerializer")
props.put("topic.metadata.refresh.interval.ms", "1")
val producer = new KafkaProducer[String, TimeList](props)
producer
}
def sendBackToKafka(timeList: TimeList): Unit ={
producer.send(new ProducerRecord[String, TimeList]("times",timeList))
}
def main(args: Array[String]) {
val topics = "log-1"
//val Array(brokers, ) = Array("localhost:9092","log-1")
val brokers = "localhost:9092"
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[" + CPUNumber + "]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
var kafkaParams = Map[String, AnyRef]("metadata.broker.list" -> brokers)
kafkaParams = kafkaParams + ("bootstrap.servers" -> "localhost:9092,localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097,localhost:9098,localhost:9099")
kafkaParams = kafkaParams + ("auto.offset.reset"-> "latest")
kafkaParams = kafkaParams + ("group.id" -> "test-consumer-group")
kafkaParams = kafkaParams + ("key.deserializer" -> classOf[StringDeserializer])
kafkaParams = kafkaParams + ("value.deserializer"-> "logparser.LogEventDeserializer")
//kafkaParams.put("zookeeper.connect", "192.168.101.165:2181");
kafkaParams = kafkaParams + ("enable.auto.commit"-> "true")
kafkaParams = kafkaParams + ("auto.commit.interval.ms"-> "1000")
kafkaParams = kafkaParams + ("session.timeout.ms"-> "20000")
kafkaParams = kafkaParams + ("metadata.max.age.ms"-> "1000")
val messages = KafkaUtils.createDirectStream[String, LogEvent](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, LogEvent](topicsSet, kafkaParams))
var started = false
val lines = messages.map(_.value)
val lineswTime = lines.map(event =>
{
event.addNextEventTime(System.currentTimeMillis())
event
}
)
lineswTime.foreachRDD(a => a.foreach(e => println(e.getTimeList)))
val logLines = lineswTime.map(
(event) => {
//println(event.getLogline.stringMessages.toString)
event.setLogLine(event.getContent)
println("Got event with id = " + event.getId)
event.addNextEventTime(System.currentTimeMillis())
println(event.getLogline.stringMessages.toString)
event
}
)
//logLines.foreachRDD(a => a.foreach(e => println(e.getTimeList + e.getLogline.stringMessages.toString)))
val x = logLines.map(le => {
le.addNextEventTime(System.currentTimeMillis())
sendBackToKafka(new TimeList(le.getTimeList))
le
})
x.foreachRDD(a => a.foreach(e => println(e.getTimeList)))
//logLines.map(ll => ll.addNextEventTime(System.currentTimeMillis()))
println("--------------***///*****-------------------")
//logLines.print(10)
/*
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
*/
// Start the computation
ssc.start()
ssc.awaitTermination()
ssc.stop(false)
pw.close()
}
}

There's a piece of information missing in your problem statement: how many partitions does your input topic log-1 have?
My guess is that such topic have less than 8 partitions.
The parallelism of Spark Streaming (in case of a Kafka source) is tied (modulo re-partitioning) to the number of total Kafka partitions it consumes (i.e. the RDDs' partitions are taken from the Kafka partitions).
If, as I suspect, your input topic only has a few partitions, for each micro-batch Spark Streaming will task only an equal amount of nodes with the computation. All the other nodes will sit idling.
The fact that you see all the node working in an almost round-robin fashion is due to the fact that Spark do not always choose the same node for processing data for the same partition, but it actually actively mix things up.
In order to have a better idea on what's happening I'd need to see some statistics from the Spark UI Streaming page.
Given the information you provided so far however, the insufficient Kafka partitioning would be my best bet for this behaviour.

Everything consuming from Kafka is limited by the number of partitions your topic(s) has. One consumer per partition. How much do you have ?
Although Spark can redistribute the work, it's not recommended as you might be spending more time exchanging information between executors than actually processing it.

No messages received when using foreachPartition spark streaming

I am pulling from Kafka using Spark Streaming. When I use foreachPartition on my RDD I never get any messages received. If I read the messages from the RDD using a foreach it works fine. However I need to use the partition function so I can have a socket connection on each executor.
This is code connecting to spark and creating stream
val kafkaParams = Map(
"zookeeper.connect" -> zooKeepers,
"group.id" -> ("metric-group"),
"zookeeper.connection.timeout.ms" -> "5000")
val inputTopic = "threatflow"
val conf = new SparkConf().setAppName(applicationTitle).set("spark.eventLog.overwrite", "true")
val ssc = new StreamingContext(conf, Seconds(5))
val streams = (1 to numberOfStreams) map { _ =>
KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, Map(inputTopic -> 1), StorageLevel.MEMORY_ONLY_SER)
}
val kafkaStream = ssc.union(streams)
kafkaStream.foreachRDD { (rdd, time) =>
calcVictimsProcess(process, rdd, time.milliseconds)
}
ssc.start()
ssc.awaitTermination()
Here is my code that attempts to process the messages using foreachPartition instead of foreach
val threats = rdd.map(message => gson.fromJson(message._2.substring(1, message._2.length()), classOf[ThreatflowMessage]))
threats.flatMap(mapSrcVictim).reduceByKey((a,b) => a + b).foreachPartition{ partition =>
val socket = new Socket(InetAddress.getByName("localhost"),4242)
val writer = new BufferedOutputStream(socket.getOutputStream)
partition.foreach{ value =>
val parts = value._1.split("-")
val put = "put %s %d %d type=%s address=%s unique=%s\n".format("metric", bucket, value._2, parts(0),parts(1),unique)
Thread.sleep(10000)
}
writer.flush()
socket.close()
}
simply switching this to foreach as I said will work, however this won't work as I need to have sockets created per executor

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Kafka + spark streaming : Multi topic processing in single job - scala

Related

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

How to store data into HDFS using spark streaming

Inconsistent fetch by spark kafka consumer

Spark w Kafka - can't get enough parallelization

No messages received when using foreachPartition spark streaming

Categories

Resources