Add prefix/suffix on kafka topic at runtime - scala

So here suppose if environment is TEST1 and topic name is BENEFIT then it should return like TEST1_BENEFIT
object hello {
def Complete (jobId: BigInt, tableName: String, topic: String = configManager.getString("Kafka.Completion.Table.Topic")): Unit = {
val kafkaServer = configManager.getString("Kafka.Server")
val props = new Properties()
props.put("bootstrap.servers", kafkaServer)
props.put("key.serializer")
props.put("value.serializer")
props.put("batch.size", "1")
props.put("acks", "all")
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String](topic, key, value)
producer.send(message)
}
}

As answered before, use string interpolation or concatenation
val message = new ProducerRecord[String, String](s"${environment}_${topic}", key, value)

Related

Presto is giving this error : Cannot invoke "com.fasterxml.jackson.databind.JsonNode.has(String)" because "currentNode" is null

I'm pushing a JSON file into a Kafka topic, connecting the topic in presto and structuring the JSON data into a queryable table.
The problem I am facing is that , presto is not to fetch data its shows error Cannot invoke "com.fasterxml.jackson.databind.JsonNode.has(String)" because "currentNode" is null.
Code for pushing data into kafka topic:
object Producer extends App{
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.connect.json.JsonSerializer")
val producer = new KafkaProducer[String,JsonNode](props)
println("inside prducer")
val mapper = (new ObjectMapper() with ScalaObjectMapper).
registerModule(DefaultScalaModule).
configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false).
findAndRegisterModules(). // register joda and java-time modules automatically
asInstanceOf[ObjectMapper with ScalaObjectMapper]
val filename = "/Users/rishunigam/Documents/devicd.json"
val jsonNode: JsonNode= mapper.readTree(new File(filename))
val s = jsonNode.size()
for(i <- 0 to jsonNode.size()) {
val js = jsonNode.get(i)
println(js)
val record = new ProducerRecord[String, JsonNode]( "tpch.devicelog", js)
println(record)
producer.send( record )
}
println("producer complete")
producer.close()
}

Get kafka record timestamp from kafka message

I want the timestamp at which the message was inserted in kafka topic by producer.
And at the kafka consumer side, i want to extract that timestamp.
class Producer {
def main(args: Array[String]): Unit = {
writeToKafka("quick-start")
}
def writeToKafka(topic: String): Unit = {
val props = new Properties()
props.put("bootstrap.servers", "localhost:9094")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val record = new ProducerRecord[String, String](topic, "key", "value")
producer.send(record)
producer.close()
}
}
class Consumer {
def main(args: Array[String]): Unit = {
consumeFromKafka("quick-start")
}
def consumeFromKafka(topic: String) = {
val props = new Properties()
props.put("bootstrap.servers", "localhost:9094")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("auto.offset.reset", "latest")
props.put("group.id", "consumer-group")
val consumer: KafkaConsumer[String, String] = new KafkaConsumer[String, String](props)
consumer.subscribe(util.Arrays.asList(topic))
while (true) {
val record = consumer.poll(1000).asScala
for (data <- record.iterator)
println(data.value())
}
}
}
Does kafka provides a way to do it? Else i will have to send an extra field from producer to topic.
Kafka provides a way since v0.10
From that version, all your messages have a timestamp information available in data.timestamp, and the kind of information inside is ruled by the config "message.timestamp.type" on your brokers. The value should be either CreateTime or LogAppendTime.
Before this version, you'll have to implement it by hand, usually through modifying your data structure.

Get last inserted message from kafka topic

I have a requirement where I need to find the recently inserted message from Kafka topic. How can I achieve this?
I tried to fetch offset first and trying to get messages from that offset?
Is it efficient solution?
val config = KafkaConfig()
val props = new Properties()
// ConsumerConfig
props.put("bootstrap.servers", config.bootstrapServers)
props.put("group.id", "stream-latest-consumer")
props.put(
"key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer"
)
props.put(
"value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer"
)
val kafkaConsumer = new KafkaConsumer[String, String](props)
val p = new TopicPartition(config.topic, 0)
val cl: util.Collection[TopicPartition] = List(p).asJava
val offsetsMap: java.util.Map[TopicPartition, java.lang.Long] =
kafkaConsumer.endOffsets(cl)
val offsetCount = offsetsMap.get(p)
You can also use
void seekToEnd(Collection<TopicPartition> partitions)
in order to get the latest offset for the given partitions.

Attempting to Write Tuple to Flink Kafka sink

I am trying to write a streaming application that both reads from and writes to Kafka. I currently have this but I have to toString my tuple class.
object StreamingJob {
def main(args: Array[String]) {
// set up the streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
properties.setProperty("group.id", "test")
val consumer = env.addSource(new FlinkKafkaConsumer08[String]("topic", new SimpleStringSchema(), properties))
val counts = consumer.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
.map { (_, 1) }
.keyBy(0)
.timeWindow(Time.seconds(5))
.sum(1)
val producer = new FlinkKafkaProducer08[String](
"localhost:9092",
"my-topic",
new SimpleStringSchema())
counts.map(_.toString()).addSink(producer)
env.execute("Window Stream WordCount")
env.execute("Flink Streaming Scala API Skeleton")
}
}
The closest I could get to getting this working was the following but the FlinkKafkaProducer08 refuses to accept the type parameter as part of the constructor.
val producer = new FlinkKafkaProducer08[(String, Int)](
"localhost:9092",
"my-topic",
new TypeSerializerOutputFormat[(String, Int)])
counts.addSink(producer)
I am wondering if there is a way to write the tuples directly to my Kafka sink.
You need a class approximately like this that serializes your tuples:
private class SerSchema extends SerializationSchema[Tuple2[String, Int]] {
override def serialize(tuple2: Tuple2[String, Int]): Array[Byte] = ...
}

Spark w Kafka - can't get enough parallelization

I am running spark with the local[8] configuration. The input is a kafka stream with 8 brokers. But as seen in the system monitor, it isn't parallel enough, it seems that about only one node is running. The input to the kafka streamer is about 1.6GB big, so it should process much faster.
system monitor
Kafka Producer:
import java.io.{BufferedReader, FileReader}
import java.util
import java.util.{Collections, Properties}
import logparser.LogEvent
import org.apache.hadoop.conf.Configuration
import org.apache.kafka.clients.producer.{KafkaProducer, Producer, ProducerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
object sparkStreaming{
private val NUMBER_OF_LINES = 100000000
val brokers ="localhost:9092,localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097,localhost:9098,localhost:9099"
val topicName = "log-1"
val fileName = "data/HDFS.log"
val producer = getProducer()
// no hdfs , read from text file.
def produce(): Unit = {
try { //1. Get the instance of Configuration
val configuration = new Configuration
val fr = new FileReader(fileName)
val br = new BufferedReader(fr)
var line = ""
line = br.readLine
var count = 1
//while (line != null){
while ( {
line != null && count < NUMBER_OF_LINES
}) {
System.out.println("Sending batch " + count + " " + line)
producer.send(new ProducerRecord[String, LogEvent](topicName, new LogEvent(count,line,System.currentTimeMillis())))
line = br.readLine
count = count + 1
}
producer.close()
System.out.println("Producer exited successfully for " + fileName)
} catch {
case e: Exception =>
System.out.println("Exception while producing for " + fileName)
System.out.println(e)
}
}
private def getProducer() : KafkaProducer[String,LogEvent] = { // create instance for properties to access producer configs
val props = new Properties
//Assign localhost id
props.put("bootstrap.servers", brokers)
props.put("auto.create.topics.enable", "true")
//Set acknowledgements for producer requests.
props.put("acks", "all")
//If the request fails, the producer can automatically retry,
props.put("retries", "100")
//Specify buffer size in config
props.put("batch.size", "16384")
//Reduce the no of requests less than 0
props.put("linger.ms", "1")
//The buffer.memory controls the total amount of memory available to the producer for buffering.
props.put("buffer.memory", "33554432")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "logparser.LogEventSerializer")
props.put("topic.metadata.refresh.interval.ms", "1")
val producer = new KafkaProducer[String, LogEvent](props)
producer
}
def sendBackToKafka(logEvent: LogEvent): Unit ={
producer.send(new ProducerRecord[String, LogEvent] ("times",logEvent))
}
def main (args: Array[String]): Unit = {
println("Starting to produce");
this.produce();
}
}
Consumer:
package logparser
import java.io._
import java.util.Properties
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._
object consumer {
var tFromKafkaToSpark: Long = 0
var tParsing : Long = 0
val startTime = System.currentTimeMillis()
val CPUNumber = 8
val pw = new PrintWriter(new FileOutputStream("data/Streaming"+CPUNumber+"config2x.txt",false))
pw.write("Writing Started")
def printstarttime(): Unit ={
pw.print("StartTime : " + System.currentTimeMillis())
}
def printendtime(): Unit ={
pw.print("EndTime : " + System.currentTimeMillis());
}
val producer = getProducer()
private def getProducer() : KafkaProducer[String,TimeList] = { // create instance for properties to access producer configs
val props = new Properties
val brokers ="localhost:9090,"
//Assign localhost id
props.put("bootstrap.servers", brokers)
props.put("auto.create.topics.enable", "true")
//Set acknowledgements for producer requests.
props.put("acks", "all")
//If the request fails, the producer can automatically retry,
props.put("retries", "100")
//Specify buffer size in config
props.put("batch.size", "16384")
//Reduce the no of requests less than 0
props.put("linger.ms", "1")
//The buffer.memory controls the total amount of memory available to the producer for buffering.
props.put("buffer.memory", "33554432")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "logparser.TimeListSerializer")
props.put("topic.metadata.refresh.interval.ms", "1")
val producer = new KafkaProducer[String, TimeList](props)
producer
}
def sendBackToKafka(timeList: TimeList): Unit ={
producer.send(new ProducerRecord[String, TimeList]("times",timeList))
}
def main(args: Array[String]) {
val topics = "log-1"
//val Array(brokers, ) = Array("localhost:9092","log-1")
val brokers = "localhost:9092"
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[" + CPUNumber + "]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
var kafkaParams = Map[String, AnyRef]("metadata.broker.list" -> brokers)
kafkaParams = kafkaParams + ("bootstrap.servers" -> "localhost:9092,localhost:9093,localhost:9094,localhost:9095,localhost:9096,localhost:9097,localhost:9098,localhost:9099")
kafkaParams = kafkaParams + ("auto.offset.reset"-> "latest")
kafkaParams = kafkaParams + ("group.id" -> "test-consumer-group")
kafkaParams = kafkaParams + ("key.deserializer" -> classOf[StringDeserializer])
kafkaParams = kafkaParams + ("value.deserializer"-> "logparser.LogEventDeserializer")
//kafkaParams.put("zookeeper.connect", "192.168.101.165:2181");
kafkaParams = kafkaParams + ("enable.auto.commit"-> "true")
kafkaParams = kafkaParams + ("auto.commit.interval.ms"-> "1000")
kafkaParams = kafkaParams + ("session.timeout.ms"-> "20000")
kafkaParams = kafkaParams + ("metadata.max.age.ms"-> "1000")
val messages = KafkaUtils.createDirectStream[String, LogEvent](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, LogEvent](topicsSet, kafkaParams))
var started = false
val lines = messages.map(_.value)
val lineswTime = lines.map(event =>
{
event.addNextEventTime(System.currentTimeMillis())
event
}
)
lineswTime.foreachRDD(a => a.foreach(e => println(e.getTimeList)))
val logLines = lineswTime.map(
(event) => {
//println(event.getLogline.stringMessages.toString)
event.setLogLine(event.getContent)
println("Got event with id = " + event.getId)
event.addNextEventTime(System.currentTimeMillis())
println(event.getLogline.stringMessages.toString)
event
}
)
//logLines.foreachRDD(a => a.foreach(e => println(e.getTimeList + e.getLogline.stringMessages.toString)))
val x = logLines.map(le => {
le.addNextEventTime(System.currentTimeMillis())
sendBackToKafka(new TimeList(le.getTimeList))
le
})
x.foreachRDD(a => a.foreach(e => println(e.getTimeList)))
//logLines.map(ll => ll.addNextEventTime(System.currentTimeMillis()))
println("--------------***///*****-------------------")
//logLines.print(10)
/*
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
*/
// Start the computation
ssc.start()
ssc.awaitTermination()
ssc.stop(false)
pw.close()
}
}
There's a piece of information missing in your problem statement: how many partitions does your input topic log-1 have?
My guess is that such topic have less than 8 partitions.
The parallelism of Spark Streaming (in case of a Kafka source) is tied (modulo re-partitioning) to the number of total Kafka partitions it consumes (i.e. the RDDs' partitions are taken from the Kafka partitions).
If, as I suspect, your input topic only has a few partitions, for each micro-batch Spark Streaming will task only an equal amount of nodes with the computation. All the other nodes will sit idling.
The fact that you see all the node working in an almost round-robin fashion is due to the fact that Spark do not always choose the same node for processing data for the same partition, but it actually actively mix things up.
In order to have a better idea on what's happening I'd need to see some statistics from the Spark UI Streaming page.
Given the information you provided so far however, the insufficient Kafka partitioning would be my best bet for this behaviour.
Everything consuming from Kafka is limited by the number of partitions your topic(s) has. One consumer per partition. How much do you have ?
Although Spark can redistribute the work, it's not recommended as you might be spending more time exchanging information between executors than actually processing it.