How can I apply the same Pattern on two different Kafka Streams in Flink? - scala

I have this Flink program below:
object WindowedWordCount {
val configFactory = ConfigFactory.load()
def main(args: Array[String]) = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val kafkaStream1 = env.addSource(new FlinkKafkaConsumer010[String](topic1, new SimpleStringSchema(), props))
.assignTimestampsAndWatermarks(new TimestampExtractor)
val kafkaStream2 = env.addSource(new FlinkKafkaConsumer010[String](topic2, new SimpleStringSchema(), props))
.assignTimestampsAndWatermarks(new TimestampExtractor)
val partitionedStream1 = kafkaStream1.keyBy(jsonString => {
extractUserId(jsonString)
})
val partitionedStream2 = kafkaStream2.keyBy(jsonString => {
extractUserId(jsonString)
})
//Is there a way to match the userId from partitionedStream1 and partitionedStream2 in this same pattern?
val patternForMatchingUserId = Pattern.begin[String]("start")
.where(stream1.getUserId() == stream2.getUserId()) //I want to do something like this
//Is there a way to pass in partitionedStream1 and partitionedStream2 to this CEP.pattern function?
val patternStream = CEP.pattern(partitionedStream1, patternForMatchingUserId)
env.execute()
}
}
In the flink program above, I have two streams named partitionedStream1 and partitionedStream2 which is keyedBy the userID.
I want to somehow compare the data from both streams in the patternForMatchingUserId pattern (similar to how I showed above). Is there a way to pass in two streams to the CEP.Pattern function?
Something like this:
val patternStream = CEP.pattern(partitionedStream1, partitionedStream2, patternForMatchingUserId)

There is no way you can pass two streams to CEP, but you can pass a combined stream.
If both streams have the same type/schema. You can just union them. I believe this solutions matches your case.
partitionedStream1.union(partitionedStream2).keyBy(...)
If they have different schema. You can transform them into one stream using some custom logic inside e.g. coFlatMap.

Related

Is there a way to name the inner mapValues() topic created as part of the count() operator in kafka-streams?

I'm, trying to name all processors of a simple word count kafka streams application, however, can't figure out how to name the inner topic created due to the mapValues() call inside the count() method, that's created as result of calling to stream(). This is the application code, followed by the topology description (showing only the second sub-topology):
def createTopology(builder: StreamsBuilder, config: Config): Topology = {
val consumed = Consumed
.as(inputTopic)
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.stringSerde)
val produced = Produced
.as(outputTopic)
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.longSerde)
val flatMapValuesProc = Named.as("flatMapValues")
val groupByProc = Grouped
.as("groupBy")
.withKeySerde(Serdes.stringSerde)
.withValueSerde(Serdes.stringSerde)
val textLines: KStream[String, String] = builder.stream[String, String](inputTopic)(consumed)
val wordCounts: KTable[String, Long] = textLines
.flatMapValues(textLine => textLine.toLowerCase.split("\\W+"), named = flatMapValuesProc)
.groupBy((_, word) => word)(grouped)
.count(Named.as("count"))(Materialized.as(storeName))
wordCounts
.toStream(Named.as("toStream"))
.to(outputTopic)(produced)
builder.build()
}
Looking at the count() method, it seems like it's not possible to name this operation, from the code. Is there another way to name this inner topic?
def count(named: Named)(implicit materialized: Materialized[K, Long, ByteArrayKeyValueStore]): KTable[K, Long] = {
...
new KTable(
javaCountTable.mapValues[Long](
((l: java.lang.Long) => Long2long(l)).asValueMapper,
Materialized.`with`[K, Long, ByteArrayKeyValueStore](tableImpl.keySerde(), Serdes.longSerde)
)
)
}

Scala spark kafka code - functional approach

I've the following code in scala. I am using spark sql to pull data from hadoop, perform some group by on the result, serialize it and then write that message to Kafka.
I've written the code - but i want to write it in functional way. Should i create a new class with function 'getCategories' to get the categories from Hadoop? I am not sure how to approach this.
Here is the code
class ExtractProcessor {
def process(): Unit = {
implicit val formats = DefaultFormats
val spark = SparkSession.builder().appName("test app").getOrCreate()
try {
val df = spark.sql("SELECT DISTINCT SUBCAT_CODE, SUBCAT_NAME, CAT_CODE, CAT_NAME " +
"FROM CATEGORY_HIERARCHY " +
"ORDER BY CAT_CODE, SUBCAT_CODE ")
val result = df.collect().groupBy(row => (row(2), row(3)))
val categories = result.map(cat =>
category(cat._1._1.toString(), cat._1._2.toString(),
cat._2.map(subcat =>
subcategory(subcat(0).toString(), subcat(1).toString())).toList))
val jsonMessage = write(categories)
val kafkaKey = java.security.MessageDigest.getInstance("SHA-1").digest(jsonMessage.getBytes("UTF-8")).map("%02x".format(_)).mkString.toString()
val key = write(kafkaKey)
Logger.log.info(s"Json Message: ${jsonMessage}")
Logger.log.info(s"Kafka Key: ${key}")
KafkaUtil.apply.send(key, jsonMessage, "testTopic")
}
And here is the Kafka Code
class KafkaUtil {
def send(key: String, message: String, topicName: String): Unit = {
val properties = new Properties()
properties.put("bootstrap.servers", "localhost:9092")
properties.put("client.id", "test publisher")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](properties)
try {
val record = new ProducerRecord[String, String](topicName, key, message)
producer.send(record)
}
finally {
producer.close()
Logger.log.info("Kafka producer closed...")
}
}
}
object KafkaUtil {
def apply: KafkaUtil = {
new KafkaUtil
}
}
Also, for writing unit tests what should i be testing in the functional approach. In OOP we unit test the business logic, but in my scala code there is hardly any business logic.
Any help is appreciated.
Thanks in advance,
Suyog
You code consists of
1) Loading the data into spark df
2) Crunching the data
3) Creating a json message
4) Sending json message to kafka
Unit tests are good for testing pure functions.
You can extract step 2) into a method with signature like
def getCategories(df: DataFrame): Seq[Category] and cover it by a test.
In the test data frame will be generated from just a plain hard-coded in-memory sequence.
Step 3) can be also covered by a unit test if you feel it error-prone
Steps 1) and 4) are to be covered by an end-to-end test
By the way
val result = df.collect().groupBy(row => (row(2), row(3))) is inefficient. Better to replace it by val result = df.groupBy(row => (row(2), row(3))).collect
Also there is no need to initialize a KafkaProducer individually for each single message.

[Spark Streaming]How to load the model every time a new message comes in?

In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in. My code looks like this
def loadingModel(#transient sc:SparkContext)={
val model=LogisticRegressionModel.load(sc, "/home/zefu/BIA800/LRModel")
model
}
var error=0.0
var size=0.0
implicit def bool2int(b:Boolean) = if (b) 1 else 0
def updateState(batchTime: Time, key: String, value: Option[String], state: State[Array[Double]]): Option[(String, Double,Double)] = {
val model=loadingModel(sc)
val parts = value.getOrElse("0,0,0,0").split(",").map { _.toDouble }
val pairs = LabeledPoint(parts(0), Vectors.dense(parts.tail))
val prediction = model.predict(pairs.features)
val wrong= prediction != pairs.label
error = state.getOption().getOrElse(Array(0.0,0.0))(0) + 1.0*(wrong:Int)
size=state.getOption().getOrElse(Array(0.0,0.0))(1) + 1.0
val output = (key, error,size)
state.update(Array(error,size))
Some(output)
}
val stateSpec = StateSpec.function(updateState _)
.numPartitions(1)
setupLogging()
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).mapWithState(stateSpec)
When I run this code, there would be an exception like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
If you need more information, please let me know.
Thank you!
When a model is used within DStream function, spark seem to serialize the context object (because model's load function uses sc), and it fails because the context object is not serializable. One workaround is to convert DStream to RDD, collect the result and then run model prediction/scoring in the driver.
Used netcat utility to simulate streaming, tried the following code to convert DStream to RDD, it works. See if it helps.
val ssc = new StreamingContext(sc,Seconds(10))
val lines = ssc.socketTextStream("xxx", 9998)
val linedstream = lines.map(lineRDD => Vectors.dense(lineRDD.split(" ").map(_.toDouble)) )
val logisModel = LogisticRegressionModel.load(sc, /path/LR_Model")
linedstream.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
val predictedVal = logisModel.predict(item)
println(predictedVal + "|" + item);
}
})
Understand collect is not scalable here, but if you think that your streaming messages are less in number for any interval, this is probably an option. This is what I see it possible in Spark 1.4.0, the higher versions probably have a fix for this. See this if its useful,
Save ML model for future usage

Spark Streaming - Issue with Passing parameters

Please take a look at the following spark streaming code written in scala:
object HBase {
var hbaseTable = ""
val hConf = new HBaseConfiguration()
hConf.set("hbase.zookeeper.quorum", "zookeeperhost")
def init(input: (String)) {
hbaseTable = input
}
def display() {
print(hbaseTable)
}
def insertHbase(row: (String)) {
val hTable = new HTable(hConf,hbaseTable)
}
}
object mainHbase {
def main(args : Array[String]) {
if (args.length < 5) {
System.err.println("Usage: MetricAggregatorHBase <zkQuorum> <group> <topics> <numThreads> <hbaseTable>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads, hbaseTable) = args
HBase.init(hbaseTable)
HBase.display()
val sparkConf = new SparkConf().setAppName("mainHbase")
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint("checkpoint")
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val storeStg = lines.foreachRDD(rdd => rdd.foreach(HBase.insertHbase))
lines.print()
ssc.start()
}
}
I am trying to initialize the parameter hbaseTable in the object HBase by calling HBase.init method. It was setting the parameter properly. I confirmed that by calling the HBase.display method in the next line.
However when HBase.insertHbase method in the foreachRDD is called, its throwing error that hbaseTable is not set.
Update with exception:
java.lang.IllegalArgumentException: Table qualifier must not be empty
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:179)
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:149)
org.apache.hadoop.hbase.TableName.<init>(TableName.java:303)
org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:339)
org.apache.hadoop.hbase.TableName.valueOf(TableName.java:426)
org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:156)
Can you please let me know how to make this code work.
"Where is this code running" - that's the question that we need to ask in order to understand what's going on.
HBase is a Scala object - by definition it's a singleton construct that gets initialized with 'only once' semantics in the JVM.
At the initialization point, HBase.init(hbaseTable) is executed in the driver of this Spark application, initializing this object with the given value in the VM of the driver.
But when we do: rdd.foreach(HBase.insertHbase), the closure is executed as a task on each executor that hosts a partition for the given RDD. At that point, the object HBase is initialized on each VM for each executor. As we can see, no initialization has happened on this object at that point.
There're two options:
We can add some checking "isInitialized" to the HBase object and add the -now conditional- call to initialize on each call to foreach.
Another option would be to use
rdd.foreachPartitition{partition =>
HBase.initialize(...)
partition.foreach(elem => HBase.insert(elem))
}
This construction will amortize any initialization by the amount of element in each partition. It's also possible to combine it with an initialization check to prevent unnecessary bootstrap work.

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}
Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file