Context in processFunction for KeyedProcessFunction is null

Context in processFunction for KeyedProcessFunction is null - scala

I am trying to use KeyedProcessFunction, but the ctx: Context variable in processFunction inside my KeyedProcessFunction is returning null. Note that I'm using the default TimeCharacteristic which is ProcessingTime (so I'm not even setting it).
I found this on stackoverflow but that one is relating to EventTime and not ProcessingTime.
Following the exact example of https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/process_function.html#example, I have created the following using Scala 2.11.12 and Flink 1.10, and I'm still getting the same error.
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.util.Collector
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
object example {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// the source data stream
val stream = env.socketTextStream("localhost", 9999).map(x => {
var splitCsv = x.stripLineEnd.split(",")
(splitCsv(0), splitCsv(1))
}
)
// apply the process function onto a keyed stream
val result: DataStream[Tuple2[String, Long]] = stream
.keyBy(0)
.process(new CountWithTimeoutFunction())
result.print()
env.execute("Flink Streaming Demo STDOUT")
}
/**
* The data type stored in the state
*/
case class CountWithTimestamp(key: String, count: Long, lastModified: Long)
/**
* The implementation of the ProcessFunction that maintains the count and timeouts
*/
class CountWithTimeoutFunction extends KeyedProcessFunction[Tuple, (String, String), (String, Long)] {
/** The state that is maintained by this process function */
lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext
.getState(new ValueStateDescriptor[CountWithTimestamp]("myState", classOf[CountWithTimestamp]))
override def processElement(
value: (String, String),
ctx: KeyedProcessFunction[Tuple, (String, String), (String, Long)]#Context,
out: Collector[(String, Long)]): Unit = {
// initialize or retrieve/update the state
val current: CountWithTimestamp = state.value match {
case null =>
CountWithTimestamp(value._1, 1, ctx.timestamp)
case CountWithTimestamp(key, count, lastModified) =>
CountWithTimestamp(key, count + 1, ctx.timestamp)
}
// write the state back
state.update(current)
// schedule the next timer 60 seconds from the current event time
ctx.timerService.registerEventTimeTimer(current.lastModified + 60000)
}
override def onTimer(
timestamp: Long,
ctx: KeyedProcessFunction[Tuple, (String, String), (String, Long)]#OnTimerContext,
out: Collector[(String, Long)]): Unit = {
state.value match {
case CountWithTimestamp(key, count, lastModified) if (timestamp == lastModified + 60000) =>
out.collect((key, count))
case _ =>
}
}
}
}
Here is the error:
Caused by: java.lang.NullPointerException at
scala.Predef$.Long2long(Predef.scala:363) at
com.leidos.example$CountWithTimeoutFunction.processElement(example.scala:57)
at
com.leidos.example$CountWithTimeoutFunction.processElement(example.scala:42)
at
org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:85)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:173)
at
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.processElement(StreamTaskNetworkInput.java:151)
at
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:128)
at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) at
org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) at
java.lang.Thread.run(Thread.java:748)
Any idea of what am I doing wrong?
Thank you in advance!

The problem is that you are accessing in line 57 the timestamp field of the Context. This field is null if you are using ProcessingTime or if you don't specify a timestamp extractor when using EventTime.

Related

JDBC sink for Flink fails with not serializable error

I am following https://ci.apache.org/projects/flink/flink-docs-master/dev/connectors/jdbc.html to use a mysql database as sink for Flink. The code compiles successfully but executing the job in a Flink cluster fails with
The program finished with the following exception:
The implementation of the AbstractJdbcOutputFormat is not serializable. The object probably contains or references non serializable fields.
org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:151)
org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:126)
org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:71)
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:1899)
org.apache.flink.streaming.api.datastream.DataStream.clean(DataStream.java:189)
org.apache.flink.streaming.api.datastream.DataStream.addSink(DataStream.java:1296)
org.apache.flink.streaming.api.scala.DataStream.addSink(DataStream.scala:1131)
Aggregator.Aggregator$.main(Aggregator.scala:81)
Here is the relevant part of the code:
object Aggregator {
#throws[Exception]
def main(args: Array[String]): Unit = {
[...]
val counts = stream.map { x => (
x.get("value").get("id").asInt(),
x.get("value").get("kpi").asDouble()
)}
.keyBy(0)
.timeWindow(Time.seconds(60))
.sum(1)
counts.print()
val statementBuilder: JdbcStatementBuilder[(Int, Double)] = (ps: PreparedStatement, t: (Int, Double)) => {
ps.setInt(1, t._1);
ps.setDouble(2, t._2);
};
val connection = new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withDriverName("mysql.Driver")
.withPassword("XXX")
.withUrl("jdbc:mysql://<DB_HOST>:3306/<DB_NAME>")
.withUsername("<USERNAME>")
.build();
val jdbcSink = JdbcSink.sink(
"INSERT INTO table (id, kpi) VALUES (?, ?)",
statementBuilder,
connection);
counts.addSink(jdbcSink)
env.execute("Aggregator")
}
}
I am not sure which part of the code is the problem here and how to debug. Unfortunately I also cannot find a reference implementation for a JDBC sink in Scala. Any help is appreciated!

What worked for me is explicitly creating JdbcStatementBuilder. Something like:
val statementBuilder: JdbcStatementBuilder[(Int, Double)] =
new JdbcStatementBuilder[(Int, Double)] {
override def accept(ps: PreparedStatement, t: (Int, Double)): Unit = {
ps.setInt(1, t._1)
ps.setDouble(2, t._2)
}
}

How can I get all the data in flatMapGroup in Spark Structured Streaming?

I have a scenario in which we need to calculate 'charges' for a stream of events which has the following details:
1. Event contains eventTime, facet, units
2. Free quantity per facet that needs to offset the earliest events based on the eventTime
3. Prices are also specified per facet
4. Events that arrive in a single minute can be considered equivalent (for reduced state maintenance) and all of them need to have free units proportionally distributed
I was hoping to make it the work in the following manner using spark structured streaming
1. Aggregate events at a minute level per facet using the window function per facet
2. Join with the price and free quantity
3. Group by facet
4. flatMapGroup by facet to then sort the aggregation by window start time, apply the results
what I am noticing is that the output of #4 is just the aggregation for which new events came in and not all the aggregation since the watermark.
Qn: How can I fix this code to get all aggregation since the watermark?
Could someone help?
Thanks
package test
import java.sql.Timestamp
import java.util.UUID
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode, Trigger}
import org.apache.spark.streaming.{Seconds, StreamingContext}
case class UsageEvent(Id: String, facetId: String, Units: Double, timeStamp:Timestamp)
case class FacetPricePoints(facetId: String, Price: Double, FreeUnits: Double)
case class UsageBlock(facetId: String, start:Timestamp, Units: Double)
case class UsageBlockWithPrice(facetId: String, start:Timestamp, Units: Double, Price: Double, FreeUnits: Double)
case class UsageBlockWithCharge(facetId: String, start:Timestamp, Units: Double, Price: Double, FreeUnits: Double, ChargedUnits: Double, Charge: Double)
object TestProcessing {
def getUsageEventStream(ts: Timestamp, units: String) : UsageEvent = {UsageEvent(UUID.randomUUID().toString, "F1", units.toInt % 20, ts)}
implicit def ordered: Ordering[Timestamp] = new Ordering[Timestamp] {def compare(x: Timestamp, y: Timestamp): Int = x compareTo y}
def ChargeUsageBlock(Key: String, Value: Iterator[UsageBlockWithPrice]) : Iterator[UsageBlockWithCharge] =
{
val usageBlocks = Value.toList.sortBy(ub => ub.start)
var freeUnits = 0.0
var freeUnitsSet = false
var newUe = for (ue <- usageBlocks)
yield {
freeUnits = if (!freeUnitsSet) ue.FreeUnits else freeUnits
freeUnitsSet = true
val freeUnitsInBlock = if (freeUnits > ue.Units) ue.Units else freeUnits
val chargedUnits = ue.Units - freeUnitsInBlock
freeUnits -= freeUnitsInBlock // todo: need to specify precision and rounding
UsageBlockWithCharge(ue.facetId, ue.start, ue.Units, ue.Price, freeUnitsInBlock, chargedUnits, chargedUnits * ue.Price)
}
newUe.iterator
}
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("Test").getOrCreate()
val stream = spark.readStream.format("rate").option("rowsPerSecond", 1).load()
import spark.implicits._
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
val prices = ssc.sparkContext.parallelize(List( FacetPricePoints("F1", 30.0, 100.0))).toDF()
val getUsageEventStreamUDF = udf((ts: Timestamp, units: String) => getUsageEventStream(ts, units)) // .where($"Units" < 2).
val usageEventsRaw = stream.withColumn("Usage", getUsageEventStreamUDF(stream("timestamp"), stream("value"))).select("Usage.*").as[UsageEvent].dropDuplicates("Id").withWatermark("timeStamp", "1 hour")
val aggUsage = usageEventsRaw.groupBy($"facetId", window($"timeStamp", "1 minute")).agg(sum($"Units") as "Units").selectExpr("facetId", "window.start", "Units").as[UsageBlock]
val fifoRate = (Key: String, Value: Iterator[UsageBlockWithPrice]) => { ChargeUsageBlock(Key, Value) }
val aggUsageCharge = aggUsage.joinWith(prices, prices.col("facetId") === usageEventsRaw.col("facetId")).select("_1.*", "_2.Price", "_2.FreeUnits").as[UsageBlockWithPrice].groupByKey(x => x.facetId).flatMapGroups(fifoRate).withWatermark("start", "1 hour")
val fin = aggUsageCharge.writeStream.trigger((Trigger.ProcessingTime("10 seconds"))).outputMode(OutputMode.Update).format("console").start()
// this applies freeUnits for every minute instead of just applying it once
fin.awaitTermination()
ssc.start()
ssc.awaitTermination()
}
}

Flink: How to convert the deprecated fold to aggregrate?

I am following the quick start example of Flink: Monitoring the Wikipedia Edit Stream.
The example is in Java, and I am implementing it in Scala, as following:
/**
* Wikipedia Edit Monitoring
*/
object WikipediaEditMonitoring {
def main(args: Array[String]) {
// set up the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val edits: DataStream[WikipediaEditEvent] = env.addSource(new WikipediaEditsSource)
val result = edits.keyBy( _.getUser )
.timeWindow(Time.seconds(5))
.fold(("", 0L)) {
(acc: (String, Long), event: WikipediaEditEvent) => {
(event.getUser, acc._2 + event.getByteDiff)
}
}
result.print
// execute program
env.execute("Wikipedia Edit Monitoring")
}
}
However, the fold function in Flink is already deprecated, and the aggregate function is recommended.
But I did not find the example or tutorial about how to convert the deprecated fold to aggregrate.
Any idea how to do this? Probably not only by applying aggregrate.
UPDATE
I have another implementation as following:
/**
* Wikipedia Edit Monitoring
*/
object WikipediaEditMonitoring {
def main(args: Array[String]) {
// set up the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val edits: DataStream[WikipediaEditEvent] = env.addSource(new WikipediaEditsSource)
val result = edits
.map( e => UserWithEdits(e.getUser, e.getByteDiff) )
.keyBy( "user" )
.timeWindow(Time.seconds(5))
.sum("edits")
result.print
// execute program
env.execute("Wikipedia Edit Monitoring")
}
/** Data type for words with count */
case class UserWithEdits(user: String, edits: Long)
}
I also would like to know how to have the implementation using self-defined AggregateFunction.
UPDATE
I followed this documentation: AggregateFunction, but have the following question:
In the source code of Interface AggregateFunction for release 1.3, you will see add indeed returns void:
void add(IN value, ACC accumulator);
But for version 1.4 AggregateFunction, is is returning:
ACC add(IN value, ACC accumulator);
How should I handle this?
The Flink version I am using is 1.3.2 and the documentation for this version is not having AggregateFunction, but there is no release 1.4 in artifactory yet.

You will find some documentation for AggregateFunction in the Flink 1.4 docs, including an example.
The version included in 1.3.2 is limited to being used with mutable accumulator types, where the add operation modifies the accumulator. This has been fixed for Flink 1.4, but hasn't been released.

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer08
import org.apache.flink.streaming.connectors.wikiedits.{WikipediaEditEvent, WikipediaEditsSource}
class SumAggregate extends AggregateFunction[WikipediaEditEvent, (String, Int), (String, Int)] {
override def createAccumulator() = ("", 0)
override def add(value: WikipediaEditEvent, accumulator: (String, Int)) = (value.getUser, value.getByteDiff + accumulator._2)
override def getResult(accumulator: (String, Int)) = accumulator
override def merge(a: (String, Int), b: (String, Int)) = (a._1, a._2 + b._2)
}
object WikipediaAnalysis extends App {
val see: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val edits: DataStream[WikipediaEditEvent] = see.addSource(new WikipediaEditsSource())
val result: DataStream[(String, Int)] = edits
.keyBy(_.getUser)
.timeWindow(Time.seconds(5))
.aggregate(new SumAggregate)
// .fold(("", 0))((acc, event) => (event.getUser, acc._2 + event.getByteDiff))
result.print()
result.map(_.toString()).addSink(new FlinkKafkaProducer08[String]("localhost:9092", "wiki-result", new SimpleStringSchema()))
see.execute("Wikipedia User Edit Volume")
}

How to ensure constant Avro schema generation and avoid the 'Too many schema objects created for x' exception?

I am experiencing a reproducible error while producing Avro messages with reactive kafka and avro4s. Once the identityMapCapacity of the client (CachedSchemaRegistryClient) is reached, serialization fails with
java.lang.IllegalStateException: Too many schema objects created for <myTopic>-value
This is unexpected, since all messages should have the same schema - they are serializations of the same case class.
val avroProducerSettings: ProducerSettings[String, GenericRecord] =
ProducerSettings(system, Serdes.String().serializer(),
avroSerde.serializer())
.withBootstrapServers(settings.bootstrapServer)
val avroProdFlow: Flow[ProducerMessage.Message[String, GenericRecord, String],
ProducerMessage.Result[String, GenericRecord, String],
NotUsed] = Producer.flow(avroProducerSettings)
val avroQueue: SourceQueueWithComplete[Message[String, GenericRecord, String]] =
Source.queue(bufferSize, overflowStrategy)
.via(avroProdFlow)
.map(logResult)
.to(Sink.ignore)
.run()
...
queue.offer(msg)
The serializer is a KafkaAvroSerializer, instantiated with a new CachedSchemaRegistryClient(settings.schemaRegistry, 1000)
Generating the GenericRecord:
def toAvro[A](a: A)(implicit recordFormat: RecordFormat[A]): GenericRecord =
recordFormat.to(a)
val makeEdgeMessage: (Edge, String) => Message[String, GenericRecord, String] = { (edge, topic) =>
val edgeAvro: GenericRecord = toAvro(edge)
val record = new ProducerRecord[String, GenericRecord](topic, edge.id, edgeAvro)
ProducerMessage.Message(record, edge.id)
}
The schema is created deep in the code (io.confluent.kafka.serializers.AbstractKafkaAvroSerDe#getSchema, invoked by io.confluent.kafka.serializers.AbstractKafkaAvroSerializer#serializeImpl) where I have no influence on it, so I have no idea how to fix the leak. Looks to me like the two confluent projects do not work well together.
The issues I have found here, here and here do not seem to address my use case.
The two workarounds for me are currently:
not use schema registry - not a long-term solution obviously
create custom SchemaRegistryClient not relying on object identity - doable but I would like to avoid creating more issues than by reimplementing
Is there a way to generate or cache a consistent schema depending on message/record type and use it with my setup?

edit 2017.11.20
The issue in my case was that each instance of GenericRecord carrying my message has been serialized by a different instance of RecordFormat, containing a different instance of the Schema. The implicit resolution here generated a new instance each time.
def toAvro[A](a: A)(implicit recordFormat: RecordFormat[A]): GenericRecord = recordFormat.to(a)
The solution was to pin the RecordFormat instance to a val and reuse it explicitly. Many thanks to https://github.com/heliocentrist for explaining the details.
original response:
After waiting for a while (also no answer for the github issue) I had to implement my own SchemaRegistryClient. Over 90% is copied from the original CachedSchemaRegistryClient, just translated into scala. Using a scala mutable.Map fixed the memory leak. I have not performed any comprehensive tests, so use at your own risk.
import java.util
import io.confluent.kafka.schemaregistry.client.rest.entities.{ Config, SchemaString }
import io.confluent.kafka.schemaregistry.client.rest.entities.requests.ConfigUpdateRequest
import io.confluent.kafka.schemaregistry.client.rest.{ RestService, entities }
import io.confluent.kafka.schemaregistry.client.{ SchemaMetadata, SchemaRegistryClient }
import org.apache.avro.Schema
import scala.collection.mutable
class CachingSchemaRegistryClient(val restService: RestService, val identityMapCapacity: Int)
extends SchemaRegistryClient {
val schemaCache: mutable.Map[String, mutable.Map[Schema, Integer]] = mutable.Map()
val idCache: mutable.Map[String, mutable.Map[Integer, Schema]] =
mutable.Map(null.asInstanceOf[String] -> mutable.Map())
val versionCache: mutable.Map[String, mutable.Map[Schema, Integer]] = mutable.Map()
def this(baseUrl: String, identityMapCapacity: Int) {
this(new RestService(baseUrl), identityMapCapacity)
}
def this(baseUrls: util.List[String], identityMapCapacity: Int) {
this(new RestService(baseUrls), identityMapCapacity)
}
def registerAndGetId(subject: String, schema: Schema): Int =
restService.registerSchema(schema.toString, subject)
def getSchemaByIdFromRegistry(id: Int): Schema = {
val restSchema: SchemaString = restService.getId(id)
(new Schema.Parser).parse(restSchema.getSchemaString)
}
def getVersionFromRegistry(subject: String, schema: Schema): Int = {
val response: entities.Schema = restService.lookUpSubjectVersion(schema.toString, subject)
response.getVersion.intValue
}
override def getVersion(subject: String, schema: Schema): Int = synchronized {
val schemaVersionMap: mutable.Map[Schema, Integer] =
versionCache.getOrElseUpdate(subject, mutable.Map())
val version: Integer = schemaVersionMap.getOrElse(
schema, {
if (schemaVersionMap.size >= identityMapCapacity) {
throw new IllegalStateException(s"Too many schema objects created for $subject!")
}
val version = new Integer(getVersionFromRegistry(subject, schema))
schemaVersionMap.put(schema, version)
version
}
)
version.intValue()
}
override def getAllSubjects: util.List[String] = restService.getAllSubjects()
override def getByID(id: Int): Schema = synchronized { getBySubjectAndID(null, id) }
override def getBySubjectAndID(subject: String, id: Int): Schema = synchronized {
val idSchemaMap: mutable.Map[Integer, Schema] = idCache.getOrElseUpdate(subject, mutable.Map())
idSchemaMap.getOrElseUpdate(id, getSchemaByIdFromRegistry(id))
}
override def getSchemaMetadata(subject: String, version: Int): SchemaMetadata = {
val response = restService.getVersion(subject, version)
val id = response.getId.intValue
val schema = response.getSchema
new SchemaMetadata(id, version, schema)
}
override def getLatestSchemaMetadata(subject: String): SchemaMetadata = synchronized {
val response = restService.getLatestVersion(subject)
val id = response.getId.intValue
val version = response.getVersion.intValue
val schema = response.getSchema
new SchemaMetadata(id, version, schema)
}
override def updateCompatibility(subject: String, compatibility: String): String = {
val response: ConfigUpdateRequest = restService.updateCompatibility(compatibility, subject)
response.getCompatibilityLevel
}
override def getCompatibility(subject: String): String = {
val response: Config = restService.getConfig(subject)
response.getCompatibilityLevel
}
override def testCompatibility(subject: String, schema: Schema): Boolean =
restService.testCompatibility(schema.toString(), subject, "latest")
override def register(subject: String, schema: Schema): Int = synchronized {
val schemaIdMap: mutable.Map[Schema, Integer] =
schemaCache.getOrElseUpdate(subject, mutable.Map())
val id = schemaIdMap.getOrElse(
schema, {
if (schemaIdMap.size >= identityMapCapacity)
throw new IllegalStateException(s"Too many schema objects created for $subject!")
val id: Integer = new Integer(registerAndGetId(subject, schema))
schemaIdMap.put(schema, id)
idCache(null).put(id, schema)
id
}
)
id.intValue()
}
}

Flink Scala join between two Streams doesn't seem to work

I want to join two streams (json) coming from a Kafka producer.
The code works if I filter the data. But it seems not working when I join them. I want to print to the console the joined stream but nothing appears.
This is my code
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.json4s._
import org.json4s.native.JsonMethods
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
object App {
def main(args : Array[String]) {
case class Data(location: String, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Data, stt: Stt)
case class Datas(location: String, timestamp: Long, measurement: Int, unit: String, accuracy: Double)
case class Sensor2(sensor_name: String, start_date: String, end_date: String, data_schema: Array[String], data: Datas, stt: Stt)
val properties = new Properties();
properties.setProperty("bootstrap.servers", "0.0.0.0:9092");
properties.setProperty("group.id", "test");
val env = StreamExecutionEnvironment.getExecutionEnvironment
val consumer1 = new FlinkKafkaConsumer010[String]("topics1", new SimpleStringSchema(), properties)
val stream1 = env
.addSource(consumer1)
val consumer2 = new FlinkKafkaConsumer010[String]("topics2", new SimpleStringSchema(), properties)
val stream2 = env
.addSource(consumer2)
val s1 = stream1.map { x => {
implicit val formats = DefaultFormats
JsonMethods.parse(x).extract[Sensor]
}
}
val s2 = stream2.map { x => {
implicit val formats = DefaultFormats
JsonMethods.parse(x).extract[Sensor2]
}
}
val s1t = s1.assignAscendingTimestamps { x => x.data.timestamp }
val s2t = s2.assignAscendingTimestamps { x => x.data.timestamp }
val j1pre = s1t.join(s2t)
.where(_.data.unit)
.equalTo(_.data.unit)
.window(TumblingEventTimeWindows.of(Time.seconds(2L)))
.apply((g, s) => (s.sensor_name, g.sensor_name, s.data.measurement))
env.execute()
}
}
I think the problem is on the assignment of the timestamp. I think that the assignAscendingTimestamp on the two sources is not the right function.
The json produced by the kafka producer has a field data.timestamp that should be assigned as the timestamp. But I don't know how to manage that.
I also thought that i should have to give a time window batch (as in spark) to the incoming tuples. But I'm not sure this is the right solution.

I think your code needs just some minor adjustments. First of all as you want to work in EventTime you should set appropriate TimeCharacteristic
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Also your code that you pasted is missing a sink for the stream. If you want to print to console you should:
j1pre.print
The rest of your code seems fine.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Context in processFunction for KeyedProcessFunction is null - scala

The problem is that you are accessing in line 57 the timestamp field of the Context. This field is null if you are using ProcessingTime or if you don't specify a timestamp extractor when using EventTime.

Related

JDBC sink for Flink fails with not serializable error

How can I get all the data in flatMapGroup in Spark Structured Streaming?

Flink: How to convert the deprecated fold to aggregrate?

How to ensure constant Avro schema generation and avoid the 'Too many schema objects created for x' exception?

Flink Scala join between two Streams doesn't seem to work

Categories

Resources