Flink's broadcast state behavior - scala

I am trying to play with flink's broacast state with a simple case.
I juste want to multiply an integer stream by another integer into a broadcast stream.
The behavior of my Broadcast is "weird", if I put too few elements in my input stream (like 10), nothing happen and my MapState is empty, but if I put more elements (like 100) I have the behavior I want (multiply the integer stream by 2 here).
Why broadcast stream is not taking into account if I gave too few elements ?
How can I control when the broadcast stream is working ?
Optional: I want to keep only the last element of my broadcast stream, is .clear() the good way ?
Thank you!
Here's my BroadcastProcessFunction:
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction
import org.apache.flink.util.Collector
import scala.collection.JavaConversions._
class BroadcastProcess extends BroadcastProcessFunction[Int, Int, Int] {
override def processElement(value: Int, ctx: BroadcastProcessFunction[Int, Int, Int]#ReadOnlyContext, out: Collector[Int]) = {
val currentBroadcastState = ctx.getBroadcastState(State.mapState).immutableEntries()
if (currentBroadcastState.isEmpty) {
out.collect(value)
} else {
out.collect(currentBroadcastState.last.getValue * value)
}
}
override def processBroadcastElement(value: Int, ctx: BroadcastProcessFunction[Int, Int, Int]#Context, out: Collector[Int]) = {
// Keep only last state
ctx.getBroadcastState(State.mapState).clear()
// Add state
ctx.getBroadcastState(State.mapState).put("key", value)
}
}
And my MapState:
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.api.scala._
object State {
val mapState: MapStateDescriptor[String, Int] =
new MapStateDescriptor(
"State",
createTypeInformation[String],
createTypeInformation[Int]
)
}
And my Main:
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
object Broadcast {
def main(args: Array[String]): Unit = {
val numberElements = 100
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val broadcastStream = env.fromElements(2).broadcast(State.mapState)
val input = (1 to numberElements).toList
val inputStream = env.fromCollection(input)
val outputStream = inputStream
.connect(broadcastStream)
.process(new BroadcastProcess())
outputStream.print()
env.execute()
}
}
Edit: I use Flink 1.5, and Broadcast State documentation is here.

Flink does not synchronize the ingestion of streams, i.e., streams produce data as soon as they can. This is true for regular and broadcast inputs. The BroadcastProcess will not wait for the first broadcast input to arrive before ingesting the regular input.
When you put more numbers into the regular input, it just takes more time to serialize, deserialize, and serve the input such that the broadcast input is already present, when the first regular number arrives.

Related

How to combine data from two kafka topics ZStreams to one ZStream?

import org.slf4j.LoggerFactory
import zio.blocking.Blocking
import zio.clock.Clock
import zio.console.{Console, putStrLn}
import zio.kafka.consumer.{CommittableRecord, Consumer, ConsumerSettings, Subscription}
import zio.kafka.consumer.Consumer.{AutoOffsetStrategy, OffsetRetrieval}
import zio.kafka.serde.Serde
import zio.stream.ZStream
import zio.{ExitCode, Has, URIO, ZIO, ZLayer}
object Test2Topics extends zio.App {
val logger = LoggerFactory.getLogger(this.getClass)
val consumerSettings: ConsumerSettings =
ConsumerSettings(List("localhost:9092"))
.withGroupId(s"consumer-${java.util.UUID.randomUUID().toString}")
.withOffsetRetrieval(OffsetRetrieval.Auto(AutoOffsetStrategy.Earliest))
val consumer: ZLayer[Clock with Blocking, Throwable, Has[Consumer]] =
ZLayer.fromManaged(Consumer.make(consumerSettings))
val streamString: ZStream[Any with Has[Consumer], Throwable, CommittableRecord[String, String]] =
Consumer.subscribeAnd(Subscription.topics("test"))
.plainStream(Serde.string, Serde.string)
val streamInt: ZStream[Any with Has[Consumer], Throwable, CommittableRecord[String, String]] =
Consumer.subscribeAnd(Subscription.topics("topic"))
.plainStream(Serde.string, Serde.string)
val combined = streamString.zipWithLatest(streamInt)((a,b)=>(a,b))
val program = for {
fiber1 <- streamInt.tap(r => putStrLn(s"streamInt: ${r.toString}")).runDrain.forkDaemon
fiber2 <- streamString.tap(r => putStrLn(s"streamString: ${r.toString}")).runDrain.forkDaemon
} yield ZIO.raceAll(fiber1.join, List(fiber2.join))
override def run(args: List[String]): URIO[zio.ZEnv, ExitCode] = {
//combined.tap(r => putStrLn(s"Combined: ${r.toString}")).runDrain.provideSomeLayer(consumer ++ Console.live).exitCode
program.provideSomeLayer(consumer ++ Console.live).exitCode
}
}
Somehow when i try to combine the output from two topics with names test and topic i dont get any output printed out, and also when i try to print both streams in parallel that also doesnt work, but if i print just one stream at a time it works.
Did anyone experience anything like this?
You are composing 1 shared layer that provides one instance of a consumer and initialize this instance twice after eachother to subscribe to 2 topics one after the other.
A single consumer instance should only be initialized once, so the above code will never work.
I believe setting up 2 independent compositions of consumer to stream like this will help:
val program = for {
fiber1 <- streamInt.tap(r => putStrLn(s"streamInt: ${r.toString}")).runDrain.forkDaemon.provideSomeLayer(consumer)
fiber2 <- streamString.tap(r => putStrLn(s"streamString: ${r.toString}")).runDrain.forkDaemon.provideSomeLayer(consumer)
} yield {...}

Concurrent Streams Does't Print Anything to Console

I'm trying to build an example concerning using the Stream.concurrently method in fs2. I'm developing the producer/consumer pattern, using a Queue as the shared state:
import cats.effect.std.{Queue, Random}
object Fs2Tutorial extends IOApp {
val random: IO[Random[IO]] = Random.scalaUtilRandom[IO]
val queue: IO[Queue[IO, Int]] = Queue.bounded[IO, Int](10)
val producer: IO[Nothing] = for {
r <- random
q <- queue
p <-
r.betweenInt(1, 11)
.flatMap(q.offer)
.flatTap(_ => IO.sleep(1.second))
.foreverM
} yield p
val consumer: IO[Nothing] = for {
q <- queue
c <- q.take.flatMap { n =>
IO.println(s"Consumed $n")
}.foreverM
} yield c
val concurrently: Stream[IO, Nothing] = Stream.eval(producer).concurrently(Stream.eval(consumer))
override def run(args: List[String]): IO[ExitCode] = {
concurrently.compile.drain.as(ExitCode.Success)
}
}
I expect the program to print some "Consumed n", for some n. However, the program prints nothing to the console.
What's wrong with the above code?
What's wrong with the above code?
You are not using the same Queue in the consumer and in the producer, rather each of them is creating its own new independent Queue (the same happens with Random BTW)
This is a common mistake made by newbies who don't grasp yet the main principles behind a data type like IO
When you do val queue: IO[Queue[IO, Int]] = Queue.bounded[IO, Int](10) you are saying that queue is a program that when evaluated will produce a value of type Queue[IO, Unit], that is the point of all this.
The program become a value, and as any value you can manipulate it in any ways to produce new values, for example using flatMap so when both consumer & producer crate a new program by flatMapping queue they both create new independent programs / values.
You can fix that code like this:
import cats.effect.{IO, IOApp}
import cats.effect.std.{Queue, Random}
import cats.syntax.all._
import fs2.Stream
import scala.concurrent.duration._
object Fs2Tutorial extends IOApp.Simple {
override final val run: IO[Unit] = {
val resources =
(
Random.scalaUtilRandom[IO],
Queue.bounded[IO, Int](10)
).tupled
val concurrently =
Stream.eval(resources).flatMap {
case (random, queue) =>
val producer =
Stream
.fixedDelay[IO](1.second)
.evalMap(_ => random.betweenInt(1, 11))
.evalMap(queue.offer)
val consumer =
Stream.fromQueueUnterminated(queue).evalMap(n => IO.println(s"Consumed $n"))
producer.concurrently(consumer)
}
concurrently.interruptAfter(10.seconds).compile.drain >> IO.println("Finished!")
}
}
(You can see it running here).
PS: I would recommend you to look into the "Programs as Values" Series from Fabio Labella: https://systemfw.org/archive.html

Apache beam stops to process PubSub messages after some time

I'm trying to write a simple Apache Beam pipeline (which will run on the Dataflow runner) to do the following:
Read PubSub messages containing file paths on GCS from a subscription.
For each message, read the data contained in the file associated with the message (the files can be of a variery of formats (csv, jsonl, json, xml, ...)).
Do some processing on each record.
Write back the result on GCS.
I'm using a 10 seconds fixed window on the messages. Since incoming files are already chunked (max size of 10MB) I decided not to use splittable do functions to read the files, in order to avoid adding useless complexity (especially for files that are not trivially splittable in chunks).
Here is a simplified code sample that gives the exact same problem of the full one:
package skytv.ingester
import java.io.{BufferedReader, InputStreamReader}
import java.nio.charset.StandardCharsets
import kantan.csv.rfc
import org.apache.beam.sdk.Pipeline
import org.apache.beam.sdk.io.{Compression, FileIO, FileSystems, TextIO, WriteFilesResult}
import org.apache.beam.sdk.io.gcp.pubsub.{PubsubIO, PubsubMessage}
import org.apache.beam.sdk.options.PipelineOptionsFactory
import org.apache.beam.sdk.transforms.DoFn.ProcessElement
import org.apache.beam.sdk.transforms.windowing.{BoundedWindow, FixedWindows, PaneInfo, Window}
import org.apache.beam.sdk.transforms.{Contextful, DoFn, MapElements, PTransform, ParDo, SerializableFunction, SimpleFunction, WithTimestamps}
import org.apache.beam.sdk.values.{KV, PCollection}
import org.joda.time.{Duration, Instant}
import skytv.cloudstorage.CloudStorageClient
import skytv.common.Closeable
import kantan.csv.ops._
import org.apache.beam.sdk.io.FileIO.{Sink, Write}
class FileReader extends DoFn[String, List[String]] {
private def getFileReader(filePath: String) = {
val cloudStorageClient = new CloudStorageClient()
val inputStream = cloudStorageClient.getInputStream(filePath)
val isr = new InputStreamReader(inputStream, StandardCharsets.UTF_8)
new BufferedReader(isr)
}
private def getRowsIterator(fileReader: BufferedReader) = {
fileReader
.asUnsafeCsvReader[Seq[String]](rfc
.withCellSeparator(',')
.withoutHeader
.withQuote('"'))
.toIterator
}
#ProcessElement
def processElement(c: ProcessContext): Unit = {
val filePath = c.element()
Closeable.tryWithResources(
getFileReader(filePath)
) {
fileReader => {
getRowsIterator(fileReader)
.foreach(record => c.output(record.toList))
}
}
}
}
class DataWriter(tempFolder: String) extends PTransform[PCollection[List[String]], WriteFilesResult[String]] {
private val convertRecord = Contextful.fn[List[String], String]((dr: List[String]) => {
dr.mkString(",")
})
private val getSink = Contextful.fn[String, Sink[String]]((destinationKey: String) => {
TextIO.sink()
})
private val getPartitioningKey = new SerializableFunction[List[String], String] {
override def apply(input: List[String]): String = {
input.head
}
}
private val getNaming = Contextful.fn[String, Write.FileNaming]((destinationKey: String) => {
new Write.FileNaming {
override def getFilename(
window: BoundedWindow,
pane: PaneInfo,
numShards: Int,
shardIndex: Int,
compression: Compression
): String = {
s"$destinationKey-${window.maxTimestamp()}-${pane.getIndex}.csv"
}
}
})
override def expand(input: PCollection[List[String]]): WriteFilesResult[String] = {
val fileWritingTransform = FileIO
.writeDynamic[String, List[String]]()
.by(getPartitioningKey)
.withDestinationCoder(input.getPipeline.getCoderRegistry.getCoder(classOf[String]))
.withTempDirectory(tempFolder)
.via(convertRecord, getSink)
.withNaming(getNaming)
.withNumShards(1)
input
.apply("WriteToAvro", fileWritingTransform)
}
}
object EnhancedIngesterScalaSimplified {
private val SUBSCRIPTION_NAME = "projects/<project>/subscriptions/<subscription>"
private val TMP_LOCATION = "gs://<path>"
private val WINDOW_SIZE = Duration.standardSeconds(10)
def main(args: Array[String]): Unit = {
val options = PipelineOptionsFactory.fromArgs(args: _*).withValidation().create()
FileSystems.setDefaultPipelineOptions(options)
val p = Pipeline.create(options)
val messages = p
.apply("ReadMessages", PubsubIO.readMessagesWithAttributes.fromSubscription(SUBSCRIPTION_NAME))
// .apply("AddProcessingTimeTimestamp", WithTimestamps.of(new SerializableFunction[PubsubMessage, Instant] {
// override def apply(input: PubsubMessage): Instant = Instant.now()
// }))
val parsedMessages = messages
.apply("ApplyWindow", Window.into[PubsubMessage](FixedWindows.of(WINDOW_SIZE)))
.apply("ParseMessages", MapElements.via(new SimpleFunction[PubsubMessage, String]() {
override def apply(msg: PubsubMessage): String = new String(msg.getPayload, StandardCharsets.UTF_8)
}))
val dataReadResult = parsedMessages
.apply("ReadData", ParDo.of(new FileReader))
val writerResult = dataReadResult.apply(
"WriteData",
new DataWriter(TMP_LOCATION)
)
writerResult.getPerDestinationOutputFilenames.apply(
"FilesWritten",
MapElements.via(new SimpleFunction[KV[String, String], String]() {
override def apply(input: KV[String, String]): String = {
println(s"Written ${input.getKey}, ${input.getValue}")
input.getValue
}
}))
p.run.waitUntilFinish()
}
}
The problem is that after the processing of some messages (in the order of 1000), the job stops processing new messages and they remain in the PubSub subscription unacknowledged forever.
I saw that in such a situation the watermark stops to advance and the data freshness linearly increases indefinitely.
Here is a screenshot from dataflow:
And here the situation on the PubSub queue:
Is it possible that there are some messages that remain stuck in the dataflow queues filling them so that no new messages can be added?
I thought that there was some problem on how timestamps are computed by the PubsubIO, so I tried to force the timestamps to be equal to the processing time of each message, but I had no success.
If I leave the dataflow job in this state, it seems that it continuously reprocesses the same messages without writing any data to storage.
Do you have any idea on how to solve this problem?
Thanks!
Most likely the pipeline has encountered an error while processing one(or more) elements in the pipeline (and it shouldn't have anything to do with how timestamps are computed by the PubsubIO), which stops the watermark from advancing since the failed work will be retried again and again on dataflow.
You can check if there's any failure from the log, specifically from worker or harness component. If there's an unhandled runtime exception such as parse error etc, it is very likely being the root cause of a streaming pipeline getting stuck.
If there's no UserCodeException then it is likely some other issue caused by dataflow backend and you can reach out to Dataflow customer support so engineers can look into the backend issue for your pipeline.

Does Akka Stream Implement the Join Semantic as Kafka Streams Does?

I am quite new to Akka Streams, whereas I have some experience with Kafka Streams.
One thing it seems lacking in Akka Streams is the possibility to join together two different streams.
Kafka Streams allows joining information coming from two different streams (or tables) using the messages' keys.
Is there something similar in Akka Streams?
The short answer is unfortunately no. I would argue that Akka-streams is more low level than Kafka-Stream, Spark Streaming, or Flink. However, you have more control over what you are doing. Basically, it means that you can build your join operator. Check this discussion at lightbend.
Basically, you have to get data from 2 Sources, Merge them and send to a window based on time or number of tuples, compute the join, and emit the data to the Sink. I have done this PoC (which is still unfinished) but I follow the operators that I said to you here, and it is compiling and working. Basically, I still have to join the data inside the window. Currently, I am just emitting them in a mini-batch.
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream.{Attributes, ClosedShape, FlowShape, Inlet, Outlet}
import akka.stream.scaladsl.{Flow, GraphDSL, Merge, RunnableGraph, Sink, Source}
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler, TimerGraphStageLogic}
import scala.collection.mutable
import scala.concurrent.duration._
object StreamOpenGraphJoin {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("StreamOpenGraphJoin")
val incrementSource: Source[Int, NotUsed] = Source(1 to 10).throttle(1, 1 second)
val decrementSource: Source[Int, NotUsed] = Source(10 to 20).throttle(1, 1 second)
def tokenizerSource(key: Int) = {
Flow[Int].map { value =>
(key, value)
}
}
// Step 1 - setting up the fundamental for a stream graph
val switchJoinStrategies = RunnableGraph.fromGraph(
GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
// Step 2 - add partition and merge strategy
val tokenizerShape00 = builder.add(tokenizerSource(0))
val tokenizerShape01 = builder.add(tokenizerSource(1))
val mergeTupleShape = builder.add(Merge[(Int, Int)](2))
val batchFlow = Flow.fromGraph(new BatchTimerFlow[(Int, Int)](5 seconds))
val sinkShape = builder.add(Sink.foreach[(Int, Int)](x => println(s" > sink: $x")))
// Step 3 - tying up the components
incrementSource ~> tokenizerShape00 ~> mergeTupleShape.in(0)
decrementSource ~> tokenizerShape01 ~> mergeTupleShape.in(1)
mergeTupleShape.out ~> batchFlow ~> sinkShape
// Step 4 - return the shape
ClosedShape
}
)
// run the graph and materialize it
val graph = switchJoinStrategies.run()
}
// step 0: define the shape
class BatchTimerFlow[T](silencePeriod: FiniteDuration) extends GraphStage[FlowShape[T, T]] {
// step 1: define the ports and the component-specific members
val in = Inlet[T]("BatchTimerFlow.in")
val out = Outlet[T]("BatchTimerFlow.out")
// step 3: create the logic
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) {
// mutable state
val batch = new mutable.Queue[T]
var open = false
// step 4: define mutable state implement my logic here
setHandler(in, new InHandler {
override def onPush(): Unit = {
try {
val nextElement = grab(in)
batch.enqueue(nextElement)
Thread.sleep(50) // simulate an expensive computation
if (open) pull(in) // send demand upstream signal, asking for another element
else {
// forward the element to the downstream operator
emitMultiple(out, batch.dequeueAll(_ => true).to[collection.immutable.Iterable])
open = true
scheduleOnce(None, silencePeriod)
}
} catch {
case e: Throwable => failStage(e)
}
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
override protected def onTimer(timerKey: Any): Unit = {
open = false
}
}
// step 2: construct a new shape
override def shape: FlowShape[T, T] = FlowShape[T, T](in, out)
}
}

Nothing is being printed out from a Flink Patterned Stream

I have this code below:
import java.util.Properties
import com.google.gson._
import com.typesafe.config.ConfigFactory
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.CEP
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
object WindowedWordCount {
val configFactory = ConfigFactory.load()
def main(args: Array[String]) = {
val brokers = configFactory.getString("kafka.broker")
val topicChannel1 = configFactory.getString("kafka.topic1")
val props = new Properties()
...
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val dataStream = env.addSource(new FlinkKafkaConsumer010[String](topicChannel1, new SimpleStringSchema(), props))
val partitionedInput = dataStream.keyBy(jsonString => {
val jsonParser = new JsonParser()
val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
jsonObject.get("account")
})
val priceCheck = Pattern.begin[String]("start").where({jsonString =>
val jsonParser = new JsonParser()
val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
jsonObject.get("account").toString == "iOS"})
val pattern = CEP.pattern(partitionedInput, priceCheck)
val newStream = pattern.select(x =>
x.get("start").map({str =>
str
})
)
newStream.print()
env.execute()
}
}
For some reason in the above code at the newStream.print() nothing is being printed out. I am positive that there is data in Kafka that matches the pattern that I defined above but for some reason nothing is being printed out.
Can anyone please help me spot an error in this code?
EDIT:
class TimestampExtractor extends AssignerWithPeriodicWatermarks[String] with Serializable {
override def extractTimestamp(e: String, prevElementTimestamp: Long) = {
val jsonParser = new JsonParser()
val context = jsonParser.parse(e).getAsJsonObject.getAsJsonObject("context")
Instant.parse(context.get("serverTimestamp").toString.replaceAll("\"", "")).toEpochMilli
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis())
}
}
I saw on the flink doc that you can just return prevElementTimestamp in the extractTimestamp method (if you are using Kafka010) and new Watermark(System.currentTimeMillis) in the getCurrentWatermark method.
But I don't understand what prevElementTimestamp is or why one would return new Watermark(System.currentTimeMillis) as the WaterMark and not something else. Can you please elaborate on why we do this on how WaterMark and EventTime work together please?
You do setup your job to work in EventTime, but you do not provide timestamp and watermark extractor.
For more on working in event time see those docs. If you want to use the kafka embedded timestamps this docs may help you.
In EventTime the CEP library buffers events upon watermark arrival so to properly handle out-of-order events. In your case there are no watermarks generated, so the events are buffered infinitly.
Edit:
For the prevElementTimestamp I think the docs are pretty clear:
There is no need to define a timestamp extractor when using the timestamps from Kafka. The previousElementTimestamp argument of the extractTimestamp() method contains the timestamp carried by the Kafka message.
Since Kafka 0.10.x Kafka messages can have embedded timestamp.
Generating Watermark as new Watermark(System.currentTimeMillis) in this case is not a good idea. You should create Watermark based on your knowledge of the data. For information on how Watermark and EventTime work together I could not be more clear than the docs