Apache flink broadcast state gets flushed - scala

I am using the broadcast pattern to connect two streams and read data from one to another. The code looks like this
case class Broadcast extends BroadCastProcessFunction[MyObject,(String,Double), MyObject]{
override def processBroadcastElement(in2: (String, Double),
context: BroadcastProcessFunction[MyObject, (String, Double), MyObject]#Context,
collector:Collector[MyObject]):Unit={
context.getBroadcastState(broadcastStateDescriptor).put(in2._1,in2._2)
}
override def processElement(obj: MyObject,
readOnlyContext:BroadCastProcessFunction[MyObject, (String,Double),
MyObject]#ReadOnlyContext, collector: Collector[MyObject]):Unit={
val theValue = readOnlyContext.getBroadccastState(broadcastStateDesriptor).get(obj.prop)
//If I print the context of the state here sometimes it is empty.
out.collect(MyObject(new, properties, go, here))
}
}
The state descriptor:
val broadcastStateDescriptor: MapStateDescriptor[String, Double) = new MapStateDescriptor[String, Double]("name_for_this", classOf[String], classOf[Double])
My execution code looks like this.
val streamA :DataStream[MyObject] = ...
val streamB :DataStream[(String,Double)] = ...
val broadcastedStream = streamB.broadcast(broadcastStateDescriptor)
streamA.connect(streamB).process(new Broadcast)
The problem is in the processElement function the state sometimes is empty and sometimes not. The state should always contain data since I am constantly streaming from a file that I know it has data. I do not understand why it is flushing the state and I cannot get the data.
I tried adding some printing in the processBroadcastElement before and after putting the data to the state and the result is the following
0 - 1
1 - 2
2 - 3
.. all the way to 48 where it resets back to 0
UPDATE:
something that I noticed is when I decrease the value of the timeout of the streaming execution context, the results are a bit better. when I increase it then the map is always empty.
env.setBufferTimeout(1) //better results
env.setBufferTimeout(200) //worse result (default is 100)

Whenever two streams are connected in Flink, you have no control over the timing with which Flink will deliver events from the two streams to your user function. So, for example, if there is an event available to process from streamA, and an event available to process from streamB, either one might be processed next. You cannot expect the broadcastedStream to somehow take precedence over the other stream.
There are various strategies you might employ to cope with this race between the two streams, depending on your requirements. For example, you could use a KeyedBroadcastProcessFunction and use its applyToKeyedState method to iterate over all existing keyed state whenever a new broadcast event arrives.

As David mentioned the job could be restarting. I disabled the checkpoints so I could see any possible exception thrown instead of flink silently failing and restarting the job.
It turned out that there was an error while trying to parse the file. So the job kept restarting thus the state was empty and flink kept consuming the stream over and over again.

Related

akka streaming file lines to actor router and writing with single actor. how to handle the backpressure

I want to stream a file from s3 to actor to be parsed and enriched and to write the output to other file.
The number of parserActors should be limited e.g
application.conf
akka{
actor{
deployment {
HereClient/router1 {
router = round-robin-pool
nr-of-instances = 28
}
}
}
}
code
val writerActor = actorSystem.actorOf(WriterActor.props())
val parser = actorSystem.actorOf(FromConfig.props(ParsingActor.props(writerActor)), "router1")
however the actor that is writing to a file should be limited to 1 (singleton)
I tried doing something like
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
val source: Source[GenericRecord, NotUsed] = AvroParquetSource(reader)
source.map (record => record ! parser)
but I am not sure that the backpressure is handled correctly. any advice ?
Indeed your solution is disregarding backpressure.
The correct way to have a stream interact with an actor while maintaining backpressure is to use the ask pattern support of akka-stream (reference).
From my understanding of your example you have 2 separate actor interaction points:
send records to the parsing actors (via a router)
send parsed records to the singleton write actor
What I would do is something similar to the following:
val writerActor = actorSystem.actorOf(WriterActor.props())
val parserActor = actorSystem.actorOf(FromConfig.props(ParsingActor.props(writerActor)), "router1")
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
val source: Source[GenericRecord, NotUsed] = AvroParquetSource(reader)
source.ask[ParsedRecord](28)(parserActor)
.ask[WriteAck](writerActor)
.runWith(Sink.ignore)
The idea is that you send all the GenericRecord elements to the parserActor which will reply with a ParsedRecord. Here as an example we specify a parallelism of 28 since that's the number of instances you have configured, however as long as you use a value higher than the actual number of actor instances no actor should suffer from work starvation.
Once the parseActor replies with the parsing result (here represented by the ParsedRecord) we apply the same pattern to interact with the singleton writer actor. Note that here we don't specify the parallelism as we have a single instance so it doesn't make sense the send more than 1 message at a time (in reality this happens anyway due to buffering at async boundaries, but this is just a built-in optimization). In this case we expect that the writer actor replies with a WriteAck to inform us that the writing has been successful and we can send the next element.
Using this method you are maintaining backpressure throughout your whole stream.
I think you should be using one of the "async" operations
Perhaps this other q/a gives you some insperation Processing an akka stream asynchronously and writing to a file sink

Skip errors in the infinite Stream

I have an infinite fs2.Stream which may encounter errors. I'd like to skip those errors with doing nothing (probably log) and keep streaming further elements. Example:
//An example
val stream = fs2.Stream
.awakeEvery[IO](1.second)
.evalMap(_ => IO.raiseError(new RuntimeException))
In this specific case, I'd like to get infinite fs2.Stream of Left(new RuntimeException) emitting every second.
There is a Stream.attempt method producing the stream that got terminated after the first error is encountered. Is there a way to just skip errors and keep pulling further elements?
The IO.raiseError(new RuntimeException).attempt won't work in general since it would require attempting all effects in all places of the stream pipeline composition.
There's no way to handle errors in the way you described.
When stream encounters the first error it is terminated. Please check this gitter question.
You can handle it in two ways:
Attempt the effect (but you already mentioned it is not possible in your case).
Restart stream after it is terminated:
val stream: Stream[IO, Either[Throwable, Unit]] = Stream
.awakeEvery[IO](1.second)
.evalMap(_ => IO.raiseError(new RuntimeException))
.handleErrorWith(t => Stream(Left(t)) ++ stream) //put Left to the stream and restart it
//since stream will infinitely restart I take only 3 first values
println(stream.take(3).compile.toList.unsafeRunSync())

using streams vs actors for periodic tasks

Im working with akka/scala/play stack.
Usually, im using stream to perform certain tasks. for example, I have a stream that wakes every minute, picks up something from the DB, and call another service to enrich its data using an API and save the enrichment to the DB.
something like this:
class FetcherAndSaveStream #Inject()(fetcherAndSaveGraph: FetcherAndSaveGraph, dbElementsSource: DbElementsSource)
(implicit val mat: Materializer,
implicit val exec: ExecutionContext) extends LazyLogging {
def graph[M1, M2](source: Source[BDElement, M1],
sink: Sink[BDElement, M2],
switch: SharedKillSwitch): RunnableGraph[(M1, M2)] = {
val fetchAndSaveDataFromExternalService: Flow[BDElement, BDElement, NotUsed] =
fetcherAndSaveGraph.fetchEndSaveEnrichment
source.viaMat(switch.flow)(Keep.left)
.via(fetchAndSaveDataFromExternalService)
.toMat(sink)(Keep.both).withAttributes(supervisionStrategy(resumingDecider))
}
def runGraph(switchSharedKill: SharedKillSwitch): (NotUsed, Future[Done]) = {
logger.info("FetcherAndSaveStream is now running")
graph(dbElementsSource.dbElements(), Sink.ignore, switchSharedKill).run()
}
}
I wonder, is this better than just using an actor that ticks every minute and do something like that? what is the comparison between using actors for this and stream?
trying to figure out still when should I choose which method (streams/actors). thanks!!
You can use both, depending on the requirements you have for your solution which are not listed there. The general concern you need to take into consideration - actors more low-level stuff than streams, so they require more code and debug.
Basically, streams are good for tasks where you have a relatively big amount of data you need to process with low memory consumption. With streams, you won't need to start to stream each n seconds, you can set this stream to run along with the application. That could make your code more concise by omitting scheduler logic.
I will omit your DI and architecture stuff, write solution with pseudocode:
val yourConsumer: Sink[YourDBRecord] = ???
val recordsSource: Source[YourDBRecord] =
val runnableGraph = (Source repeat ())
.throttle(1, n seconds)
.mapAsync(yourParallelism){_ =>
fetchReasonableAmountOfRecordsFromDB
} mapConcat identity to yourConsumer
This stream will do your stuff. You even can enhance it with more sophisticated logic to adapt the polling rate according to workloads using feedback loop in graph api. Also, you can add the error-handling strategy you need to resume in place your stream has crashed.
Moreover, there's alpakka connectors for DBS capable of doing so, you can see if solutions there fit your purpose, or check for implementation details.
What you can get by doing so - backpressure, ability to work with streams, clean and concise code with no timed automata managed directly by you.
https://doc.akka.io/docs/akka/current/stream/stream-rate.html
You can also create an actor, but then you should do all the things akka streams do for you by hand, i.e. back-pressure in case you want to interop with streams, scheduler, chunking and memory management(to not to load 100000 or so entries in one batch to memory), etc.

Broadcasting updates on spark jobs

I've seen this question asked here but they essentially focus on spark streaming and I can't find a proper solution to work on batch. The idea is to loop through several days and at each iteration/day it updates the information about the previous day (that is used for the current iteration). The code looks like the following:
var prevIterDataRdd = // some RDD
days.foreach(folder => {
val previousData : Map[String, Double] = parseResult(prevIterDataRdd)
val broadcastMap = sc.broadcast(previousData)
val (result, previousStatus) =
processFolder(folder, broadcastMap)
// store result
result.write.csv(outputPath)
// updating the RDD that enables me to extract previousData to update broadcast
val passingPrevStatus = prevIterDataRdd.subtractByKey(previousStatus)
prevIterDataRdd = previousStatus.union(passingPrevStatus)
broadcastMap.unpersist(true)
broadcastMap.destroy()
})
Using broadcastMap.destroy() does not run because it does not let me use the broadcastMap again (which I actually don't understand because it should be totally unrelated - immutable).
How should I run this loop and update the broadcast variable at each iteration?
When using method unpersist I pass the true argument in order to make it blocking. Is sc.broadcast() also blocking?
Do I really need unpersist() if I'm anyway broadcasting again?
Why can't I use the broadcast again after using destroy given that I'm creating a new broadcast variable?
Broadcast variables are immutable but you can create a new broadcast variable.
This new broadcast variable can be used in the next iteration.
All you need to do is to change the reference to the newly created broadcast, unpersist the old broadcast from the executors and destroy it from the driver.
Define the variable at class level which will allow you to change the reference of broadcast variable in driver and use the destroy method.
object Main extends App {
// defined and initialized at class level to allow reference change
var previousData: Map[String, Double] = null
override def main(args: Array[String]): Unit = {
//your code
}
}
You were not allowed to use the destroy method on the variable because the reference no longer exists in the driver. Changing the reference to the new broadcast variable can resolve the issue.
Unpersist only removes data from the executors and hence when the variable is re-accessed, the driver resends it to the executors.
blocking = true will allow you let the application completely remove the data from the executor before the next access.
sc.broadcast() - There is no official documentation saying that it is blocking. Although as soon as it is called the application will start broadcasting the data to the executors before running the next line of the code .So if the data is very large it may slow down your application. So be care full on how you are using it .
It is a good practice to call unpersist before destroy.This will help you get rid of data completely from executors and driver.

Update concurrent map inside a stream map on flink

I have one stream that constantly streaming the latest values of some keys.
Stream A:DataStream[(String,Double)]
I have another stream that wants to get the latest value on each process call.
My approach was to introduce a concurrentHashMap which will be updated by stream A and read by the second stream.
val rates = new concurrentHasMap[String,Double].asScala
val streamA : DataStream[(String,Double)]= ???
streamA.map(keyWithValue => rates(keyWithValue._1)= keyWithValue._2) //rates never gets updated
rates("testKey")=2 //this works
val streamB: DataStream[String] = ???
streamB.map(str=> rates(str) // rates does not contain the values of the streamA at this point
//some other functionality
)
Is it possible to update a concurrent map from a stream? Any other solution on sharing data from a stream with another is also acceptable
The behaviour You are trying to use will not work in a distributed manner, basically if You will have parellelism > 1 it will not work. In Your code rates are actually updated, but in different instance of parallel operator.
Actually, what You would like to do in this case is use a BroadcastState which was designed to solve exactly the issue You are facing.
In Your specific usecase it would look like something like this:
val streamA : DataStream[(String,Double)]= ???
val streamABroadcasted = streamA.broadcast(<Your Map State Definition>)
val streamB: DataStream[String] = ???
streamB.connect(streamABroadcasted)
Then You could easily use BroadcastProcessFunction to implement Your logic. More on the Broadcast state pattern can be found here