Refresh cached values in spark streaming without reboot the batch - mongodb

Maybe the question is too simple, at least look like that, but I have the following problem:
A. Execute spark submit in spark streaming process.
ccc.foreachRDD(rdd => {
rdd.repartition(20).foreachPartition(p => {
val repo = getReposX
t foreach (g => {
.................
B. getReposX is a function which one make a query in mongoDB recovering a Map wit a key/value necessary in every executor of the process.
C. Into each g in foreach I manage this map "cached"
The problem or the question is when in the collection of mongo anything change, I don't watch or I don't detect the change, therefore I am managing a Map not updated. My question is: How can I get it? Yes, I know If I reboot the spark-submit and the driver is executed again is OK but otherwise I will never see the update in my Map.
Any ideas or suggestion?
Regards.

Finally I Developed a solution. First I explain the question more in detail because what I really wanted to know is how to implement an object or "cache", which was refreshed every so often, or by some kind of order, without the need to restart the spark streaming process, that is, it would refresh in alive.
In my case this "cache" or refreshed object is an object (Singleton) that connected to a collection of mongoDB to recover a HashMap that was used by each executor and was cached in memory as a good Singleton. The problem with this was that once the spark streaming submit is executed it was cached that object in memory but it was not refreshed unless the process was restarted. Think of a broadcast as a counter mode to refresh when the variable reaches 1000, but these are read only and can not be modified. Think of an counter, but these can only be read by the driver.
Finally my solution is, within the initialization block of the object that loads the mongo collection and the cache, I implement this:
//Initialization Block
{
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
logger.info("Refresh - Inicialization")
initCache
}
}
val f = ex.scheduleAtFixedRate(task, 0, TIME_REFRES, TimeUnit.SECONDS)
}
initCache is nothing more than a function that connects mongo and loads a collection:
var cache = mutable.HashMap[String,Description]()
def initCache():mutable.HashMap[String, Description]={
val serverAddresses = Functions.getMongoServers(SERVER, PORT)
val mongoConnectionFactory = new MongoCollectionFactory(serverAddresses,DATABASE,COLLECTION,USERNAME,PASSWORD)
val collection = mongoConnectionFactory.getMongoCollection()
val docs = collection.find.iterator()
cache.clear()
while (docs.hasNext) {
var doc = docs.next
cache.put(...............
}
cache
}
In this way, once the spark streaming submit has been started, each task will open one more task, which will make every X time (1 or 2 hours in my case) refresh the value of the singleton collection, and always recover that value instantiated:
def getCache():mutable.HashMap[String, Description]={
cache
}

Related

Akka stream hangs when starting more than 15 external processes using ProcessBuilder

I'm building an app that has the following flow:
There is a source of items to process
Each item should be processed by external command (it'll be ffmpeg in the end but for this simple reproducible use case it is just cat to have data be passed through it)
In the end, the output of such external command is saved somewhere (again, for the sake of this example it just saves it to a local text file)
So I'm doing the following operations:
Prepare a source with items
Make an Akka graph that uses Broadcast to fan-out the source items into individual flows
Individual flows uses ProcessBuilder in conjunction with Flow.fromSinkAndSource to build flow out of this external process execution
End the individual flows with a sink that saves the data to a file.
Complete code example:
import akka.actor.ActorSystem
import akka.stream.scaladsl.GraphDSL.Implicits._
import akka.stream.scaladsl._
import akka.stream.ClosedShape
import akka.util.ByteString
import java.io.{BufferedInputStream, BufferedOutputStream}
import java.nio.file.Paths
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, ExecutionContext, Future}
object MyApp extends App {
// When this is changed to something above 15, the graph just stops
val PROCESSES_COUNT = Integer.parseInt(args(0))
println(s"Running with ${PROCESSES_COUNT} processes...")
implicit val system = ActorSystem("MyApp")
implicit val globalContext: ExecutionContext = ExecutionContext.global
def executeCmdOnStream(cmd: String): Flow[ByteString, ByteString, _] = {
val convertProcess = new ProcessBuilder(cmd).start
val pipeIn = new BufferedOutputStream(convertProcess.getOutputStream)
val pipeOut = new BufferedInputStream(convertProcess.getInputStream)
Flow
.fromSinkAndSource(StreamConverters.fromOutputStream(() ⇒ pipeIn), StreamConverters.fromInputStream(() ⇒ pipeOut))
}
val source = Source(1 to 100)
.map(element => {
println(s"--emit: ${element}")
ByteString(element)
})
val sinksList = (1 to PROCESSES_COUNT).map(i => {
Flow[ByteString]
.via(executeCmdOnStream("cat"))
.toMat(FileIO.toPath(Paths.get(s"process-$i.txt")))(Keep.right)
})
val graph = GraphDSL.create(sinksList) { implicit builder => sinks =>
val broadcast = builder.add(Broadcast[ByteString](sinks.size))
source ~> broadcast.in
for (i <- broadcast.outlets.indices) {
broadcast.out(i) ~> sinks(i)
}
ClosedShape
}
Await.result(Future.sequence(RunnableGraph.fromGraph(graph).run()), Duration.Inf)
}
Run this using following command:
sbt "run PROCESSES_COUNT"
i.e.
sbt "run 15"
This all works quite well until I raise the amount of "external processes" (PROCESSES_COUNT in the code). When it's 15 or less, all goes well but when it's 16 or more then the following things happen:
Whole execution just hangs after emitting the first 16 items (this amount of 16 items is Akka's default buffer size AFAIK)
I can see that cat processes are started in the system (all 16 of them)
When I manually kill one of these cat processes in the system, something frees up and processing continues (of course in the result, one file is empty because I killed its processing command)
I checked that this is caused by the external execution for sure (not i.e. limit of Akka Broadcast itself).
I recorded a video showing these two situations (first, 15 items working fine and then 16 items hanging and freed up by killing one process) - link to the video
Both the code and video are in this repo
I'd appreciate any help or suggestions where to look solution for this one.
It is an interesting problem and it looks like that the stream is dead-locking. The increase of threads may be fixing the symptom but not the underlying problem.
The problem is following code
Flow
.fromSinkAndSource(
StreamConverters.fromOutputStream(() => pipeIn),
StreamConverters.fromInputStream(() => pipeOut)
)
Both fromInputStream and fromOutputStream will be using the same default-blocking-io-dispatcher as you correctly noticed. The reason for using a dedicated thread pool is that both perform Java API calls that are blocking the running thread.
Here is a part of a thread stack trace of fromInputStream that shows where blocking is happening.
at java.io.FileInputStream.readBytes(java.base#11.0.13/Native Method)
at java.io.FileInputStream.read(java.base#11.0.13/FileInputStream.java:279)
at java.io.BufferedInputStream.read1(java.base#11.0.13/BufferedInputStream.java:290)
at java.io.BufferedInputStream.read(java.base#11.0.13/BufferedInputStream.java:351)
- locked <merged>(a java.lang.ProcessImpl$ProcessPipeInputStream)
at java.io.BufferedInputStream.read1(java.base#11.0.13/BufferedInputStream.java:290)
at java.io.BufferedInputStream.read(java.base#11.0.13/BufferedInputStream.java:351)
- locked <merged>(a java.io.BufferedInputStream)
at java.io.FilterInputStream.read(java.base#11.0.13/FilterInputStream.java:107)
at akka.stream.impl.io.InputStreamSource$$anon$1.onPull(InputStreamSource.scala:63)
Now, you're running 16 simultaneous Sinks that are connected to a single Source. To support back-pressure, a Source will only produce an element when all Sinks send a pull command.
What happens next is that you have 16 calls to method FileInputStream.readBytes at the same time and they immediately block all threads of default-blocking-io-dispatcher. And there are no threads left for fromOutputStream to write any data from the Source or perform any kind of work. Thus, you have a dead-lock.
The problem can be fixed if you increase the threads in the pool. But this just removes the symptom.
The correct solution is to run fromOutputStream and fromInputStream in two separate thread pools. Here is how you can do it.
Flow
.fromSinkAndSource(
StreamConverters.fromOutputStream(() => pipeIn).async("blocking-1"),
StreamConverters.fromInputStream(() => pipeOut).async("blocking-2")
)
with following config
blocking-1 {
type = "Dispatcher"
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 2
}
}
blocking-2 {
type = "Dispatcher"
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 2
}
}
Because they don't share the pools anymore, both fromOutputStream and fromInputStream can perform their tasks independently.
Also note that I just assigned 2 threads per pool to show that it's not about the thread count but about the pool separation.
I hope this helps to understand akka streams better.
Turns out this was limit on Akka configuration level of blocking IO dispatchers:
So changing that value to something bigger than the amount of streams fixed the issue:
akka.actor.default-blocking-io-dispatcher.thread-pool-executor.fixed-pool-size = 50

With Akka Stream, how to dynamically duplicate a flow?

I'm running a live video streaming server. There's an Array[Byte] video source. Note that I can't get 2 connections to my video source. I want every client connecting to my server to receive this same stream, with a buffer discarding the old frames.
I tried using a BroadcastHub like this :
val source =
Source.fromIterator(() => myVideoStreamingSource.zipWithIndex)
val runnableGraph =
source.toMat(BroadcastHub.sink(bufferSize = 2))(Keep.right)
runnableGraph.run().to(Sink.foreach { index =>
println(s"client A reading frame #$index")
}).run()
runnableGraph.run().to(Sink.foreach { index =>
println(s"client B reading frame #$index")
}).run()
I get :
client A reading frame #0
client B reading frame #1
client A reading frame #2
client B reading frame #3
We see that the main stream is partitioned between the two clients, whereas I'd expect my two client being able to see all the source stream's frames.
Did I miss something, or is there any other solution ?
The issue is the combination of Iterator with BroadcastHub. I assume you myVideoStreamingSource is something like:
val myVideoStreamingSource = Iterator("A","B","C","D","E")
I'll now quote from BroadcastHub.Sink:
Every new materialization of the [[Sink]] results in a new, independent hub, which materializes to its own [[Source]] for consuming the [[Sink]] of that materialization.
The issue here for you, is that it does not yet consume the data from the iterator.
The thing with iterator, is that once you consumed its data, you won't get back to the beginning again. Add to that the fact that both graphs run in parallel, it looks like it "divides" the elements between the two. But actually that is completely random. For example, if you add a sleep of 1 second between the Client A and Client B, so the only client that will print will be A.
In order to get that work, you need to create a source that is reversible. For example, Seq, or List. The following will do:
val myVideoStreamingSource = Seq("A","B","C","D","E")
val source = Source.fromIterator(() => myVideoStreamingSource.zipWithIndex.iterator)

Can Flink Map be Called on and When Required (Not activated on input Stream)

I have a map in flink that gets activated once data comes through a stream.
I want to call that map even if no data come through.
I moved the map into a function (infinite function call) but then the flink job never runs. And if I add it within a map it will only get activated if and when data comes through.
The Idea is, have 1 map in an infinte loop, checking some shared variable and another flink stream monitoring a kafka queue, if data comes in it process it changes a shared variable that effects the infinite loop in some way and continues.
How Do I call an infinite loop map and run the flink maps together?
I tried creating a CollectionMap with random data to activate the stream and map to call the infinite loop, but exits almost immediately even though there is a while(true) condition within the map
In the IDE it works, when I push it to Flink.local it exits almost immediately not staying in loop
Stream 1
val data_stream = env.addSource(myConsumer)
.map(x => {process(x)})
Stream 2
val elements = List[String]("Start")
var read = env.fromElements(elements).map(x => ProcessData.infinteLoop())
How Do I call an infinite loop map and run the flink maps together?
You can create a window and a trigger and call the map every x seconds.
You can find the documentation heare: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html
Example:
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows
import org.apache.flink.streaming.api.windowing.triggers.{CountTrigger, PurgingTrigger}
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow
val data_stream = env.addSource(myConsumer)
.map(x => {process(x)})
val window: DataStream[String] = data_stream
.windowAll(GlobalWindows.create())
.trigger(PurgingTrigger.of(CountTrigger.of[GlobalWindow](5)))
.apply((w: GlobalWindow, x: Iterable[(Integer, String)], y: Collector[String]) => {})

Akka Sink never closes

I am uploading a single file to an SFTP server using Alpakka but once the file is uploaded and I have gotten the success response the Sink stays open, how do I drain it?
I started off with this:
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
​
source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
.map(_.wasSuccessful)
But that ends up never leaving the map step.
I tried to add a killswitch, but that seems to have had no effect (neither with shutdown or abort):
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
​
val (killswitch, result) = source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
result.map {
killswitch.shutdown()
_.wasSuccessful
}
Am I doing something fundamentally wrong? I just want one result.
EDIT The settings sent in to toPath:
SftpSettings(InetAddress.getByName(host))
.withCredentials(FtpCredentials.create(amexUsername, amexPassword))
.withStrictHostKeyChecking(false)
By asking you to put Await.result(result, Duration.Inf) at the end I wanted to check the theory expressed by A. Gregoris. Thus it's either
your app exits before Future completes or
(if you app does't exit) function in which you do this discards result
If your app doesn't exit you can try using result.onComplete to do necessary work.
I cannot see your whole code but it seems to me that in the snippet you posted that result value is a Future that is not completing before the end of your program execution and that is because the code in the map is not being executed either.

Play 1.2.3 framework - Right way to commit transaction

We have a HTTP end-point that takes a long time to run and can also be called concurrently by users. As part of this request, we update the model inside a synchronized block so that other (possibly concurrent) requests pick up that change.
E.g.
MyModel m = null;
synchronized (lockObject) {
m = MyModel.findById(id);
if (m.status == PENDING) {
m.status = ACTIVE;
} else {
//render a response back to user that the operation is not allowed
}
m.save(); //Is not expected to be called unless we set m.status = ACTIVE
}
//Long running operation continues here. It can involve further changes to instance "m"
The reason for the synchronized block is to ensure that even concurrent requests get to pick up the latest status. However, the underlying JPA does not commit my changes (m.save()) until the request is complete. Since this is a long-running request, I do not want to wait until the request is complete and still want to ensure that other callers are notified of the change in status. I tried to call "m.em().flush(); JPA.em().getTransaction().commit();" after m.save(), but that makes the transaction unavailable for the subsequent action as part of the same request. Can I just given "JPA.em().getTransaction().begin();" and let Play handle the transaction from then on? If not, what is the best way to handle this use-case?
UPDATE:
Based on the response, I modified my code as follows:
MyModel m = null;
synchronized (lockObject) {
m = MyModel.findById(id);
if (m.status == PENDING) {
m.status = ACTIVE;
} else {
//render a response back to user that the operation is not allowed
}
m.save(); //Is not expected to be called unless we set m.status = ACTIVE
}
new MyModelUpdateJob(m.id).now();
And in my job, I have the following line:
doJob() {
MyModel m = MyModel.findById(id);
print m.status; //This still prints the old status as-if m.save() had no effect...
}
What am I missing?
Put your update code in a job an call
new MyModelUpdateJob(id).now().get();
thus the update will be done in another transaction that is commited at the end of the job
ouch, as soon as you add more play servers, you will be in trouble. You may want to play with optimistic locking in your example or and I advise against it pessimistic locking....ick.
HOWEVER, looking at your code, maybe read the article Building on Quicksand. I am not sure you need a synchronized block in that case at all...try to go after being idempotent.
In your case if
1. user 1 and user 2 both call that method and it is pending, then it goes to active(Idempotent)
If user 1 or user 2 wins, well that would be like you had the synchronization block anyways.
I am sure however you have a more complex scenario not shown here, BUT READ that article Building on Quicksand as it really changes the traditional way of thinking and is how google and amazon and very large scale systems operate.
Another option for distributed transactions across play servers is zookeeper which the big large nosql guys use BUT only as a last resort ;) ;)
later,
Dean