Scalaz-stream chunking UP to N - scala

Given a queue like so:
val queue: Queue[Int] = async.boundedQueue[Int](1000)
I want to pull off this queue and stream it into a downstream Sink, in chunks of UP to 100.
queue.dequeue.chunk(100).to(downstreamConsumer)
works sort of, but it will not empty the queue if I have say 101 messages. There will be 1 message left over, unless another 99 are pushed in. I want to take as many as I can off the queue up to 100, as fast as my downstream process can handle.
Is there an existing combinator available?

For this, you may need to monitor size of the queue when dequeueing from it. then, if the size reaches 0 you simply won't wait to any more elements. In fact you may implement elastic sizing of the batching based on the size of queue. I.e. :
val q = async.unboundedQueue[String]
val deq:Process[Task,(String,Int)] = q.dequeue zip q.size
val elasticChunk: Process1[(String,Int), Vector[String]] = ???
val downstreamConsumer : Sink[Task,Vector[String]] = ???
deq.pipe(elasticChunk) to downstreamConsumer

I actually solved this a different way then I had intended.
The scalaz-stream queue now contains a dequeueBatch method that allows dequeuing all values in the queue, up to N, or blocks.
https://github.com/scalaz/scalaz-stream/issues/338

Related

With Akka Stream, how to dynamically duplicate a flow?

I'm running a live video streaming server. There's an Array[Byte] video source. Note that I can't get 2 connections to my video source. I want every client connecting to my server to receive this same stream, with a buffer discarding the old frames.
I tried using a BroadcastHub like this :
val source =
Source.fromIterator(() => myVideoStreamingSource.zipWithIndex)
val runnableGraph =
source.toMat(BroadcastHub.sink(bufferSize = 2))(Keep.right)
runnableGraph.run().to(Sink.foreach { index =>
println(s"client A reading frame #$index")
}).run()
runnableGraph.run().to(Sink.foreach { index =>
println(s"client B reading frame #$index")
}).run()
I get :
client A reading frame #0
client B reading frame #1
client A reading frame #2
client B reading frame #3
We see that the main stream is partitioned between the two clients, whereas I'd expect my two client being able to see all the source stream's frames.
Did I miss something, or is there any other solution ?
The issue is the combination of Iterator with BroadcastHub. I assume you myVideoStreamingSource is something like:
val myVideoStreamingSource = Iterator("A","B","C","D","E")
I'll now quote from BroadcastHub.Sink:
Every new materialization of the [[Sink]] results in a new, independent hub, which materializes to its own [[Source]] for consuming the [[Sink]] of that materialization.
The issue here for you, is that it does not yet consume the data from the iterator.
The thing with iterator, is that once you consumed its data, you won't get back to the beginning again. Add to that the fact that both graphs run in parallel, it looks like it "divides" the elements between the two. But actually that is completely random. For example, if you add a sleep of 1 second between the Client A and Client B, so the only client that will print will be A.
In order to get that work, you need to create a source that is reversible. For example, Seq, or List. The following will do:
val myVideoStreamingSource = Seq("A","B","C","D","E")
val source = Source.fromIterator(() => myVideoStreamingSource.zipWithIndex.iterator)

Offer to queue with some initial display

I want to offer to queue a string sent in load request after some initial delay say 10 seconds.
If the subsequent request is made with some short interval delay(1 second) then everything works fine, but if it is made continuously like from a script then there is no delay.
Here is the sample code.
def load(randomStr :String) = Action { implicit request =>
Source.single(randomStr)
.delay(10 seconds, DelayOverflowStrategy.backpressure)
.map(x =>{
println(x)
queue.offer(x)
})
.runWith(Sink.ignore)
Ok("")
}
I am not entirely sure that this is the correct way of doing what you want. There are some things you need to reconsider:
a delayed source has an initial buffer capacity of 16 elements. You can increase this with addAttributes(initialBuffer)
In your case the buffer cannot actually become full because every time you provide just one element.
Who is the caller of the Action? You are defining a DelayOverflowStrategy.backpressure strategy but is the caller able to handle this?
On every call of the action you are creating a Stream consisting of one element, how is the backpressure here helping? It is applied on the stream processing and not on the offering to the queue

Proper way to stop Akka Streams on condition

I have been successfully using FileIO to stream the contents of a file, compute some transformations for each line and aggregate/reduce the results.
Now I have a pretty specific use case, where I would like to stop the stream when a condition is reached, so that it is not necessary to read the whole file but the process finishes as soon as possible. What is the recommended way to achieve this?
If the stop condition is "on the outside of the stream"
There is a advanced building-block called KillSwitch that you could use to do this: http://doc.akka.io/japi/akka/2.4.7/akka/stream/KillSwitches.html The stream would get shut down once the kill switch is notified.
It has methods like abort(reason) / shutdown etc, see here for it's API: http://doc.akka.io/japi/akka/2.4.7/akka/stream/SharedKillSwitch.html
Reference documentation is here: http://doc.akka.io/docs/akka/2.4.8/scala/stream/stream-dynamic.html#kill-switch-scala
Example usage would be:
val countingSrc = Source(Stream.from(1)).delay(1.second,
DelayOverflowStrategy.backpressure)
val lastSnk = Sink.last[Int]
val (killSwitch, last) = countingSrc
.viaMat(KillSwitches.single)(Keep.right)
.toMat(lastSnk)(Keep.both)
.run()
doSomethingElse()
killSwitch.shutdown()
Await.result(last, 1.second) shouldBe 2
If the stop condition is inside the stream
You can use takeWhile to express any condition really, though sometimes take or limit may be also enough "take 10 lnes".
If your logic is very advanced, you could build a special stage that handles that special logic using statefulMapConcat that allows to express literally anything - so you could complete the stream whenever you want to "from the inside".

Control flow of messages in Akka actor

I have an actor using Akka which performs an action that takes some time to complete, because it has to download a file from the network.
def receive = {
case songId: String => {
Future {
val futureFile = downloadFile(songId)
for (file <- futureFile) {
val fileName = doSomenthingWith(file)
otherActor ! fileName
}
}
}
}
I would like to control the flow of messages to this actor. If I try to download too many files simultaneously, I have a network bottleneck. The problem is that I am using a Future inside the actor receive, so, the methods exits and the actor is ready to process a new message. If I remove the Future, I will download only one file per time.
What is the best way to limit the number of messages being processed per unit of time? Is there a better way to design this code?
There is a contrib project for Akka that provides a throttle implementation (http://letitcrash.com/post/28901663062/throttling-messages-in-akka-2). If you sit this in front of the actual download actor then you can effectively throttle the rate of messages going into that actor. It's not 100% perfect in that if the download times are taking longer than expected you could still end up with more downloads then might be desired, but it's a pretty simple implementation and we use it quite a bit to great effect.
Another option could be to use a pool of download actors and remove the future and allow the actors to perform this blocking so that they are truly handling only one message at a time. Because you are going to let them block, I would suggest giving them their own Dispatcher (ExecutionContext) so that this blocking does not negatively effect the main Akka Dispatcher. If you do this, then the pool size itself represents your max allowed number of simultaneous downloads.
Both of these solutions are pretty much "out-of-the-box" solutions that don't require much custom logic to support your use case.
Edit
I also thought it would be good to mention the Work Pulling Pattern as well. With this approach you could still use a pool and then a single work distributer in front. Each worker (download actor) could perform the download (still using a Future) and only request new work (pull) from the work distributer when that Future has fully completed meaning the download is done.
If you have an upper bound on the amount of simultanious downloads you want to happen you can 'ack' back to the actor saying that a download completed and to free up a spot to download another file:
case object AckFileRequest
class ActorExample(otherActor:ActorRef, maxFileRequests:Int = 1) extends Actor {
var fileRequests = 0
def receive = {
case songId: String if fileRequests < maxFileRequests =>
fileRequests += 1
val thisActor = self
Future {
val futureFile = downloadFile(songId)
//not sure if you're returning the downloaded file or a future here,
//but you can move this to wherever the downloaded file is and ack
thisActor ! AckFileRequest
for (file <- futureFile) {
val fileName = doSomenthingWith(file)
otherActor ! fileName
}
}
case songId: String =>
//Do some throttling here
val thisActor = self
context.system.scheduler.scheduleOnce(1 second, thisActor, songId)
case AckFileRequest => fileRequests -= 1
}
}
In this example, if there are too many file requests then we put this songId request on hold and queue it back up for processing 1 second later. You can obviously change this however you see fit, maybe you can just send the message straight back to the actor in a tight loop or do some other throttling, depends on your use case.
There is a contrib implementation of message Throttling, as described here.
The code is very simple:
// A simple actor that prints whatever it receives
class Printer extends Actor {
def receive = {
case x => println(x)
}
}
val printer = system.actorOf(Props[Printer], "printer")
// The throttler for this example, setting the rate
val throttler = system.actorOf(Props(classOf[TimerBasedThrottler], 3 msgsPer 1.second))
// Set the target
throttler ! SetTarget(Some(printer))
// These three messages will be sent to the printer immediately
throttler ! "1"
throttler ! "2"
throttler ! "3"
// These two will wait at least until 1 second has passed
throttler ! "4"
throttler ! "5"

Python-Multithreading Time Sensitive Task

from random import randrange
from time import sleep
#import thread
from threading import Thread
from Queue import Queue
'''The idea is that there is a Seeker method that would search a location
for task, I have no idea how many task there will be, could be 1 could be 100.
Each task needs to be put into a thread, does its thing and finishes. I have
stripped down a lot of what this is really suppose to do just to focus on the
correct queuing and threading aspect of the program. The locking was just
me experimenting with locking'''
class Runner(Thread):
current_queue_size = 0
def __init__(self, queue):
self.queue = queue
data = queue.get()
self.ID = data[0]
self.timer = data[1]
#self.lock = data[2]
Runner.current_queue_size += 1
Thread.__init__(self)
def run(self):
#self.lock.acquire()
print "running {ID}, will run for: {t} seconds.".format(ID = self.ID,
t = self.timer)
print "Queue size: {s}".format(s = Runner.current_queue_size)
sleep(self.timer)
Runner.current_queue_size -= 1
print "{ID} done, terminating, ran for {t}".format(ID = self.ID,
t = self.timer)
print "Queue size: {s}".format(s = Runner.current_queue_size)
#self.lock.release()
sleep(1)
self.queue.task_done()
def seeker():
'''Gathers data that would need to enter its own thread.
For now it just uses a count and random numbers to assign
both a task ID and a time for each task'''
queue = Queue()
queue_item = {}
count = 1
#lock = thread.allocate_lock()
while (count <= 40):
random_number = randrange(1,350)
queue_item[count] = random_number
print "{count} dict ID {key}: value {val}".format(count = count, key = random_number,
val = random_number)
count += 1
for n in queue_item:
#queue.put((n,queue_item[n],lock))
queue.put((n,queue_item[n]))
'''I assume it is OK to put a tulip in and pull it out later'''
worker = Runner(queue)
worker.setDaemon(True)
worker.start()
worker.join()
'''Which one of these is necessary and why? The queue object
joining or the thread object'''
#queue.join()
if __name__ == '__main__':
seeker()
I have put most of my questions in the code itself, but to go over the main points (Python2.7):
I want to make sure I am not creating some massive memory leak for myself later.
I have noticed that when I run it at a count of 40 in putty or VNC on
my linuxbox that I don't always get all of the output, but when
I use IDLE and Aptana on windows, I do.
Yes I understand that the point of Queue is to stagger out your
Threads so you are not flooding your system's memory, but the task at
hand are time sensitive so they need to be processed as soon as they
are detected regardless of how many or how little there are; I have
found that when I have Queue I can clearly dictate when a task has
finished as oppose to letting the garbage collector guess.
I still don't know why I am able to get away with using either the
.join() on the thread or queue object.
Tips, tricks, general help.
Thanks for reading.
If I understand you correctly you need a thread to monitor something to see if there are tasks that need to be done. If a task is found you want that to run in parallel with the seeker and other currently running tasks.
If this is the case then I think you might be going about this wrong. Take a look at how the GIL works in Python. I think what you might really want here is multiprocessing.
Take a look at this from the pydocs:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.