I've got two streams and I want to be able to consume only one based on a computation that I run every x seconds.
I think I basically need to create a third tick stream - something like every(3.seconds) - that does the computation and then come up with sort of a switch between the other two.
I'm kind of stuck here (and I've only just started fooling around with scalaz-stream).
Thanks!
There are several ways we can approach this problem. One way to approach it is using awakeEvery. For concrete example, see here.
To describe the example briefly, consider that we would like to query twitter in every 5 sec and get the tweets and perform sentiment analysis. We can compose this pipeline as follows:
val source =
awakeEvery(5 seconds) |> buildTwitterQuery(query) through queryChannel flatMap {
Process emitAll _
}
Note that the queryChannel can be stated as follows.
def statusTask(query: Query): Task[List[Status]] = Task {
twitterClient.search(query).getTweets.toList
}
val queryChannel: Channel[Task, Query, List[Status]] = channel lift statusTask
Let me know if you have any question. As stated earlier, for the complete example, see this.
I hope it helps!
Related
Im working with akka/scala/play stack.
Usually, im using stream to perform certain tasks. for example, I have a stream that wakes every minute, picks up something from the DB, and call another service to enrich its data using an API and save the enrichment to the DB.
something like this:
class FetcherAndSaveStream #Inject()(fetcherAndSaveGraph: FetcherAndSaveGraph, dbElementsSource: DbElementsSource)
(implicit val mat: Materializer,
implicit val exec: ExecutionContext) extends LazyLogging {
def graph[M1, M2](source: Source[BDElement, M1],
sink: Sink[BDElement, M2],
switch: SharedKillSwitch): RunnableGraph[(M1, M2)] = {
val fetchAndSaveDataFromExternalService: Flow[BDElement, BDElement, NotUsed] =
fetcherAndSaveGraph.fetchEndSaveEnrichment
source.viaMat(switch.flow)(Keep.left)
.via(fetchAndSaveDataFromExternalService)
.toMat(sink)(Keep.both).withAttributes(supervisionStrategy(resumingDecider))
}
def runGraph(switchSharedKill: SharedKillSwitch): (NotUsed, Future[Done]) = {
logger.info("FetcherAndSaveStream is now running")
graph(dbElementsSource.dbElements(), Sink.ignore, switchSharedKill).run()
}
}
I wonder, is this better than just using an actor that ticks every minute and do something like that? what is the comparison between using actors for this and stream?
trying to figure out still when should I choose which method (streams/actors). thanks!!
You can use both, depending on the requirements you have for your solution which are not listed there. The general concern you need to take into consideration - actors more low-level stuff than streams, so they require more code and debug.
Basically, streams are good for tasks where you have a relatively big amount of data you need to process with low memory consumption. With streams, you won't need to start to stream each n seconds, you can set this stream to run along with the application. That could make your code more concise by omitting scheduler logic.
I will omit your DI and architecture stuff, write solution with pseudocode:
val yourConsumer: Sink[YourDBRecord] = ???
val recordsSource: Source[YourDBRecord] =
val runnableGraph = (Source repeat ())
.throttle(1, n seconds)
.mapAsync(yourParallelism){_ =>
fetchReasonableAmountOfRecordsFromDB
} mapConcat identity to yourConsumer
This stream will do your stuff. You even can enhance it with more sophisticated logic to adapt the polling rate according to workloads using feedback loop in graph api. Also, you can add the error-handling strategy you need to resume in place your stream has crashed.
Moreover, there's alpakka connectors for DBS capable of doing so, you can see if solutions there fit your purpose, or check for implementation details.
What you can get by doing so - backpressure, ability to work with streams, clean and concise code with no timed automata managed directly by you.
https://doc.akka.io/docs/akka/current/stream/stream-rate.html
You can also create an actor, but then you should do all the things akka streams do for you by hand, i.e. back-pressure in case you want to interop with streams, scheduler, chunking and memory management(to not to load 100000 or so entries in one batch to memory), etc.
I am using spark 1.6 and came across this function reduceByKeyAndWindow which I am using to perform word count over data transmitted over a kafka topic.
Following is the list of alternatives reduceByKeyAndWindow is providing. As we can see, all the alternatives has similar signatures with extra parameters.
But when I just use reduceByKeyAndWindow with my reduce function or with my reduce function and duration, it works and doesn't give me any errors as shown below.
But when I use the alternative with reduce function, duration and sliding window time it starts giving me the following error, same happens with the other alternatives, as shown below.
I am not really sure what is happening here and how can I fix the problem.
Any help is appreciated
If you comment this line .words.map(x => (x, 1L)) you should be able to use the method [.reduceByWindow(_+_, Seconds(2), Seconds(2))] from DStream.
If you transform the words to words with count, then you should use the below method.
reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
Please see the documentation on more details for what are those reduce function and inverse reduce function https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
I'm new to RxJava, and I've been trying to combine multiple observables in a round-robin way.
So, imagine you have three observables:
o1: --0---1------2--
o2: -4--56----------
o3: -------8---9----
Combining those in a round-robin way would give you something like:
r : --04---815-9-26-
What would be the best way to approach this?
Since it looks like RxJava, RxScala etc. pretty much share API, answer in any language should be fine. :)
Thanks,
Matija
RxJava doesn't have such operator by default. The closest thing is using merge with well paced sources because it uses round-robin to collect values, but this property can't be relied upon. Why do you need this round-robin behavior?
The best bet is to implement this behavior manually. Here is an example without backpressure support.
There is an approach that is very simple to implement and does almost exactly what you want - just zip the three source observables, and than emit the three values from the zipped observable each time a new triplet arrives.
Translated to RxScala
val o1 = Observable.just(1, 2, 3)
val o2 = Observable.just(10, 20, 30)
val o3 = Observable.just(100, 200, 300)
val roundRobinSource = Observable
.zip(Observable.just(o1, o2, o3))
.flatMap(Observable.from[Int])
roundRobinSource.subscribe(println, println)
gives you
1
10
100
2
20
200
3
30
300
Which is precisely what you want.
The problem with this approach is that it will block until a value from each of the three sources arrives, but if your cool with that, I think this is by far the simplest solution. I'm curious, what is your use case?
Update, Take #2
This is actually a fun question. Here is another take, that will trade one drawback for another.
import rx.lang.scala.{Subject, Observable}
val s1 = Subject[Int]()
val s2 = Subject[Int]()
val s3 = Subject[Int]()
val roundRobinSource3 = s1.publish(po1 ⇒ s2.publish(po2 ⇒ s3.publish(po3 ⇒ {
def oneRound: Observable[Int] = po1.take(1) ++ po2.take(1) ++ po3.take(1)
def all: Observable[Int] = oneRound ++ Observable.defer(all)
all
})))
roundRobinSource3.subscribe(println, println, () ⇒ println("Completed"))
println("s1.onNext(1)")
s1.onNext(1)
println("s2.onNext(10)")
s2.onNext(10)
println("s3.onNext(100)")
s3.onNext(100)
println("s2.onNext(20)")
s2.onNext(20)
println("s1.onNext(2)")
s1.onNext(2)
println("s3.onNext(200)")
s3.onNext(200)
println("s1.onCompleted()")
s1.onCompleted()
println("s2.onCompleted()")
s2.onCompleted()
println("s3.onCompleted()")
s3.onCompleted()
println("Done...")
Gives you
s1.onNext(1)
1
s2.onNext(10)
10
s3.onNext(100)
100
s2.onNext(20)
s1.onNext(2)
2
20
s3.onNext(200)
200
s1.onCompleted()
s2.onCompleted()
s3.onCompleted()
Done...
It doesn't block, it round robins, but... it also doesn't complete :( You could make it complete in a stateful manner using a takeUntil, Subject and doOnComplete if you need it, though..
As for the mechanism, it uses the to me somehow mysterious behavior of publish, that keeps track of things already emitted. I have been originally pointed to it by #lopar when he answered my own questiong Implementing a turnstile-like operator with RxJava.
The behavior of publish is actually such a mystery to me, that I have posted a question about it here: https://github.com/ReactiveX/RxJava/issues/2775. If you are curious, you can follow it.
I have a RichPipe with 3 fields: name: String, time: Long and value: Int. I need to get the value for a specific name, time pair. How can I do it? I can't figure it out from scalding documentation, as it is very cryptic and can't find any examples that do this.
Well a RichPipe is not a Key-Value store, that's why there is no documentation on using as a key-value store :) A RichPipe should be thought of as a pipe - so you can't get at data in the middle without first going in at one end and traversing the pipe till you find the element your looking for. Furthermore this is a little painful in Scalding because you have to write your results to disk (because it's built on top of Hadoop) and then read the result from disk in order to use it in your application. So the code will be something like:
myPipe.filter[String, Long](('name, 'time))(_ == (specificName, specificTime))
.write(Tsv("tmp/location"))
Then you'll need some higher level code to run the job and read the data back into memory to get at the result. Rather than write out all the code to do this (it's pretty straightforward), why don't you give some more context about what your use case is and what you are trying to do - maybe you can solve your problem under the Map-Reduce programming model.
Alternatively, use Spark, you'll have the same problem of having to traverse a distributed dataset, but you don't have the faff of writting to disk and reading back again. Furthermore you can use custom partitioner is Spark that could result in near key-value store like behaviour. But anyway naively, the code would be:
val theValueYouWant =
myRDD.filter {
case (`specificName`, `specificTime`, _) => true
case _ => false
}
.toArray.head._3
Passing messages around with actors is great. But I would like to have even easier code.
Examples (Pseudo-code)
val splicedList:List[List[Int]]=biglist.partition(100)
val sum:Int=ActorPool.numberOfActors(5).getAllResults(splicedList,foldLeft(_+_))
where spliceIntoParts turns one big list into 100 small lists
the numberofactors part, creates a pool which uses 5 actors and receives new jobs after a job is finished
and getallresults uses a method on a list. all this done with messages passing in the background. where maybe getFirstResult, calculates the first result, and stops all other threads (like cracking a password)
With Scala Parallel collections that will be included in 2.8.1 you will be able to do things like this:
val spliced = myList.par // obtain a parallel version of your collection (all operations are parallel)
spliced.map(process _) // maps each entry into a corresponding entry using `process`
spliced.find(check _) // searches the collection until it finds an element for which
// `check` returns true, at which point the search stops, and the element is returned
and the code will automatically be done in parallel. Other methods found in the regular collections library are being parallelized as well.
Currently, 2.8.RC2 is very close (this or next week), and 2.8 final will come in a few weeks after, I guess. You will be able to try parallel collections if you use 2.8.1 nightlies.
You can use Scalaz's concurrency features to achieve what you want.
import scalaz._
import Scalaz._
import concurrent.strategy.Executor
import java.util.concurrent.Executors
implicit val s = Executor.strategy[Unit](Executors.newFixedThreadPool(5))
val splicedList = biglist.grouped(100).toList
val sum = splicedList.parMap(_.sum).map(_.sum).get
It would be pretty easy to make this prettier (i.e. write a function mapReduce that does the splitting and folding all in one). Also, parMap over a List is unnecessarily strict. You will want to start folding before the whole list is ready. More like:
val splicedList = biglist.grouped(100).toList
val sum = splicedList.map(promise(_.sum)).toStream.traverse(_.sum).get
You can do this with less overhead than creating actors by using futures:
import scala.actors.Futures._
val nums = (1 to 1000).grouped(100).toList
val parts = nums.map(n => future { n.reduceLeft(_ + _) })
val whole = (0 /: parts)(_ + _())
You have to handle decomposing the problem and writing the "future" block and recomposing it in to a final answer, but it does make executing a bunch of small code blocks in parallel easy to do.
(Note that the _() in the fold left is the apply function of the future, which means, "Give me the answer you were computing in parallel!", and it blocks until the answer is available.)
A parallel collections library would automatically decompose the problem and recompose the answer for you (as with pmap in Clojure); that's not part of the main API yet.
I'm not waiting for Scala 2.8.1 or 2.9, it would rather be better to write my own library or use another, so I did more googling and found this: akka
http://doc.akkasource.org/actors
which has an object futures with methods
awaitAll(futures: List[Future]): Unit
awaitOne(futures: List[Future]): Future
but http://scalablesolutions.se/akka/api/akka-core-0.8.1/
has no documentation at all. That's bad.
But the good part is that akka's actors are leaner than scala's native ones
With all of these libraries (including scalaz) around, it would be really great if scala itself could eventually merge them officially
At Scala Days 2010, there was a very interesting talk by Aleksandar Prokopec (who is working on Scala at EPFL) about Parallel Collections. This will probably be in 2.8.1, but you may have to wait a little longer. I'll lsee if I can get the presentation itself. to link here.
The idea is to have a collections framework which parallelizes the processing of the collections by doing exactly as you suggest, but transparently to the user. All you theoretically have to do is change the import from scala.collections to scala.parallel.collections. You obviously still have to do the work to see if what you're doing can actually be parallelized.