Schedule computation concurrently for all elements of the fs2.Stream

Schedule computation concurrently for all elements of the fs2.Stream - scala

I have an fs2.Stream consisting of some elements (probably infinite) and I want to schedule some computation for all elements of the stream concurrently to each other. Here is what I tried
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)
val stream = for {
id <- fs2.Stream.emits(List(1, 2)).covary[IO]
_ <- fs2.Stream.awakeEvery[IO](1.second)
_ <- fs2.Stream.eval(IO(println(id)))
} yield ()
stream.compile.drain.unsafeRunSync()
The program output looks like
1
1
1
etc...
which is not what's expected. I'd like to interleave the scheduled computation for all of the elements of the original stream, but not wait until the first stream terminates (which never happens due to the infinite scheduling).

val str = for {
id <- Stream.emits(List(1, 5, 7)).covary[IO]
res = timer.sleep(id.second) >> IO(println(id))
} yield res
val stream = str.parEvalMapUnordered(5)(identity)
stream.compile.drain.unsafeRunSync()
or
val stream = Stream.emits(List(1, 5, 7))
.map { id =>
Stream.eval(timer.sleep(id.second) >> IO(println(id))) }
.parJoinUnbounded
stream.compile.drain.unsafeRunSync()

Accroding to hints given by #KrzysztofAtłasik and #LuisMiguelMejíaSuárez here is the solution I just came up with:
val originalStream = fs2.Stream.emits(List(1, 2))
val scheduledComputation = originalStream.covary[IO].map({ id =>
fs2.Stream.awakeEvery[IO](1.second).evalMap(_ => IO.delay(println(id)))
}).fold(fs2.Stream.empty.covaryAll[IO, Unit])((result, stream) => result.merge(stream)).flatten
The solution that #KrzysztofAtłasik proposed in the comment with interleaving
id <- fs2.Stream.emits(List(1, 2)).covary[IO] and _ <- fs2.Stream.awakeEvery[IO](1.second) also works, but it does not allow to schedule each element in its own way.
To schedule elements concurrently for elementValue seconds it is possible to do the following:
val scheduleEachElementIndividually = originalStream.covary[IO].map({ id =>
//id.seconds
fs2.Stream.awakeEvery[IO](id.second).evalMap(_ => IO.delay(println(id)))
}).fold(fs2.Stream.empty.covaryAll[IO, Unit])((result, stream) => result.merge(stream)).flatten

Related

fs2.Stream hangs on taking twice

Problem:
I want to repeatedly take some batches from the fs2.Stream provided by some third-party library and therefore abstract clients away from the fs2.Stream itself and give them simply F[List[Int]] batches as soon as they are ready.
Attempts:
I tried to use fs2.Stream::take and ran some examples.
I.
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
val r = for {
queue <- fs2.concurrent.Queue.unbounded[IO, Int]
stream = queue.dequeue
_ <- fs2.Stream.range(0, 1000).covaryAll[IO, Int].evalTap(queue.enqueue1).compile.drain
_ <- stream.take(10).compile.toList.flatTap(lst => IO(println(lst))).iterateWhile(_.nonEmpty)
} yield ()
r.unsafeRunSync()
It prints the very first batch List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and then hangs. I expected that all the batches from 0 to 1000 will be printed.
Keeping things a bit simpler here is
II.
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
val r = for {
queue <- fs2.concurrent.Queue.unbounded[IO, Int]
stream = queue.dequeue
_ <- fs2.Stream.range(0, 1000).covaryAll[IO, Int].evalTap(queue.enqueue1).compile.drain
_ <- stream.take(10).compile.toList.flatTap(lst => IO(println(lst)))
_ <- stream.take(20).compile.toList.flatTap(lst => IO(println(lst)))
} yield ()
r.unsafeRunSync()
The behavior is completely the same to I. Prints List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and then hangs.
Question:
Given an fs2.Stream[IO, Int] how to provide an effect IO[List[Int]] which iterates through consecutive batches provided by the stream when evaluated?

Well, you can not have an IO[List[X]] that represents multiple batches, that IO would be a single batch.
The best you can do is something like this:
def processByBatches(process: List[Int] => IO[Unit]): IO[Unit]
That is, your users will give you an operation to execute for each batch and you would give them an IO that will block the current fiber until the whole stream was consumed using that function.
And the simples way to implement such function would be:
def processByBatches(process: List[Int] => IO[Unit]): IO[Unit] =
getStreamFromThirdParty
.chunkN(n = ChunkSize)
.evalMap(chunk => process(chunk.toList))
.compile
.drain

How to avoid Await.ready in Scala while using Futures?

I have following pseudo code.
Invoke fetch, fetchRecordDetail, upload and notifyUploaded functions in sequence. Each function returns a future event but the first functions returns Option[T], going forward(fetchRecordDetail, upload and notifyUploaded) calls I need to carry only Some[T] type and ignore None.
Unfortunately I was able to achieve the following output with too many Await.ready calls.
Expected output
notified List(UploadResult(a_detail_uploaded), UploadResult(c_detail_uploaded))
Code
def fetch(id: String): Future[Option[Record]] = Future {
Thread sleep 100
if (id != "b" && id != "d") {
Some(Record(id))
} else None
}
def fetchRecordDetail(record: Record): Future[RecordDetail] = Future {
Thread sleep 100
RecordDetail(record.id + "_detail")
}
def upload(recordDetail: RecordDetail): Future[UploadResult] = Future {
Thread sleep 100
UploadResult(recordDetail.id + "_uploaded")
}
def notifyUploaded(results: Seq[UploadResult]): Future[Unit] = Future{ println("notified " + results)}
val result: Future[Unit] = //Final call to 'notifyUploaded' goes here
Await.ready(result, Duration.Inf)
Can someone help to improvise this code by avoiding Await.ready calls.
val ids: Seq[String] = Seq("a", "b", "c", "d")
def filterSome(s:String) = fetch(s) map ((s, _)) collect { case (s, Some(v)) => v }
val validData = ids map filterSome
Await.ready(Future.sequence(validData), Duration.Inf)
val records = validData.map(_.value.get.toOption)
val recordDetails = records.flatten map fetchRecordDetail
Await.ready(Future.sequence(recordDetails), Duration.Inf)
val uploadResult = recordDetails.map(_.value.get.toOption).flatten map upload
Await.ready(Future.sequence(uploadResult), Duration.Inf)
val seqUploadResult = uploadResult.map(_.value.get.toOption)
val result: Future[Unit] = notifyUploaded(seqUploadResult.flatten)
Await.ready(result, Duration.Inf)

This appears to work.
Future.sequence(ids.map(fetch)) //fetch Recs
.map(_.flatten) //remove None
.flatMap(rs=> Future.sequence(rs.map(fetchRecordDetail))) //fetch Details
.flatMap(ds=> Future.sequence(ds.map(upload))) //upload
.flatMap(notifyUploaded) //notify
It returns a Future[Unit] which you could Await() on, but I don't know why.

Something like that is what do you want?:
for {
f1 <- validData
f2 <- recordDetails
f3 <- seqUploadResult
}yield f3
onComplete(notifyUploaded(seqUploadResult.flatten))

spark scala get uncommon map elements

I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:
val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
val fields = line.split(",")
Rating(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
Then I select 80% of those for my training set:
val train = ratings.sample(false,.8,1)
Is there an easy way to get the test set in a distributed way,
I am trying this but fails:
val test = ratings.filter(!_.equals(train.map(_)))

val test = ratings.subtract(train)

Take a look here. http://markmail.org/message/qi6srcyka6lcxe7o
Here is the code
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val partitionSeeds = data.partitions.map(partition => rand.nextLong)
val temp = data.mapPartitionsWithIndex((index, iter) => {
val partitionRand = new java.util.Random(partitionSeeds(index))
iter.map(x => (x, partitionRand.nextDouble))
})
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
}

Instead of using an exclusion method (like filter or subtract), I'd partition the set "by hand" for a more efficient execution:
val probabilisticSegment:(RDD[Double,Rating],Double=>Boolean) => RDD[Rating] =
(rdd,prob) => rdd.filter{case (k,v) => prob(k)}.map {case (k,v) => v}
val ranRating = rating.map( x=> (Random.nextDouble(), x)).cache
val train = probabilisticSegment(ranRating, _ < 0.8)
val test = probabilisticSegment(ranRating, _ >= 0.8)
cache saves the intermediate RDD sothat the next two operations can be performed from that point on without incurring in the execution of the complete lineage.
(*) Note the use of val to define a function instead of def. vals are serializer-friendly

Run sequential process with scala future

I have two external processes to be run sequentially:
val antProc = Process(Seq(antBatch","everythingNoJunit"), new File(scriptDir))
val bossProc = Process(Seq(bossBatch,"-DcreateConnectionPools=true"))
val f: Future[Process] = Future {
println("Run ant...")
antProc.run
}
f onSuccess {
case proc => {
println("Run boss...")
bossProc.run
}
}
The result is:
Run ant...
Process finished with exit code 0
How do I run antProc until completion, then bossProc?
The following method seems to achieve the purpose. However, it's not a Future approach.
antProc.!<
bossProc.!<

You should be able to do something like this:
val antProc = Process(Seq(antBatch,"everythingNoJunit"), new File(scriptDir))
val bossProc = Process(Seq(bossBatch,"-DcreateConnectionPools=true"))
val antFut: Future[Process] = Future {
antProc.run
}
val bossFut: Future[Process] = Future {
bossProc.run
}
val aggFut = for{
antRes <- antFut
bossRes <- bossFut
} yield (antRes, bossRes)
aggFut onComplete{
case tr => println(tr)
}
The result of the aggFut will be a tuple consisting of the ant result and the boss result.
Also, be sure your vm that is running this is not exiting before the async callbacks can occur. If your execution context contains daemon threads then it might exit before completion.
Now if you want bossProc to run after antProc, the code would look like this
val antProc = Process(Seq(antBatch,"everythingNoJunit"), new File(scriptDir))
val bossProc = Process(Seq(bossBatch,"-DcreateConnectionPools=true"))
val antFut: Future[Process] = Future {
antProc.run
}
val aggFut = for{
antRes <- antFut
bossRes <- Future {bossProc.run}
} yield (antRes, bossRes)
aggFut onComplete{
case tr => println(tr)
}

How to retrieve partial result in multiple Future wait when some might time-out?

I am using FUTURE in scala on play framework. But I have difficulty to get part of the final result in case of timeout when merging multiple futures. Here is what my code does. It has two future to query two providers respectively. And then use a for/yield statement to merge the results. And then await for the result with a time out value. It works fine when two providers reply to the query on time. However, if just one provider reply on time, I know the await will timeout but in this case I still need to retrieve the data returned from the other provider which replied on time. How do I do that?
val pool = Executors.newCachedThreadPool()
implicit val ec = ExecutionContext.fromExecutorService(pool)
var future1 = Future(QueryProvider(provider1, request1))
var future2 = Future(QueryProvider(provider2, request2))
val future = for {
result1 <- future1
result2 <- future2
} yield (Merge(result1, result2))
val duration = Duration(60000, MILLISECONDS)
try{
result = Await.result(future, duration).asInstanceOf[string]
}
catch{
case _: Exception => println("time out...")
//Here how can I retrieve provider1's result if only provider2 timeout???***
}

You could use after from akka instead of blocking Await.result:
val timeout =
akka.pattern.after(FiniteDuration(60000, MILLISECONDS), using = system.scheduler){
Future.successful(None)
}
val resFuture = for {
result1 <- Future firstCompletedOf Seq(future1.map{Some(_)}, timeout)
result2 <- Future firstCompletedOf Seq(future2.map{Some(_)}, timeout)
} yield (result1, result2)
val result = resFuture map {
case (Some(r1), Some(r2)) => Merge(r1, r2)
case (Some(r1), None) => PartialResult1(r1)
case (None, Some(r2)) => PartialResult2(r2)
case _ => EmptyResult
}
In this case resFuture will be completed in 60 seconds and you can process partial result. Also you don't need Await in Play - you could use Async.
In case you have many equivalent futures of the same type you could use it like this:
val futures: Seq[Future[Int]] = ???
val futureWithTimeout =
futures.map{ f => Future firstCompletedOf Seq(f.map{Some(_)}, timeout) }
val result: Future[Seq[Option[Int]]] = Future.sequence(futureWithTimeout)
// In case you need to know index of completed or not completed future
val indexedResults = result.zipWithIndex
// In case you just need only completed results
val completedResults: Future[Seq[Option[Int]]] = result.map{_.flatten}
Types here are only for illustration.