Akka stream - List to mapAsync of individual elements - scala

My stream has a Flow whose outputs are List[Any] objects. I want to have a mapAsync followed by some other stages each of which processed an individual element instead of the list. How can I do that?
Effectively I want to connect the output of
Flow[Any].map { msg =>
someListDerivedFrom(msg)
}
to be consumed by -
Flow[Any].mapAsyncUnordered(4) { listElement =>
actorRef ? listElement
}.someOtherStuff
How do I do this?

I think the combinator you are looking for is mapConcat. This combinator will take an input argument and return something that is an Iterable. A simple example would be as follows:
implicit val system = ActorSystem()
implicit val mater = ActorMaterializer()
val source = Source(List(List(1,2,3), List(4,5,6)))
val sink = Sink.foreach[Int](println)
val graph =
source.
mapConcat(identity).
to(sink)
graph.run
Here, my Source is spitting out List elements, and my Sink accepts the underlying type of what's in those Lists. I can't connect them directly together as the types are different. But if I apply mapConcat between them, they can be connected as that combinator will flatten those List elements out, sending their individual elements (Int) downstream. Because the input element to mapConcat is already an Iterable, then you only need to use the identify function in the body of mapConcat to make things work.

Related

How to generate the materialized value from the elements in Source or Flow?

Suppose there is a source of type Source[Int, NotUsed]. How can this be turned into a Source[Int, T] where the materialized value T is computed based on the elements of the source?
Example: I would like summing the elements from a stream; how to implement dumbFlow so result should be 6 instead of 42?
val dumbFlow = ??? //Flow[Int].mapMaterializedValue(_ => 42)
//code below cannot be changed
val source = Source(List(1, 2, 3)).viaMat(dumbFlow)(Keep.right)
val result = source.toMat(Sink.ignore)(Keep.left).run()
//result: Int = 42
I know how to achieve the same result using Sink.fold or Sink.head but I need the materialization logic in the Source; cannot change .to(Sink.ignore).
Strictly speaking, the materialized value is always computed (including any mapMaterializedValue/toMat/viaMat etc.) before a single element goes through the stream and thus cannot depend on the elements of the stream.
If the materialized value happens to be a Future (in the Scala API), the future can be constructed (though not yet completed) and the stream can complete the Future based on the elements. In general, the Future materialized values are from sinks (e.g., as you note Sink.fold/Sink.head).
The alsoTo operator on a Source or Flow lets you embed a Sink to the side of a Source/Flow. It has an alsoToMat companion which lets you combine the Sink's materialized value with the Source/Flow's.
So one could have
val summingSink = Sink.fold[Int, Int](0)(_ + _)
val dumbFlow: Flow[Int, Int, Future[Int]] = Flow[Int].alsoToMat(summingSink)(Keep.right)
val result: Future[Int] = source.toMat(Sink.ignore)(Keep.left).run()
result.foreach(println _)
// will eventually print 6

Mapping RDD to function does not invoke the function

I am using Scala Spark API. In my code, I have an RDD of the following structure:
Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])]
I need to process (perform validations and modify values) the second element of the RDD. I am using map function to do that:
myRDD.map(line => mappingFunction(line))
Unfortunately, the mappingFunction is not invoked. This is the code of the mapping function:
def mappingFunction(line: Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] ): Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] = {
println("Inside mappingFunction")
return line
}
When my program ends, there are no printed messages in the stdout.
In order to investigate the problem, I implemented a code snippet that worked:
val x = List.range(1, 10)
val mappedX = x.map(i => callInt(i))
And the following mapping function was invoked:
def callInt(i: Int) = {
println("Inside callInt")
}
Please assist in getting the RDD mapping function mappingFunction invoked. Thank you.
x is a List, so there is no laziness there, that's why your action is being invoked regardless you are not calling an action.
myRDD is an RDD, RDDs are lazy, this means that you don't actually execute your transformations (map, flatMap, filter) until you need to.
That means that you are not running your map function until you perform an action. An action is an operation that triggers the precedent operations (called transformations) to be executed.
Some examples of actions are collect or count
If you do this:
myRDD.map(line => mappingFunction(line)).count()
You'll see your prints. Anyway, there is no problem with your code at all, you just need to take into consideration the laziness nature of the RDDs
There is a good answer about this topic here.
Also you can find more info and a whole list of transformations and actions here

Understanding NotUsed and Done

I am having a hard time understanding the purpose and significance of NotUsed and Done in Akka Streams.
Let us see the following 2 simple examples:
Using NotUsed :
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[NotUsed] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.to(Sink.foreach(println))
val runResult:NotUsed = myStream.run()
Using Done
implicit val system = ActorSystem("akka-streams")
implicit val materializer = ActorMaterializer()
val myStream: RunnableGraph[Future[Done]] =
Source.single("stackoverflow")
.map(s => s.toUpperCase())
.toMat(Sink.foreach(println))(Keep.right)
val runResult: Future[Done] = myStream.run()
When I run these examples, I get the same output in both cases:
STACKOVERFLOW //output
So what exactly are NotUsed and Done? What are the differences and when should I prefer one above the other ?
First of all, the choice you are making is between NotUsed and Future[Done] (not just Done).
Now, you are essentially deciding the materialized value of your graph, by using the different combinators (to and toMat with Keep.right).
The materialized value is a way to interact with your stream while it's running. This choice does not affect the data processed by your stream, and for this reason you see the same output in both cases. The same element (the string "stackoverflow") goes through both streams.
The choice depends on what your main program is supposed to do after running the stream:
in case you are not interested in interacting with it, NotUsed is the right choice. It is just a dummy object, and it conveys the information that no interaction with the stream is allowed nor needed
in case you need to listen for the completion of the stream to perform some other action, you need to expose the Future[Done]. This way you can attach a callback to it using (e.g.) onComplete or map.

Reading multiple Files Asynchronously using Akka Streams, Scala

I want to read many .CSV files inside a folder asynchronously and return an Iterable of a custom case class.
Can i achieve this with Akka Streams and How?
*I have tried to somehow Balance the job according to documentation but it's a little hard to manage through...
Or
Is it a good practice to use Actors instead?(a parent Actor with children, every child to read a File, and return an Iterable to parent, and then parent combine all Iterables?)
Mostly the same as #paul answer but with small improvements
def files = new java.io.File("").listFiles().map(_.getAbsolutePath).to[scala.collection.immutable.Iterable]
Source(files).flatMapConcat( filename => //you could use flatMapMerge if you don't bother about line ordering
FileIO.fromPath(Paths.get(filename))
.via(Framing.delimiter(ByteString("\n"), 256, allowTruncation = true).map(_.utf8String))
).map { csvLine =>
// parse csv here
println(csvLine)
}
first of all you need to read/learn how Akka stream works, with Source, Flow and Sink. Then you can start learning the operators.
To make multiple actions in parallel you can use operator mapAsync In which you specify the number of parallelism.
/**
* Using mapAsync operator, we pass a function which return a Future, the number of parallel run futures will
* be determine by the argument passed to the operator.
*/
#Test def readAsync(): Unit = {
Source(0 to 10)//-->Your files
.mapAsync(5) { value => //-> It will run in parallel 5 reads
implicit val ec: ExecutionContext = ActorSystem().dispatcher
Future {
//Here read your file
Thread.sleep(500)
println(s"Process in Thread:${Thread.currentThread().getName}")
value
}
}
.runWith(Sink.foreach(value => println(s"Item emitted:$value in Thread:${Thread.currentThread().getName}")))
}
You can learn more about akka and akka stream here https://github.com/politrons/Akka

Losing types on sequencing Futures

I'm trying to do this:
case class ConversationData(members: Seq[ConversationMemberModel], messages: Seq[MessageModel])
val membersFuture: Future[Seq[ConversationMemberModel]] = ConversationMemberPersistence.searchByConversationId(conversationId)
val messagesFuture: Future[Seq[MessageModel]] = MessagePersistence.searchByConversationId(conversationId)
Future.sequence(List(membersFuture, messagesFuture)).map{ result =>
// some magic here
self ! ConversationData(members, messages)
}
But when I'm sequencing the two futures compiler is losing types. The compiler says that type of result is List[Seq[Product with Serializable]] At the beginning I expect to do something like
Future.sequence(List(membersFuture, messagesFuture)).map{ members, messages => ...
But it looks like sequencing futures don't work like this... I also tried to using a collect inside the map but I get similar errors.
Thanks for your help
When using Future.sequence, the assumption is that the underlying types produced by the multiple Futures are the same (or extend from the same parent type). With sequence, you basically invert a Seq of Futures for a particular type to a single Future for a Seq of that particular type. A concrete example is probably more illustrative of that point:
val f1:Future[Foo] = ...
val f2:Future[Foo] = ...
val f3:Future[Foo] = ...
val futures:List[Future[Foo]] = List(f1, f2, f3)
val aggregateFuture:Future[List[Foo]] = Future.sequence(futures)
So you can see that I went from a List of Future[Foo] to a single Future wrapping a List[Foo]. You use this when you already have a bunch of Futures for results of the same type (or base type) and you want to aggregate all of the results for the next processing step. The sequence method product a new Future that won't be completed until all of the aggregated Futures are done and it will then contain the aggregated results of all of those Futures. This works especially well when you have an indeterminate or variable number of Futures to process.
For your case, it seems that you have a fixed number of Futures to handle. As #Zoltan suggested, a simple for comprehension is probably a better fit here because the number of Futures is known. So solving your problem like so:
for{
members <- membersFuture
messages <- messagesFuture
} {
self ! ConversationData(members, messages)
}
is probably the best way to go for this specific example.
What are you trying to achieve with the sequence call? I'd just use a for-comprehension instead:
val membersFuture: Future[Seq[ConversationMemberModel]] = ConversationMemberPersistence.searchByConversationId(conversationId)
val messagesFuture: Future[Seq[MessageModel]] = MessagePersistence.searchByConversationId(conversationId)
for {
members <- membersFuture
messages <- messagesFuture
} yield (self ! ConversationData(members, messages))
Note that it is important that you declare the two futures outside the for-comprehension, because otherwise your messagesFuture wouldn't be submitted until the membersFuture is completed.
You could also use zip:
membersFuture.zip(messagesFuture).map {
case (members, messages) => self ! ConversationData(members, messages)
}
but I'd prefer the for-comprehension.