I have two Input Stream. I would like to merge two stream element based on same ID. Here is the code details
implicit val system = ActorSystem("sourceDemo")
implicit val materializer = ActorMaterializer()
case class Foo(id: Int, value: String)
case class Bar(id: Int, value: String)
case class MergeResult(id: Int, fooValue: String, barValue: String)
val sourceOne = Source(List.fill(100)(Foo(Random.nextInt(100), value = "foo")))
val sourceTwo = Source(List.fill(100)(Bar(Random.nextInt(100), value = "bar")))
What I would like to get the result is MergeResult, which is based on the same id in Foo and Bar.
Also, for some Foo and Bar which has mismatched id, I would like to keep in the memory, I wonder if there is a clean way to do it because it is stateful.
More importantly, the source elements are in order. If there are ID duplicates found, the strategy should be first matched first served. That means if Foo(1, "foo-1"), Foo(1, "foo-2") and Bar(1, "Bar-1"), the match should be MergeResult(1, "foo-1", "Bar-1") .
I am looking at some solutions from akka stream at the moment. If there are some good solution like Spark, Flink and so on, that would be helpful as well.
Thanks in advance.
You are precisely describing a join operation.
Akka streams does not support join operations. You may find a way to do that using windowing on each stream and some actor/stateful transformation to do the lookup between them, but last time I searched for this I found nothing (not so long ago), so you are probably in uncharted waters.
You will only find joins on streams on more heavy-weight frameworks: Flink, Spark Streaming, Kafka streams. The reason is that joins fundamentally is a lookup of one stream against another, which means that it needs more complex stuff (state management) than the designers of Akka streams wanted to deal with.
Related
(kinda related to How to create dynamic metric in Flink)
I have a stream of events(someid:String, name:String) and for monitoring reasons, I need a counter per event ID.
In all the Flink documentations and examples, I can see that the counter is , for instance, initialised with a name in the open of a map function.
But in my case I can not initialise the counter as I will need one per eventId and I do not know the value in advance. Also, I understand how expensive it would be to create a new counter every time an even passes in the map() method of the MapFunction.
Finally, I can not keep a "cache" of counters as it would be too big.
Ideally, I would like something like this :
class Event(id: String, name: String)
class ExampleMapFunction extends RichMapFunction[Event, Event] {
#transient private var counter: Counter = _
override def open(parameters: Configuration): Unit = {
counter = new Counter()
}
override def map(event: Event): Event = {
counter.inc(event.id)
event
}
}
Or basically could I implement my own counter that allow me to pass a dimension? if yes, how?
Any advise or best practice for this kind of use-case?
If keeping a cache of the counters would be too big, then I don't think using metrics is going to scale in a way that will satisfy your requirements.
A few alternatives:
Use side outputs to collect meaningful events in some external, queryable/visualizable data store -- e.g., influxdb.
Hold the info in keyed state, and use broadcast messages to trigger output of relevant portions of it as desired (again using side outputs).
Hold the info in keyed state, and take periodic savepoints, which you then analyze via queries using the state processor API.
Here, we developed multi services each uses akka actors and communication between services are via Akka GRPC. There is one service which fills an in memory database and other service called Reader applies some query and shape data then transfer them to elasticsearch service for insertion/update. The volume of data in each reading phase is about 1M rows.
The problem arises when Reader transfers large amount of data so elasticsearch can not process them and insert/update them all.
I used akka stream method for these two services communication. I also use scalike jdbc lib and code like below to read and insert batch data instead of whole ones.
def applyQuery(query: String,mergeResult:Map[String, Any] => Unit) = {
val publisher = DB readOnlyStream {
SQL(s"${query}").map(_.toMap()).list().fetchSize(100000)
.iterator()
}
Source.fromPublisher(publisher).runForeach(mergeResult)
}
////////////////////////////////////////////////////////
var batchRows: ListBuffer[Map[String, Any]] = new ListBuffer[Map[String, Any]]
val batchSize: Int = 100000
def mergeResult(row:Map[String, Any]):Unit = {
batchRows :+= row
if (batchRows.size == batchSize) {
send2StorageServer(readyOutput(batchRows))
batchRows.clear()
}
}
def readyOutput(res: ListBuffer[Map[String, Any]]):ListBuffer[StorageServerRequest] = {
// code to format res
}
Now, when using 'foreach' command, it makes operations much slower. I tried different batch size but it made no sense. Am I wrong in using foreach command or is there any better way to resolve speed problem using akka stream, flow, etc.
I found that operation to be used to append to ListBuffer is
batchRows += row
but using :+ does not produce bug but is very inefficient so by using correct operator, foreach is no longer slow, although the speed problem again exists. This time, reading data is fast but writing to elasticsearch is slow.
After some searches, I came up with these solutions:
1. The use of queue as buffer between database and elasticsearch may help.
2. Also if blocking read operation until write is done is not costly,
it can be another solution.
I need to look up some data in a Spark-streaming job from a file on HDFS
This data is fetched once a day by a batch job.
Is there a "design pattern" for such a task?
how can I reload the data in memory (a hashmap) immediately after a
daily update?
how to serve the streaming job continously while this lookup data is
being fetched?
One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:
val mainStream: DStream[T] = ???
Next you can create another stream which reads lookup data:
val lookupStream: DStream[(K, V)] = ???
and a simple function which can be used to update state
def update(
current: Seq[V], // A sequence of values for a given key in the current batch
prev: Option[V] // Value for a given key from in the previous state
): Option[V] = {
current
.headOption // If current batch is not empty take first element
.orElse(prev) // If it is empty (None) take previous state
}
This two pieces can be used to create state:
val state = lookup.updateStateByKey(update)
All whats left is to key-by mainStream and connect data:
def toPair(t: T): (K, T) = ???
mainStream.map(toPair).leftOuterJoin(state)
While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.
Can anyone please explain me difference between map and mapAsync w.r.t AKKA stream? In the documentation it is said that
Stream transformations and side effects involving external non-stream
based services can be performed with mapAsync or mapAsyncUnordered
Why cant we simply us map here? I assume that Flow, Source, Sink all would be Monadic in nature and thus map should work fine w.r.t the Delay in the nature of these ?
Signature
The difference is best highlighted in the signatures: Flow.map takes in a function that returns a type T while Flow.mapAsync takes in a function that returns a type Future[T].
Practical Example
As an example, suppose that we have a function which queries a database for a user's full name based on a user id:
type UserID = String
type FullName = String
val databaseLookup : UserID => FullName = ??? //implementation unimportant
Given an akka stream Source of UserID values we could use Flow.map within a stream to query the database and print the full names to the console:
val userIDSource : Source[UserID, _] = ???
val stream =
userIDSource.via(Flow[UserID].map(databaseLookup))
.to(Sink.foreach[FullName](println))
.run()
One limitation of this approach is that this stream will only make 1 db query at a time. This serial querying will be a "bottleneck" and likely prevent maximum throughput in our stream.
We could try to improve performance through concurrent queries using a Future:
def concurrentDBLookup(userID : UserID) : Future[FullName] =
Future { databaseLookup(userID) }
val concurrentStream =
userIDSource.via(Flow[UserID].map(concurrentDBLookup))
.to(Sink.foreach[Future[FullName]](_ foreach println))
.run()
The problem with this simplistic addendum is that we have effectively eliminated backpressure.
The Sink is just pulling in the Future and adding a foreach println, which is relatively fast compared to database queries. The stream will continuously propagate demand to the Source and spawn off more Futures inside of the Flow.map. Therefore, there is no limit to the number of databaseLookup running concurrently. Unfettered parallel querying could eventually overload the database.
Flow.mapAsync to the rescue; we can have concurrent db access while at the same time capping the number of simultaneous lookups:
val maxLookupCount = 10
val maxLookupConcurrentStream =
userIDSource.via(Flow[UserID].mapAsync(maxLookupCount)(concurrentDBLookup))
.to(Sink.foreach[FullName](println))
.run()
Also notice that the Sink.foreach got simpler, it no longer takes in a Future[FullName] but just a FullName instead.
Unordered Async Map
If maintaining a sequential ordering of the UserIDs to FullNames is unnecessary then you can use Flow.mapAsyncUnordered. For example: you just need to print all of the names to the console but didn't care about order they were printed.
I'm creating WebSocket actors in Play (Scala).
Actors are being created somewhere else in the system, and I just need to keep them in one place, grouped by some variables.
What is the best practice to store them, and which one takes up the smallest amount of memory:
Seq[Actor]
Seq[ActorRef]
Something else?
You should NEVER store actors - the only way to access actor should be through the ActorRef
There are few patterns/practices that you could use to find your actors.
First is ActorSelection, and it would require building right actor hierarchy. For instance, you have users split by geographical location, then you might want to have actor selections like
/user/..../US/PA/18900/user1
/user/..../US/PA/18900/user2
/user/..../US/NJ/07000/user3
This way you could find all actors using selection with wildcard, although you will stick with just one property to filter them
The other way is to have data structure that would store all your flags/properties, for instance.
case class UserRef(ref: ActorRef, name: String, country: String, zip: Integer, active: Boolean)
Then, your 'directory' will store them as a users = List[UserRef] and you will be able to query this structure with one pass using users.filter(_.active = true) or users.find(_.name = "superuser")