(kinda related to How to create dynamic metric in Flink)
I have a stream of events(someid:String, name:String) and for monitoring reasons, I need a counter per event ID.
In all the Flink documentations and examples, I can see that the counter is , for instance, initialised with a name in the open of a map function.
But in my case I can not initialise the counter as I will need one per eventId and I do not know the value in advance. Also, I understand how expensive it would be to create a new counter every time an even passes in the map() method of the MapFunction.
Finally, I can not keep a "cache" of counters as it would be too big.
Ideally, I would like something like this :
class Event(id: String, name: String)
class ExampleMapFunction extends RichMapFunction[Event, Event] {
#transient private var counter: Counter = _
override def open(parameters: Configuration): Unit = {
counter = new Counter()
}
override def map(event: Event): Event = {
counter.inc(event.id)
event
}
}
Or basically could I implement my own counter that allow me to pass a dimension? if yes, how?
Any advise or best practice for this kind of use-case?
If keeping a cache of the counters would be too big, then I don't think using metrics is going to scale in a way that will satisfy your requirements.
A few alternatives:
Use side outputs to collect meaningful events in some external, queryable/visualizable data store -- e.g., influxdb.
Hold the info in keyed state, and use broadcast messages to trigger output of relevant portions of it as desired (again using side outputs).
Hold the info in keyed state, and take periodic savepoints, which you then analyze via queries using the state processor API.
Related
In my apache-beam job I call an external source, GCP Storage, this can be considered like a http call for universal purposes, the important part is that it is external call to enrich the job.
Every piece of data I am processing, I call this API to obtain some information to enrich the data. There is heavy amounts of repeat calls to the same data on the API.
Is there a good way to cache or store the results for reuse for each piece of data processed to limit the amount of network traffic required. It is a massive bottleneck for processing.
You can consider persisting this value as instance state on your DoFn. For example
class MyDoFn(beam.DoFn):
def __init__(self):
# This will be called during construction and pickled to the workers.
self.value1 = some_api_call()
def setup(self):
# This will be called once for each DoFn instance (generally
# once per worker), good for non-pickleable stuff that won't change.
self.value2 = some_api_call()
def start_bundle(self):
# This will be called per-bundle, possibly many times on a worker.
self.value3 = some_api_call()
def process(self, element):
# This is called on each element.
key = ...
if key not in self.some_lru_cache:
self.some_lru_cache[key] = some_api_call()
value4 = self.some_lru_cache[key]
# Use self.value1, self.value2, self.value3 and/or value4 here.
There is no internal persistence layer in Beam. You have to download the data you want to process. And this can potentially happen on a fleet of workers that all have to have access to the data.
However you might want to consider accessing your data as a side-input. You will have to preload all the data and won't need to query the external source for each element: https://beam.apache.org/documentation/programming-guide/#side-inputs
For GCS specifically you might want to try to use the existing IO, e.g. TextIO: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java
I implemented a rich sink function which performs some network calls per the invoked upon object. I would like to be able to count some metadata on these events keyed by some contextual information contained on the event (a batchID of the event), and expose this meta data to external system.
For example an event looks like this:
case class MyEvent(batchId: String, eventId: String, moreInformation: ...)
class MySink(...) extends RichSinkFunction[MyEvent]
{
override def open(parameters: Configuration): Unit = {
...
}
override def close(): Unit = {
...
}
override def invoke(event: MyEvent) = {
// some processing is done here
....
//
...
if (success) {
I want to save the meta data here per event.batchId
state.count.number.of.events.processed.for.event.batchId
}
}
}
And in another place I want to somehow be able to query the value of how many events were processed for batchId
A few options:
Plan A: Use Metric objects and a MetricReporter to expose the data to the external system(s). This has the drawback that metrics aren't checkpointed, and if there are a lot of batchIds, you'll probably end up polluting the metrics system with lots of metrics that can't get GC'ed.
Plan B: Rewrite your RichSinkFunction as a RichFlatMap (or ProcessFunction) that emits a stream of Tuples holding (batchId, number.of.events.in.batchId). You can key this stream by the batchId, and then use keyed state in a KeyedProcessFunction (for example) to store and expose this state via queryable state. This has the drawback that queryable state only allows for point queries (one key at a time).
Plan C: In this variant, the external systems could query the state created in Plan B by injecting queries into a stream that is broadcast into a KeyedBroadcastProcessFunction that holds keyed state.count.number.of.events.processed.for.event.batchId data. You can then use ctx.applyToKeyedState in the processBroadcastElement method of the KeyedBroadcastProcessFunction to respond to these queries. See one of the Flink training exercises for an example.
Plan D: write the results from B (or C) into redis, or elasticsearch, or some other queryable data store, and have the external systems get this info from there.
I need to look up some data in a Spark-streaming job from a file on HDFS
This data is fetched once a day by a batch job.
Is there a "design pattern" for such a task?
how can I reload the data in memory (a hashmap) immediately after a
daily update?
how to serve the streaming job continously while this lookup data is
being fetched?
One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:
val mainStream: DStream[T] = ???
Next you can create another stream which reads lookup data:
val lookupStream: DStream[(K, V)] = ???
and a simple function which can be used to update state
def update(
current: Seq[V], // A sequence of values for a given key in the current batch
prev: Option[V] // Value for a given key from in the previous state
): Option[V] = {
current
.headOption // If current batch is not empty take first element
.orElse(prev) // If it is empty (None) take previous state
}
This two pieces can be used to create state:
val state = lookup.updateStateByKey(update)
All whats left is to key-by mainStream and connect data:
def toPair(t: T): (K, T) = ???
mainStream.map(toPair).leftOuterJoin(state)
While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.
Can anyone please explain me difference between map and mapAsync w.r.t AKKA stream? In the documentation it is said that
Stream transformations and side effects involving external non-stream
based services can be performed with mapAsync or mapAsyncUnordered
Why cant we simply us map here? I assume that Flow, Source, Sink all would be Monadic in nature and thus map should work fine w.r.t the Delay in the nature of these ?
Signature
The difference is best highlighted in the signatures: Flow.map takes in a function that returns a type T while Flow.mapAsync takes in a function that returns a type Future[T].
Practical Example
As an example, suppose that we have a function which queries a database for a user's full name based on a user id:
type UserID = String
type FullName = String
val databaseLookup : UserID => FullName = ??? //implementation unimportant
Given an akka stream Source of UserID values we could use Flow.map within a stream to query the database and print the full names to the console:
val userIDSource : Source[UserID, _] = ???
val stream =
userIDSource.via(Flow[UserID].map(databaseLookup))
.to(Sink.foreach[FullName](println))
.run()
One limitation of this approach is that this stream will only make 1 db query at a time. This serial querying will be a "bottleneck" and likely prevent maximum throughput in our stream.
We could try to improve performance through concurrent queries using a Future:
def concurrentDBLookup(userID : UserID) : Future[FullName] =
Future { databaseLookup(userID) }
val concurrentStream =
userIDSource.via(Flow[UserID].map(concurrentDBLookup))
.to(Sink.foreach[Future[FullName]](_ foreach println))
.run()
The problem with this simplistic addendum is that we have effectively eliminated backpressure.
The Sink is just pulling in the Future and adding a foreach println, which is relatively fast compared to database queries. The stream will continuously propagate demand to the Source and spawn off more Futures inside of the Flow.map. Therefore, there is no limit to the number of databaseLookup running concurrently. Unfettered parallel querying could eventually overload the database.
Flow.mapAsync to the rescue; we can have concurrent db access while at the same time capping the number of simultaneous lookups:
val maxLookupCount = 10
val maxLookupConcurrentStream =
userIDSource.via(Flow[UserID].mapAsync(maxLookupCount)(concurrentDBLookup))
.to(Sink.foreach[FullName](println))
.run()
Also notice that the Sink.foreach got simpler, it no longer takes in a Future[FullName] but just a FullName instead.
Unordered Async Map
If maintaining a sequential ordering of the UserIDs to FullNames is unnecessary then you can use Flow.mapAsyncUnordered. For example: you just need to print all of the names to the console but didn't care about order they were printed.
I am designing a backend using CQRS + Event sourcing, using Akka + Scala. I am not sure about how to handle a growing state. For instance, I will have a growing list of users. To my understanding, each user will be created following a UserCreated event, such events will be replayed by the PersistentActor, and the users will be stored in a collection. Something like:
class UsersActor extends PersistentActor {
override def persistenceId = ....
private case class UsersState(users: List[User])
private var state = UsersState()
....
}
Obviously such state would eventually grow too big to be held in memory by this actor, so I guess I'm doing something wrong.
I found this example project: the idea seems that each user should be held by a different actor, and loaded (from the event history) as needed.
What is the right way to do this? Thanks a lot.
The answer is: each aggregate/entity (in my example, each User) gets its own actor, which embeds the state for that particular entity and that one only.