Apache Flink - implementing a stream processor with potentially very large state - scala

I wish to project a potentially very large state from a stream of events. This is how I might implement this in an imperative fashion:
class ImperativeFooProcessor {
val state: mutable.Map[UUID, BarState] = mutable.HashMap.empty[UUID, BarState]
def handle(event: InputEvent) = {
event match {
case FooAdded(fooId, barId) => {
// retrieve relevant state and do some work on it
val barState = state(barId)
// let the world know about what may have happened
publish(BarOccured(fooId, barId))
// or maybe rather
publish(BazOccured(fooId, barId))
}
case FooRemoved(fooId, barId) => {
// retrieve relevant state and do some work on it
val barState = state(barId)
// let the world know about what may have happened
publish(BarOccured(fooId, barId))
// or maybe rather
publish(BazOccured(fooId, barId))
}
}
}
private def publish(event: OutputEvent): Unit = {
// push event to downstream sink
}
}
In the worst case the size of BarState will grow with the number of times its been mentioned by FooAdded
The number of unique barId's is very small relative to the total number of events for each barId.
How would I begin to represent this processing structure in Flink?
How do I work with the fact that each BarState can potentially get very large?

Flink maintains state in so-called state backends. There are state backends (MemoryStateBackend and FsStateBackend) that operated on the JVM heap of the worker processes. These backends are not suited to handle large state.
Flink also features a RocksDBStateBackend, which is based on RocksDB. RocksDB is used as a local database (no need to set it up as an external service) on each worker node and writes state data to disk. Hence, it can handle very large state exceeding the memory.
Flink offers a KeyedStream which is a stream which is partitioned on a certain attribute. In your case, you probably want that all access to the same id go to the same state instance, so you would use barId as a key. Then, the state is partitioned across all parallel worker threads based on the barId. This is basically a distributed key-value store or map. So you would not need to represent the state as a map, because it is automatically distributed by Flink.

Related

Differences in number of approximateEntries in Kafka Streams store if it is retrieved via streams.store() or in a processor via context.getStateStore

We are accessing a state store that is used/filled by an aggregate() call in our stream DSL and also accessed in two other areas of our Kafka streams app.
One is just used for scheduled monitoring of the entries, created and accessed by:
val value: StoreQueryParameters[ReadOnlyKeyValueStore[String, Aggregated]] =
StoreQueryParameters.fromNameAndType(config.stream.stateStoreName, QueryableStoreTypes.keyValueStore())
val store = streams.store(value)
// then later in a scheduled TimerThread
logger.info(store.approximateNumEntries())
the other place in the Kafka streams app is in a Processor used in a transfrom() call of the DSL:
[...].transform(
ourTransformerSupplier(config.stream.stateStoreName),
Named.as("ourTransformer"),
config.stream.stateStoreName
)
// Transformer code
def init(context: ProcessorContext): Unit = {
this.context = context
this.stateStore = context.getStateStore(stateStoreName).asInstanceOf[TimestampedKeyValueStore[String, Aggregated]]
[...]
// inside the transformers scheduled (WALL_CLOCK_TIME) punctuate:
logger.info(stateStore.approximateNumEntries())
The numbers differ by a magnitude of 10x to 20 times, while being executed about just the same time on the same instance. Of course we use the same state store name config (i.e. it is really the same store) and the actions taken in the processor have also an effect on the numbers of the monitoring piece of code. The difference magnitude is the roughly the same on every instance in the cluster.
I wondered if this has something to do with how the state store is returned in the processor by the API (TimestampedKeyValueStore instead of ReadOnlyKeyValueStore) but after looking at the library code this just seems to be an API wrapper.

Apache pulsar: Akka streams - consumer configuration

I have to write a simple service (Record ingester service) via which I need to consume messages present on apache pulsar and store them to elastic store and for that I am using com.sksamuel.pulsar4s.akka.
Messages on pulsar is produced by another service which is Record pump service.
Both these services are to be deployed separately, in production.
Here is my source:
private val source = committableSource(consumerFn)
The above code works fine and its able to consume message from pulsar and write to ES.
However, I am not sure if we should be using MessageId.earliest when creating source
private val source = committableSource(consumerFn, Some(MessageId.earliest))
While testing, I found pros and cons of both that is without using MessageId.earliest
and with using MessageId.earliest, but none of them are suitable for production (as per my opinion).
1. Without using MessageId.earliest:
a. This adds a constraint that Record ingester service has to be up before we start Record pump service.
b. If my record ingester service goes down (due to an issue or due to maintenance), the messages produced on pulsar by record pump service will not get consumed after the ingester service is back up. This means that messages produced during the time, ingester service is down never gets consumed.
So, I think the logic is that only those messages will be consumed which will be put on pulsar
AFTER the consumer has subscribed to that topic.
But, I don't think its acceptable in production for the reason mentioned in point a and point b.
2. With MessageId.earliest:
Point a and b mentioned above are solved with this but -
When we use this, any time my record ingester service comes back up (after downtime or maintenance), it starts consuming all messages since the very beginning.
I have the logic that records with same id gets overwritten at ES side, so it really doesn't do any harm but still I don't think this is the right way - as there would be millions of messages on that topic and it will everytime consume messages that are already consumed (which is a waste).
This also to me is unacceptable in production.
Can anyone please help me out in what configuration to use which solves both cases.
I tried various configurations such as using subscriptionInitialPosition = Some(SubscriptionInitialPosition.Earliest)
but no luck.
Complete code:
//consumer
private val consumerFn = () =>
pulsarClient.consumer(
ConsumerConfig(
subscriptionName = Subscription.generate,
topics = Seq(statementTopic),
subscriptionType = Some(SubscriptionType.Shared)
)
)
//create source
private val source = committableSource(consumerFn)
//create intermediate flow
private val intermediateFlow = Flow[CommittableMessage[Array[Byte]]].map {
committableSourceMessage =>
val message = committableSourceMessage.message
val obj: MyObject = MyObject.parseFrom(message.value)
WriteMessage.createIndexMessage(obj.id, JsonUtil.toJson(obj))
}.via(
ElasticsearchFlow.create(
indexName = "myindex",
typeName = "_doc",
settings = ElasticsearchWriteSettings.Default,
StringMessageWriter
)
)
source.via(intermediateFlow).run()
What you would want is some form of compaction. See the Pulsar docs for details. You can make consumption compaction-aware with
ConsumerConfig(
// other consumer config options as before
readCompacted = Some(true)
)
There's a discussion in the Pulsar docs about the mechanics of compaction. Note that enabling compaction requires that writes to the topic be keyed, which may or may not have happened in the past.
Compaction can be approximated in Akka in a variety of ways, depending on how many distinct keys to compact on are in the topic, how often they're superceded by later messages, etc. The basic idea would be to have a statefulMapConcat which keeps a Map[String, T] in its state and some means of flushing the buffer.
A simple implementation would be:
Flow[CommittableMessage[Array[Byte]].map { csm =>
Option(MyObject.parseFrom(csm.message.value))
}
.keepAlive(1.minute, () => None)
.statefulMapConcat { () =>
var map: Map[String, MyObject] = Map.empty
var count: Int = 0
{ objOpt: Option[MyObject] =>
objOpt.map { obj =>
map = map.updated(obj.id, obj)
count += 1
if (count == 1000) {
val toEmit = map.values.toList
count = 0
map = Map.empty
toEmit
} else Nil
}.getOrElse {
val toEmit = map.values.toList
count = 0
map = Map.empty
toEmit
}
}
A more involved answer would be to create an actor corresponding to each object (cluster sharding may be of use here, especially if there are likely to be a lot of objects) and having the ingest from Pulsar send the incoming messages to the relevant actor, which then schedules a write of the latest message received to Elasticsearch.
One thing to be careful around with this is not committing offsets until you're sure the message (or a successor which supercedes it) has been written to Elasticsearch. If doing the actor per object approach, Akka Persistence may be of use: the basic strategy would be to commit the offset once the actor has acknowledged receipt (which occurs after persisting an event e.g. to Cassandra).

Is it safe to use a concurrenthashmap that is shared with 100K actors?

I have about 100K products that I want to keep in memory.
These products can change at a high rate per hour, but at the same time reads will be the majority of the calls, and there can be a delay with getting the most recent version.
I want to create 1 actor per product, but have a shared in-memory store of the product.
Is it safe to use a concurrenthashmap and pass it as a prop to 100K actors?
val products = new ConcurrentHashMap[ProductId, Product](initialCapacity)
So in my actor I will have something like:
def recieve = {
case GetProduct(id: ProductId) =>
// lookup in products concurrenthashmap, if not there, read from datastore and return
//
case UpdateProduct(id: ProductId) => ???
}
When there are updates, there could be 100-500K per hour. There can be a delay in the update, that's not an issue. So to distribute the updates and prevent locks, 1 actor per product is what I am thinking. I just want to make the cache global across all actors.
Is this design sound?
ConcurrentHashMap is safe to be shared, the number of threads does not matter. But why do you create 100k actors when you keep the state elsewhere anyway? An Actor exists because it owns its state; if that is not the case then the Actor should not exist.
So either you keep your data in a ConcurrentHashMap, but then you need only as many threads (actors) as you have CPU cores to make the accesses. Or you ditch the CHM and store each of your products in its own designated Actor. The decision depends on the nature of the operations you want to perform: if you want to interact only with a single product at a time, then Actors are a great way of modelling that. If you want to interact with a collection of products at a time, then use something else.

Find actor by persistence id

I have a system, that has an actor per user. Users send messages rarely, but when they do, they send usually not only one, but few.
Currently, I have a map, where I store persistenceId -> ActorRef. When I'm receiving a new message for an actor, I look into the map, if there is an ActorRef, I use it. If it is missing, I create it and put it into the map. For sure I don't want to have 2 instances of same persistence actor at the same time. Also, I don't want to create and destroy the actor for each message, as recovery could take some time.
I feel there should be some cleaner way of "locating or creating" an actor. Something like actorSystem.getOrCreate(persistenceId, props). I thought that sharding might help me with that, but I couldn't find an exact example of this. Also, I know there is actorSelection, which has downsides:
using it in too many places, with hardcoded paths that are tricky to
maintain
using it to send too many messages as it has a performance
cost
So basically the question is what is the best way of locating persistence actor within one service if I actor persistenceId is userId. If I decide to use sharding, then it will be 1 shard per actor. Is this ok?
Actor sharding is pretty much what you need - you can think about it as a distributed map of actors and there is no need of having additional solutions. The sharding takes care of summoning the actor behind the scenes and there is no need for you to manage actors yourself.
val sharding = ClusterSharding(system).start(
typeName = CustomerActor.shardName,
entityProps = CustomerActor.props,
settings = ClusterShardingSettings(system),
extractEntityId = CustomerActor.extractEntityId,
extractShardId = CustomerActor.extractShardId)
}
where extractEntityId is a function which routes messages to appropriate actors
val extractEntityId: ShardRegion.ExtractEntityId = {
case gc: GetCustomer => (gc.customerId, gc)
}
And final example:
case class GetCustomer(customerId: String)
sharding ! GetCustomer("customer-id")
More details here https://doc.akka.io/docs/akka/2.5/cluster-sharding.html

Spark Streaming states to be persisted to disk in addition to in memory

I have written a program using spark streaming by using map with state function which detect repetitive records and avoid such records..the function is similar as bellow:
val trackStateFunc1 = (batchTime: Time,
key: String,
value: Option[(String,String)],
state: State[Long]) => {
if (state.isTimingOut()) {
None
}
else if (state.exists()) None
else {
state.update(1L)
Some(value.get)
}
}
val stateSpec1 = StateSpec.function(trackStateFunc1)
//.initialState(initialRDD)
.numPartitions(100)
.timeout(Minutes(30*24*60))
My numbers of records could be high and I kept the time-out for about one month. Therefore, number of records and keys could be high..I wanted to know if I can save these states on Disk in addition to the Memory..something like
"RDD.persist(StorageLevel.MEMORY_AND_DISK_SER)"
I wanted to know if I can save these states on Disk in addition to the
Memory
Stateful streaming in Spark automatically get serialized to persistent storage, this is called checkpointing. When you run your stateful DStream, you must provide a checkpoint directory otherwise the graph won't be able to execute at runtime.
You can set the checkpointing interval via DStream.checkpoint. For example, if you want to set it to every 30 seconds:
inputDStream
.mapWithState(trackStateFunc)
.checkpoint(Seconds(30))
Accourding to "MapWithState" sources you can try:
mapWithStateDS.dependencies.head.persist(StorageLevel.MEMORY_AND_DISK)
actual for spark 3.0.1