Apache pulsar: Akka streams - consumer configuration - scala

I have to write a simple service (Record ingester service) via which I need to consume messages present on apache pulsar and store them to elastic store and for that I am using com.sksamuel.pulsar4s.akka.
Messages on pulsar is produced by another service which is Record pump service.
Both these services are to be deployed separately, in production.
Here is my source:
private val source = committableSource(consumerFn)
The above code works fine and its able to consume message from pulsar and write to ES.
However, I am not sure if we should be using MessageId.earliest when creating source
private val source = committableSource(consumerFn, Some(MessageId.earliest))
While testing, I found pros and cons of both that is without using MessageId.earliest
and with using MessageId.earliest, but none of them are suitable for production (as per my opinion).
1. Without using MessageId.earliest:
a. This adds a constraint that Record ingester service has to be up before we start Record pump service.
b. If my record ingester service goes down (due to an issue or due to maintenance), the messages produced on pulsar by record pump service will not get consumed after the ingester service is back up. This means that messages produced during the time, ingester service is down never gets consumed.
So, I think the logic is that only those messages will be consumed which will be put on pulsar
AFTER the consumer has subscribed to that topic.
But, I don't think its acceptable in production for the reason mentioned in point a and point b.
2. With MessageId.earliest:
Point a and b mentioned above are solved with this but -
When we use this, any time my record ingester service comes back up (after downtime or maintenance), it starts consuming all messages since the very beginning.
I have the logic that records with same id gets overwritten at ES side, so it really doesn't do any harm but still I don't think this is the right way - as there would be millions of messages on that topic and it will everytime consume messages that are already consumed (which is a waste).
This also to me is unacceptable in production.
Can anyone please help me out in what configuration to use which solves both cases.
I tried various configurations such as using subscriptionInitialPosition = Some(SubscriptionInitialPosition.Earliest)
but no luck.
Complete code:
//consumer
private val consumerFn = () =>
pulsarClient.consumer(
ConsumerConfig(
subscriptionName = Subscription.generate,
topics = Seq(statementTopic),
subscriptionType = Some(SubscriptionType.Shared)
)
)
//create source
private val source = committableSource(consumerFn)
//create intermediate flow
private val intermediateFlow = Flow[CommittableMessage[Array[Byte]]].map {
committableSourceMessage =>
val message = committableSourceMessage.message
val obj: MyObject = MyObject.parseFrom(message.value)
WriteMessage.createIndexMessage(obj.id, JsonUtil.toJson(obj))
}.via(
ElasticsearchFlow.create(
indexName = "myindex",
typeName = "_doc",
settings = ElasticsearchWriteSettings.Default,
StringMessageWriter
)
)
source.via(intermediateFlow).run()

What you would want is some form of compaction. See the Pulsar docs for details. You can make consumption compaction-aware with
ConsumerConfig(
// other consumer config options as before
readCompacted = Some(true)
)
There's a discussion in the Pulsar docs about the mechanics of compaction. Note that enabling compaction requires that writes to the topic be keyed, which may or may not have happened in the past.
Compaction can be approximated in Akka in a variety of ways, depending on how many distinct keys to compact on are in the topic, how often they're superceded by later messages, etc. The basic idea would be to have a statefulMapConcat which keeps a Map[String, T] in its state and some means of flushing the buffer.
A simple implementation would be:
Flow[CommittableMessage[Array[Byte]].map { csm =>
Option(MyObject.parseFrom(csm.message.value))
}
.keepAlive(1.minute, () => None)
.statefulMapConcat { () =>
var map: Map[String, MyObject] = Map.empty
var count: Int = 0
{ objOpt: Option[MyObject] =>
objOpt.map { obj =>
map = map.updated(obj.id, obj)
count += 1
if (count == 1000) {
val toEmit = map.values.toList
count = 0
map = Map.empty
toEmit
} else Nil
}.getOrElse {
val toEmit = map.values.toList
count = 0
map = Map.empty
toEmit
}
}
A more involved answer would be to create an actor corresponding to each object (cluster sharding may be of use here, especially if there are likely to be a lot of objects) and having the ingest from Pulsar send the incoming messages to the relevant actor, which then schedules a write of the latest message received to Elasticsearch.
One thing to be careful around with this is not committing offsets until you're sure the message (or a successor which supercedes it) has been written to Elasticsearch. If doing the actor per object approach, Akka Persistence may be of use: the basic strategy would be to commit the offset once the actor has acknowledged receipt (which occurs after persisting an event e.g. to Cassandra).

Related

Flink KPL doesn't seem to deliver to second shard

So I have the following code where I am trying to get the KPL to set the partition key, so I can start sharding my stream.
def createSinkFromStaticConfig(stream: Option[String], region: Option[String]): FlinkKinesisProducer[String] = {
val outputProperties = new Properties
outputProperties setProperty(AWSConfigConstants.AWS_REGION, region.get)
outputProperties setProperty("Region", region.get)
outputProperties.put("RecordTtl", s"${Int.MaxValue}")
outputProperties.put("ThreadPoolSize", "5")
outputProperties.put("MaxConnections", "5")
val sink = new FlinkKinesisProducer[String](new SimpleStringSchema, outputProperties)
sink setDefaultStream stream.get
sink setDefaultPartition "0"
sink setCustomPartitioner new KinesisPartitioner[String]() {
override def getPartitionId(element: String): String = {
val epoch = LocalDateTime.now.toEpochSecond(ZoneOffset.UTC)
epoch.toString
}
}
sink setQueueLimit 500
sink
}
So the sink, when called, does work and sends to data to the stream. I have manually sharded the stream and have two consumers on it. I can see each consumer is being assigned to different shards, but only one will get any work. Is there something I am doing wrong to set the shard? IS there a way to validate which shard it was sent to?
Thanks
So, found some fun.
So the above code does work, and submits to different shards. My problem was that the second task of my kinesis consumer was somehow reading the parent shard, instead of grabbing on of the available child shards. once I spun up a 3rd task, then I had one for each child and parent. Still no good, but at least I know its not the producer!
EDIT:
so for more clarification, this was a problem of my own making (aren't they all?). In testing, I had sharded and unsharded a bunch of times on the stream for testing. This caused the parent to be considered still active, and needing processing. Even without all that shenanigans, it could of been a problem. See the link
https://github.com/awslabs/amazon-kinesis-client/issues/786

Migrating from standards Akka actors to Akka Streams with back pressure and throttling

We have the following logic implemented to manage jobs targeting different backends:
A Manager Actor is started. This actor:
Loads the configuration required to target each backen (mutable map backend name -> backend connector configuration);
Loads a pool of Actors (RoundRobinPool) to handle the jobs for each backend (mutable map backend name -> RoundRobinPool Actor Ref)
When a request is received by the Manager actor, it retrieves the backend name from the message and forward it to the corresponding pool of Actor to handle the job (assuming a configuration for this backend was registered). The result of the job request is then returned from the actor to the original sender (reason why we use forward).
This logic works very well, but backend being slow to handle job, we are in a typical case of fast publisher, slow consumer and this is raising issues when the load increases.
After doing some research, Akka Streams seems the way to go as it allows to implement back pressure and throttling which would be perfect for our usage (for exemple, limit to 5 requests per seconds).
The idea is to keep the Manager Actor with the same routing logic but replace the pools of Actors with a Source.queue.
When registering the Source.queue, this would bed perform like this:
val queue = Source
.queue[RunBackendRequest](0, OverflowStrategy.backpressure)
.throttle(5, 1.second)
.map(r => runBackendRequest(r))
.toMat(Sink.ignore)(Keep.left)
.run())
Where the definition of RunBackendRequest is:
case class RunBackendRequest(originalSender: ActorRef, backendConnector: BackendConnector, request: BackendRequest)
And the function runBackendRequest is defined as such:
private def runBackendRequest(runRequest: RunBackendRequest): Unit = {
val connector = BackendConnectorFactory.getBackendConnector(configuration.underlying, runRequest.backendConnector.toConfig(), materializer, environment.asJava)
Future { connector.doSomeWork(runRequest.request) } map { result =>
runRequest.originalSender ! Success(result)
} recover {
case e: Exception => runRequest.originalSender ! Failure(e)
}
}
}
When the Manager Actor receive a message, it will 'offer' it to the correct queue based on the name of the target backend contained in the message.
Therefore, I have a few question:
Is this the correct way to use Akka Stream in this particular use case or could it we written differently and more efficiently?
Is that ok to provide the actorRef of the original sender in RunBackendRequest object so that the request will be answered in the Flow?
Is there a way to retrieve the result of the Flow into a Future instead and the Manager actor could then return the result of the request itself?
Akka Streams seems to be very powerful, but there is clearly a learning curve!
It feels to me having the Manager Actor creates a single point of failure. Maybe worth a try:
The original sender keeps hammering an Akka stream graph instead of the Manager actor. Make sure you pass the ActorRef downstream such that reply can be sent back
Inside the graph, using either partition-then-merge or Substreams to process requests that aim towards different backend connectors.
Either as the last step of the graph or after the backend connectors have finished, answer the original sender.
Overall, Colin's article is a great introduction on how to use Akka streams with Partition and Merge to archive your goal.
Let me know if you need more clarification and I can update my answer accordingly.

Kafka Streams: mix-and-match PAPI and DSL KTable not co-partitioning

I have a mix-and-match Scala topology where the main worker is a PAPI processor, and other parts are connected through DSL.
EventsProcessor:
INPUT: eventsTopic
OUTPUT: visitorsTopic (and others)
Data throughout the topics (incl. original eventsTopic) is partitioned through a, let's call it DoubleKey that has two fields.
Visitors are sent to visitorsTopic through a Sink:
.addSink(VISITOR_SINK_NAME, visitorTopicName,
DoubleKey.getSerializer(), Visitor.getSerializer(), visitorSinkPartitioner, EVENT_PROCESSOR_NAME)
In the DSL, I create a KV KTable over this topic:
val visitorTable = builder.table(
visitorTopicName,
Consumed.`with`(DoubleKey.getKafkaSerde(),
Visitor.getKafkaSerde()),
Materialized.as(visitorStoreName))
which I later connect to the EventProcessor:
topology.connectProcessorAndStateStores(EVENT_PROCESSOR_NAME, visitorStoreName)
Everything is co-partitioned (via DoubleKey). visitorSinkPartitioner performs a typical modulo operation:
Math.abs(partitionKey.hashCode % numPartitions)
In the PAPI processor EventsProcessor, I query this table to see if there are existent Visitors already.
However, in my tests (using EmbeddedKafka, but that should not make a difference), if I run them with one partition, all is fine (the EventsProcessor checks KTable on two events on same DoubleKey, and on the second event - with some delay - it can see the existent Visitor on the store), but if I run it with a higher number, the EventProcessor never sees the value in the Store.
However if I check the store via API ( iterating store.all()), the record is there. So I understand it must be going to different partition.
Since the KTable should work on the data on its partition, and everything is sent to the same partition, (using explicit partitioners calling the same code), the KTable should get that data on the same partition.
Are my assumptions correct? What could be happening?
KafkaStreams 1.0.0, Scala 2.12.4.
PS. Of course it would work doing the puts on the PAPI creating the store through PAPI instead of StreamsBuilder.table(), since that would definitely use the same partition where the code runs, but that's out of the question.
Yes, the assumptions were correct.
In case it helps anyone:
I had a problem when passing the Partitioner to the Scala EmbeddedKafka library. In one of the tests suites it was not done right.
Now, following the everhealthy practice of refactoring, I have this method used in all the suites of this topology.
def getEmbeddedKafkaTestConfig(zkPort: Int, kafkaPort: Int) :
EmbeddedKafkaConfig = {
val producerProperties = Map(ProducerConfig.PARTITIONER_CLASS_CONFIG ->
classOf[DoubleKeyPartitioner].getCanonicalName)
EmbeddedKafkaConfig(kafkaPort = kafkaPort, zooKeeperPort = zkPort,
customProducerProperties = producerProperties)
}

Calling a rest service from Spark

I'm trying to figure out the best approach to call a Rest endpoint from Spark.
My current approach (solution [1]) looks something like this -
val df = ... // some dataframe
val repartitionedDf = df.repartition(numberPartitions)
lazy val restEndPoint = new restEndPointCaller() // lazy evaluation of the object which creates the connection to REST. lazy vals are also initialized once per JVM (executor)
val enrichedDf = repartitionedDf
.map(rec => restEndPoint.getResponse(rec)) // calls the rest endpoint for every record
.toDF
I know I could have used .mapPartitions() instead of .map(), but looking at the DAG, it looks like spark optimizes the repartition -> map to a mapPartition anyway.
In this second approach (solution [2]), a connection is created once for every partition and reused for all records within the partition.
val newDs = myDs.mapPartitions(partition => {
val restEndPoint = new restEndPointCaller /*creates a db connection per partition*/
val newPartition = partition.map(record => {
restEndPoint.getResponse(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
restEndPoint.close() // close dbconnection here
newPartition.iterator // create a new iterator
})
In this third approach (solution [3]), a connection is created once per JVM (executor) reused across all partitions processed by the executor.
lazy val connection = new DbConnection /*creates a db connection per partition*/
val newDs = myDs.mapPartitions(partition => {
val newPartition = partition.map(record => {
readMatchingFromDB(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
newPartition.iterator // create a new iterator
})
connection.close() // close dbconnection here
[a] With Solutions [1] and [3] which are very similar, is my understanding of how lazy val work correct? The intention is to restrict the number of connections to 1 per executor/ JVM and reuse the open connections for processing subsequent requests. Will I be creating 1 connection per JVM or 1 connection per partition?
[b] Are there any other ways by which I can control the number of requests (RPS) we make to the rest endpoint ?
[c] Please let me know if there are better and more efficient ways to do this.
Thanks!
IMO the second solution with mapPartitions is better. First, you explicitly tells what you're expecting to achieve. The name of the transformation and the implemented logic tell it pretty clearly. For the first option you need to be aware of the how Apache Spark optimizes the processing. And it's maybe obvious to you just now but you should also think about the people who will work on your code or simply about you in 6 months, 1 year, 2 years and so fort. And they should understand better the mapPartitions than repartition + map.
Moreover maybe the optimization for repartition with map will change internally (I don't believe in it but you can still consider is as a valid point) and at this moment your job will perform worse.
Finally, with the 2nd solution you avoid a lot of problems that you can encounter with the serialization. In the code you wrote the driver will create one instance of the endpoint object, serialize it and send to the executors. So yes, maybe it'll be a single instance but only if it's serializable.
[edit]
Thanks for clarification. You can achieve what are you looking for in different manners. To have exactly 1 connection per JVM you can use a design pattern called singleton. In Scala it's expressed pretty easily as an object (the first link I found on Google https://alvinalexander.com/scala/how-to-implement-singleton-pattern-in-scala-with-object)
And that it's pretty good because you don't need to serialize anything. The singletons are read directly from the classpath on the executor side. With it you're sure to have exactly one instance of given object.
[a] With Solutions [1] and [3] which are very similar, is my
understanding of how lazy val work correct? The intention is to
restrict the number of connections to 1 per executor/ JVM and reuse
the open connections for processing subsequent requests. Will I be
creating 1 connection per JVM or 1 connection per partition?
It'll create 1 connection per partition. You can execute this small test to see that:
class SerializationProblemsTest extends FlatSpec {
val conf = new SparkConf().setAppName("Spark serialization problems test").setMaster("local")
val sparkContext = SparkContext.getOrCreate(conf)
"lazy object" should "be created once per partition" in {
lazy val restEndpoint = new NotSerializableRest()
sparkContext.parallelize(0 to 120).repartition(12)
.mapPartitions(numbers => {
//val restEndpoint = new NotSerializableRest()
numbers.map(nr => restEndpoint.enrich(nr))
})
.collect()
}
}
class NotSerializableRest() {
println("Creating REST instance")
def enrich(id: Int): String = s"${id}"
}
It should print Creating REST instance 12 times (# of partitions)
[b] Are there ways by which I can control the number of requests (RPS)
we make to the rest endpoint ?
To control the number of requests you can use an approach similar to database connection pools: HTTP connection pool (one quickly found link: HTTP connection pooling using HttpClient).
But maybe another valid approach would be the processing of smaller subsets of data ? So instead of taking 30000 rows to process, you can split it into different smaller micro-batches (if it's a streaming job). It should give your web service a little bit more "rest".
Otherwise you can also try to send bulk requests (Elasticsearch does it to index/delete multiple documents at once https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). But it's up to the web service to allow you to do so.

Apache Flink - implementing a stream processor with potentially very large state

I wish to project a potentially very large state from a stream of events. This is how I might implement this in an imperative fashion:
class ImperativeFooProcessor {
val state: mutable.Map[UUID, BarState] = mutable.HashMap.empty[UUID, BarState]
def handle(event: InputEvent) = {
event match {
case FooAdded(fooId, barId) => {
// retrieve relevant state and do some work on it
val barState = state(barId)
// let the world know about what may have happened
publish(BarOccured(fooId, barId))
// or maybe rather
publish(BazOccured(fooId, barId))
}
case FooRemoved(fooId, barId) => {
// retrieve relevant state and do some work on it
val barState = state(barId)
// let the world know about what may have happened
publish(BarOccured(fooId, barId))
// or maybe rather
publish(BazOccured(fooId, barId))
}
}
}
private def publish(event: OutputEvent): Unit = {
// push event to downstream sink
}
}
In the worst case the size of BarState will grow with the number of times its been mentioned by FooAdded
The number of unique barId's is very small relative to the total number of events for each barId.
How would I begin to represent this processing structure in Flink?
How do I work with the fact that each BarState can potentially get very large?
Flink maintains state in so-called state backends. There are state backends (MemoryStateBackend and FsStateBackend) that operated on the JVM heap of the worker processes. These backends are not suited to handle large state.
Flink also features a RocksDBStateBackend, which is based on RocksDB. RocksDB is used as a local database (no need to set it up as an external service) on each worker node and writes state data to disk. Hence, it can handle very large state exceeding the memory.
Flink offers a KeyedStream which is a stream which is partitioned on a certain attribute. In your case, you probably want that all access to the same id go to the same state instance, so you would use barId as a key. Then, the state is partitioned across all parallel worker threads based on the barId. This is basically a distributed key-value store or map. So you would not need to represent the state as a map, because it is automatically distributed by Flink.