Monix looks like great framework but documentation is very sparse.
What is alsoTo analogue of akka-streams in monix ?
Basically I want stream to be consumed by two consumers.
Monix follows the Rx model in that subscriptions are dynamic. Any Observable supports an unlimited number of subscribers:
val obs = Observable.interval(1.second)
val s1 = obs.dump("O1").subscribe()
val s2 = obs.dump("O2").subscribe()
There is a catch however — Observable is by default what is called a "cold data source", meaning that each subscriber gets its own data source.
So for example, if you had an Observable that reads from a File, then each subscriber would get its own file handle.
In order to "share" such an Observable between multiple subscribers, you have to convert it into a hot data source, to share it. You do so with the multicast operator and its versions, publish being most commonly used. These give you back a ConnectableObservable, that needs a connect() call to start the streaming:
val shared = obs.publish
// Nothing happens here:
val s1 = shared.dump("O1").subscribe()
val s2 = shared.dump("O2").subscribe()
// Starts actual streaming
val cancelable = shared.connect()
// You can subscribe after connect(), but you might lose events:
val s3 = shared.dump("O3").subscribe()
// You can unsubscribe one of your subscribers, but the
// data source keeps the stream active for the others
s1.cancel()
// To cancel the connection for all subscribers:
cancelable.cancel()
PS: monix.io is a work in progress, PRs are welcome 😀
Related
I have to write a simple service (Record ingester service) via which I need to consume messages present on apache pulsar and store them to elastic store and for that I am using com.sksamuel.pulsar4s.akka.
Messages on pulsar is produced by another service which is Record pump service.
Both these services are to be deployed separately, in production.
Here is my source:
private val source = committableSource(consumerFn)
The above code works fine and its able to consume message from pulsar and write to ES.
However, I am not sure if we should be using MessageId.earliest when creating source
private val source = committableSource(consumerFn, Some(MessageId.earliest))
While testing, I found pros and cons of both that is without using MessageId.earliest
and with using MessageId.earliest, but none of them are suitable for production (as per my opinion).
1. Without using MessageId.earliest:
a. This adds a constraint that Record ingester service has to be up before we start Record pump service.
b. If my record ingester service goes down (due to an issue or due to maintenance), the messages produced on pulsar by record pump service will not get consumed after the ingester service is back up. This means that messages produced during the time, ingester service is down never gets consumed.
So, I think the logic is that only those messages will be consumed which will be put on pulsar
AFTER the consumer has subscribed to that topic.
But, I don't think its acceptable in production for the reason mentioned in point a and point b.
2. With MessageId.earliest:
Point a and b mentioned above are solved with this but -
When we use this, any time my record ingester service comes back up (after downtime or maintenance), it starts consuming all messages since the very beginning.
I have the logic that records with same id gets overwritten at ES side, so it really doesn't do any harm but still I don't think this is the right way - as there would be millions of messages on that topic and it will everytime consume messages that are already consumed (which is a waste).
This also to me is unacceptable in production.
Can anyone please help me out in what configuration to use which solves both cases.
I tried various configurations such as using subscriptionInitialPosition = Some(SubscriptionInitialPosition.Earliest)
but no luck.
Complete code:
//consumer
private val consumerFn = () =>
pulsarClient.consumer(
ConsumerConfig(
subscriptionName = Subscription.generate,
topics = Seq(statementTopic),
subscriptionType = Some(SubscriptionType.Shared)
)
)
//create source
private val source = committableSource(consumerFn)
//create intermediate flow
private val intermediateFlow = Flow[CommittableMessage[Array[Byte]]].map {
committableSourceMessage =>
val message = committableSourceMessage.message
val obj: MyObject = MyObject.parseFrom(message.value)
WriteMessage.createIndexMessage(obj.id, JsonUtil.toJson(obj))
}.via(
ElasticsearchFlow.create(
indexName = "myindex",
typeName = "_doc",
settings = ElasticsearchWriteSettings.Default,
StringMessageWriter
)
)
source.via(intermediateFlow).run()
What you would want is some form of compaction. See the Pulsar docs for details. You can make consumption compaction-aware with
ConsumerConfig(
// other consumer config options as before
readCompacted = Some(true)
)
There's a discussion in the Pulsar docs about the mechanics of compaction. Note that enabling compaction requires that writes to the topic be keyed, which may or may not have happened in the past.
Compaction can be approximated in Akka in a variety of ways, depending on how many distinct keys to compact on are in the topic, how often they're superceded by later messages, etc. The basic idea would be to have a statefulMapConcat which keeps a Map[String, T] in its state and some means of flushing the buffer.
A simple implementation would be:
Flow[CommittableMessage[Array[Byte]].map { csm =>
Option(MyObject.parseFrom(csm.message.value))
}
.keepAlive(1.minute, () => None)
.statefulMapConcat { () =>
var map: Map[String, MyObject] = Map.empty
var count: Int = 0
{ objOpt: Option[MyObject] =>
objOpt.map { obj =>
map = map.updated(obj.id, obj)
count += 1
if (count == 1000) {
val toEmit = map.values.toList
count = 0
map = Map.empty
toEmit
} else Nil
}.getOrElse {
val toEmit = map.values.toList
count = 0
map = Map.empty
toEmit
}
}
A more involved answer would be to create an actor corresponding to each object (cluster sharding may be of use here, especially if there are likely to be a lot of objects) and having the ingest from Pulsar send the incoming messages to the relevant actor, which then schedules a write of the latest message received to Elasticsearch.
One thing to be careful around with this is not committing offsets until you're sure the message (or a successor which supercedes it) has been written to Elasticsearch. If doing the actor per object approach, Akka Persistence may be of use: the basic strategy would be to commit the offset once the actor has acknowledged receipt (which occurs after persisting an event e.g. to Cassandra).
I'm trying to figure out the best approach to call a Rest endpoint from Spark.
My current approach (solution [1]) looks something like this -
val df = ... // some dataframe
val repartitionedDf = df.repartition(numberPartitions)
lazy val restEndPoint = new restEndPointCaller() // lazy evaluation of the object which creates the connection to REST. lazy vals are also initialized once per JVM (executor)
val enrichedDf = repartitionedDf
.map(rec => restEndPoint.getResponse(rec)) // calls the rest endpoint for every record
.toDF
I know I could have used .mapPartitions() instead of .map(), but looking at the DAG, it looks like spark optimizes the repartition -> map to a mapPartition anyway.
In this second approach (solution [2]), a connection is created once for every partition and reused for all records within the partition.
val newDs = myDs.mapPartitions(partition => {
val restEndPoint = new restEndPointCaller /*creates a db connection per partition*/
val newPartition = partition.map(record => {
restEndPoint.getResponse(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
restEndPoint.close() // close dbconnection here
newPartition.iterator // create a new iterator
})
In this third approach (solution [3]), a connection is created once per JVM (executor) reused across all partitions processed by the executor.
lazy val connection = new DbConnection /*creates a db connection per partition*/
val newDs = myDs.mapPartitions(partition => {
val newPartition = partition.map(record => {
readMatchingFromDB(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
newPartition.iterator // create a new iterator
})
connection.close() // close dbconnection here
[a] With Solutions [1] and [3] which are very similar, is my understanding of how lazy val work correct? The intention is to restrict the number of connections to 1 per executor/ JVM and reuse the open connections for processing subsequent requests. Will I be creating 1 connection per JVM or 1 connection per partition?
[b] Are there any other ways by which I can control the number of requests (RPS) we make to the rest endpoint ?
[c] Please let me know if there are better and more efficient ways to do this.
Thanks!
IMO the second solution with mapPartitions is better. First, you explicitly tells what you're expecting to achieve. The name of the transformation and the implemented logic tell it pretty clearly. For the first option you need to be aware of the how Apache Spark optimizes the processing. And it's maybe obvious to you just now but you should also think about the people who will work on your code or simply about you in 6 months, 1 year, 2 years and so fort. And they should understand better the mapPartitions than repartition + map.
Moreover maybe the optimization for repartition with map will change internally (I don't believe in it but you can still consider is as a valid point) and at this moment your job will perform worse.
Finally, with the 2nd solution you avoid a lot of problems that you can encounter with the serialization. In the code you wrote the driver will create one instance of the endpoint object, serialize it and send to the executors. So yes, maybe it'll be a single instance but only if it's serializable.
[edit]
Thanks for clarification. You can achieve what are you looking for in different manners. To have exactly 1 connection per JVM you can use a design pattern called singleton. In Scala it's expressed pretty easily as an object (the first link I found on Google https://alvinalexander.com/scala/how-to-implement-singleton-pattern-in-scala-with-object)
And that it's pretty good because you don't need to serialize anything. The singletons are read directly from the classpath on the executor side. With it you're sure to have exactly one instance of given object.
[a] With Solutions [1] and [3] which are very similar, is my
understanding of how lazy val work correct? The intention is to
restrict the number of connections to 1 per executor/ JVM and reuse
the open connections for processing subsequent requests. Will I be
creating 1 connection per JVM or 1 connection per partition?
It'll create 1 connection per partition. You can execute this small test to see that:
class SerializationProblemsTest extends FlatSpec {
val conf = new SparkConf().setAppName("Spark serialization problems test").setMaster("local")
val sparkContext = SparkContext.getOrCreate(conf)
"lazy object" should "be created once per partition" in {
lazy val restEndpoint = new NotSerializableRest()
sparkContext.parallelize(0 to 120).repartition(12)
.mapPartitions(numbers => {
//val restEndpoint = new NotSerializableRest()
numbers.map(nr => restEndpoint.enrich(nr))
})
.collect()
}
}
class NotSerializableRest() {
println("Creating REST instance")
def enrich(id: Int): String = s"${id}"
}
It should print Creating REST instance 12 times (# of partitions)
[b] Are there ways by which I can control the number of requests (RPS)
we make to the rest endpoint ?
To control the number of requests you can use an approach similar to database connection pools: HTTP connection pool (one quickly found link: HTTP connection pooling using HttpClient).
But maybe another valid approach would be the processing of smaller subsets of data ? So instead of taking 30000 rows to process, you can split it into different smaller micro-batches (if it's a streaming job). It should give your web service a little bit more "rest".
Otherwise you can also try to send bulk requests (Elasticsearch does it to index/delete multiple documents at once https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). But it's up to the web service to allow you to do so.
An app that I am developing requires/gives the users the ability to create and define arbitrary streams at runtime. I understand that in Akka streams in particular
Materialisation = Execute or Run
My questions
1) Should materialisation for a stream be done only once? i.e if it is already materialised then can I use the value for subsequent runs?
2) As said above, maybe I misunderstood the term materialisation. If a stream has to run, it is materialised each time?
I am confused because in the docs, it says materialisation actually creates the resources needed for stream execution. So my immediate understanding is that it has to be done only once. Just like a JDBC connection to a database. Can someone please explain in a non-akka terminology.
Yes, a stream can be materialized multiple times. And yes, if a stream is run multiple times, it is materialized each time. From the documentation:
Since a stream can be materialized multiple times, the materialized value will also be calculated anew for each such materialization, usually leading to different values being returned each time. In the example below we create two running materialized instance of the stream that we described in the runnable variable, and both materializations give us a different Future from the map even though we used the same sink to refer to the future:
// connect the Source to the Sink, obtaining a RunnableGraph
val sink = Sink.fold[Int, Int](0)(_ + _)
val runnable: RunnableGraph[Future[Int]] =
Source(1 to 10).toMat(sink)(Keep.right)
// get the materialized value of the FoldSink
val sum1: Future[Int] = runnable.run()
val sum2: Future[Int] = runnable.run()
// sum1 and sum2 are different Futures!
Think of a stream as a reusable blueprint that can be run/materialized multiple times. A materializer is required to materialize a stream, and Akka Streams provides a materializer called ActorMaterializer. The materializer allocates the necessary resources (actors, etc.) and executes the stream. While it is common to use the same materializer for different streams and multiple materializations, each materialization of a stream triggers the resource allocation needed to run the stream. In the example above, sum1 and sum2 use the same blueprint (runnable) and the same materializer, but they are the results of distinct materializations that incurred distinct resource allocations.
I am trying to understand better Akka Streams concept on the following example. Consider a bank account. It has a past transaction history and there will be new transactions coming. Now we want to use it as a source for an Akka stream. But its data will be used in 3 different scenarios:
A consumer app collects all past transactions and prints a report.
A consumer app is a transaction monitor that prints all new transaction starting with a time the app started.
A consumer app combines functions of (1) and (2): it first prints all past transactions and then prints all arriving transactions.
What do we have here in terms of Akka streams? Is the difference in stream sources that feed otherwise same flows and sinks with different data? Or is the source the same (it's all transactions from the same bank account) but we need to apply different filtering operations to obtain different results?
Akka stream Sources can be combined like any other Iterable that exists within scala.
Based on your example, say we have historic transactions that are persisted in a database. We could use something like slick streaming to get those transactions from the db:
val historicSource : Source[Transaction, _] = ???
There would also be realtime transactions (possibly coming from a messaging system):
val realtimeSource : Source[Transaction, _] = ???
These two Sources can be combined:
val combinedSource = historicSource ++ realtimeSource
Those combined events can then be used by the same stream processing logic; for example you could println any transaction over $1,000.00:
val isLargeTransaction = (_ : Transaction).dollarAmount > 1000.0
val reportTransaction = (transaction : Transaction) =>
println s"Large Transaction: $transaction"
combinedSource.filter(isLargeTransaction)
.runWith(Sink foreach reportTransaction)
I am trying to create an observable that meets the following requirements:
1) When the first client subscribes then the observable needs to connect to some backend service, and push out an initial value
2) When successive clients subscribe then the observable should push out a new value
3) When the final client disposes then the observable should disconnect from the backend system.
4) The backend service also calls OnNext regularly with other messages
So far I have something like below. I can't work out how I can react to each subscribe call but only call the disposer on the final dispose.
var o = Observable.Create((IObserver<IModelBrokerEvent> observer) =>
{
observer.OnNext(newValue);
_backendThingy.Subscribe(observer.OnNext);
return Disposable.Create(() =>
{
_backendThingy.Unsubscribe(observer.OnNext);
});
}
_observable = Observable.Defer(() => o).Publish().RefCount();
There are several ways to do things similar to what you are talking about, but without exact semantics I am limited to proving a few generic solutions...
Replay Subject
The simplest is as such:
var source = ...;
var n = 1
var shared = source.Replay(n).RefCount();
The Replay operator ensures that each new subscription receives the latest n values from the source Observable. However, it does not re-invoke any subscription logic to the source Observable to achieve this. In effect, assuming the source stream has emitted values already, subsequent subscriptions will receive the previous n values synchronously upon subscription. RefCount does what you might think it should do: Connect the Replay upon the first subscription, and dispose of the connection upon the last unsubscribe.
Bidirection Communication via Proxy
Replay solves the most common use case, in that the source stream is capable of keeping itself up-to-date relatively well. However, if the source stream is updated only periodically, and new subscriptions should constitute an immediate update, then you may want to get more fancy:
var service = ...;
var source = service.Publish().RefCount();
var forceUpdate = Observable.Defer(() => Observable.Start(service.PingForUpdate));
var shared = Observable.Merge(source, forceUpdate);
Where the subscription to the server constitutes a new connection, and the PingForUpdate method indicates to the service that it's consumer would like to have a new value ASAP (which then forces the service to output said value.
Merging Periodic Updates with Initial Latest Value
Using the bidirectional communication method denotes that all consumers of this service will receive the latest value from the source upon any new subscription. However, it may be possible that we only want the latest value for the new subscriber, and all other consumers should receive values on the periodic basis.
For this, we can change the code a bit.
var service = ...;
var source = service.Publish().RefCount();
var latest = Observable.Defer(() => service.GetLatestAsObservable());
var shared = Observable.Merge(source, forceUpdate);
Where the subscription to the server constitutes a new connection, and the GetLatestAsObservable method simply retrieves the latest value from the service asynchronously via an Observable.
However, with this method you must also choose how to handle race conditions, such that you request the latest value, but a newer value is yielded before the request for the latest value is returned.