While processing a stream of Avro messages through Kafka and Spark, I am saving the processed data as documents in a ElasticSearch index.
Here's the code (simplified):
directKafkaStream.foreachRDD(rdd ->{
rdd.foreach(avroRecord -> {
byte[] encodedAvroData = avroRecord._2;
MyType t = deserialize(encodedAvroData);
// Creating the ElasticSearch Transport client
Settings settings = Settings.builder()
.put("client.transport.ping_timeout", 5, TimeUnit.SECONDS).build();
TransportClient client = new PreBuiltTransportClient(settings)
.addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));
IndexRequest indexRequest = new IndexRequest("index", "item", id)
.source(jsonBuilder()
.startObject()
.field("name", name)
.field("timestamp", new Timestamp(System.currentTimeMillis()))
.endObject());
UpdateRequest updateRequest = new UpdateRequest("index", "item", id)
.doc(jsonBuilder()
.startObject()
.field("name", name)
.field("timestamp", new Timestamp(System.currentTimeMillis()))
.endObject())
.upsert(indexRequest);
client.update(updateRequest).get();
client.close();
Everything works as expected; the only problem is performance: saving to ES requires some time, and I suppose that this is due to the fact that I open/close an ES Transport client for each RDD. Spark documentation suggests that this approach is quite correct: as soon as I understand, the only possible optimisation is using rdd.foreachPartition, but I only have one partition, so I am not sure that this would be beneficial.
Any other solution to achieve better performance?
Because you create new connect whenever process a record of RDD.
So, I think use foreachPartition will make better performance regardless of only one partition, because it help you bring your ES connection instance outside, reuse it in the loop.
I would stream the processed messages back onto a separate Kafka topic, and then use Kafka Connect to land them to Elasticsearch. This decouples your Spark-specific processing from getting the data into Elasticsearch.
Example of it in action: https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/
Related
So I have the following code where I am trying to get the KPL to set the partition key, so I can start sharding my stream.
def createSinkFromStaticConfig(stream: Option[String], region: Option[String]): FlinkKinesisProducer[String] = {
val outputProperties = new Properties
outputProperties setProperty(AWSConfigConstants.AWS_REGION, region.get)
outputProperties setProperty("Region", region.get)
outputProperties.put("RecordTtl", s"${Int.MaxValue}")
outputProperties.put("ThreadPoolSize", "5")
outputProperties.put("MaxConnections", "5")
val sink = new FlinkKinesisProducer[String](new SimpleStringSchema, outputProperties)
sink setDefaultStream stream.get
sink setDefaultPartition "0"
sink setCustomPartitioner new KinesisPartitioner[String]() {
override def getPartitionId(element: String): String = {
val epoch = LocalDateTime.now.toEpochSecond(ZoneOffset.UTC)
epoch.toString
}
}
sink setQueueLimit 500
sink
}
So the sink, when called, does work and sends to data to the stream. I have manually sharded the stream and have two consumers on it. I can see each consumer is being assigned to different shards, but only one will get any work. Is there something I am doing wrong to set the shard? IS there a way to validate which shard it was sent to?
Thanks
So, found some fun.
So the above code does work, and submits to different shards. My problem was that the second task of my kinesis consumer was somehow reading the parent shard, instead of grabbing on of the available child shards. once I spun up a 3rd task, then I had one for each child and parent. Still no good, but at least I know its not the producer!
EDIT:
so for more clarification, this was a problem of my own making (aren't they all?). In testing, I had sharded and unsharded a bunch of times on the stream for testing. This caused the parent to be considered still active, and needing processing. Even without all that shenanigans, it could of been a problem. See the link
https://github.com/awslabs/amazon-kinesis-client/issues/786
Is there a way to do a typical batch processing with Vert.x - like providing a file or DB query as input and let each record be processed by a vertice in non-blocking way.
In examples of Vertice, a server is defined in startup. And even though multiple vertices are deployed, server is created only onece. Which means that Vert.x engine does have a build in concept of a server and knows how to send incomming requests to each vertice for processing.
Same happens with Event Bus as well.
But is there a way to define a vertice with a handler for processing data from a general stream - query, file, etc..
I am particularly interested in spreading data processing over cluster nodes.
One way I can think of, is execute a query a regular way and then publish data to event bus for processing. But that means that if I have to process few millions of records, I will run out of memory. Of course I could do paging, etc.. - but there is no coordination between retrieving and processing of data.
Thanks
Andrius
If you are using the JDBC Client, you can stream the query result:
(using vertx-rx-java2)
JDBCClient client = ...;
JsonObject params = new JsonArray().add(dataCategory);
client.rxQueryStreamWithParams("SELECT * FROM data WHERE data.category = ?", params)
.flatMapObservable(SQLRowStream::toObservable)
.subscribe(
(JsonArray row) -> vertx.eventBus().send("data.process", row)
);
This way each row is send to the event bus. If you then have multiple verticle instances that each listen to this address, you spread the data processing to multiple threads.
If you are using another SQL Client have a look at its documentation - Maybe is has a similar method.
I have to write a simple service (Record ingester service) via which I need to consume messages present on apache pulsar and store them to elastic store and for that I am using com.sksamuel.pulsar4s.akka.
Messages on pulsar is produced by another service which is Record pump service.
Both these services are to be deployed separately, in production.
Here is my source:
private val source = committableSource(consumerFn)
The above code works fine and its able to consume message from pulsar and write to ES.
However, I am not sure if we should be using MessageId.earliest when creating source
private val source = committableSource(consumerFn, Some(MessageId.earliest))
While testing, I found pros and cons of both that is without using MessageId.earliest
and with using MessageId.earliest, but none of them are suitable for production (as per my opinion).
1. Without using MessageId.earliest:
a. This adds a constraint that Record ingester service has to be up before we start Record pump service.
b. If my record ingester service goes down (due to an issue or due to maintenance), the messages produced on pulsar by record pump service will not get consumed after the ingester service is back up. This means that messages produced during the time, ingester service is down never gets consumed.
So, I think the logic is that only those messages will be consumed which will be put on pulsar
AFTER the consumer has subscribed to that topic.
But, I don't think its acceptable in production for the reason mentioned in point a and point b.
2. With MessageId.earliest:
Point a and b mentioned above are solved with this but -
When we use this, any time my record ingester service comes back up (after downtime or maintenance), it starts consuming all messages since the very beginning.
I have the logic that records with same id gets overwritten at ES side, so it really doesn't do any harm but still I don't think this is the right way - as there would be millions of messages on that topic and it will everytime consume messages that are already consumed (which is a waste).
This also to me is unacceptable in production.
Can anyone please help me out in what configuration to use which solves both cases.
I tried various configurations such as using subscriptionInitialPosition = Some(SubscriptionInitialPosition.Earliest)
but no luck.
Complete code:
//consumer
private val consumerFn = () =>
pulsarClient.consumer(
ConsumerConfig(
subscriptionName = Subscription.generate,
topics = Seq(statementTopic),
subscriptionType = Some(SubscriptionType.Shared)
)
)
//create source
private val source = committableSource(consumerFn)
//create intermediate flow
private val intermediateFlow = Flow[CommittableMessage[Array[Byte]]].map {
committableSourceMessage =>
val message = committableSourceMessage.message
val obj: MyObject = MyObject.parseFrom(message.value)
WriteMessage.createIndexMessage(obj.id, JsonUtil.toJson(obj))
}.via(
ElasticsearchFlow.create(
indexName = "myindex",
typeName = "_doc",
settings = ElasticsearchWriteSettings.Default,
StringMessageWriter
)
)
source.via(intermediateFlow).run()
What you would want is some form of compaction. See the Pulsar docs for details. You can make consumption compaction-aware with
ConsumerConfig(
// other consumer config options as before
readCompacted = Some(true)
)
There's a discussion in the Pulsar docs about the mechanics of compaction. Note that enabling compaction requires that writes to the topic be keyed, which may or may not have happened in the past.
Compaction can be approximated in Akka in a variety of ways, depending on how many distinct keys to compact on are in the topic, how often they're superceded by later messages, etc. The basic idea would be to have a statefulMapConcat which keeps a Map[String, T] in its state and some means of flushing the buffer.
A simple implementation would be:
Flow[CommittableMessage[Array[Byte]].map { csm =>
Option(MyObject.parseFrom(csm.message.value))
}
.keepAlive(1.minute, () => None)
.statefulMapConcat { () =>
var map: Map[String, MyObject] = Map.empty
var count: Int = 0
{ objOpt: Option[MyObject] =>
objOpt.map { obj =>
map = map.updated(obj.id, obj)
count += 1
if (count == 1000) {
val toEmit = map.values.toList
count = 0
map = Map.empty
toEmit
} else Nil
}.getOrElse {
val toEmit = map.values.toList
count = 0
map = Map.empty
toEmit
}
}
A more involved answer would be to create an actor corresponding to each object (cluster sharding may be of use here, especially if there are likely to be a lot of objects) and having the ingest from Pulsar send the incoming messages to the relevant actor, which then schedules a write of the latest message received to Elasticsearch.
One thing to be careful around with this is not committing offsets until you're sure the message (or a successor which supercedes it) has been written to Elasticsearch. If doing the actor per object approach, Akka Persistence may be of use: the basic strategy would be to commit the offset once the actor has acknowledged receipt (which occurs after persisting an event e.g. to Cassandra).
So I need to have a GlobalKTable containing the aggregation of several messages across many instances. Right now, my single instance KTable setup looks something like this:
final KTable<String, Double> aggregatedMetrics = eventStream
.groupByKey(Serdes.String(), jsonSerde)
.aggregate(
() -> 0d,
new MetricsAggregator(),
Serdes.Double(),
LOCAL_METRICS_STORE_NAME);
Obviously, this doesn't scale since each instance only has the updated metrics for the messages it has received, not for all of the messages received by all the other instances. I was thinking of using this:
final KStreamBuilder builder = new KStreamBuilder();
builder.globalTable(METRIC_CHANGES_TOPIC, METRICS_STORE_NAME);
and then just streaming updates to my aggregatedMetrics KTable to the METRIC_CHANGES_TOPIC, which would update the global table. However, each instance would just be overwriting the other instances' aggregations on each update to the global table.
Is there any way I can do a global aggregation?
The solution sound correct to me.
This does not sound correct:
However, each instance would just be overwriting the other instances' aggregations on each update to the global table.
Note, that aggregations are done key-based. Thus, different instances will aggregate on different keys, and thus, each instance will just update its own keys in the GlobalKTable.
In the link it was suggested to create a connection pool that is available across multiple rdds in the spark streaming job.
rdd.foreachpartition( iter => {
val client = MongoClient(host,port)
val col = client.getDataBase("testDataBase").getCollection("testCollection")
// i am bascically inserting data in the iterator to the testcollection
})
However I was not able to figure out how to create a connection pool that returns a connection object to a mongodb collection. I was able to use foreachpartition to create a single connection for the whole partition. can someone please let me know how to create a connection object that available across the executor for reuse.
The MongoDB Spark Connector internally uses broadcast variables to achieve this:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
So you should be able to share the MongoClient and connection pool across tasks.
Mongo dB spark connection doesn’t help in collecting exceptions. How do we do that. Also if we insert in batches , if one fails - it stops the remaining inserts. Mongo spark driver helps to insert multiple documents as well as you can set ordered = false so it even inserts the remaining documents even if there is some duplicates or some timeouts.