I need to insert large bulk data to elasticsearch regulary via scala code. When googling, I found to use logstash for large insertion rate but logstash doesn't have any java libraries or Api to call so I tried to connect to it via http client. I don't know it is a good approach to send large data with http protocol or better to use other approaches for example using broker, queues, redis, etc.
I know last versions of logstash(6.X,7.x) enables uses of persistent queue so it can be another solution to use logstash queue but again through http or tcp protocol.
Also note that reliability is the first priority for me since data must not be lost and there should be a mechanism to return response in code so to handle success or failure.
I would appreciate any ideas.
Update
It seems using http is robust and has acknowledgement mechanism based on here but if taking this approach, what http client libs in scala is more appropriate as I need to send bulk data in sequence of key value format and handle response in none-blocking way?
It may sound overkill but introducing a buffering layer between scala code and logstash may prove helpful as you can get rid of heavy HTTP calls and rely on lightweight protocol transport.
Consider adding Kafka between your scala code and logstash for queuing of messages. Logstash can reliably process the messages from Kafka using TCP transport and bulk insert into ElasticSearch. On the other hand, you can put messages into Kafka from your scala code in build (batches) to make the whole pipeline work efficiently.
With that being said, if you don't have a volume in let say 10,000 msgs/sec then you can also consider fiddling around with logstash HTTP input plugin by tweaking threads and using multiple logstash processes. This is to reduce the complexity of adding another moving piece (Kafka) into your architecture.
Related
There are several questions regarding message enrichment using external data, and the recommendation is almost always the same: ingest external data using Kafka Connect and then join the records using state stores. Although it fits in most cases, there are several other use cases in which it does not, such as IP to location and user agent detection, to name a few.
Enriching a message with an IP-based location usually requires a lookup by a range of IPs, but currently, there is no built-in state store that provides such capability. For user agent analysis, if you rely on a third-party service, you have no choices other than performing external calls.
We spend some time thinking about it, and we came up with an idea of implementing a custom state store on top of a database that supports range queries, like Postgres. We could also abstract an external HTTP or GRPC service behind a state store, but we're not sure if it is the right way.
In that sense, what is the recommended approach when you cannot avoid querying an external service during the stream processing, but you still must guarantee fault tolerance? What happens when an error occurs while the state store is retrieving data (a request fails, for instance)? Do Kafka Streams retry processing the message?
Generally, KeyValueStore#range(fromKey, toKey) is supported by build-in stores. Thus, it would be good to understand how the range queries you try to do are done? Also note, that internally, everything is stored as byte[] arrasy and RocksDB (default storage engine) sorts data accordingly -- hence, you can actually implement quite sophisticated range queries if you start to reason about the byte layout, and pass in corresponding "prefix keys" into #range().
If you really need to call an external service, you have "two" options to not lose data: if an external calls fails, throw an exception and let the Kafka Streams die. This is obviously not a real option, however, if you swallow error from the external lookup you would "skip" the input message and it would be unprocessed. Kafka Streams cannot know that processing "failed" (it does not know what your code does) and will not "retry", but consider the message as completed (similar if you would filter it out).
Hence, to make it work, you would need to put all data you use to trigger the lookup into a state store if the external call fails, and retry later (ie, do a lookup into the store to find unprocessed data and retry). This retry can either be a "side task" when you process the next input message, of you schedule a punctuation, to implement the retry. Note, that this mechanism changes the order in which records are processed, what might or might not be ok for your use case.
The naive approach for implementing the use case of enriching an incoming stream of events stored in Kafka with reference data - is by calling in map() operator an external service REST API that provides this reference data, for each incoming event.
eventStream.map((key, event) -> /* query the external service here, then return the enriched event */)
Another approach is to have second events stream with reference data and store it in KTable that will be a lightweight embedded "database" then join main event stream with it.
KStream<String, Object> eventStream = builder.stream(..., "event-topic");
KTable<String, Object> referenceDataTable = builder.table(..., "reference-data-topic");
KTable<String, Object> enrichedEventStream = eventStream
.leftJoin(referenceDataTable , (event, referenceData) -> /* return the enriched event */)
.map((key, enrichedEvent) -> new KeyValue<>(/* new key */, enrichedEvent)
.to("enriched-event-topic", ...);
Can the "naive" approach be considered an anti-pattern? Can the "KTable" approach be recommended as the preferred one?
Kafka can easily manage millions of messages per minute. Service that is called from the map() operator should be capable of handling high load too and also highly-available. These are extra requirements for the service implementation. But if the service satisfies these criteria can the "naive" approach be used?
Yes, it is ok to do RPC inside Kafka Streams operations such as map() operation. You just need to be aware of the pros and cons of doing so, see below. Also, you should do any such RPC calls synchronously from within your operations (I won't go into details here why; if needed, I'd suggest to create a new question).
Pros of doing RPC calls from within Kafka Streams operations:
Your application will fit more easily into an existing architecture, e.g. one where the use of REST APIs and request/response paradigms is common place. This means that you can make more progress quickly for a first proof-of-concept or MVP.
The approach is, in my experience, easier to understand for many developers (particularly those who are just starting out with Kafka) because they are familiar with doing RPC calls in this manner from their past projects. Think: it helps to move gradually from request-response architectures to event-driven architectures (powered by Kafka).
Nothing prevents you from starting with RPC calls and request-response, and then later migrating to a more Kafka-idiomatic approach.
Cons:
You are coupling the availability, scalability, and latency/throughput of your Kafka Streams powered application to the availability, scalability, and latency/throughput of the RPC service(s) you are calling. This is relevant also for thinking about SLAs.
Related to the previous point, Kafka and Kafka Streams scale very well. If you are running at large scale, your Kafka Streams application might end up DDoS'ing your RPC service(s) because the latter probably can't scale as much as Kafka. You should be able to judge pretty easily whether or not this is a problem for you in practice.
An RPC call (like from within map()) is a side-effect and thus a black box for Kafka Streams. The processing guarantees of Kafka Streams do not extend to such side effects.
Example: Kafka Streams (by default) processes data based on event-time (= based on when an event happened in the real world), so you can easily re-process old data and still get back the same results as when the old data was still new. But the RPC service you are calling during such reprocessing might return a different response than "back then". Ensuring the latter is your responsibility.
Example: In the case of failures, Kafka Streams will retry operations, and it will guarantee exactly-once processing (if enabled) even in such situations. But it can't guarantee, by itself, that an RPC call you are doing from within map() will be idempotent. Ensuring the latter is your responsibility.
Alternatives
In case you are wondering what other alternatives you have: If, for example, you are doing RPC calls for looking up data (e.g. for enriching an incoming stream of events with side/context information), you can address the downsides above by making the lookup data available in Kafka directly. If the lookup data is in MySQL, you can setup a Kafka connector to continuously ingest the MySQL data into a Kafka topic (think: CDC). In Kafka Streams, you can then read the lookup data into a KTable and perform the enrichment of your input stream via a stream-table join.
I suspect most of the advice you hear from the internet is along the lines of, "OMG, if this REST call takes 200ms, how wil I ever process 100,000 Kafka messages per second to keep up with my demand?"
Which is technically true: even if you scale your servers up for your REST service, if responses from this app routinely take 200ms - because it talks to a server 70ms away (speed of light is kinda slow, if that server is across the continent from you...) and the calling microservice takes 130ms even if you measure right at the source....
With kstreams the problem may be worse than it appears. Maybe you get 100,000 messages a second coming into your stream pipeline, but some kstream operator flatMaps and that operation in your app creates 2 messages for every one object... so now you really have 200,000 messages a second crashing through your REST server.
BUT maybe you're using Kstreams in an app that has 100 messages a second, or you can partition your data so that you get a message per partition maybe even just once a second. In that case, you might be fine.
Maybe your Kafka data just needs to go somewhere else: ie the end of the stream is back into a Good Ol' RDMS. In which case yes, there's some careful balancing there on the best way to deal with potentially "slow" systems, while making sure you don't DDOS yourself, while making sure you can work your way out of a backlog.
So is it an anti-pattern? Eh, probably, if your Kafka cluster is LinkedIn size. Does it matter for you? Depends on how many messages/second you need to drive, how fast your REST service really is, how efficiently it can scale (ie your new kstreams pipeline suddenly delivers 5x the normal traffic to it...)
I found Slick 3.0 introduced a new feature called streaming
http://slick.typesafe.com/doc/3.0.0-RC1/database.html#streaming
I'm not familiar with Akka. streaming seems a lazy or async value, but it is not very clear for me to understand why it is useful, and when will it be useful..
Does anyone have ideas about this?
So lets imagine the following use case:
A "slow" client wants to get a large dataset from the server. The client sends a request to the server which loads all the data from the database, stores it in memory and then passes it down to the client.
And here we're faced with problems: The client handles the data not so fast as we wanted => we can't release the memory => this may result in an out of memory error.
Reactive streams solve this problem by using backpressure. We can wrap Slick's publisher around the Akka source and then "feed" it to the client via Akka HTTP.
The thing is that this backpressure is propagated through TCP via Akka HTTP down to the publisher that represents the database query.
That means that we only read from the database as fast as the client can consume the data.
P.S This just a little aspect where reactive streams can be applied.
You can find more information here:
http://www.reactive-streams.org/
https://youtu.be/yyz7Keg1w9E
https://youtu.be/9S-4jMM1gqE
Not that it wouldn't be easy (or fun) enough to write one, it makes sense not to re-invent the wheel so to speak. I've had a look around at various attempts, but I don't seem to have yet come across an implementation that supports these criteria;
Simple queue OSS system with MongoDB persistence;
C# Driver (official) based (so full POCO serialization)
Tailable cursors rather than polling
handles message timeout (GC correctly)
handles consumer failure (ideally crash detecting re-insertion, but timeout with delayed re-insertion is fine) so findAndModify on complete
multiple writers, multiple consumers
threadsafe
Nice to have;
allows for (latest only) message (replace older messages in the Q)
If anyone has nice simple a library like that floating around on GitHub that I've not yet found, please speak up!
There's my little project - a .net message bus implementation that works with MS SQL queues or MongoDB (MongoDB support is a recent addition). Link: http://code.google.com/p/nginn-messagebus/ and http://nginn.org/blog for some examples.
I'm not sure if this is what you're looking for, it's also lacking in documentation and example departments and it doesn't exactly match your specs (polling instead of tailing) - but maybe it's worth giving a try. This is a publish-subscribe message bus, like NServiceBus or MassTransit - not a raw message queue.
PS I'm afraid there are mutually exclusive requirements in your specs: you can't use tailable cursor with concurrent consumers because you lose atomicity. If you want to tail a queue you should use only a single consumer.
I am a new starter in Flink, I have a requirement to read data from Kafka, enrich those data conditionally (if a record belongs to category X) by using some API and write to S3.
I made a hello world Flink application with the above logic which works like a charm.
But, the API which I am using to enrich doesn't have 100% uptime SLA, so I need to design something with retry logic.
Following are the options that I found,
Option 1) Make an exponential retry until I get a response from API, but this will block the queue, so I don't like this
Option 2) Use one more topic (called topic-failure) and publish it to topic-failure if the API is down. In this way it won't block the actual main queue. I will need one more worker to process the data from the queue topic-failure. Again, this queue has to be used as a circular queue if the API is down for a long time. For example, read a message from queue topic-failure try to enrich if it fails to push to the same queue called topic-failure and consume the next message from the queue topic-failure.
I prefer option 2, but it looks like not an easy task to accomplish this. Is there is any standard Flink approach available to implement option 2?
This is a rather common problem that occurs when migrating away from microservices. The proper solution would be to have the lookup data also in Kafka or some DB that could be integrated in the same Flink application as an additional source.
If you cannot do it (for example, API is external or data cannot be mapped easily to a data storage), both approaches are viable and they have different advantages.
1) Will allow you to retain the order of input events. If your downstream application expects orderness, then you need to retry.
2) The common term is dead letter queue (although more often used on invalid records). There are two easy ways to integrate that in Flink, either have a separate source or use a topic pattern/list with one source.
Your topology would look like this:
Kafka Source -\ Async IO /-> Filter good -> S3 sink
+-> Union -> with timeout -+
Kafka Source dead -/ (for API call!) \-> Filter bad -> Kafka sink dead