Kafka streams vs Kafka connect for Kafka HBase ETL pipeline - apache-kafka

I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.

I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.

Related

RabbitMq and KStreams for Data Aggregation

I'm trying to solve the problem of data denormalization before indexing to the Elasticsearch. Right now, my Postgres 11 database is configured with pgoutput plugin and Debezium with Postgresql Connector is streaming the log changes to RabbitMq which are then aggregated by doing a reverse lookup on the db and feeding to the Elasticsearch.
Although, this works okay, the lookup at the App layer to aggregate the data is expensive and taking a lot of execution time (the query is already refined but it has about 10 joins making it sloppy).
The other alternative I explored was to use KStreams for data aggregation. My knowledge on Apache Kafka is minimal and thus I'm here. My question here is it a requirement to have Apache Kafka as the broker to be able to utilize the Java KStreams API or can it be leveraged with any broker such as RabbitMq? I'm unsure about this because all the articles talk about Kafka Topics and Key Value pairs which are specific to Apache Kafka.
If there is a better way to solve the data denormalization problem, I'm open to it too.
Thanks
Kafka Steams is only for Kafka. You're more than welcome to use Kafka Streams between Debezium and the process that consumes any topic (the Postgres connector that writes to RabbitMQ?)
You can use Spark, Flink, or Beam for stream processing on other messaging queues, but Debezium requires Kafka so start with tools around that.
Spark, for example, has an Elasticsearch writer library; not sure about the others.

How to get data from Kafka into a store without Kafka Connect sink?

When reading about Kafka and how to get data from Kafka to a queryable database suited for some specific task, there is usually mention of Kafka Connect sinks.
This sounds like the way to go if I needed Kafka to search indexing like ElasticSearch or analytics like Hadoop to Spark where there's a Kafka Connect sink available.
But my question is what is the best way to handle a store that isn't as popular say MyImaginaryDB, where the only way I can get to it is through some API, and the data needs to be handled securely and reliably, as well as decently transformed before inserting? Is it recommended to:
Just have the API consume from Kafka and use the MyImaginaryDB driver to write
Figure out how to build a custom Kafka Connect sink (assuming it can handle schemas, authentication/authorization, retries, fault-tolerance, transforms and post-processing needed before landing in MyImaginaryDB)
I have also been reading about Kafka KSQL and Streams and am wondering if that helps with transforming the data before it is sent to the end store.
Option 2, definitely. Just because there isn't an existing source connector, doesn't mean Kafka Connect isn't for you. If you're going to be writing some code anyway, it still makes sense to hook into the Kafka Connect framework. Kafka Connect handles all the common stuff (schemas, serialisation, restarts, offset tracking, scale out, parallelism etc etc), and leaves you just to implement the bit of getting the data to MyImaginaryDB.
As regards transformations, standard pattern is either:
Use Single Message Transform for lightweight stuff
Use Kafka Streams/KSQL and write back to another topic, which is then routed through Kafka Connect to the target
If you try to build your own app doing (transformation + data sink) then you're munging together responsibilities, and you're reinventing a chunk of wheel that exists already (integration with an external system in a reliable scalable way)
You might find this talk useful for background about what Kafka Connect can do: http://rmoff.dev/ksldn19-kafka-connect

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON

Real Time event processing

I really want to get an architectural solution for my below scenario.
I have a source of events (Say sensors in oil wells , around 50000 ), that produces events to a server. At the server side I want to process all these events in such a way that , the information from the sensors about latest humidity, temperature,pressure ...etc will be stored/updated to a database.
I am confused with flume or kafka.
Can somebody please address my simple scenario in architectural terms.
I don't want to store the event somewhere, since I am already updating the database with latest values.
Should I really need spark , (flume/kafka) + spark , to meet the processing side?.
Can we do any kind of processing using flume without a sink?
Sounds like you need to use the Kafka producer API to publish the events to a topic then simply read those events either by using the Kafka consumer API to write to your database or use the Kafka JDBC sink connector.
Also if you need just the latest data inside Kafka take a look at log compaction.
One way would be to push all the messages to Kafka Topic. Using Spark Stream you can ingest and process from the kafka topic. Spark streaming can directly process from your Kafka Topic

Kafka -> Flink DataStream -> MongoDB

I want to setup Flink so it would transform and redirect the data streams from Apache Kafka to MongoDB. For testing purposes I'm building on top of flink-streaming-connectors.kafka example (https://github.com/apache/flink).
Kafka streams are being properly red by Flink, I can map them etc., but the problem occurs when I want to save each recieved and transformed message to MongoDB. The only example I've found about MongoDB integration is flink-mongodb-test from github. Unfortunately it uses static data source (database), not the Data Stream.
I believe there should be some DataStream.addSink implementation for MongoDB, but apparently there's not.
What would be the best way to achieve it? Do I need to write the custom sink function or maybe I'm missing something? Maybe it should be done in different way?
I'm not tied to any solution, so any suggestion would be appreciated.
Below there's an example what exactly i'm getting as an input and what I need to store as an output.
Apache Kafka Broker <-------------- "AAABBBCCCDDD" (String)
Apache Kafka Broker --------------> Flink: DataStream<String>
Flink: DataStream.map({
return ("AAABBBCCCDDD").convertTo("A: AAA; B: BBB; C: CCC; D: DDD")
})
.rebalance()
.addSink(MongoDBSinkFunction); // store the row in MongoDB collection
As you can see in this example I'm using Flink mostly for Kafka's message stream buffering and some basic parsing.
As an alternative to Robert Metzger answer, you can write your results again to Kafka and then use one of the maintained kafka's connectors to drop the content of a topic inside your MongoDB Database.
Kafka -> Flink -> Kafka -> Mongo/Anything
With this approach you can mantain the "at-least-once semantics" behaivour.
There is currently no Streaming MongoDB sink available in Flink.
However, there are two ways for writing data into MongoDB:
Use the DataStream.write() call of Flink. It allows you to use any OutputFormat (from the Batch API) with streaming. Using the HadoopOutputFormatWrapper of Flink, you can use the offical MongoDB Hadoop connector
Implement the Sink yourself. Implementing sinks is quite easy with the Streaming API, and I'm sure MongoDB has a good Java Client library.
Both approaches do not provide any sophisticated processing guarantees. However, when you're using Flink with Kafka (and checkpointing enabled) you'll have at-least-once semantics: In an error case, the data is streamed again to the MongoDB sink.
If you're doing idempotent updates, redoing these updates shouldn't cause any inconsistencies.
If you really need exactly-once semantics for MongoDB, you should probably file a JIRA in Flink and discuss with the community how to implement this.