reading data from Kafka and storing it in dynamo db - apache-kafka

I need to read data from multiple topics in Kafka broker and store the data in Dynamo DB.
Any reference code or any specific method i can go ahead with.
I Tried using https://github.com/shikhar/kafka-connect-dynamodb but i couldn't get much help as am new to this.

One of the options to read from Kafka and write to Dynamo is Nifi.
Use ConsumeKafka Nifi Processor as consumer:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-0-9-nar/1.5.0/org.apache.nifi.processors.kafka.pubsub.ConsumeKafka/
and PutDynamoDB to write:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-aws-nar/1.5.0/org.apache.nifi.processors.aws.dynamodb.PutDynamoDB/
This also facilitates to do any quick transformation, forking etc.

Related

Kafka Connect vs Apache Nifi

Good Afternoon, my question is pretty simple, I'm new in Apache Kafka but I'm doing some work as part of my internship which is why I came with the question.
I will provide the context as much as I can, so I hope someone can help me, I want to clear my doubts.
I was requested to develop a pipeline (or workflow) using first Apache Nifi.
This pipeline consisted of the following.
I fetched data from one local MySQL database using Nifi, then the data was sent to one Kafka topic which was later processed to clean some raw data using the Kafka Client with Java (KStream, KTable and some regular expressions) and sent again to one kafka topic.
Once the processing was done, the new data was read again using Apache Nifi, and then sent to a new MySQL table.
I provide a picture for a better undertanding.
General Pipeline
After it, I was requested to do the same but using Kafka Connect instead of Apache Nifi, which was even shorter because I only had to use the Source connector to read the data from the MySQL database to sent it to one kafka topic, then process it with the Kafka Client with Java and sent it to a new kafka topic. Finally use the Sink connector to save the processed data of the new topic to sent it straight to one new table in the database.
So, someone in charge asked me when I should use Apache Nifi + Kafka instead of Kafka Connect + Kafka and I have no idea being honest.
So let's consider that the most important point here is apply Data Enrichment and let's consider two scenaries:
when I have data from different source but the data is not streaming data AND when the data is streaming data as well as not.
And all of it needs to be processed, integrated, cleaned and finally unified to apply data enrichment.
If I consider the context provided previously my questions and doubts are:
when should I use or not Nifi and Kafka? and why?
When should I use or not Kafka Connect with Kafka? and why?
I think I have one basic idea, and I have been reading in order to be able to answer it for myself, but being honest, I haven't come with one acceptable answer or clearly idea of when to use each one.
So, I would really appreciate your help.

Design questions considering Kafka Streams and Spring Cloud Stream

I need to maintain external systems records (KTables) and track any change on those records (KStreams).
The KTables will be requested by KSQL queries, while the KStreams will be handled by an event monitor.
Questions:
I need the KTable working like mirrors from the external systems. Will I have any problem if I decide to use this design regarding data storage? Data loss, expiration?
Using Spring, what is the best approach for the data type? Avro with a schema registry?
The source of everything is a Topic, right? So I will need to send messages to Topics, and my KTable and KStream would translate as needed. Is that right?
The KTable definitions are known, but I may have a group KStreams being created dynamic; what is the best way to achieve this?
I appreciate any comment that could help better design it.
here are my suggestions/opinions on the questions, you might want to do further research into some of the core Kafka Streams related questions.
Not entirely clear what use-case/design you are proposing. The way I understood it, you have an external system (such as a database) and you want to extract that data as a key/value pair which could be translated into a KTable. In Kafka Streams, as you indicated in your question #3, the source of truth is the Kafka topic. Therefore, you need to bring the data from the external system into a Kafka topic first, and then materialize that as a KTable in Kafka Streams. There are established patterns such as the Change Data Capture (CDC) for exporting data from external systems to a Kafka topic in almost real-time. KTable can be materialized into state storage which is by default backed up RocksDB. The same information is also replicated by Kafka changelog topics and therefore applies the guarantees provided by data in a Kafka topic. I hope that someone from the Kafka Streams team can chime in on this specific topic for more information needed.
Spring Cloud Stream provides a binder for Kafka Streams using which you can establish bindings to Kafka topics through various Kafka Streams types such as KStream, KTable and GlobalKTable. See the reference docs for more details. The binder provides several convenient options for data types with Serde inference in the case of common data types. The question about Avro data types is really dependent on your use cases and how you want to manage the schema structure for the data. If centralized schema management is a concern, then avro is a good choice. You can use Confluent's schema registry for Avro with Spring Cloud Stream. Spring provides a schema registry, but for Kafka Streams workloads that require avro, we recommend using the Confluent schema registry as it has more features. Either way, it should work and we provide a number of sample applications demonstrating schema evolution here.
As I mentioned in the answer for #1, yes, the source of truth is Kafka topics and the Spring Cloud Stream binder provides binding mechanisms for connecting to Kafka topics and translate the data as KStream or KTable.
Here again, I am not following the actual use-case. However, Kafka Streams provides many different API methods which allow you to transform the incoming data so that other KStream types can be created dynamically. For instance, you apply a map or flatMap operation on the incoming KStream and thus create a new KStream from it. Not sure, if that is what you meant. If that is the case, then it really becomes a business logic concern. This is certainly possible.
Hope this helps, once again, these are my thoughts around these, and for some of these questions, there is no right or wrong answer. You need to consider the use case and design options carefully and choose the right path that fits your needs.

Kafka streams vs Kafka connect for Kafka HBase ETL pipeline

I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON

Build a solution for Kafka+Spark for RDBMS data

My current project is in MainFrames with DB2 as its database. We have 70 databases with nearly 60 tables in each of them. Our architect proposed a plan of using Kafka with Spark streaming for processing data. How good is Kafka in reading the RDBMS tables for data ? Do we directly read the data from the tables using Kafka or is there any other way to get the data from RDBMS into Kafka ?
If there is any better solution, your suggestions can help a lot.
Do not directly read from database, it will create additional load. I would suggest two approaches.
Send new data both to databases and to Kafka, or send it to Kafka and then consume for processing.
Read data from database write ahead log (I know it is possible for MySQL with Maxwell but I am not sure for DB2) and send it to Kafka for further processing.
You can use Spark Streaming or Kafka Streams depending on your needs.