I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.
Related
I have used the Flink for sending data from source to sink.
My flink app consumes the data from Kafka and send to the destination.
The destination is also kafka topic which has a different topic name.
The Flink is only used for delivering purpose without having any business logic.
In this case, I think that changing the flink to Kafka Stream will increase the throughput. Because the flink has no contribution except for delivering data from source to sink. Also my source and sink uses the kafka so I think that kafka streams will be faster in case of the delivery data.
I would appreciate if you could give you any opinion for my question.
Thanks.
There's no guarantee one will be faster than the other. You still need to do JVM and network tuning.
Either will work, but the limitation of Kafka Streams is that the data must remain in the same Kafka cluster. Flink has no such limitation.
Or you can simply use MirrorMaker for moving data between Kafka topics of different clusters.
I have started to learn about MQTT as I have a use case in telematics in my current organisation. I would like to integrate MQTT broker ( mosquitto ) messages to my kafka.
Since every vehicle is sending the data in its own topic in MQTT broker within a single organisation, I would like to push all this data in kafka. Now I know it is not advisable to create so many topics in kafka ( more than a million ). Also I would like not like to save all the vehicles data in one kafka topic as I would like to later put all this data in S3, differentiated via vehicle id.
How can I achieve this without making so many topics in kafka. One way is the consumer of kafka segregate the events and put in s3 but I believe there will be a lot of small files in S3.
Generally, if you have the same logical entity you would use the same topic.
You can use the MQTT plugin for Kafka Connect to stream the data from MQTT into Kafka, and Kafka Connect's Single Message Transform RegexRouter to modify the topic name to which messages are written, and other SMT to modify the message key. That way you get all the messages in one topic, partitioned based on the vehicle id. That's probably the best way to store it.
From there, you can use the data however you want. When it comes to stream it to S3, you can use Kafka Connect S3 sink and as cricket_007 mentioned partition the data by time if it's just volume you're worried about. If you want to route the messages to different buckets or areas of the same bucket you could use a stream processing (e.g. Kafka Streams / ksqlDB) to pre-process the topic to populate others.
See here for an example of the MQTT connector.
I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.
I'm implementing a streaming pipeline that resembles the illustration below:
*K-topic1* ---> processor1 ---> *K-topic2* ---> processor2 -->
*K-topic3* ---> processor3 --> *K-topic4*
The K-topic components represent Kafka topics and the processor components code (Python/Java).
For the processor component, the intention is to read/consume data from the topic, perform some processing/ETL on it, and persist the results to the next topic in the chain as well as persistent store such as S3.
I have a question regarding the design approach.
The way I see it, each processor component should encapsulate both consumer and producer functionality.
Would the best approach be to have a Processor module/class that could contain KafkaConsumer and KafkaProducer classes ? To date, most examples I've seen have separate consumer and producer components which are run separately and would entail running double the number of components
as opposed to encapsulating producers & consumers within each Processor object.
Any suggestions/references are welcome.
This question is different from
Designing a component both producer and consumer in Kafka
as that question specifically mentions using Samza which is not the case here.
the intention is to read/consume data from the topic, perform some processing/ETL on it, and persist the results to the next topic in the chain
This is exactly the strength of Kafka Streams and/or KSQL. You could use the Processor API, but from what you describe, I think you'll only need the Streams DSL API
persist the results to the next topic in the chain as well as persistent store such as S3.
From the above topic, you can use a Kafka Connect Sink for getting the topic data into these other external systems. There is no need to write a consumer to do this for you.
I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON