I have started to learn about MQTT as I have a use case in telematics in my current organisation. I would like to integrate MQTT broker ( mosquitto ) messages to my kafka.
Since every vehicle is sending the data in its own topic in MQTT broker within a single organisation, I would like to push all this data in kafka. Now I know it is not advisable to create so many topics in kafka ( more than a million ). Also I would like not like to save all the vehicles data in one kafka topic as I would like to later put all this data in S3, differentiated via vehicle id.
How can I achieve this without making so many topics in kafka. One way is the consumer of kafka segregate the events and put in s3 but I believe there will be a lot of small files in S3.
Generally, if you have the same logical entity you would use the same topic.
You can use the MQTT plugin for Kafka Connect to stream the data from MQTT into Kafka, and Kafka Connect's Single Message Transform RegexRouter to modify the topic name to which messages are written, and other SMT to modify the message key. That way you get all the messages in one topic, partitioned based on the vehicle id. That's probably the best way to store it.
From there, you can use the data however you want. When it comes to stream it to S3, you can use Kafka Connect S3 sink and as cricket_007 mentioned partition the data by time if it's just volume you're worried about. If you want to route the messages to different buckets or areas of the same bucket you could use a stream processing (e.g. Kafka Streams / ksqlDB) to pre-process the topic to populate others.
See here for an example of the MQTT connector.
Related
I'm implementing streaming using kafka connect in one of my projects. I have created S3 sink connector to consume messages from different topics (using regex) and then write a file to S3. Consuming messages from different topics is done using the below property.
"topics.regex": "my-topic-sample\\.(.+)",
I have 3 different topics as below. Using the above property, S3 sink connector will consume messages from these 3 topics and writes a separate file (for each topic) to S3.
my-topic-sample.test1
my-topic-sample.test2
my-topic-sample.test3
For now, ignoring all the invalid messages. However, want to implement dead letter queue.
We can achieve that using the below properties
'errors.tolerance'='all',
'errors.deadletterqueue.topic.name' = 'error_topic'
From the above property, we can move all the invalid messages to DLQ. But the issue is that we will have only 1 DLQ though there are 3 different topics from which S3 sink connector is consuming the messages. Invalid messages from all the 3 topics will be pushed to same DLQ.
Is there a way that we can have multiple DLQ's and write the message to a different DLQ based on the topic that was consumed from.
Thanks
You can only configure a single DLQ topic per connector, it's not possible to use a different DLQ topic for each source topic.
Today if you want to split records your connector fails to process into different DLQ topics, you need to start multiple connector instances each consuming from a different source topic.
Apache Kafka is an open source project so if you are up for the challenge, you can propose implementing such a feature if you want to. See https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals for the overall process.
I am using the Schema Registry with RecordNameStrategy naming policy so I have events with totally different avro schemas into the same Kafka topic.
I am doing that as I want to group logically related events that may have different data structures under a same topic to keep order for these data.
For instance:
user_created event and user_mail_confirmed event might have different schemas but it's important to keep them into a same topic partition to guarantee order for consumers.
I am trying to sink these data, coming from a single topic, into GCS in multiple paths (one path for each schema)
Does someone know if the Confluent Kafka connect GCS Sink connector (or any other connector) provide us with that feature please ?
I haven't used GCS connector, but I suppose that this is not possible with Confluent connectors in general.
You should probably copy your source topic with different data structures to a new set of topics, where data have common data structure. This is possible with ksqlDB (check an example) or Kafka Streams application. Then, you can create connectors for these topics.
Alternatively, you can use RegexRouter transformation with a set of predicates based on the message headers.
Is there any way to obtain lineage data for Kafka jobs.
Like for example we have a Job History URL for all the MapReduce jobs.
Is there anything similar for Kafka where I can get the metadata of the Producer producing information to a particular topic? (eg: IP address of the producer)
Out of the box, no, not for Producers.
You can list consumer client addressses, but not see producers into a topic.
One option would be to look into Apache Atlas for lineage information, which is one component of Hortonwork's Stream Messaging Manager for analyzing Kafka connection information.
Otherwise, you would be stuck trying to enfore producers to send this data along with the messages.
I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.
I really want to get an architectural solution for my below scenario.
I have a source of events (Say sensors in oil wells , around 50000 ), that produces events to a server. At the server side I want to process all these events in such a way that , the information from the sensors about latest humidity, temperature,pressure ...etc will be stored/updated to a database.
I am confused with flume or kafka.
Can somebody please address my simple scenario in architectural terms.
I don't want to store the event somewhere, since I am already updating the database with latest values.
Should I really need spark , (flume/kafka) + spark , to meet the processing side?.
Can we do any kind of processing using flume without a sink?
Sounds like you need to use the Kafka producer API to publish the events to a topic then simply read those events either by using the Kafka consumer API to write to your database or use the Kafka JDBC sink connector.
Also if you need just the latest data inside Kafka take a look at log compaction.
One way would be to push all the messages to Kafka Topic. Using Spark Stream you can ingest and process from the kafka topic. Spark streaming can directly process from your Kafka Topic