Is there a way to limit the size of avro files when writing from kafka via hdfs connector? - apache-kafka

Currently we used the Flink FsStateBackend checkpointing and set fileStateSizeThreshold to limit the size of data written to avro/json files on HDFS to 128MB. Also closing files after a certain delay in checkpoint actions.
Since we are not using advanced Flink features in in a new project we want to use Kafka Streaming with the Kafka Connect HDFS Connector to write messages directly to hdfs (without spinning up Flink)
However I cannot find if there are options to limit the filesize of the hdfs files from the kafka connector, except maybe flush.size which seem to limit the # of records.
If there are no settings on the connector, how do people manage the filesizes from streaming data on hdfs in another way?

There is no file size option, only time based rotation and flush size. You can set a large flush size, which you never think you'll reach, then a time based rotation will do a best-effort partitioning of large files into date partitions (we've been able to get 4GB output files per topic partition within an hour directory from Connect)
Personally, I suggest additional tools such as Hive, Pig, DistCp, Flink/Spark, depending on what's available, and not all at once, running in an Oozie job to "compact" these streaming files into larger files.
See my comment here
Before Connect, there was Camus, which is now Apache Gobblin. Within that project, it offers the ideas of compaction and late event processing + Hive table creation
The general answer here is that you have a designated "hot landing zone" for streaming data, then you periodically archive it or "freeze" it (which brings out technology names like Amazon Glacier/Snowball & Snowplow)

Related

How can I set the micro batch size in Spark Structured Streaming from Kafka topic?

I have a Spark Structured Streaming app that reads from Kafka and writes to Elasticsearch and S3. I have enabled checkpointing to a S3 bucket as well (app runs AWS EMR). I saw that in S3 bucket that over time the commits get less frequently and there is always growing delay in the data.
So I want to make Spark to process always to process batches with same amount of data each batch. I tried to set the ".option("maxOffsetsPerTrigger", 100)" but the batch size didnt become smaller, still huge amount of time between commits.
As I understood that we just tell spark how much data consume from kafka per poll and that spark just polls multiple times and then writes, so no limitations in the batch size.
I also tried to use continuous mode but the submit failed, i guess cuz of the output sink / foreachbatch doesnt support it.
any ideas are welcome, i will try everything ^^
actually the each offset contained so much data that I had to limit the max offsets per trigger to 50, and had to delete the old checkpoint folder, I read somewhere that it tries to finish first batch with the offset in the checkpoint, and then turns on the max offset per trigger

How to do batch processing on kafka connect generated datasets?

Suppose we have batch jobs producing records into kafka and we have a kafka connect cluster consuming records and moving them to HDFS. We want the ability to run batch jobs later on the same data but we want to ensure that batch jobs see the whole records generated by producers. What is a good design for this?
You can run any MapReduce, Spark, Hive, etc query on the data, and you will get all records that have been thus far been written to HDFS. It will not see data that has not been consumed by the Sink from the producers, but this has nothing to do with Connect or HDFS, that is a pure Kafka limitation.
Worth pointing out that Apache Pinot is a better place to combine Kafka streaming data and have batch query support.

Integrating a large XML file size with Kafka

The XML file (~100 Mb) is a batch export by an external system of its entire database (The Batch export is every 6 hours).
I can not change the integration to use Debezium connector for example.
I have access only to the XML file.
What would be the best solution to consume the file with Apache Kafka?
Or, an architecture to send single messages of the XML file with an XSD schema?
Is not receiving its content on a large single message size a bad thing for the architecture?
The default max.message.bytes configuration on broker and topic level in Kafka is set to c. 1MB and it is not advisable to significantly increase that configuration as Kafka is not optimizes to handle large messages.
Is see two options to solve this:
Before loading the XML into Kafka, split it into chunks that represent an individual row of the database. In addition, us a typesafe format (such as AVRO) in combination with a Schema Registry to tell potential consumers how to read the data.
Dependent on what needs to be done with the large XML file, you could also store the XML in a resilient location (such as HDFS) and only provide the location path in a Kafka message. That way, a consumer can consume the paths from the Kafka topic and make some processing on them.
Writing a Kafka producer that unamarshalls XML files to Java Objects, Sends serialized objects in Avro format to the cluster was the solution for me.

Purpose of +tmp in Kafka hdfs connect

I am planning to use Kafka hdfs connect for moving messages from Kafka to hdfs. While looking into it, I see there are parameters like flush size and rotate interval Ms with which you can batch messages in heap and write batch at once.
Is the batch written to Wal first and then to the mentioned location. I also see it creates a +tmp directory. What's the purpose of+tmp directory . We can directly write whole batch as file under specified location with offset ranges..
When Kafka consumer writes to HDFS, it writes to WAL first. +tmp dir holds all the temporary files, which get compressed together into larger HDFS files. Then it is moved to the actual defined location.
Infact you can refer the actual implementation to understand in depth.
https://github.com/confluentinc/kafka-connect-hdfs/blob/121a69133bc2c136b6aa9d08b23a0799a4cd8799/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java#L611

Druid.io: update/override existing data via streams from Kafka (Druid Kafka indexing service)

I'm loading streams from Kafka using the Druid Kafka indexing service.
But the data I uploaded is always changed, so I need to reload it again and avoid duplicates and collisions if data was already loaded.
I research docs about Updating Existing Data in Druid.
But all info about Hadoop Batch Ingestion, Lookups .
Is it possible to update existing Druid data during Kafka streams?
In other words, I need to rewrite the old values with new ones using Kafka indexing service (streams from Kafka).
May be any kind of setting to rewrite duplicates?
Druid is in a way a time-series database where the data gets "finalised" and written to a log every time-interval. It does aggregations and optimises columns for storage and easy queries when it "finalises" the data.
By "finalising", what I mean is that Druid assumes that the data for the specified interval is already present and it can safely do its computations on top of them. So this in effect means that there is no support for you to update the data (like you do in a database). Any data that you write is treated as a new data and it keeps adding to its computations.
But Druid is different in the sense it provides a way to upload historical data for the same time period the real-time indexing has already taken place. This batch upload will overwrite any segments with the new ones and further queries will reflect the latest uploaded batch data.
So I am afraid the only option would be to do batch ingestion. Maybe you could still send the data to Kafka, but have a spark/gobbin job that does de-duplication and write to Hadoop. Then have a simple cron job to re-index these as a batch onto Druid.