How to write a partitioned parquet file in apache beam java - apache-beam

I am new to Apache Beam and not sure how to accomplish this task.
I want to write a partitioned parquet file using Apache Beam in Java.
Data is read from Kafka and I want the file to have a new partition every hour. The timestamp columns is present in the data.

Try to use FixedWindows for that. There is an example of windowed WordCount that writes every window into separate text file - so, I believe, it can be adapted for your case.

Related

Spark stream job writing to Hdfs in a Json format

I have made a spark streaming job, that polls massages from Kafka and stores it in a json format to Hdfs. Got an example from here : https://github.com/sryza/simplesparkavroapp/blob/specifics/src/main/scala/com/cloudera/sparkavro/SparkSpecificAvroWriter.scala
There is another job that creates a hive table based on a avro with the following properties - AvroContainerInputFormat/ AvroConrainerOutputFormat.
Now I’m facing a problem that produced json file is not visualized querying hive table.
Seems that input/ output formats are different
Did someone had the similar problem ?

Configure Avro file size written to HDFS by Spark

I am writing a Spark dataframe in Avro format to HDFS. And I would like to split large Avro files so they would fit into the Hadoop block size and at the same time would not be too small. Are there any dataframe or Hadoop options for that? How can I split the files to be written into smaller ones?
Here is the way I write the data to HDFS:
dataDF.write
.format("avro")
.option("avroSchema",parseAvroSchemaFromFile("/avro-data-schema.json"))
.toString)
.save(dataDir)
I have researched a lot and found out that it is not possible to set up a limit in file sizes only in the number of Avro records. So the only solution would be to create an application for mapping the number of records to file sizes.

How to process files using Spark Structured Streaming chunk by chunk?

I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately.
How can I do it using Spark Structured Streaming ?
I have seen that Jacek Laskowski (https://stackoverflow.com/users/1305344/jacek-laskowski) said in a similar question (Spark to process rdd chunk by chunk from json files and post to Kafka topic) that it was possible using the Spark Structured Streaming, but I can't find any examples about it.
Thanks a lot,
If using File Source:
maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)
spark
.readStream
.format("json")
.path("/path/to/files")
.option("maxFilesPerTrigger", 50)
.load
If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger option.

Is there a way to limit the size of avro files when writing from kafka via hdfs connector?

Currently we used the Flink FsStateBackend checkpointing and set fileStateSizeThreshold to limit the size of data written to avro/json files on HDFS to 128MB. Also closing files after a certain delay in checkpoint actions.
Since we are not using advanced Flink features in in a new project we want to use Kafka Streaming with the Kafka Connect HDFS Connector to write messages directly to hdfs (without spinning up Flink)
However I cannot find if there are options to limit the filesize of the hdfs files from the kafka connector, except maybe flush.size which seem to limit the # of records.
If there are no settings on the connector, how do people manage the filesizes from streaming data on hdfs in another way?
There is no file size option, only time based rotation and flush size. You can set a large flush size, which you never think you'll reach, then a time based rotation will do a best-effort partitioning of large files into date partitions (we've been able to get 4GB output files per topic partition within an hour directory from Connect)
Personally, I suggest additional tools such as Hive, Pig, DistCp, Flink/Spark, depending on what's available, and not all at once, running in an Oozie job to "compact" these streaming files into larger files.
See my comment here
Before Connect, there was Camus, which is now Apache Gobblin. Within that project, it offers the ideas of compaction and late event processing + Hive table creation
The general answer here is that you have a designated "hot landing zone" for streaming data, then you periodically archive it or "freeze" it (which brings out technology names like Amazon Glacier/Snowball & Snowplow)

Read Kafka topic in a Spark batch job

I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however,
I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.
What is the right approach to read from Kafka in a batch job?
I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpoint
to HDFS and then in the next run it starts from that.
But in this case how can I just fetch once and stop streaming after the first batch?
createRDD is the right approach for reading a batch from kafka.
To query for info about the latest / earliest available offsets, look at KafkaCluster.scala methods getLatestLeaderOffsets and getEarliestLeaderOffsets. That file was private, but should be public in the latest versions of spark.