How to manually commit kafka offset after FileIO in apache beam? - apache-kafka

I have a FileIO writing a Pcollection<GenericRecord> to files and returns WriteFilesResult<DestinationT>.
I would like to create a DoFn after writing files to commit the offset of written records to kafka but since my offsets are stored in my GenericRecords I can no longer access them in the output of FileIO.
What is the best way to solve this ?

For anyone interested, here is how I did:
manually groupbykey records by DestinationT
for each group I get the list of offsets and I create a new key EnrichedDestinationT + flatten the iterable
so the Pcollection before entering FileIO is PCollection<KV<EnrichedDestinationT, GenericRecord>>
in FileIO, the .by() becomes .by(KV::getKey) and the .via() becomes .via(Contextful.fn(KV::getValue), Contextful.fn(this::getSink))

Related

How to understand Kafka streams?

I am following Kafka streams documentation and I am getting confused in some concepts which I want to clarify here.
https://kafka.apache.org/23/documentation/streams/developer-guide/dsl-api.html
On reading flatMap mentioned in documentation, that it takes one record and produces zero, one, or more records. You can modify the record keys and values too. it also marks the data for re-partitioning.
Questions:
1) What does it mean by re-partitioning, will it re-partition data for a new topic, where I am going to write transformed results or will it re-partition data in the same topic, where from I started streaming?
2) If in case old topic data is getting re-partitioned, does that mean transformed results are getting written to that topic too?
For example:
KStream<Long, String> stream = ...;
KStream<String, Integer> transformed = stream.flatMap(
// Here, we generate two output records for each input record.
// We also change the key and value types.
// Example: (345L, "Hello") -> ("HELLO", 1000), ("hello", 9000)
(key, value) -> {
List<KeyValue<String, Integer>> result = new LinkedList<>();
result.add(KeyValue.pair(value.toUpperCase(), 1000));
result.add(KeyValue.pair(value.toLowerCase(), 9000));
return result;
}
);
In this example, it is taking one record and generating two records, does this mean that my topic from which I have started streaming, will have 3 records now, one with key 345L and two with HELLO. If I put transformed result to a new topic or a store, what would be state of old and new topic then. Would both the tables will contain all 3 records. I am novice.
This is a transformed result. So, when you read from a topic, you don't change the source topic. However, when you write to another topic, your output sink topic will have 2 values.
When it says it marks for repartitioning, it will mark the result for repartitioning and when you write to sink topic, it will have to repartition. It doesn't repartition the source topic. Think about why?
If you're continuously reading from source topic, will it continuously repartition the source topic? So, that's not practical option.
I hope this clarifies your question.
Re-partitioning in Kafka Steams means that the records are send to an intermediate topic before a processor and then the processor reads the records from the intermediate topic. By sending the records to an intermediate topic the records are re-partitioned.
This is needed, for example with join processors. A join processor in Kafka Streams requires that all keys of one partition are processed by the same task to ensure correctness. This would not be true, if an upstream processor modified the keys of the records as in your example the flatMap(). Besides joins also aggregations require that all keys of one partition are processed by the same task. Re-partitioning does not write anything to the input or output topic of your streams application and you should usually not need to care about intermediate topics.
However, what you can do is avoid re-partitionings where possible by using *Values() operators like flatMapValues() if you do not change the key of the records. For example, if you use flatMap() and you do not change the keys of the record, the records will be nevertheless re-partitioned although it would not be needed. Kafka Streams cannot know that you did not touch the key if you do not use flatMapValues().

kafka connect hdfs flatten array records

We are experimenting with Kafka Connect HDFS and have source data which contains an array of avro records (referred to as container record below) which are pushed to topics.
The plan was to use kafka-connect hdfs to flatten the array in the container record into individual records and write to hdfs as avro and then commit the actual container record.
However it seems that kafka-connect-hdfs does not support this out of the box.
In attempt to navigate the source we found that there is a tight mapping of the topic names and offset management and attempting to transform 1 sinkRecod to multiple sinkRecords deviates from expected behavior.
so wanted to verify if 1 to n transformation possible ? if so how ?
Kafka Connect supports Single Message Transforms (SMT) which is the area of functionality, if any, in Kafka Connect that would do what you want. However IIRC SMT are only for 1:1 input/output record processing, not 1:n - that is, I don't think (but am willing to be corrected) that you can generate additional output records from a single input record.
You probably want to look at Kafka Streams which would give you the full capabilities of doing this. You'd write a streams app which would subscribe to the source topic, do the necessary array processing, and write it back to a new Kafka topic. The new Kafka topic would then be the source topic for your Kafka Connect HDFS sink.

Kafka Streams: pipe one topic into another

I'm new to Kafka Streams and I'm using it to make an exact copy of a topic into another with a different name. This topic has several partitions and my producers are using custom partitioners. The output topic is created beforehand with the same number of partitions of the input topic.
In my app, I did (I'm using Kotlin):
val builder = StreamsBuilder()
builder
.stream<Any, Any>(inputTopic)
.to(outputTopic)
This works, except for the partitions (because of course I'm using a custom partitioner). Is there a simple way to copy input records to the output topic using the same partition of the input record?
I checked the Processor API that allows to access the partition of the input record through a ProcessorContext but I was unable to manually set the partition of the output record.
Apparently, I could use a custom partitioner in the sink, but that would imply deserializing and serializing the records to recalculate the output partition with my custom partitioner.
Produced (that is one of the KStream::to arguments) has StreamPartitioner as one of its member.
You could try following code:
builder.stream("input", Consumed.with(Serdes.ByteArray(), Serdes.ByteArray()))
.to("output", Produced.with(Serdes.ByteArray(), Serdes.ByteArray(), (topicName, key, value, numberOfPartitions) -> calculatePartition(topicName, key, value, numberOfPartitions));
In above code only ByteArray Serdes are used so any special serialization or deserialization happens.
Firstly, messages are distributed among partitions based on Key. A message with similar key would always go in the same partition.
So if your messages have keys then you don't need to worry about it at all. As long as you have similar number of partitions as your original topic; it would be taken care of.
Secondly, if you are copying data to another topic as it is then you should consider using the original topic instead. Kafka has notion of consumer-groups.
For example, you have a topic 'transactions' then you can have consumer-groups i.e. 'credit card processor', 'mortgage payment processor', 'apple pay processor' and so on. Consumer-groups would read the same topic and filter out events that are meaningful to them and process them.
You can also create 3 topics and achieve the same result. Though, it's not an optimal solution. You can find more information at https://kafka.apache.org/documentation/.

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.

Apache spark + RDD + persist() doubts

I am new in apache spark and using scala API. I have 2 questions regarding RDD.
How to persist some partitions of a rdd, instead of entire rdd in apache spark? (core rdd implementation provides rdd.persist() and rdd.cache() methods but i do not want to persist entire rdd. I am interested only some partitions to persist.)
How to create one empty partition while creating each rdd? (I am using repartition and textFile transformations.In these cases i can get expected number of partitions but i also want one empty partition for each rdd.)
Any help is appreciated.
Thanks in advance