Kafka commit with Akka and LogRotator - apache-kafka

I am trying to use the Consumer.committableSource to read data from Kafka with Akka. I would then like to write the data in files on a shared folder.
When committing, we usually use something like via(Committer.flow(committerSettings).
However, this method does not return the values of the Kafka stream, so afterward I cannot call something like .runWith(LogRotatorSink.withSinkFactory(rotator, sink)) to write the data.
Here's the code without commit:
Consumer.committableSource(settings, Subscriptions.topics(kafkaTopics.toSet))
.via(processor)
.prepend(headerCSVSource)
.via(CsvFormatting.format(delimiter =
CsvFormatting.SemiColon))
.runWith(LogRotatorSink.withSinkFactory(rotator, sink))
Here's what I think I need:
Consumer
.committableSource(settings, Subscriptions.topics(kafkaTopics.toSet))
.via(processor)
.prepend(headerCSVSource)
.via(CsvFormatting.format(delimiter =
CsvFormatting.SemiColon))
.via(Committer.flow(committerSettings))
.runWith(LogRotatorSink.withSinkFactory(rotator, sink))
But that won't work because via(Committer.flow) does not return the stream values (but Flow[Committable, Done, NotUsed]).
What I need is to commit the offset only after the data has been written in the file.
If you feel that other options (like using plainSource / auto-commit) would be more appropriate I am open to considering them.

Looks like you need to pass flow element to one sink, and when it succeeded, to another.
You can run a substream inside your stream. Something along this lines:
.via(CsvFormatting.format(delimiter = CsvFormatting.SemiColon))
.mapAsync(1) { c =>
Source.single(c).runWith(LogRotatorSink.withSinkFactory(rotator, sink)).map(_ => c)
}
.runWith(Committer.sink(committerSettings))
It should work, however, after some thought, I think best would be not to use sink to write to logs, but some other way which doesn't terminate the stream.

Related

Is it possible to create a batch flink job in streaming flink job?

I have a job streaming using Apache Flink (flink version: 1.8.1) using scala. there are flow job requirements as follows:
Kafka -> Write to Hbase -> Send to kafka again with a different topic
During the writing process to Hbase, there was a need to retrieve data from another table. To ensure that the data is not empty (NULL), the job must check repeatedly (within a certain time) if the data is empty.
is this possible with Flink? If yes, can you help provide examples for conditions similar to my needs?
Edit :
I mean, with the problem that I described in the content, I thought about having to create some kind of job batch in the job streaming, but I couldn't find the right example for my case. So, is it possible to create a batch flink job in streaming flink job? If yes, can you help provide examples for conditions similar to my needs?
With more recent versions of Flink you can do lookup queries (with a configurable cache) against HBase from the SQL/Table APIs. Your use case sounds like it might be easily implemented in this fashion. See the docs for more info.
Just to clarify my comment I will post a sketch of what I was trying to suggest based on The Broadcast State Pattern. The link provides an example in Java, so I will follow it. In case you want in Scala it should not be too much different. You will likely have to implement the below code as it is explained on the link that I mentioned:
DataStream<String> output = colorPartitionedStream
.connect(ruleBroadcastStream)
.process(
// type arguments in our KeyedBroadcastProcessFunction represent:
// 1. the key of the keyed stream
// 2. the type of elements in the non-broadcast side
// 3. the type of elements in the broadcast side
// 4. the type of the result, here a string
new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
// my matching logic
}
);
I was suggesting that you can collect the stream ruleBroadcastStream in fixed intervals from the database or whatever is your store. Instead of getting:
// broadcast the rules and create the broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
like the web page says. You will need to add a source where you can schedule it to run every X minutes.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
BroadcastStream<Rule> ruleBroadcastStream = env
.addSource(new YourStreamSource())
.broadcast(ruleStateDescriptor);
public class YourStreamSource extends RichSourceFunction<YourType> {
private volatile boolean running = true;
#Override
public void run(SourceContext<YourType> ctx) throws Exception {
while (running) {
// TODO: yourData = FETCH DATA;
ctx.collect(yourData);
Thread.sleep("sleep for X minutes");
}
}
#Override
public void cancel() {
this.running = false;
}
}

How to send message by message to Kafka

I'm new to reactive programming and I try to implement a very basic scenario.
I want to send a message to kafka each time a file is dropped to a specific folder.
I think that I don't understand well the basics things... so please could you help me?
So I have a few questions :
What is the difference between smallrye-reactive-messaging and smallrye-reactive-streams-operators ?
I have this simple code :
#Outgoing( "my-topic" )
public PublisherBuilder<Message<MessageWrapper>> generate() {
if(Objects.isNull(currentMessage)){
//currentMessage is an instance variable which is null when I start the application
return ReactiveStreams.of(new MessageWrapper()).map(Message::of);
}
else {
//currentMessage has been correctly set with the file information
LOGGER.info(currentMessage);
return ReactiveStreams.of(currentMessage).map(Message::of);
}
}
When the code goes in the if statement, everything is ok and I got a JSON serialization of my object will null values. However I don't understand why when my code goes to the else statement, nothing goes to the topic? It seems that the .of instructions of the if statement has broke the streams or something like that...
How to keep a continuous streams that 'react' to the new dropped files ? (or other events like HTTP GET request or something like that) ...
If I don't return an instance of PublisherBuilder but an Integer for example, then my kafka topic will be populated by a very huge stream of Integer value. This is why examples are using some intervals when sending messages...
Should I use some CompletationStage or CompletableFuture ? RxJAva2? It's a bit confusing which lib to use (vertx, smallrye, rxjava2, microprofile, ...)
What are the differences between :
ReactiveStreams.fromCompletionStage
ReactiveStreams.fromProcessor
ReactiveStreams.fromPublisher
ReactiveStreams.fromSubscriber
Which one to use on which scenario ?
Thank you very much !
Let's start with the difference between smallrye-reactive-messaging & smallrye-reactive-streams-operators: smallrye-reactive-streams-operators is the same as smallrye-reactive-messaging but in addition it has a support to MicroProfile-context-propagation. Since most reactive-messaging providers use Vert.x behind the scene, it will process your message in an event-loop style, which means it will run in separate thread. Sometimes you need to propagate some ctx from your base thread into the new thread (ex: populating CDI and Tx context to execute some JPA Entity manager logic). Here where ctx propagation help.
For method signatures. You can take a look at the official documentation of SmallRye-reactive-streams sections 3,4 & 5. Each one has a different use case. It is up to you which flavor do you want to use.
When to use what ? If you are not running within reactive context, you can use the below to send messages.
#Inject
#Channel("my-channel")
Emitter emitter;
For Message consumption you can use method signature like this :
#Incoming("channel-2")
public CompletionStage doSomething(Message anEvent)
Or
#Incoming("channel-2")
public void doSomething(String anEvent)
Hope that helps.

Aggregating Topics with apache beam Kafkaio (Dataflow)

I have slow moving data in a compacted kafka topic and also fast moving data in another topic.
1) fast moving data is real-time ingested unbounded events from Kafka.
2) slow moving data is meta data which is used to enrich the fast moving data. This is a compacted topic and the data is updated infrequently (days/months).
3) Each fast moving data payload should have a meta data payload with the same customerId which they can be aggregated with.
I would like to aggregate the fast/slow moving data against the customerId (common in the data on both topics). I was wondering how you would go about doing this? So far:
PTransform<PBegin, PCollection<KV<byte[], byte[]>>> kafka = KafkaIO.<byte[], byte[]>read()
.withBootstrapServers(“url:port")
.withTopics([“fast-moving-data”, “slow-moving-data"])
.withKeyDeserializer(ByteArrayDeserializer.class)
.withValueDeserializer(ByteArrayDeserializer.class)
.updateConsumerProperties((Map) props)
.withoutMetadata();
I have noticed that I can use .withTopics and specific the different topics I would like to use, but after this point I've not been able to find any examples to help in terms of aggregation. Any help would be appreciated.
The following pattern which is also discussed in this SO Q&A might be a good one to explore for your use case. One item that could be an issue is the size of your compacted slow moving stream. Hope its useful.
For this pattern we can use the GenerateSequence source transform to emit a value periodically for example once a day.
Pass this value into a global window via a data-driven trigger that activates on each element.
In a DoFn, use this process as a trigger to pull data from your bounded source
Create your SideInput for use in downstream transforms.
It's important to note that because this pattern uses a global-window SideInput triggering on processing time, matching to elements being processed in event time will be nondeterministic. For example if we have a main pipeline which is Windowed on Event time, the version of the SideInput View that those windows will see will depend on the latest trigger that has fired in processing time rather than the event time.
Also important to note that in general the SideInput should be something that fits into memory.
Java (SDK 2.9.0):
In the sample below the sideinput is updated at very short intervals, this is so that effects can be easily seen. The expectation is that the side input is updating slowly, for example every few hours or once a day.
In the example code below we make use of a Map that we create in a DoFn which becomes the View.asSingleton, this is the recommended approach for this pattern.
The sample below illustrates the pattern, please note the View.asSingleton is rebuilt on every counter update.
For your use case, you could replace the GenerateSequence transforms with PubSubIO transforms. Does that make sense?
public static void main(String[] args) {
// Create pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(PipelineOptions.class);
// Using View.asSingleton, this pipeline uses a dummy external service as illustration.
// Run in debug mode to see the output
Pipeline p = Pipeline.create(options);
// Create slowly updating sideinput
PCollectionView<Map<String, String>> map = p
.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, String>>() {
#ProcessElement public void process(#Element Long input,
OutputReceiver<Map<String, String>> o) {
// Do any external reads needed here...
// We will make use of our dummy external service.
// Every time this triggers, the complete map will be replaced with that read from
// the service.
o.output(DummyExternalService.readDummyData());
}
})).apply(View.asSingleton());
// ---- Consume slowly updating sideinput
// GenerateSequence is only used here to generate dummy data for this illustration.
// You would use your real source for example PubSubIO, KafkaIO etc...
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(ParDo.of(new DoFn<Long, KV<Long, Long>>() {
#ProcessElement public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug("Value is {} key A is {} and key B is {}"
, c.element(), keyMap.get("Key_A"),keyMap.get("Key_B"));
}
}).withSideInputs(map));
p.run();
}
public static class DummyExternalService {
public static Map<String, String> readDummyData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}
I would suggest reading those topics separately, creating two different inputs to the pipeline. You can cross/join them later. And the way to cross them is to provide slow-moving stream as a side-input into the hotpath (transforms of the fast moving PCollection).
See here: https://beam.apache.org/documentation/programming-guide/#side-inputs

How to use delta trigger in flink?

I want to use the deltatrigger in apache flink (flink 1.3) but I have some trouble with this code :
.trigger(DeltaTrigger.of(100, new DeltaFunction[uniqStruct] {
override def getDelta(oldFp: uniqStruct, newFp: uniqStruct): Double = newFp.time - oldFp.time
}, TypeInformation[uniqStruct]))
And I have this error:
error: object org.apache.flink.api.common.typeinfo.TypeInformation is not a value [ERROR] }, TypeInformation[uniqStruct]))
I don't understand why DeltaTrigger need TypeSerializer[T]
and I don't know what to do to remove this error.
Thanks a lot everyone.
I would read into this a bit https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/types_serialization.html sounds like you can create a serializer using typeInfo.createSerializer(config) on your type info. Note what you're passing in currently is a type itself and NOT the type info which is why you're getting the error you are.
You would need to do something more like
val uniqStructTypeInfo: TypeInformation[uniqStruct] = createTypeInformation[uniqStruct]
val uniqStrictTypeSerializer = typeInfo.createSerializer(config)
To quote the page above regarding the config param you need to pass to create serializer
The config parameter is of type ExecutionConfig and holds the
information about the program’s registered custom serializers. Where
ever possibly, try to pass the programs proper ExecutionConfig. You
can usually obtain it from DataStream or DataSet via calling
getExecutionConfig(). Inside functions (like MapFunction), you can get
it by making the function a Rich Function and calling
getRuntimeContext().getExecutionConfig().
DeltaTrigger needs a TypeSerializer because it uses Flink's managed state mechanism to store each element for later comparison with the next one (it just keeps one element, the last one, which is updated as new elements arrive).
You will find an example (in Java) here.
But if all you need is a window that triggers every 100msec, then it'll be easier to just use a TimeWindow, such as
input
.keyBy(<key selector>)
.timeWindow(Time.milliseconds(100)))
.apply(<window function>)
Updated:
To have hour-long windows that trigger every 100msec, you could use sliding windows. However, you would have 10 * 60 * 60 windows, and every event would be placed into each of these 36000 windows. So that's not a great idea.
If you use a GlobalWindow with a DeltaTrigger, then the window will be triggered only when events are more than 100msec apart, which isn't what you've said you want.
I suggest you look at ProcessFunction. It should be straightforward to get what you want that way.

How to commit a file(entire file) in spring batch without using chunks - commit interval?

Commit interval will commit the data at specified intervals. I want to commit the entire file at a single shot since my requirement is to validate the file (line by line) and if it fails at any point . roll back. no commit. is there any way to achieve this in spring batch?
You can either set your commit-interval to Integer.MAX_VALUE (231-1) or create your own CompletionPolicy.
Here's how you configure a step to use a custom CompletionPolicy :
<chunk reader="reader" writer="writer" chunk-completion-policy="completionPolicy"/>
<bean id="completionPolicy" class="xx.xx.xx.CompletionPolicy"/>
Then you have to either choose an out-of-the-box CompletionPolicy provided by Spring Batch (a list of implementations is available on previous link) or create your own.
What do you mean by "commit"?
You are talking about validating and not about writing the read data to another file or into database.
As mentioned in the comment by Michael Prarlow, memory problems could arise, if the size of the file changes.
In order to prevent this, I would suggest to start your job with a validation step. Simply read the data chunkwise, check the data line by line in your processor and throw a none-skippable exception, if the line is not valid. Use a passthroughwriter, so nothing is persisted. If there is a problem, the whole job will fail.
If you really have to write the data into a db or another file, you could do this in a second step. Since you have validated your data, you shouldn't observe any problems.
Simple PassThroughItemWriter
public class PassThroughItemWriter<T> implements ItemWriter<T> {
public void write(List<? extends T> items) {
// do nothing
}
}
or, if you use the Java-Api to build your job and steps, you could simply use a lambda:
stepBuilders.get("step")
.<..., ...>chunk(..)
.reader(...)
.processor(...) // your processor with the validation logic
.writer(items -> {}) // empty lambda expression
.build();