Adding new rules at run time in apache kafka flink - apache-kafka

I am using FlinkKafka for applying rules over the stream. And the following is the sample code:
ObjectMapper mapper = new ObjectMapper();
List<JsonNode> rulesList = null;
try {
// Read rule file
rulesList = mapper.readValue(new File("ruleFile"), new TypeReference<List<JsonNode>>(){});
} catch (IOException e1) {
System.out.println( "Error reading Rules file.");
System.exit(-1);
}
for (JsonNode jsonObject : rulesList) {
String id = (String) jsonObject.get("Id1").textValue();
// Form the pattern dynamically
Pattern<JsonNode, ?> pattern = null;
pattern = Pattern.<JsonNode>begin("start").where(new SimpleConditionImpl(jsonObject.get("rule1")));
// Create the pattern stream
PatternStream<JsonNode> patternStream = CEP.pattern(data, pattern);
}
But the problem is, FlinkKafka only reads the file once when we start the program and I want the new rules to be added dynamically at runtime and applied to the stream.
Is there any way we can achieve this in Flink Kafka?

Flink's CEP library doesn't (yet) support dynamic patterns. (See FLINK-7129.)
The standard approach for this is to use broadcast state to communicate and store the rules throughout the cluster, but you'll have to come up with some way to evaluate/execute the rules.
See https://training.da-platform.com/exercises/taxiQuery.html and https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/examples/datastream_java/broadcast/BroadcastState.java for examples.

Related

Is there any way to compress the data while using mongo persistence with NEventStore?

I'm working with C#, Dotnet core, and NeventStore( version- 9.0.1), trying to evaluate various persistence options that it supports out of the box.
More specifically, when trying to use the mongo persistence, the payload is getting stored without any compression being applied.
Note: Payload compression is happening perfectly when using the SQL persistence of NEventStore whereas not with the mongo persistence.
I'm using the below code to create the event store and initialize:
private IStoreEvents CreateEventStore(string connectionString)
{
var store = Wireup.Init()
.UsingMongoPersistence(connectionString,
new NEventStore.Serialization.DocumentObjectSerializer())
.InitializeStorageEngine()
.UsingBsonSerialization()
.Compress()
.HookIntoPipelineUsing()
.Build();
return store;
}
And, I'm using the below code for storing the events:
public async Task AddMessageTostore(Command command)
{
using (var stream = _eventStore.CreateStream(command.Id))
{
stream.Add(new EventMessage { Body = command });
stream.CommitChanges(Guid.NewGuid());
}
}
The workaround did: Implementing the PreCommit(CommitAttempt attempt) and Select methods in IPipelineHook and by using gzip compression logic the compression of events was achieved in MongoDB.
Attaching data store image of both SQL and mongo persistence:
So, the questions are:
Is there some other option or setting I'm missing so that the events get compressed while saving(fluent way of calling compress method) ?
Is the workaround mentioned above sensible to do or is it a performance overhead?
I also faced the same issue while using the NEventStore.Persistence.MongoDB.
Even if I used the fluent way of compress method, the payload compression is not happening perfectly in the mongo persistence like SQL persistence.
Finally, I have achieved the compression/decompression by customizing the logic inside the PreCommit(CommitAttempt attempt) and Select(ICommit committed) methods.
Code used for compression:
using (var stream = new MemoryStream())
{
using (var compressedStream = new GZipStream(stream,
CompressionMode.Compress))
{
var serializer = new JsonSerializer {
TypeNameHandling = TypeNameHandling.None,
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
};
var writer = new JsonTextWriter(new StreamWriter(compressedStream));
serializer.Serialize(writer, this);
writer.Flush();
}
return stream.ToArray();
}
Code used for decompression:
using (var stream = new MemoryStream(bytes))
{
var decompressedStream = new GZipStream(stream, CompressionMode.Decompress);
var serializer = new JsonSerializer {
TypeNameHandling = TypeNameHandling.None,
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
};
var reader = new JsonTextReader(new StreamReader(decompressedStream));
var body = serializer.Deserialize(reader, type);
return body as Command;
}
I'm not sure if this a right approach or will this have any impact on the performance of EventStore operations like Insert and Select..

How to generate output files for each input in Apache Flink

I'm using Flink to process my streaming data.
The streaming is coming from some other middleware, such as Kafka, Pravega, etc.
Saying that Pravega is sending some word stream, hello world my name is....
What I need is three steps of process:
Map each word to my custom class object MyJson.
Map the object MyJson to String.
Write Strings to files: one String is written to one file.
For example, for the stream hello world my name is, I should get five files.
Here is my code:
// init Pravega connector
PravegaDeserializationSchema<String> adapter = new PravegaDeserializationSchema<>(String.class, new JavaSerializer<>());
FlinkPravegaReader<String> source = FlinkPravegaReader.<String>builder()
.withPravegaConfig(pravegaConfig)
.forStream(stream)
.withDeserializationSchema(adapter)
.build();
// map stream to MyJson
DataStream<MyJson> jsonStream = env.addSource(source).name("Pravega Stream")
.map(new MapFunction<String, MyJson>() {
#Override
public MyJson map(String s) throws Exception {
MyJson myJson = JSON.parseObject(s, MyJson.class);
return myJson;
}
});
// map MyJson to String
DataStream<String> valueInJson = jsonStream
.map(new MapFunction<MyJson, String>() {
#Override
public String map(MyJson myJson) throws Exception {
return myJson.toString();
}
});
// output
valueInJson.print();
This code will output all of results to Flink log files.
My question is how to write one word to one output file?
I think the easiest way to do this would be with a custom sink.
stream.addSink(new WordFileSink)
public static class WordFileSink implements SinkFunction<String> {
#Override
public void invoke(String value, Context context) {
// generate a unique name for the new file and open it
// write the word to the file
// close the file
}
}
Note that this implementation won't necessarily provide exactly once behavior. You might want to take care that the file naming scheme is both unique and deterministic (rather than depending on processing time), and be prepared for the case that the file may already exist.

Use exchange message inside the .to() method in apache camel

Im new to camel and would like to change my route dynamically according to some logic preformed before hand
camelContext.addRoutes(new RouteBuilder() {
public void configure() {
PropertiesComponent pc = getContext().getComponent("properties", PropertiesComponent.class);
pc.setLocation("classpath:application.properties");
log.info("About to start route: Kafka Server -> Log ");
from("kafka:{{consumer.topic}}?brokers={{kafka.host}}:{{kafka.port}}"
+ "&maxPollRecords={{consumer.maxPollRecords}}"
+ "&consumersCount={{consumer.consumersCount}}"
+ "&seekTo={{consumer.seekTo}}"
+ "&groupId={{consumer.group}}"
+ "&valueDeserializer=" + BytesDeserializer.class.getName())
.routeId("FromKafka")
.process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
System.out.println(" message: " + exchange.getIn().getBody());
Bytes body = exchange.getIn().getBody(Bytes.class);
HashMap data = (HashMap)SerializationUtils.deserialize(body.get());
// do some work on data;
Map messageBusDetails = new HashMap();
messageBusDetails.put("topicName", "someTopic");
messageBusDetails.put("producerOption", "bla");
exchange.getOut().setHeader("kafka", messageBusDetails);
exchange.getOut().setBody(SerializationUtils.serialize(data));
}
}).choice()
.when(header("kafka"))
.to("kafka:"+ **getHeader("kafka").get("topicName")** )
.log("${body}");
}
});
getHeader("kafka").get("topicName")
this is what im trying to achieve.
But i dont know how to access the headers value ( which is a map - cause a kafka producer might have more configuration) inside the .to()
I understand i might be using it totally wrong... buts thats what i managed to understand until now...
The main goal is to have multiple message busses as .from()
and multiple message bus options in the .to() that will be decided via an external source (like config file) that way the same route will apply to many logic scenarios
and i thought the choice() method is the best answer
Thanks!
Instead of to(), you can use toD(), which is the "Dynamic To"
See this for details
And for the syntax to use to pull in various headers etc., see the Simple expression page

Avro with Kafka - Deserializing with changing schema

Based on Avro schema I generated a class (Data) to work with the class appropriate to the schema
After it I encode the data and send in to other application "A" using kafka
Data data; // <- The object was initialized before . Here it is only the declaration "for example"
EncoderFactory encoderFactory = EncoderFactory.get();
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = encoderFactory. directBinaryEncoder(out, null);
DatumWriter<Tloog> writer;
writer = new SpecificDatumWriter<Data>( Data.class);
writer.write(data, encoder);
byte[] avroByteMessage = out.toByteArray();
On the other side (in the application "A") I deserilize the the data by implementing Deserializer
class DataDeserializer implements Deserializer<Data> {
private String encoding = "UTF8";
#Override
public void configure(Map<String, ?> configs, boolean isKey) {
// nothing to do
}
#Override
public Tloog deserialize(String topic, byte[] data) {
try {
if (data == null)
{
return null;
}
else
{
DatumReader<Tloog> reader = new SpecificDatumReader<Data>( Data.class);
DecoderFactory decoderFactory = DecoderFactory.get();
BinaryDecoder decoder = decoderFactory.binaryDecoder( data, null);
Data decoded = reader.read(null, decoder);
return decoded;
}
} catch (Exception e) {
throw new SerializationException("Error when deserializing byte[] to string due to unsupported encoding " + encoding);
}
}
The problem is that this approach requires the use of SpecificDatumReader, I.e.the Data class should be integrated with the application code...This could be problematic - schema could change and therefore Data class should be re-generated and integrated once more
2 questions:
Should I use GenericDatumReader in the application? How to do that
correctly. (I can save the schema simply in the application)
Isthere a simple way to work with SpecificDatumReader if Data changes? How could it be integrated with out much trouble?
Thanks
I use GenericDatumReader -- well, actually I derive my reader class from it, but you get the point. To use it, I keep my schemas in a special Kafka topic -- Schema surprisingly enough. Consumers and producers both, on startup, read from this topic and configure their respective parsers.
If you do it like this, you can even have your consumers and producers update their schemas on the fly, without having to restart them. This was a design goal for me -- I didn't want to have to restart my applications in order to add or change schemas. Which is why SpecificDatumReader doesn't work for me, and honestly why I use Avro in the first place instead of something like Thrift.
Update
The normal way to do Avro is to store the schema in the file with the records. I don't do it that way, primarily because I can't. I use Kafka, so I can't store the schemas directly with the data -- I have to store the schemas in a separate topic.
The way I do it, first I load all of my schemas. You can read them from a text file; but like I said, I read them from a Kafka topic. After I read them from Kafka, I have an array like this:
val schemaArray: Array[String] = Array(
"""{"name":"MyObj","type":"record","fields":[...]}""",
"""{"name":"MyOtherObj","type":"record","fields":[...]}"""
)
Apologize for the Scala BTW, but it's what I got.
At any rate, then you need to create a parser, and foreach schema, parse it and create readers and writers, and save them off to Maps:
val parser = new Schema.Parser()
val schemas = Map(schemaArray.map{s => parser.parse(s)}.map(s => (s.getName, s)):_*)
val readers = schemas.map(s => (s._1, new GenericDatumReader[GenericRecord](s._2)))
val writers = schemas.map(s => (s._1, new GenericDatumWriter[GenericRecord](s._2)))
var decoder: BinaryDecoder = null
I do all of that before I parse an actual record -- that's just to configure the parser. Then, to decode an individual record I would do:
val byteArray: Array[Byte] = ... // <-- Avro encoded record
val schemaName: String = ... // <-- name of the Avro schema
val reader = readers.get(schemaName).get
decoder = DecoderFactory.get.binaryDecoder(byteArray, decoder)
val record = reader.read(null, decoder)

Pax Exam how to start multiple containers

for a project i'm working on, we have the necessity to write PaxExam integration tests which run over multiple Karaf containers.
The idea would be finding a way to extend/configure PaxExam to start-up a Karaf container (or more) and deploying there a bounce of bundles, and then start the test Karaf container which will then test the functionality.
We need this to verify performance tests and other things.
Does someone know anything about that? Is that actually possible in PaxExam?
I write the answer by myself, after having found this interesting article.
In particular have a look at the sections Using the Karaf Shell and Distributed integration tests in Karaf
http://planet.jboss.org/post/advanced_integration_testing_with_pax_exam_karaf
This is basically what the article says:
first of all you have to change the test probe header, allowing the dynamic-package
#ProbeBuilder
public TestProbeBuilder probeConfiguration(TestProbeBuilder probe) {
probe.setHeader(Constants.DYNAMICIMPORT_PACKAGE, "*;status=provisional");
return probe;
}
After that, the article suggests the following code that is able to execute commands in the Karaf shell
#Inject
CommandProcessor commandProcessor;
protected String executeCommands(final String ...commands) {
String response;
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
final PrintStream printStream = new PrintStream(byteArrayOutputStream);
final CommandSession commandSession = commandProcessor.createSession(System.in, printStream, System.err);
FutureTask<string> commandFuture = new FutureTask<string>(
new Callable<string>() {
public String call() {
try {
for(String command:commands) {
System.err.println(command);
commandSession.execute(command);
}
} catch (Exception e) {
e.printStackTrace(System.err);
}
return byteArrayOutputStream.toString();
}
});
try {
executor.submit(commandFuture);
response = commandFuture.get(COMMAND_TIMEOUT, TimeUnit.MILLISECONDS);
} catch (Exception e) {
e.printStackTrace(System.err);
response = "SHELL COMMAND TIMED OUT: ";
}
return response;
}
Then, the rest is kind of trivial, you will have to implement a layer able to start-up a child instance of Karaf
public void createInstances() {
//Install broker feature that is provided by FuseESB
executeCommands("admin:create --feature broker brokerChildInstance");
//Install producer feature that provided by imaginary feature repo.
executeCommands("admin:create --featureURL mvn:imaginary/repo/1.0/xml/features --feature producer producerChildInstance");
//Install producer feature that provided by imaginary feature repo.
executeCommands("admin:create --featureURL mvn:imaginary/repo/1.0/xml/features --feature consumer consumerChildInstance");
//start child instances
executeCommands("admin:start brokerChildInstance");
executeCommands("admin:start producerChildInstance");
executeCommands("admin:start consumerChildInstance");
//You will need to destroy the child instances once you are done.
//Using #After seems the right place to do that.
}