I've a requirement to add track and span id to Flink jobs running in cluster, the request flows something like below
User --> Rest API -> Kafka-topic-1 --> FlinkJob-1 --> Kafka-topic-2 --> FlinkJob-2 --> Consumer --> DB
I'm using Spring boot to create my rest APIs and using Spring Sleuth to add track and span id to generated logs, the track and span id is added when rest API is invoked and when message is put over Kakfa-topic-1 as well but I'm not able to figure out how to add track and span id while consuming message at FlinkJob-1 and FLinkJob-2 since they are out of spring context.
One way is to make track and span Id to kafka message headers and have Kafka Consumer/Producer interceptor to extract and log track and span Id, I tried this but my interceptors are not invoked as Flink APIs use Flink version of Kafka-client.
Couldn't get my custom KafkaDeserializationSchema invoked
public class MyDeserializationSchema implements KafkaDeserializationSchema<String> {
private static final Logger LOGGER = LoggerFactory.getLogger(MyDeserializationSchema.class);
#Override
public TypeInformation<String> getProducedType() {
System.out.println("************** Invoked 1");
LOGGER.debug("************** Invoked 1");
return null;
}
#Override
public boolean isEndOfStream(String nextElement) {
System.out.println("************** Invoked 2");
LOGGER.debug("************** Invoked 2");
return true;
}
#Override
public String deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
System.out.println("************** Invoked 3");
LOGGER.debug("************** Invoked 3");
return record.toString();
}
}
Can someone please suggest me how to achieve same.
You can use KafkaDeserializationSchema in order to get the Header as well
For accessing the key, value and metadata of the Kafka message, the
KafkaDeserializationSchema has the following deserialize method T
deserialize(ConsumerRecord record).
public class Bla implements KafkaDeserializationSchema {
#Override
public boolean isEndOfStream(Object dcEvents) {
return false;
}
#Override
public Object deserialize(ConsumerRecord consumerRecord) throws Exception {
return null;
}
#Override
public TypeInformation<DCEvents> getProducedType() {
return null;
}
You are using a Simple String here and in serialize byte to String can be done something like the below code.
public class MyDeserializationSchema implements KafkaDeserializationSchema<String> {
#Override
public boolean isEndOfStream(String nextElement) {
return false;
}
#Override
public String deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
return new String(record.value(), "UTF-8");
}
#Override
public TypeInformation<String> getProducedType() {
return BasicTypeInfo.STRING_TYPE_INFO;
}
}
Related
The Flink consumer application I am developing reads from multiple Kafka topics. The messages published in the different topics adhere to the same schema (formatted as Avro). For schema management, I am using the Confluent Schema Registry.
I have been using the following snippet for the KafkaSource and it works just fine.
KafkaSource<MyObject> source = KafkaSource.<MyObject>builder()
.setBootstrapServers(BOOTSTRAP_SERVERS)
.setTopics(TOPIC-1, TOPIC-2)
.setGroupId(GROUP_ID)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(ConfluentRegistryAvroDeserializationSchema.forSpecific(MyObject.class, SCHEMA_REGISTRY_URL))
.build();
Now, I want to determine the topic-name for each message that I process. Since the current deserializer is ValueOnly, I started looking into the setDeserializer() method which I felt would give me access to the whole ConsumerRecord object and I can fetch the topic-name from that.
However, I am unable to figure out how to use that implementation. Should I implement my own deserializer? If so, how does the Schema registry fit into that implementation?
You can use the setDeserializer method with a KafkaRecordDeserializationSchema that might look something like this:
public class KafkaUsageRecordDeserializationSchema
implements KafkaRecordDeserializationSchema<UsageRecord> {
private static final long serialVersionUID = 1L;
private transient ObjectMapper objectMapper;
#Override
public void open(DeserializationSchema.InitializationContext context) throws Exception {
KafkaRecordDeserializationSchema.super.open(context);
objectMapper = JsonMapper.builder().build();
}
#Override
public void deserialize(
ConsumerRecord<byte[], byte[]> consumerRecord,
Collector<UsageRecord> collector) throws IOException {
collector.collect(objectMapper.readValue(consumerRecord.value(), UsageRecord.class));
}
#Override
public TypeInformation<UsageRecord> getProducedType() {
return TypeInformation.of(UsageRecord.class);
}
}
Then you can use the ConsumerRecord to access the topic and other metadata.
I took inspiration from the above answer (by David) and added the following custom deserializer -
KafkaSource<MyObject> source = KafkaSource.<MyObject>builder()
.setBootstrapServers(BOOTSTRAP_SERVERS)
.setTopics(TOPIC-1, TOPIC-2)
.setGroupId(GROUP_ID)
.setStartingOffsets(OffsetsInitializer.earliest())
.setDeserializer(KafkaRecordDeserializationSchema.of(new KafkaDeserializationSchema<Event>{
DeserializationSchema deserialzationSchema = ConfluentRegistryAvroDeserializationSchema.forSpecific(MyObject.class, SCHEMA_REGISTRY_URL);
#Override
public boolean isEndOfStream(Event nextElement) {
return false;
}
#Override
public String deserialize(ConsumerRecord<byte[], byte[]> consumerRecord) throws Exception {
Event event = new Event();
event.setTopicName(record.topic());
event.setMyObject((MyObject) deserializationSchema.deserialize(record.value()));
return event;
}
#Override
public TypeInformation<String> getProducedType() {
return TypeInformation.of(Event.class);
}
})).build();
The Event class is a wrapper over the MyObject class with additional field for storing the topic name.
I'm trying to read JSON events from Kafka, aggregate it on a eventId and its category and write them to a different kafka topic through flink. The program is able to read messages from kafka, but KafkaSink is not writing the data back to the other kafka topic. I'm not sure on the mistake I'm doing. Can someone please check and let me know, where I'm wrong. Here is the code I'm using.
KafkaSource<EventMessage> source = KafkaSource.<EventMessage>builder()
.setBootstrapServers(LOCAL_KAFKA_BROKER)
.setTopics(INPUT_KAFKA_TOPIC)
.setGroupId(LOCAL_GROUP)
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new InputDeserializationSchema())
.build();
WindowAssigner<Object, TimeWindow> windowAssigner = TumblingEventTimeWindows.of(WINDOW_SIZE);
DataStream<EventMessage> eventStream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Event Source");
DataStream<EventSummary> events =
eventStream
.keyBy(eventMessage -> eventMessage.getCategory() + eventMessage.getEventId())
.window(windowAssigner)
.aggregate(new EventAggregator())
.name("EventAggregator test >> ");
KafkaSink<EventSummary> sink = KafkaSink.<EventSummary>builder()
.setBootstrapServers(LOCAL_KAFKA_BROKER)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(OUTPUT_KAFKA_TOPIC)
.setValueSerializationSchema(new OutputSummarySerializationSchema())
.build())
.setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.build();
events.sinkTo(sink);
These are the POJO's I've created for input message and output.
# EventMessage POJO
public class EventMessage implements Serializable {
private Long timestamp;
private int eventValue;
private String eventId;
private String category;
public EventMessage() { }
public EventMessage(Long timestamp, int eventValue, String eventId, String category) {
this.timestamp = timestamp;
this.eventValue = eventValue;
this.eventId = eventId;
this.category = category;
}
.....
}
# EventSummary POJO
public class EventSummary {
public EventMessage eventMessage;
public int sum;
public int count;
public EventSummary() { }
....
}
These are the deserialization and serialization schemas I'm using.
public class InputDeserializationSchema implements DeserializationSchema<EventMessage> {
static ObjectMapper objectMapper = new ObjectMapper();
#Override
public EventMessage deserialize(byte[] bytes) throws IOException {
return objectMapper.readValue(bytes, EventMessage.class);
}
#Override
public boolean isEndOfStream(EventMessage inputMessage) {
return false;
}
#Override
public TypeInformation<EventMessage> getProducedType() {
return TypeInformation.of(EventMessage.class);
}
}
public class OutputSummarySerializationSchema implements SerializationSchema<EventSummary> {
static ObjectMapper objectMapper = new ObjectMapper();
Logger logger = LoggerFactory.getLogger(OutputSummarySerializationSchema.class);
#Override
public byte[] serialize(EventSummary eventSummary) {
if (objectMapper == null) {
objectMapper.setVisibility(PropertyAccessor.FIELD, JsonAutoDetect.Visibility.ANY);
objectMapper = new ObjectMapper();
}
try {
String json = objectMapper.writeValueAsString(eventSummary);
return json.getBytes();
} catch (com.fasterxml.jackson.core.JsonProcessingException e) {
logger.error("Failed to parse JSON", e);
}
return new byte[0];
}
}
I'm using this aggregator for aggregating the JSON messages.
public class EventAggregator implements AggregateFunction<EventMessage, EventSummary, EventSummary> {
private static final Logger log = LoggerFactory.getLogger(EventAggregator.class);
#Override
public EventSummary createAccumulator() {
return new EventSummary();
}
#Override
public EventSummary add(EventMessage eventMessage, EventSummary eventSummary) {
eventSummary.eventMessage = eventMessage;
eventSummary.count += 1;
eventSummary.sum += eventMessage.getEventValue();
return eventSummary;
}
#Override
public EventSummary getResult(EventSummary eventSummary) {
return eventSummary;
}
#Override
public EventSummary merge(EventSummary summary1, EventSummary summary2) {
return new EventSummary(null,
summary1.sum + summary2.sum,
summary1.count + summary2.count);
}
}
Can someone help me on this?
Thanks in advance.
In order for event time windowing to work, you must specify a proper WatermarkStrategy. Otherwise, the windows will never close, and no results will be produced.
The role that watermarks play is to mark a place in a stream, and indicate that the stream is, at that point, complete through some specific timestamp. Until receiving this indicator of stream completeness, windows continue to wait for more events to be assigned to them.
To simply the debugging the watermarks, you might switch to a PrintSink until you get the watermarking working properly. Or to simplify debugging the KafkaSink, you could switch to using processing time windows until the sink is working.
How would one detect when the Job has been signaled to stop from within the FlatFileItemReader bufferedFileReader that was created using the configured custom BufferedFileReaderFactory bean. This custom bufferedFileReader tails the file and waits for more input indefinitely so the Job STOPPING status isn't being detected as the code is blocked outside the Spring Batch framework.
I can do this with a Step1->Tasklet->Step1 flow loop that does a Thread.sleep in the Tasklet but the nature of this constantly growing file means I'll be hitting EOF every couple of seconds and generating a huge amount of StepExecution rows in the database.
public class TailingBufferedReaderFactory implements BufferedReaderFactory {
#Override
public BufferedReader create(Resource resource, String encoding) throws IOException {
return new TailingBufferedReader(new InputStreamReader(resource.getInputStream(), encoding));
}
}
public class TailingBufferedReader extends BufferedReader implements JobExecutionListener {
private JobExecution jobExecution;
public TailingBufferedReader(Reader in) {
super(in);
}
#Override
public String readLine() throws IOException {
while (!jobExecution.isStopping()) { //The elusive Job Execution status check
var line = super.readLine();
if (line == null) {
Thread.sleep(waitDurationMillis);
continue;
}
return line;
}
return null;
}
// Ideally something like this configured on the Job
#Override
public void beforeJob(JobExecution jobExecution) {
this.jobExecution = jobExecution;
}
#Override
public void afterJob(JobExecution jobExecution) {}
}
I have exposed as service as below
restConfiguration().component("servlet").bindingMode(RestBindingMode.json);
rest("/batchFile").consumes("application/json").post("/routeStart").type(BatchFileRouteConfig.class).to("startRouteProcessor");
Based upon the request from rest service,i would start camel route in processor as below
#Component("startRouteProcessor")
public class StartRouteProcessor implements Processor {
public void process(Exchange exchange) throws Exception {
BatchFileRouteConfig config = exchange.getIn().getBody(BatchFileRouteConfig.class);
String routeId = config.getRouteId();
String sourceLocation = config.getSourceLocation();
exchange.getContext().startRoute(routeId);
}
}
I need to pass the sourceLocation from above bean to below route
#Component
public class FileReaderRoute extends RouteBuilder {
#Override
public void configure() throws Exception {
from("file:sourceLocation")
.log("File Reader Route route started");
}
}
Above is sample code..request you to help me in passing the sourcelocation from StartRouteProcessor to FileReaderRoute
This is not possible, since in your example is FileReaderRoute already started at the time of calling batchFile endpoint.
You can do it in slightly different way.
Extract your FileReaderRoute to direct. Something like:
#Component
public class FileReaderRoute extends RouteBuilder {
#Override
public void configure() throws Exception {
from("direct:fileReaderCommon")
.log("File Reader Route route started");
}
}
And then you can create new route at runtime:
#Component("startRouteProcessor")
public class StartRouteProcessor implements Processor {
public void process(Exchange exchange) throws Exception {
BatchFileRouteConfig config = exchange.getIn().getBody(BatchFileRouteConfig.class);
exchange.getContext().addRoutes(new RouteBuilder() {
#Override
public void configure() throws Exception {
from("file:"+config.getSourceLocation())
.routeId(config.getRouteId())
.to("direct:fileReaderCommon");
}
});
}
}
Do not forget to take sufficient sanitizing of input, since you are allowing user to create file consumer based on user input. In your approach, there is a high risk of path traversal attack.
Question is : How to make an Item reader in spring batch to deliver a list instead of a single object.
I have searched across, some answers are to modify the item reader to return list of objects and changing item processor to accept a list as input.
How to do/code the item reader ?
take a look at the official spring batch documentation for itemReader
public interface ItemReader<T> {
T read() throws Exception, UnexpectedInputException, ParseException;
}
// so it is as easy as
public class ReturnsListReader implements ItemReader<List<?>> {
public List<?> read() throws Exception {
// ... reader logic
}
}
the processor works the same
public class FooProcessor implements ItemProcessor<List<?>, List<?>> {
#Override
public List<?> process(List<?> item) throws Exception {
// ... logic
}
}
instead of returning a list, the processor can return anything e.g. a String
public class FooProcessor implements ItemProcessor<List<?>, String> {
#Override
public String process(List<?> item) throws Exception {
// ... logic
}
}