I want to make a stream of small data by calling it again and again - apache-kafka

I have a question, I've got a small CSV data that I'm able to launch on flink with help of kafka . My question is can I call the same data, again and again, using window and trigger, or it'll call my data only once?
1,35
2,45
3,55
4,65
5,555
This is the data that I want to call again and again. Though I myself don't think so it's better to take 2nd opinion as I'm a beginner. Thanks for the help

Not sure what you mean by call data again and again. But you can create a stream of that data in Flink using SourceFunction. For example, the following source creates a stream of that csv file and emits it every second.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> csvStream = env.addSource(new SourceFunction<String>() {
#Override
public void run(SourceContext<String> sourceContext) throws Exception {
String data = "1,35\n" +
"2,45\n" +
"3,55\n" +
"4,65\n" +
"5,555";
while(true) {
sourceContext.collect(data);
TimeUnit.SECONDS.sleep(1);
}
}
#Override
public void cancel() {
}
});

Related

How to trigger window if one of multiple Kafka topics are idle

I'm consuming multiple Kafka topics, windowing them hourly and writing them into separate parquet files for each topic. However, if one of the topics are idle, the window does not get triggered and nothing is written to the FS. For this example, I'm consuming 2 topics with a single partition. taskmanager.numberOfTaskSlots: 2 and parallelism.default: 1. What is the proper way of solving this problem in Apache Beam with Flink Runner?
pipeline
.apply(
"ReadKafka",
KafkaIO
.read[String, String]
.withBootstrapServers(bootstrapServers)
.withTopics(topics)
.withCreateTime(Duration.standardSeconds(0))
.withReadCommitted
.withKeyDeserializer(classOf[StringDeserializer])
.withValueDeserializer(classOf[StringDeserializer])
.withoutMetadata()
)
.apply("ConvertToMyEvent", MapElements.via(new KVToMyEvent()))
.apply(
"WindowHourly",
Window.into[MyEvent](FixedWindows.of(Duration.standardHours(1)))
)
.apply(
"WriteParquet",
FileIO
.writeDynamic[String, MyEvent]()
.by(new BucketByEventName())
//...
)
TimeWindow needs data. If the topic is idle, it means , there is no data to close the Window and the window is open until the data arrives. If you want to window data based on Processing time instead of actual event time , try using a simple process function
public class MyProcessFunction extends
KeyedProcessFunction<KeyDataType,InputDataType,OutputDataType>{
// The data type can be primitive like String or your custom class
private transient ValueState<Long> windowDesc;
#Override
public void open(final Configuration conf) {
final ValueStateDescriptor<Long> windowDesc = new ValueStateDescriptor("windowDesc", Long.class);
this.windowTime = this.getRuntimeContext().getState(windowDesc); // normal variable declaration does not work. Declare variables like this and use it inside the functions
}
#Override
public void processElement(InputType input, Context context, Collector<OutPutType> collector)
throws IOException {
this.windowTime.update( <window interval> ); // milliseconds are recommended
context.timerService().registerProcessingTimeTimer(this.windowTime.value());//register a timer. Timer runs for windowTime from the current time.
.
.
.
if( this.windowTime.value() != null ){
context.timerService().deleteProcessingTimeTimer(this.windowTime.value());
// delete any existing time if you want to reset timer
}
}
#Override
public void onTimer(long timestamp, KeyedProcessFunction<KeyDataType,InputDataType,OutputDataType>.OnTimerContext context,
Collector<OutputType> collector) throws IOException {
//This method is executed when the timer reached
collector.collect( < whatever you want to stream out> );// this data will be available in the pipeline
}
}
```

Apache Flink Stream join using DataStream API not outputting anything

I have 2 streams created using kafka topics and I'm joining them using the DataStream API. I want the results of the join (apply) to be published to another kafka topic. I don't see the results of the join in the out topic.
I confirm I'm publishing proper data to both the source topics. Not sure where it is going wrong. Here is code snippet,
The streams created as shown below.
DataStream<String> ms1=env.addSource(new FlinkKafkaConsumer("top1",new SimpleStringSchema(),prop))
.assignTimestampsAndWatermarks(new WatermarkStrategy() {
#Override
public WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
});
DataStream<String> ms2=env.addSource(new FlinkKafkaConsumer("top2",new SimpleStringSchema(),prop))
.assignTimestampsAndWatermarks(new WatermarkStrategy() {
#Override
public WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new AscendingTimestampsWatermarks<>();
}
#Override
public TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return (event, timestamp) -> System.currentTimeMillis();
}
});
Stream joins performed using the join-where-equals, as below.
DataStream joinedStreams = ms1.join(ms2)
.where(o -> {String[] tokens = ((String) o).split("::"); return tokens[0];})
.equalTo(o -> {String[] tokens = ((String) o).split("::"); return tokens[0];})
.window(EventTimeSessionWindows.withGap(Time.seconds(60)))
.apply(new JoinFunction<String, String, CountryData>() {
#Override
public CountryData join(String o, String o2) throws Exception {
String[] tokens1 = o.split("::");
String[] tokens2 = o2.split("::");
CountryData countryData = new CountryData(tokens1[0], tokens1[1], tokens1[2], Long.parseLong(tokens1[3])+Long.parseLong(tokens2[3]));
return countryData;
}});
Added sink as below,
joinedStreams.addSink(new FlinkKafkaProducer<CountryData>("localhost:9095","flink-output", new CustomSchema()));
dataStreamSink.setParallelism(1);
dataStreamSink.name("KAFKA-TOPIC");
Any clue, where it is going wrong? I can see messages available in the topology
Thanks
I think the two FlinkKafkaConsumer instances are missing a time extractor and a watermark configuration.
Since the code is using event-time window join, it needs some kind of time information associated with the data found in Kafka in order to know which time window each events corresponds to.
Without that, events from both streams are probably never close enough in event time to match the 60s window defined by EventTimeSessionWindows.withGap(Time.seconds(60)).
You also need to set the watermark parameter to tell Flink when to stop waiting for new data and materialize the output s.t. you can see the join result.
Have a look at the Kafka connector time and watermark configuration for the various time extraction and watermarking possibilities you have.
Finally, make sure you send test data spread over a long enough time period to your application. With event time processing, only "old enough" data makes it to the output, young data is always "stuck in transit". For example, with 60s time window and, say, 30s watermark, you would need at least 90s of data before you see anything in the output.

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:
I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.
a simplified slice of the CSV file as shown below
start_ip,end_ip,country
"1.1.1.1","100.100.100.100","United States of America"
"100.100.100.101","200.200.200.200","China"
I have made some researches and found a couple of potential solutions:
1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.
1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)
2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.
2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.
3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.
3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null
4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:
class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {
var ipMapState: MapState[String, String] = _
var csvFinishedFlag: ValueState[Boolean] = _
override def processElement(event: Event,
ctx: KeyedProcessFunction[Long, Event, Event]#Context,
out: Collector[Event]): Unit = {
val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])
ipMapState = getRuntimeContext.getMapState(ipDescriptor)
csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)
if (!csvFinishedFlag.value()) {
val csv = new CSVParser(defaultCSVFormat)
val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
for (row <- fileSource.getLines()) {
val Some(List(start, end, country)) = csv.parseLine(row)
ipMapState.put(start, country)
}
fileSource.close()
csvFinishedFlag.update(true)
}
out.collect {
if (ipMapState.contains(event.userIp)) {
val details = ipMapState.get(event.userIp)
event.copy(data =
event.data.copy(
ipLocation = Some(details.country)
))
} else {
event
}
}
}
}
4. Result: It's too hacky and prevents event processing due to blocking file read operation.
Could you tell me what can I do for this situation?
Thanks
What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here; I'll excerpt some key portions:
The job is organized like this:
DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));
DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
.partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
.flatMap(new EnrichmentFunctionWithPartitionedPreloading());
The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:
private static class SensorIdPartitioner implements Partitioner<Long> {
#Override
public int partition(final Long sensorMeasurement, final int numPartitions) {
return Math.toIntExact(sensorMeasurement % numPartitions);
}
}
And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:
public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
}
#Override
public void flatMap(
final SensorMeasurement sensorMeasurement,
final Collector<EnrichedMeasurements> collector) throws Exception {
SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
}
private Map<Long, SensorReferenceData> loadReferenceData(
final int partition,
final int numPartitions) {
SensorReferenceDataClient client = new SensorReferenceDataClient();
return client.getSensorReferenceDataForPartition(partition, numPartitions);
}
}
Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

Use exchange message inside the .to() method in apache camel

Im new to camel and would like to change my route dynamically according to some logic preformed before hand
camelContext.addRoutes(new RouteBuilder() {
public void configure() {
PropertiesComponent pc = getContext().getComponent("properties", PropertiesComponent.class);
pc.setLocation("classpath:application.properties");
log.info("About to start route: Kafka Server -> Log ");
from("kafka:{{consumer.topic}}?brokers={{kafka.host}}:{{kafka.port}}"
+ "&maxPollRecords={{consumer.maxPollRecords}}"
+ "&consumersCount={{consumer.consumersCount}}"
+ "&seekTo={{consumer.seekTo}}"
+ "&groupId={{consumer.group}}"
+ "&valueDeserializer=" + BytesDeserializer.class.getName())
.routeId("FromKafka")
.process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
System.out.println(" message: " + exchange.getIn().getBody());
Bytes body = exchange.getIn().getBody(Bytes.class);
HashMap data = (HashMap)SerializationUtils.deserialize(body.get());
// do some work on data;
Map messageBusDetails = new HashMap();
messageBusDetails.put("topicName", "someTopic");
messageBusDetails.put("producerOption", "bla");
exchange.getOut().setHeader("kafka", messageBusDetails);
exchange.getOut().setBody(SerializationUtils.serialize(data));
}
}).choice()
.when(header("kafka"))
.to("kafka:"+ **getHeader("kafka").get("topicName")** )
.log("${body}");
}
});
getHeader("kafka").get("topicName")
this is what im trying to achieve.
But i dont know how to access the headers value ( which is a map - cause a kafka producer might have more configuration) inside the .to()
I understand i might be using it totally wrong... buts thats what i managed to understand until now...
The main goal is to have multiple message busses as .from()
and multiple message bus options in the .to() that will be decided via an external source (like config file) that way the same route will apply to many logic scenarios
and i thought the choice() method is the best answer
Thanks!
Instead of to(), you can use toD(), which is the "Dynamic To"
See this for details
And for the syntax to use to pull in various headers etc., see the Simple expression page

Apache Beam count HBase row block and not return

Start to try out the Apache Beam and try to use it to read and count HBase table. When try to read the table without the Count.globally, it can read the row, but when try to count number of rows, the process hung and never exit.
Here is the very simple code:
Pipeline p = Pipeline.create(options);
p.apply("read", HBaseIO.read().withConfiguration(configuration).withTableId(HBASE_TABLE))
.apply(ParDo.of(new DoFn<Result, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
Result result = c.element();
String rowkey = Bytes.toString(result.getRow());
System.out.println("row key: " + rowkey);
c.output(rowkey);
}
}))
.apply(Count.<String>globally())
.apply("FormatResults", MapElements.via(new SimpleFunction<Long, String>() {
public String apply(Long element) {
System.out.println("result: " + element.toString());
return element.toString();
}
}));
when use Count.globally, the process never finish. When comment it out, the process print all the rows.
Anyy ideas?
Which version of beam are you using?
Thanks for bringing this issue. I tried to reproduce your case and indeed there seems to be an issue with colliding versions of guava that breaks transforms with HBaseIO. I sent a pull request to fix the shading of this, I will keep you updated once it is merged so you can test if it works.
Thanks again.