zipWithIndex on Apache Flink - scala

I'd like to assign each row of my input an id - which should be a number from 0 to N - 1, where N is the number of rows in the input.
Roughly, I'd like to be able to do something like the following :
val data = sc.textFile(textFilePath, numPartitions)
val rdd = data.map(line => process(line))
val rddMatrixLike = rdd.zipWithIndex.map { case (v, idx) => someStuffWithIndex(idx, v) }
But in Apache Flink. Is it possible?

This is now a part of the 0.10-SNAPSHOT release of Apache Flink. Examples for zipWithIndex(in) and zipWithUniqueId(in) are available in the official Flink documentation.

Here is a simple implementation of the function:
public class ZipWithIndex {
public static void main(String[] args) throws Exception {
ExecutionEnvironment ee = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> in = ee.readTextFile("/home/robert/flink-workdir/debug/input");
// count elements in each partition
DataSet<Tuple2<Integer, Long>> counts = in.mapPartition(new RichMapPartitionFunction<String, Tuple2<Integer, Long>>() {
#Override
public void mapPartition(Iterable<String> values, Collector<Tuple2<Integer, Long>> out) throws Exception {
long cnt = 0;
for (String v : values) {
cnt++;
}
out.collect(new Tuple2<Integer, Long>(getRuntimeContext().getIndexOfThisSubtask(), cnt));
}
});
DataSet<Tuple2<Long, String>> result = in.mapPartition(new RichMapPartitionFunction<String, Tuple2<Long, String>>() {
long start = 0;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
List<Tuple2<Integer, Long>> offsets = getRuntimeContext().getBroadcastVariable("counts");
Collections.sort(offsets, new Comparator<Tuple2<Integer, Long>>() {
#Override
public int compare(Tuple2<Integer, Long> o1, Tuple2<Integer, Long> o2) {
return ZipWithIndex.compare(o1.f0, o2.f0);
}
});
for(int i = 0; i < getRuntimeContext().getIndexOfThisSubtask(); i++) {
start += offsets.get(i).f1;
}
}
#Override
public void mapPartition(Iterable<String> values, Collector<Tuple2<Long, String>> out) throws Exception {
for(String v: values) {
out.collect(new Tuple2<Long, String>(start++, v));
}
}
}).withBroadcastSet(counts, "counts");
result.print();
}
public static int compare(int x, int y) {
return (x < y) ? -1 : ((x == y) ? 0 : 1);
}
}
This is how it works: I'm using the first mapPartition() operation to go over all elements in the partitions to count how many elements are in there.
I need to know the number of elements in each partition to properly set the offsets when assigning the IDs to the elements.
The result of the first mapPartition is a DataSet containing mappings. I'm broadcasting this DataSet to all the second mapPartition() operators which will assign the IDs to the elements from the input.
In the open() method of the second mapPartition() I'm computing the offset for each partition.
I'm probably going to contribute the code to Flink (after discussing it with the other committers).

Related

write to multiple Kafka topics in apache-beam?

I am executing a simple word count program where I used one Kafka topic (producer) as an input source then I apply a pardo to it for calculating the word count. Now I need help to write the words to different topics on the basis of their frequency. Let say all the word with even frequency will go to topic 1 and rest will go to topic 2.
can anyone help me with the example?
This can be done using Kafka.io writeRecord method that takes Producer<key,value> and then using new Produce<>("topic_name","key","value") -
below is the code -:
static class ExtractWordsFn extends DoFn<String, String> {
private final Counter emptyLines = Metrics.counter(ExtractWordsFn.class, "emptyLines");
private final Distribution lineLenDist =
Metrics.distribution(ExtractWordsFn.class, "lineLenDistro");
#ProcessElement
public void processElement(#Element String element, OutputReceiver<String> receiver) {
lineLenDist.update(element.length());
if (element.trim().isEmpty()) {
emptyLines.inc();
}
String[] words = element.split(ExampleUtils.TOKENIZER_PATTERN, -1);
for (String word : words) {
if (!word.isEmpty()) {
receiver.output(word);
}
}
}
}
public static class FormatAsTextFn extends SimpleFunction<KV<String, Long>, ProducerRecord<String,String>> {
#Override
public ProducerRecord<String, String> apply(KV<String, Long> input) {
if(input.getValue()%2==0)
return new ProducerRecord("test",input.getKey(),input.getKey()+" "+input.getValue().toString());
else
return new ProducerRecord("copy",input.getKey(),input.getKey()+" "+input.getValue().toString());
}
}
public static class CountWords
extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
#Override
public PCollection<KV<String, Long>> expand(PCollection<String> lines) {
PCollection<String> words = lines.apply(ParDo.of(new ExtractWordsFn()));
PCollection<KV<String, Long>> wordCounts = words.apply(Count.perElement());
return wordCounts;
}
}
p.apply("ReadLines", KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("copy")// use withTopics(List<String>) to read from multiple topics.
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("group.id", "my_beam_app_1"))
.updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true"))
.withLogAppendTime()
.withReadCommitted()
.commitOffsetsInFinalize()
.withProcessingTime()
.withoutMetadata()
)
.apply(Values.create())
.apply(Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(new CountWords())
.apply(MapElements.via(new FormatAsTextFn())) //PCollection<ProducerRecord<string,string>>
.setCoder(ProducerRecordCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()))
.apply("WriteCounts", (KafkaIO.<String, String>writeRecords()
.withBootstrapServers("localhost:9092")
//.withTopic("test")
.withKeySerializer(StringSerializer.class)
.withValueSerializer(StringSerializer.class)
))

How to merge two streams and perform stateful operations on merged stream using Apache Beam

I have 2 Kafka streams, I want to merge by some key and on top of the merged stream I want to perform the stateful operation so that I can sum up counts from both streams
this what I tried but dint work ..
PCollection<String> stream1 = .. read from kafka
PCollection<String> stream2 = .. read from kafka
PCollection<String,Long> wonrdCount1 = stream1.apply(...)
PCollection<String,Long> wonrdCount2 = stream2.apply(...)
PCollection<String,Long> merged = merge wordcount1 and wordcount2 using CoGroupByKey
Pcolection<String,Long> finalStream = mergred.apply(...)
for finalstream apply state
public class KafkaWordCount implements Serializable {
private String kafkaBrokers =null;
private String topic =null;
public KafkaWordCount(String brokers, String topic){
this.kafkaBrokers =brokers;
this.topic =topic;
}
public PCollection<KV<String,Long>> build(Pipeline p){
final String myState="HELLO";
PCollection<KV<String,Long>> res =
p.apply(KafkaIO.<Long, String>read()
.withBootstrapServers(this.kafkaBrokers )
.withTopic(this.topic)
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class))
.apply(ParDo.of(new DoFn<KafkaRecord<Long, String>, String>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
KafkaRecord<Long, String> record = processContext.element();
processContext.output(record.getKV().getValue());
}
}))
.apply("ExtractWords",
ParDo.of(new DoFn<String, KV<String, Long>>() {
#ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("[^\\p{L}]+")) {
if (!word.isEmpty()) {
c.output(KV.of(word,1L));
}
}
}
}));
return res;
}
}
public class DataPipe {
public static void main(String[] args) {
final String stateId = "myMapState";
final String myState = "myState";
PipelineOptions options = PipelineOptionsFactory.create();
options.as(FlinkPipelineOptions.class).setRunner(FlinkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<KV<String,Long>> stream1 =
new KafkaWordCount("localhost:9092","idm")
.build(p)
.apply(
Window
.<KV<String,Long>>into(
FixedWindows.of(Duration.millis(3600000)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
PCollection<KV<String,Long>> stream2 =
new KafkaWordCount("localhost:9092","assist")
.build(p)
.apply(
Window
.<KV<String,Long>>into(
FixedWindows.of(Duration.millis(3600000)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes());
final TupleTag<Long> web = new TupleTag<Long>();
final TupleTag<Long> assist = new TupleTag<Long>();
PCollection<KV<String, CoGbkResult>> joinedStream =
KeyedPCollectionTuple.of(web, stream1)
.and(assist, stream2)
.apply(CoGroupByKey.<String>create());
PCollection<KV<String,Long>> finalCountStream =
joinedStream
.apply(ParDo.of(
new DoFn<KV<String, CoGbkResult>, KV<String,Long>>() {
#StateId(stateId)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext processContext,
#StateId(stateId) MapState<String, Long> state) {
KV<String,CoGbkResult> element = processContext.element();
Iterable<Long> count1 = element.getValue().getAll(web);
Iterable<Long> count2 = element.getValue().getAll(assist);
Long sumAmount =
StreamSupport
.stream(
Iterables
.concat(count1, count2)
.spliterator(),
false)
.collect(Collectors.summingLong(n -> n));
System.out.println(element.getKey()+"::"+sumAmount);
// processContext.output(element.getKey()+"::"+sumAmount);
Long currCount = state.get(element.getKey()).read()==null? 0L:state.get(element.getKey()).read();
Long newCount = currCount+sumAmount;
state.put(element.getKey(),sumAmount);
processContext.output(KV.of(element.getKey(),sumAmount));
}
}));
finalCountStream
.apply(ParDo.of(new DoFn<KV<String,Long>, KV<String,Long>>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
processContext.output(processContext.element());
}
}))
.apply("finalState", ParDo.of(new DoFn<KV<String,Long>, String>() {
#StateId(myState)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext c,
#StateId(myState) MapState<String, Long> state){
KV<String,Long> e = c.element();
System.out.println("Thread ID :"
+ Thread.currentThread().getId());
Long currCount =
state.get(e.getKey()).read()==null
? 0L
: state.get(e.getKey()).read();
Long newCount = currCount+e.getValue();
state.put(e.getKey(),newCount);
c.output(e.getKey()+":"+newCount);
}
}))
.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values());
/* finalCountStream.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values()
);*/
//finalCountStream.apply(TextIO.write().to("wordcounts"));
p.run().waitUntilFinish();
}
}
This Beam pipeline reads text from two kafka streams , split it into words and merge both streams based on word and finally emits word count from both stream to another kafka topic
For reference: if I understand correctly the problem, you can simplify the first part of your pipeline by using KafkaIO.withTopics(List<String>) and read from two (or more) topics in one step. So, there is no need to join data from different topics after.

How to throttle flink output to kafka?

I want to send 100 messages/second from my stream to a kafka topic. I have more than enough data in stream to do so.
So far, I have found windowing concept, but I am unable to modify it to my use case.
You could do this easily with a ProcessFunction. You would keep a counter in Flink state, and only emit elements when the counter is less than 100. Meanwhile, use a timer to reset the counter to zero once a second.
Flink v1.15, I created function.
Refer to checkpointing_under_backpressure
and process_function.
public class RateLimitFunction extends KeyedProcessFunction<String, String, String> {
private transient ValueState<Long> counter;
private transient ValueState<Long> lastTimestamp;
private final Long count;
private final Long millisecond;
public RateLimitFunction(Long count, Long millisecond) {
this.count = count;
this.millisecond = millisecond;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
counter = getRuntimeContext()
.getState(new ValueStateDescriptor<>("counter", TypeInformation.of(Long.class)));
lastTimestamp = getRuntimeContext()
.getState(new ValueStateDescriptor<>("last-timestamp", TypeInformation.of(Long.class)));
}
#Override
public void processElement(String value, KeyedProcessFunction<String, String, String>.Context ctx,
Collector<String> out) throws Exception {
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime());
long current = counter.value() == null ? 0L : counter.value();
if (current < count) {
counter.update(current + 1L);
out.collect(value);
} else {
if (lastTimestamp.value() == null) {
lastTimestamp.update(ctx.timerService().currentProcessingTime());
}
Thread.sleep(millisecond);
out.collect(value);
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
if (lastTimestamp.value() != null && lastTimestamp.value() + millisecond <= timestamp) {
counter.update(0L);
lastTimestamp.update(null);
}
}
}

RxJava 2.x: serialize() doesn't work

I tried below to test the sereialize().
I called onNext 1,000,000 times to count from 2 different threads.
Then, I expected to get 2,000,000 at onComplete.
However, I couldn't get the expected value.
private static int count = 0;
private static void setCount(int value) {
count = value;
}
private static final int TEST_LOOP = 10;
private static final int NEXT_LOOP = 1_000_000;
#Test
public void test() throws Exception {
for (int test = 0; test < TEST_LOOP; test++) {
Flowable.create(emitter -> {
ExecutorService service = Executors.newCachedThreadPool();
emitter.setCancellable(() -> service.shutdown());
Future<Boolean> future1 = service.submit(() -> {
for (int i = 0; i < NEXT_LOOP; i++) {
emitter.onNext(i);
}
return true;
});
Future<Boolean> future2 = service.submit(() -> {
for (int i = 0; i < NEXT_LOOP; i++) {
emitter.onNext(i);
}
return true;
});
if (future1.get(1, TimeUnit.SECONDS)
&& future2.get(1, TimeUnit.SECONDS)) {
emitter.onComplete();
}
}, BackpressureStrategy.BUFFER)
.serialize()
.cast(Integer.class)
.subscribe(new Subscriber<Integer>() {
private int count = 0;
#Override
public void onSubscribe(Subscription s) {
s.request(Long.MAX_VALUE);
}
#Override
public void onNext(Integer t) {
count++;
}
#Override
public void onError(Throwable t) {
fail(t.getMessage());
}
#Override
public void onComplete() {
setCount(count);
}
});
assertThat(count, is(NEXT_LOOP * 2));
}
}
I wonder whether serialize() doesn't work or I missunderstood the usage of serialize()
I checked the source of SerializedSubscriber.
#Override
public void onNext(T t) {
...
synchronized(this){
...
}
actual.onNext(t);
emitLoop();
}
Since actual.onNext(t); is called out of synchronized block, I guess that actual.onNext(t); could be called from different threads at the same time. Also, it may be possible to call onComplete before onNext would be done, I guess.
I used RxJava 2.0.4.
This is not a bug but a misuse of the FlowableEmitter:
The onNext, onError and onComplete methods should be called in a sequential manner, just like the Subscriber's methods. Use serialize() if you want to ensure this. The other methods are thread-safe.
FlowableEmitter.serialize()
Applying Flowable.serialize() is too late for the create operator.

Implementing resource queue in rx

I have a hot observable Observable<Resource> resources that represents consumable resources and I want to queue up consumers Action1<Resource> for these resources. A Resource can be used by at most 1 consumer. It should not be used at all once a new value is pushed from resources. If my consumers were also wrapped in a hot observable then the marble-diagram of what I'm after would be
--A--B--C--D--E--
----1----2--34---
----A----C--D-E--
----1----2--3-4--
I've managed a naive implementation using a PublishSubject and zip but this only works if each resource is consumed before a new resource is published (i.e. instead of the required sequence [A1, C2, D3, E4] this implementation will actually produce [A1, B2, C3, D4]).
This is my first attempt at using rx and I've had a play around with both delay and join but can't quite seem to get what I'm after. I've also read that ideally Subjects should be avoided, but I can't see how else I would implement this.
public class ResourceQueue<Resource> {
private final PublishSubject<Action1<Resource>> consumers = PublishSubject.create();
public ResourceQueue(Observable<Resource> resources) {
resources.zipWith(this.consumers, new Func2<Resource, Action1<Resource>, Object>() {
#Override
public Object call(Resource resource, Action1<Resource> consumer) {
consumer.execute(resource);
return null;
}
}).publish().connect();
}
public void queue(final Action1<Resource> consumer) {
consumers.onNext(consumer);
}
}
Is there a way to achieve what I'm after? Is there a more 'rx-y' approach to the solution?
EDIT: changed withLatesFrom suggestion with combineLatest.
The only solution I can think of is to use combineLatest to get all the possible combinations, and manually exclude the ones that you do not need:
final ExecutorService executorService = Executors.newCachedThreadPool();
final Observable<String> resources = Observable.create(s -> {
Runnable r = new Runnable() {
#Override
public void run() {
final List<Integer> sleepTimes = Arrays.asList(200, 200, 200, 200, 200);
for (int i = 0; i < sleepTimes.size(); i++) {
try {
Thread.sleep(sleepTimes.get(i));
} catch (Exception e) {
e.printStackTrace();
}
String valueOf = String.valueOf((char) (i + 97));
System.out.println("new resource " + valueOf);
s.onNext(valueOf);
}
s.onCompleted();
}
};
executorService.submit(r);
});
final Observable<Integer> consumers = Observable.create(s -> {
Runnable r = new Runnable() {
#Override
public void run() {
final List<Integer> sleepTimes = Arrays.asList(300, 400, 200, 0);
for (int i = 0; i < sleepTimes.size(); i++) {
try {
Thread.sleep(sleepTimes.get(i));
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("new consumer " + (i + 1));
s.onNext(i + 1);
}
s.onCompleted();
};
};
executorService.submit(r);
});
final LatestValues latestValues = new LatestValues();
final Observable<String> combineLatest = Observable.combineLatest(consumers, resources, (c, r) -> {
if (latestValues.alreadyProcessedAnyOf(c, r)) {
return "";
}
System.out.println("consumer " + c + " will consume resource " + r);
latestValues.updateWithValues(c, r);
return c + "_" + r;
});
combineLatest.subscribe();
executorService.shutdown();
executorService.awaitTermination(10, TimeUnit.SECONDS);
The class holding the latest consumers and resources.
static class LatestValues {
Integer latestConsumer = Integer.MAX_VALUE;
String latestResource = "";
public boolean alreadyProcessedAnyOf(Integer c, String r) {
return latestConsumer.equals(c) || latestResource.equals(r);
}
public void updateWithValues(Integer c, String r) {
latestConsumer = c;
latestResource = r;
}
}