How to throttle flink output to kafka?

How to throttle flink output to kafka? - apache-kafka

I want to send 100 messages/second from my stream to a kafka topic. I have more than enough data in stream to do so.
So far, I have found windowing concept, but I am unable to modify it to my use case.

You could do this easily with a ProcessFunction. You would keep a counter in Flink state, and only emit elements when the counter is less than 100. Meanwhile, use a timer to reset the counter to zero once a second.

Flink v1.15, I created function.
Refer to checkpointing_under_backpressure
and process_function.
public class RateLimitFunction extends KeyedProcessFunction<String, String, String> {
private transient ValueState<Long> counter;
private transient ValueState<Long> lastTimestamp;
private final Long count;
private final Long millisecond;
public RateLimitFunction(Long count, Long millisecond) {
this.count = count;
this.millisecond = millisecond;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
counter = getRuntimeContext()
.getState(new ValueStateDescriptor<>("counter", TypeInformation.of(Long.class)));
lastTimestamp = getRuntimeContext()
.getState(new ValueStateDescriptor<>("last-timestamp", TypeInformation.of(Long.class)));
}
#Override
public void processElement(String value, KeyedProcessFunction<String, String, String>.Context ctx,
Collector<String> out) throws Exception {
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime());
long current = counter.value() == null ? 0L : counter.value();
if (current < count) {
counter.update(current + 1L);
out.collect(value);
} else {
if (lastTimestamp.value() == null) {
lastTimestamp.update(ctx.timerService().currentProcessingTime());
}
Thread.sleep(millisecond);
out.collect(value);
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
if (lastTimestamp.value() != null && lastTimestamp.value() + millisecond <= timestamp) {
counter.update(0L);
lastTimestamp.update(null);
}
}
}

Related

Beam Pipeline with CheckStopReadingFn throws IllegalStateException upon returning true from CheckStopReadingFn

I am using Apache Beam 2.29.0 to consume from Kafka. I added the function CheckStopReadingFn to stop reading from Kafka once the test returns true. My pipeline creation is like this:
return p.apply("Read from Kafka", KafkaIO.<byte[], GenericRecord>read()
.withBootstrapServers(options.getSourceBrokers().get())
.withTopics(srcTopics)
.withConsumerConfigUpdates(consumerExtraOptions)
.withConsumerFactoryFn(new ConsumerFactoryFn(offsetInfo))
.withConsumerConfigUpdates(ImmutableMap.of("group.id", "my_group"))
.withCheckStopReadingFn(new CheckStopReadingFn(options.getProjectId().get()))
.withKeyDeserializer(ByteArrayDeserializer.class)
.withStartReadTime(Instant.ofEpochMilli(options.getStartTimestamp().get()))
.withValueDeserializer(ConfluentSchemaRegistryDeserializerProvider.of(options.getSourceSchemaRegistryUrl().get(), options.getSourceSubject().get()))
.withoutMetadata()).apply("Drop keys", Values.<GenericRecord>create())
.apply("Windowing of " + windowDuration + " seconds", Window.<GenericRecord>into(FixedWindows.of(Duration.standardSeconds(windowDuration))));
The place in Beam that throws the exception is in the OffsetTracker method below because the lastAttempedOffset is null:
#Override
public void checkDone() throws IllegalStateException {
if (range.getFrom() == range.getTo()) {
return;
}
checkState(
lastAttemptedOffset != null,
"Last attempted offset should not be null. No work was claimed in non-empty range %s.",
range);
checkState(
lastAttemptedOffset >= range.getTo() - 1,
"Last attempted offset was %s in range %s, claiming work in [%s, %s) was not attempted",
lastAttemptedOffset,
range,
lastAttemptedOffset + 1,
range.getTo());
}
Any ideas what the problem might be?
Here is my CheckStopReadingFn implementation:
private static class CheckStopReadingFn implements SerializableFunction<TopicPartition, Boolean> {
final private String projectId;
final private String bucketName;
final private String subdirectory;
CheckStopReadingFn(String projectId, String bucketName, String subdirectory ) {
this.projectId = projectId;
this.bucketName = bucketName;
this.viewName = subdirectory;
}
#Override
public Boolean apply(TopicPartition topicPartition) {
return = GCSUtility.filesExist(projectId, bucketName, subdirectory, Set.of(topicPartition.toString()));
}
}

Two concurrent request were able to lock the same row in Postgres sql

when two concurrent request were made for the below code, both of the requests were able to acquire lock simultaneously and hence were able to execute the block of code
Sample code running in production:
Sample Code for reference :
//Starting point for the request
#Override
public void receiveTransferItems(String argumet1, String refernceId, List<Item> items, long messageId)
throws Exception {
ParentDTO parent = DAO.lockByReferenceid(referenceId);
if (parent == null) {
throw new Exception(referenceId + "does not exist");
}
updateData(parent);
for (Item item : items) {
receiveItem(td, td.getWarehouseId(), item.getItemSKU(), item.getItemStatus(), item.getQtyReceived(), messageId);
}
}
private void updateData(ParentDTO td) throws DropShipException {
//perform some logical processing and then execute update
DAO.update(td);
}
private void receiveItem(ParentDTO td, String warehouseId, String asin, String itemStatus, int quantity, long messageId)
throws Exception {
/**
* perform some logical processing
*
**/
//call is being made to another class to do the rest of the processing
service.receive(td, asin, quantity, condition, container, messageId);
}
#Override
public void receive(
ParentDTO parentDTO,
String asin,
int quantity,
Condition condition,
Container container,
long messageId,
DataAccessor accessor) throws Exception {
List<ChildDTO> childDTOs =
DAO.lockChildDTOItems(parentDTO.getReferenceId(), asin, condition,
CostInfoSource.MANIFEST);
List<ChildDTO> filterItems = DAO
.loadChildDTOItems(parentDTO.getReferenceId(), asin, condition.name());
long totalExpectedQuantity = getTotalExpectedQuantity(filterItems);
long totalReceivedQuantity = getTotalReceivedQuantity(filterItems);
int quantityNormalReceived = 0;
for (ChildDTO tdi : childDTOs) {
int quantityReceived = 0;
if (asinDropShipMsgAction != null) {
quantity -= asinDropShipMsgAction.getInitialQuantity();
quantityNormalReceived += asinDropShipMsgAction.getInitialQuantity();
} else {
quantityReceived = new DBOperationRunner<Integer>(accessor.getSessionManager()) {
#Override
protected Integer doWorkAndReturn() throws Exception {
return normalReceive(tdi, quantityLeft, container, MessageActionType.TS_IN, messageId);
}
}.execute();
}
}
}
private int normalReceive(final ChildDTO childDTO,
int quantity,
final Container container,
final MessageActionType type,
long messageId)
throws Exception {
/**
* perform some business logic
*
* */
DAO.update(childDTO);
return someQuantity;
}
Implementation for lockByReferenceId function:
#Override
public ParentDTO lockByReferenceId(String referenceId) {
Criteria criteria = getCurrentSession().createCriteria(ParentDTO.class)
.add(Restrictions.eq("referenceId", referenceId)).setLockMode(LockMode.UPGRADE_NOWAIT);
return (ParentDTO) criteria.uniqueResult();
}
Implementation of DBOperationRunner class :
public T execute() throws Exception {
T t = null;
Session originalSession = (Session) ThreadLocalContext.get(ThreadLocalContext.CURRENT_SESSION);
try {
ThreadLocalContext.put(ThreadLocalContext.CURRENT_SESSION, sessionManager.getCurrentSession());
sessionManager.beginTransaction();
t = doWorkAndReturn();
sessionManager.commit();
} catch (Exception e) {
try {
sessionManager.rollback();
} catch (Throwable t1) {
logger.error("failed to rollback", t1);
}
throw e;
} finally {
ThreadLocalContext.put(ThreadLocalContext.CURRENT_SESSION, originalSession);
}
return t;
}
Recently i observed one issue in production code in which two or more simultaneous requests were able to acquire lock on the same data at same time.
I am using hibernate and criteria as a DB framework and c3p0 as a connection pooling framework and Postgres as DB.
Note : This issue is intermittent and only observed for some random concurrent requests which is making it hard to debug.
I am unable to understand how two concurrent request were able to lock the same rows simultaneously. Can you please help me in identifying what is going wrong in this case?
Thanks in advance!!!!

Select immediate predecessor by date

I am new to Drools and I'm using Drools 7.12.0 to try and validate a set of meter readings, which look like
public class MeterReading() {
private long id;
private LocalDate readDate;
private int value;
private String meterId
private boolean valid;
/* Getters & Setters omitted */
}
As part of the validation I need to compare the values of each MeterReading with its immediate predecessor by readDate.
I first tried using 'accumulate'
when $mr: MeterReading()
$previousDate: LocalDate() from accumulate(MeterReading($pdate: readDate < $mr.readDate ), max($pdate))
then
System.out.println($mr.getId() + ":" + $previousDate);
end
but then discovered that this only returns the date of the previous meter read, not the object that contains it. I then tried a custom accumulate with
when
$mr: MeterReading()
$previous: MeterReading() from accumulate(
$p: MeterReading(id != $mr.id),
init( MeterReading prev = null; ),
action( if( prev == null || $p.readDate < prev.readDate) {
prev = $p;
}),
result(prev))
then
System.out.println($mr.getId() + ":" + $previous.getId() + ":" + $previous.getReadDate());
end
but this selects the earliest read in the set of meter readings, not the immediate predecessor. Can someone point me in the right direction as to what I should be doing or reading to be able to select the immediate predecessor to each individual meter read.
Regards

After further research I found this article http://planet.jboss.org/post/how_to_implement_accumulate_functions which I used to write my own accumulate function;\
public class PreviousReadFinder implements AccumulateFunction {
#Override
public Serializable createContext() {
return new PreviousReadFinderContext();
}
#Override
public void init(Serializable context) throws Exception {
PreviousReadFinderContext prfc = (PreviousReadFinderContext) context;
prfc.list.clear();
}
#Override
public void accumulate(Serializable context, Object value) {
PreviousReadFinderContext prfc = (PreviousReadFinderContext) context;
prfc.list.add((MeterReading) value);
}
#Override
public void reverse(Serializable context, Object value) throws Exception {
PreviousReadFinderContext prfc = (PreviousReadFinderContext) context;
prfc.list.remove((MeterReading) value);
}
#Override
public Object getResult(Serializable context) throws Exception {
PreviousReadFinderContext prfc = (PreviousReadFinderContext) context;
return prfc.findLatestReadDate();
}
#Override
public boolean supportsReverse() {
return true;
}
#Override
public Class<?> getResultType() {
return MeterReading.class;
}
#Override
public void writeExternal(ObjectOutput out) throws IOException {
}
#Override
public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
}
private static class PreviousReadFinderContext implements Serializable {
List<MeterReading> list = new ArrayList<>();
public Object findLatestReadDate() {
Optional<MeterReading> optional = list.stream().max(Comparator.comparing(MeterReading::getReadDate));
if (optional.isPresent()) {
MeterReading to = optional.get();
return to;
}
return null;
}
}
}
and my rule is now
rule "Opening Read With Previous"
dialect "mvel"
when $mr: MeterReading()
$pmr: MeterReading() from accumulate($p: MeterReading(readDate < $mr.readDate ), previousReading($p))
then
System.out.println($mr.getId() + ":" + $pmr.getMeterReadDate());
end
How do I write a rule to select the eatliest meter reading in the set which does not have a previous read?

RxJava 2.x: serialize() doesn't work

I tried below to test the sereialize().
I called onNext 1,000,000 times to count from 2 different threads.
Then, I expected to get 2,000,000 at onComplete.
However, I couldn't get the expected value.
private static int count = 0;
private static void setCount(int value) {
count = value;
}
private static final int TEST_LOOP = 10;
private static final int NEXT_LOOP = 1_000_000;
#Test
public void test() throws Exception {
for (int test = 0; test < TEST_LOOP; test++) {
Flowable.create(emitter -> {
ExecutorService service = Executors.newCachedThreadPool();
emitter.setCancellable(() -> service.shutdown());
Future<Boolean> future1 = service.submit(() -> {
for (int i = 0; i < NEXT_LOOP; i++) {
emitter.onNext(i);
}
return true;
});
Future<Boolean> future2 = service.submit(() -> {
for (int i = 0; i < NEXT_LOOP; i++) {
emitter.onNext(i);
}
return true;
});
if (future1.get(1, TimeUnit.SECONDS)
&& future2.get(1, TimeUnit.SECONDS)) {
emitter.onComplete();
}
}, BackpressureStrategy.BUFFER)
.serialize()
.cast(Integer.class)
.subscribe(new Subscriber<Integer>() {
private int count = 0;
#Override
public void onSubscribe(Subscription s) {
s.request(Long.MAX_VALUE);
}
#Override
public void onNext(Integer t) {
count++;
}
#Override
public void onError(Throwable t) {
fail(t.getMessage());
}
#Override
public void onComplete() {
setCount(count);
}
});
assertThat(count, is(NEXT_LOOP * 2));
}
}
I wonder whether serialize() doesn't work or I missunderstood the usage of serialize()
I checked the source of SerializedSubscriber.
#Override
public void onNext(T t) {
...
synchronized(this){
...
}
actual.onNext(t);
emitLoop();
}
Since actual.onNext(t); is called out of synchronized block, I guess that actual.onNext(t); could be called from different threads at the same time. Also, it may be possible to call onComplete before onNext would be done, I guess.
I used RxJava 2.0.4.

This is not a bug but a misuse of the FlowableEmitter:
The onNext, onError and onComplete methods should be called in a sequential manner, just like the Subscriber's methods. Use serialize() if you want to ensure this. The other methods are thread-safe.
FlowableEmitter.serialize()
Applying Flowable.serialize() is too late for the create operator.

zipWithIndex on Apache Flink

I'd like to assign each row of my input an id - which should be a number from 0 to N - 1, where N is the number of rows in the input.
Roughly, I'd like to be able to do something like the following :
val data = sc.textFile(textFilePath, numPartitions)
val rdd = data.map(line => process(line))
val rddMatrixLike = rdd.zipWithIndex.map { case (v, idx) => someStuffWithIndex(idx, v) }
But in Apache Flink. Is it possible?

This is now a part of the 0.10-SNAPSHOT release of Apache Flink. Examples for zipWithIndex(in) and zipWithUniqueId(in) are available in the official Flink documentation.

Here is a simple implementation of the function:
public class ZipWithIndex {
public static void main(String[] args) throws Exception {
ExecutionEnvironment ee = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> in = ee.readTextFile("/home/robert/flink-workdir/debug/input");
// count elements in each partition
DataSet<Tuple2<Integer, Long>> counts = in.mapPartition(new RichMapPartitionFunction<String, Tuple2<Integer, Long>>() {
#Override
public void mapPartition(Iterable<String> values, Collector<Tuple2<Integer, Long>> out) throws Exception {
long cnt = 0;
for (String v : values) {
cnt++;
}
out.collect(new Tuple2<Integer, Long>(getRuntimeContext().getIndexOfThisSubtask(), cnt));
}
});
DataSet<Tuple2<Long, String>> result = in.mapPartition(new RichMapPartitionFunction<String, Tuple2<Long, String>>() {
long start = 0;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
List<Tuple2<Integer, Long>> offsets = getRuntimeContext().getBroadcastVariable("counts");
Collections.sort(offsets, new Comparator<Tuple2<Integer, Long>>() {
#Override
public int compare(Tuple2<Integer, Long> o1, Tuple2<Integer, Long> o2) {
return ZipWithIndex.compare(o1.f0, o2.f0);
}
});
for(int i = 0; i < getRuntimeContext().getIndexOfThisSubtask(); i++) {
start += offsets.get(i).f1;
}
}
#Override
public void mapPartition(Iterable<String> values, Collector<Tuple2<Long, String>> out) throws Exception {
for(String v: values) {
out.collect(new Tuple2<Long, String>(start++, v));
}
}
}).withBroadcastSet(counts, "counts");
result.print();
}
public static int compare(int x, int y) {
return (x < y) ? -1 : ((x == y) ? 0 : 1);
}
}
This is how it works: I'm using the first mapPartition() operation to go over all elements in the partitions to count how many elements are in there.
I need to know the number of elements in each partition to properly set the offsets when assigning the IDs to the elements.
The result of the first mapPartition is a DataSet containing mappings. I'm broadcasting this DataSet to all the second mapPartition() operators which will assign the IDs to the elements from the input.
In the open() method of the second mapPartition() I'm computing the offset for each partition.
I'm probably going to contribute the code to Flink (after discussing it with the other committers).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to throttle flink output to kafka? - apache-kafka

I want to send 100 messages/second from my stream to a kafka topic. I have more than enough data in stream to do so. So far, I have found windowing concept, but I am unable to modify it to my use case.

You could do this easily with a ProcessFunction. You would keep a counter in Flink state, and only emit elements when the counter is less than 100. Meanwhile, use a timer to reset the counter to zero once a second.

Related

Beam Pipeline with CheckStopReadingFn throws IllegalStateException upon returning true from CheckStopReadingFn

Two concurrent request were able to lock the same row in Postgres sql

Select immediate predecessor by date

RxJava 2.x: serialize() doesn't work

zipWithIndex on Apache Flink

Categories

Resources