Apache Beam - Aggregate date from beginning to logged timestamps

Apache Beam - Aggregate date from beginning to logged timestamps - real-time

I am trying to implement apache beam for a streaming process where I want to calculate the min(), max() value of an item with every registered timestamp.
Eg:
Timestamp
item_count
2021-08-03 01:00:03.22333 UTC
5
2021-08-03 01:00:03.256427 UTC
4
2021-08-03 01:00:03.256497 UTC
7
2021-08-03 01:00:03.256499 UTC
2
Output :
Timestamp
Min
Max
2021-08-03 01:00:03.22333 UTC
5
5
2021-08-03 01:00:03.256427 UTC
4
5
2021-08-03 01:00:03.256497 UTC
4
7
2021-08-03 01:00:03.256499 UTC
2
7
I am not able to figure out how do I fit my use-case to windowing, since for me the frame starts from row 1 and ends with every new I am reading.
Any suggestions how should I approach this?
Thank you

This is not going to be 100% perfect, since there's always going to be some latency and you may get elements in wrong order, but should be good enough.
public interface RollingMinMaxOptions extends PipelineOptions {
#Description("Topic to read from")
#Default.String("projects/pubsub-public-data/topics/taxirides-realtime")
String getTopic();
void setTopic(String value);
}
public static class MinMax extends Combine.CombineFn<Float, KV<Float, Float>, KV<Float, Float>> { //Types: Input, Accum, Output
#Override
public KV<Float, Float> createAccumulator() {
KV<Float, Float> start = KV.of(Float.POSITIVE_INFINITY, 0f);
return start;
}
#Override
public KV<Float, Float> addInput(KV<Float, Float> accumulator, Float input) {
Float max = Math.max(accumulator.getValue(), input);
Float min = Math.min(accumulator.getKey(), input);
return KV.of(min, max);
}
#Override
public KV<Float, Float> mergeAccumulators(Iterable<KV<Float, Float>> accumulators) {
Float max = 0f;
Float min = Float.POSITIVE_INFINITY;
for (KV<Float, Float> kv : accumulators) {
max = Math.max(kv.getValue(), max);
min = Math.min(kv.getKey(), min);
}
return KV.of(min, max);
}
#Override
public KV<Float, Float> extractOutput(KV<Float, Float> accumulator) {
return accumulator;
}
}
public static void main(String[] args) {
RollingMinMaxOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(RollingMinMaxOptions.class);
Pipeline p = Pipeline.create(options);
p
.apply("ReadFromPubSub", PubsubIO.readStrings().fromTopic(options.getTopic()))
.apply("Get meter reading", ParDo.of(new DoFn<String, Float>() {
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
JSONObject json = new JSONObject(c.element());
String rideStatus = json.getString("ride_status");
Float meterReading = json.getFloat("meter_reading");
if (rideStatus.equals("dropoff") && meterReading > 0){
c.output(meterReading);
}
}
})
)
.apply(Window.<Float>into(
new GlobalWindows())
.triggering(Repeatedly.forever(
AfterPane.elementCountAtLeast(1)
)
)
.withTimestampCombiner(TimestampCombiner.LATEST)
.accumulatingFiredPanes()
)
.apply(Combine.globally(new MinMax()))
.apply("Format", ParDo.of(new DoFn<KV<Float, Float>, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
TableRow row = new TableRow();
row.set("min", c.element().getKey());
row.set("max", c.element().getValue());
row.set("timestamp", c.timestamp().toString());
LOG.info(row.toString());
c.output(row);
}
})
);
p.run();
}
If you want the min / max to reset every X time, change it to a FixedWindow of that size

Related

what causes ConstraintMatchTotal could not add constraintMatch, when issue is tied to a .drl 'or' clause?

In extending code from OptaPlanner nurse rostering sample code. What causes the "constraintMatchTotal could not add constraintMatch" (Illegal state?) error to be thrown, that would be related to the parsing of a .drl rule with an 'or' clause, please? It is occurring immediately at import of data into .drl-based ruleset... but does NOT error if either of the two 'or' clauses is commented out. I believe that as they individually are acceptable, the system should handle them in the 'or' setup.
The rule is below, followed by the error, and the domain object used in the 'or' clause. I confirmed that:
If I comment out the 'or' and the BoundaryDate clause above it, the
program loads and runs.
If I comment out the 'or' and the BoundaryDate clause below it, the program loads and runs.
If I leave both in place, the error (below the rule) is thrown immediately.
Additionally, if I insert this clause into the 2nd BoundaryDate condition (after the 'or'), then the program loads and runs:
preferredSequenceStart == true,
.drl rule:
rule "Highlight irregular shifts"
when
EmployeeWorkSameShiftTypeSequence(
employee != null,
$firstDayIndex : firstDayIndex,
$lastDayIndex : lastDayIndex,
$employee : employee,
$dayLength : dayLength)
(
BoundaryDate(
dayIndex == $firstDayIndex,
preferredSequenceStart == false // does not start on a boundary start date
)
or // or
BoundaryDate(
dayIndex == $firstDayIndex,
$dayLength != preferredCoveringLength // is incorrect length for exactly one block
)
)
StaffRosterParametrization($lastDayIndex >= planningWindowStartDayIndex) // ignore if assignment is in (fixed) prior data
// non-functional identification drives desired indictment display on ShiftAssignment planning objects
ShiftAssignment(employee == $employee, shiftDateDayIndex >= $firstDayIndex, shiftDateDayIndex <= $lastDayIndex)
then
scoreHolder.addSoftConstraintMatch(kcontext, -1);
end
Exception executing consequence for rule "Highlight irregular shifts" in westgranite.staffrostering.solver: java.lang.IllegalStateException: The constraintMatchTotal (westgranite.staffrostering.solver/Highlight irregular shifts=0hard/-274soft) could not add constraintMatch (westgranite.staffrostering.solver/Highlight irregular shifts/[2020-01-02/D/6, 2018-12-25 - 2020-01-06, 2020-01-02, ...
[continues on with a list of constraint matches]
The BoundaryData.java is below, so the methods being called from the rule are visible:
package westgranite.staffrostering.domain;
import java.time.DayOfWeek;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import com.thoughtworks.xstream.annotations.XStreamAlias;
import westgranite.common.domain.AbstractPersistable;
#XStreamAlias("BoundaryDate")
public class BoundaryDate extends AbstractPersistable {
/**
*
*/
private static final long serialVersionUID = -7393276689810490427L;
private static final DateTimeFormatter LABEL_FORMATTER = DateTimeFormatter.ofPattern("E d MMM");
private int dayIndex;
private LocalDate date;
private boolean preferredSequenceStart; // true means "this date is a preferred start to assignment sequences"
private boolean preferredSequenceEnd; // true means "this date is a preferred end for assignment sequences"
private int nextPreferredStartDayIndex; // MAX_VALUE means "none"; if preferredSequenceStart is true, then this ref is still to the FUTURE next pref start date
private int prevPreferredStartDayIndex; // MIN_VALUE means "none"; if preferredSequenceStart is true, then this ref is still to the PREVIOUS next pref start date
// magic value that is beyond reasonable dayIndex range and still allows delta of indices to be an Integer
public static final int noNextPreferredDayIndex = Integer.MAX_VALUE/3;
public static final int noPrevPreferredDayIndex = Integer.MIN_VALUE/3;
public int getDayIndex() {
return dayIndex;
}
public void setDayIndex(int dayIndex) {
this.dayIndex = dayIndex;
}
public LocalDate getDate() {
return date;
}
public void setDate(LocalDate date) {
this.date = date;
}
public boolean isPreferredSequenceStart() {
return preferredSequenceStart;
}
public void setPreferredSequenceStart(boolean preferredSequenceStart) {
this.preferredSequenceStart = preferredSequenceStart;
}
public boolean isPreferredSequenceEnd() {
return preferredSequenceEnd;
}
public void setPreferredSequenceEnd(boolean preferredSequenceEnd) {
this.preferredSequenceEnd = preferredSequenceEnd;
}
public int getNextPreferredStartDayIndex() {
return nextPreferredStartDayIndex;
}
public void setNextPreferredStartDayIndex(int nextPreferredStartDayIndex) {
this.nextPreferredStartDayIndex = nextPreferredStartDayIndex;
}
public int getPrevPreferredStartDayIndex() {
return prevPreferredStartDayIndex;
}
public void setPrevPreferredStartDayIndex(int prevPreferredStartDayIndex) {
this.prevPreferredStartDayIndex = prevPreferredStartDayIndex;
}
// ===================== COMPLEX METHODS ===============================
public int getCurrOrPrevPreferredStartDayIndex() {
return (isPreferredSequenceStart() ? dayIndex : prevPreferredStartDayIndex);
}
public int getCurrOrNextPreferredStartDayIndex() {
return (isPreferredSequenceStart() ? dayIndex : nextPreferredStartDayIndex);
}
public int getCurrOrPrevPreferredEndDayIndex() {
return (isPreferredSequenceEnd() ? dayIndex : (isPreferredSequenceStart() ? dayIndex-1 : prevPreferredStartDayIndex-1));
}
public int getCurrOrNextPreferredEndDayIndex() {
return (isPreferredSequenceEnd() ? dayIndex : nextPreferredStartDayIndex-1);
}
public boolean isNoNextPreferred() {
return getNextPreferredStartDayIndex() == noNextPreferredDayIndex;
}
public boolean isNoPrevPreferred() {
return getPrevPreferredStartDayIndex() == noPrevPreferredDayIndex;
}
/**
* #return if this is a preferred start date, then the sequence length that will fill from this date through the next end date; otherwise the days filling the past preferred start date through next end date
*/
public int getPreferredCoveringLength() {
if (isPreferredSequenceStart()) {
return nextPreferredStartDayIndex - dayIndex;
}
return nextPreferredStartDayIndex - prevPreferredStartDayIndex;
}
/**
* #return if this is a preferred start boundary, then "today", else day of most recent start boundary
*/
public DayOfWeek getPreferredStartDayOfWeek() {
if (isPreferredSequenceStart()) {
return getDayOfWeek();
}
if (isNoPrevPreferred()) {
throw new IllegalStateException("No prev preferred day of week available for " + toString());
}
return date.minusDays(dayIndex - getPrevPreferredStartDayIndex()).getDayOfWeek();
}
public DayOfWeek getPreferredEndDayOfWeek() {
if (isPreferredSequenceEnd()) {
return getDayOfWeek();
}
if (isNoNextPreferred()) {
throw new IllegalStateException("No next preferred day of week available for " + toString());
}
return date.plusDays((getNextPreferredStartDayIndex()-1) - dayIndex).getDayOfWeek();
}
public DayOfWeek getDayOfWeek() {
return date.getDayOfWeek();
}
public int getMostRecentDayIndexOf(DayOfWeek targetDayOfWeek) {
return dayIndex - getBackwardDaysToReach(targetDayOfWeek);
}
public int getUpcomingDayIndexOf(DayOfWeek targetDayOfWeek) {
return dayIndex + getForwardDaysToReach(targetDayOfWeek);
}
public LocalDate getMostRecentDateOf(DayOfWeek targetDayOfWeek) {
return date.minusDays(getBackwardDaysToReach(targetDayOfWeek));
}
public LocalDate getUpcomingDateOf(DayOfWeek targetDayOfWeek) {
return date.plusDays(getForwardDaysToReach(targetDayOfWeek));
}
public int getForwardDaysToReach(DayOfWeek targetDayOfWeek) {
return getForwardDaysToReach(this.getDayOfWeek(), targetDayOfWeek);
}
public static int getForwardDaysToReach(DayOfWeek startDayOfWeek, DayOfWeek targetDayOfWeek) {
if (startDayOfWeek == targetDayOfWeek) {
return 0;
}
int forwardDayCount = 1;
while (startDayOfWeek.plus(forwardDayCount) != targetDayOfWeek) {
forwardDayCount++;
if (forwardDayCount > 10) {
throw new IllegalStateException("counting forward in days from " + startDayOfWeek + " never found target day of week: " + targetDayOfWeek);
}
}
return forwardDayCount;
}
public int getBackwardDaysToReach(DayOfWeek targetDayOfWeek) {
return getBackwardDaysToReach(this.getDayOfWeek(), targetDayOfWeek);
}
public static int getBackwardDaysToReach(DayOfWeek startDayOfWeek, DayOfWeek targetDayOfWeek) {
if (startDayOfWeek == targetDayOfWeek) {
return 0;
}
int backwardDayCount = 1;
while (startDayOfWeek.minus(backwardDayCount) != targetDayOfWeek) {
backwardDayCount++;
if (backwardDayCount > 10) {
throw new IllegalStateException("counting backward in days from " + startDayOfWeek + " never found target day of week: " + targetDayOfWeek);
}
}
return backwardDayCount;
}
public String getLabel() {
return date.format(LABEL_FORMATTER);
}
#Override
public String toString() {
return date.format(DateTimeFormatter.ISO_DATE);
}
}

If the same object being tested in the rule can match in multiple parts of an 'or' condition, then Optaplanner throws this IllegalStateException, at least through 7.15.0. See details explored in optaplanner jira 1433.
Workaround is to always add terms to later terms of 'or' expressions that ensure the matching object can not be the same one that matched earlier parts of the 'or'. For the original posting above, the 'preferredSequenceStart == true' achieved this exclusion.
Note that the use of the 'exists' keyword in the terms of the 'or' can cause trouble with this workaround; try to avoid using 'exists' in this situation.

How to throttle flink output to kafka?

I want to send 100 messages/second from my stream to a kafka topic. I have more than enough data in stream to do so.
So far, I have found windowing concept, but I am unable to modify it to my use case.

You could do this easily with a ProcessFunction. You would keep a counter in Flink state, and only emit elements when the counter is less than 100. Meanwhile, use a timer to reset the counter to zero once a second.

Flink v1.15, I created function.
Refer to checkpointing_under_backpressure
and process_function.
public class RateLimitFunction extends KeyedProcessFunction<String, String, String> {
private transient ValueState<Long> counter;
private transient ValueState<Long> lastTimestamp;
private final Long count;
private final Long millisecond;
public RateLimitFunction(Long count, Long millisecond) {
this.count = count;
this.millisecond = millisecond;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
counter = getRuntimeContext()
.getState(new ValueStateDescriptor<>("counter", TypeInformation.of(Long.class)));
lastTimestamp = getRuntimeContext()
.getState(new ValueStateDescriptor<>("last-timestamp", TypeInformation.of(Long.class)));
}
#Override
public void processElement(String value, KeyedProcessFunction<String, String, String>.Context ctx,
Collector<String> out) throws Exception {
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime());
long current = counter.value() == null ? 0L : counter.value();
if (current < count) {
counter.update(current + 1L);
out.collect(value);
} else {
if (lastTimestamp.value() == null) {
lastTimestamp.update(ctx.timerService().currentProcessingTime());
}
Thread.sleep(millisecond);
out.collect(value);
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
if (lastTimestamp.value() != null && lastTimestamp.value() + millisecond <= timestamp) {
counter.update(0L);
lastTimestamp.update(null);
}
}
}

How to get the average of a generic array list in java?

i'm having trouble getting the average of a generic array list of type T.

You should use <T extends Number> generic signature to specify the type of the Number types, plus, you should use instanceof keyword. A simple dummy demo here;
Test Code
import java.util.ArrayList;
import java.util.List;
public class Test {
public static void main(String[] args) {
List<Integer> integerList = new ArrayList<Integer>();
List<Double> doubleList = new ArrayList<Double>();
List<Float> floatList = new ArrayList<Float>();
for(int i = 0; i < 10; i++)
{
integerList.add(new Integer(i+1));
doubleList.add(new Double(i+1));
floatList.add(new Float(i+1));
}
Utility<Integer> utilityInteger = new Utility<Integer>(integerList);
Utility<Double> utilityDouble = new Utility<Double>(doubleList);
Utility<Float> utilityFloat = new Utility<Float>(floatList);
System.out.println("Integer average: " + utilityInteger.getAverage());
System.out.println("Double average : " + utilityDouble.getAverage());
System.out.println("Float average : " + utilityFloat.getAverage());
}
public static class Utility<T extends Number>
{
// Fields
private List<T> list;
private Object average;
// Constructor
#SuppressWarnings("unchecked")
public Utility(List<T> list)
{
this.list = list;
T sample = list.get(0);
if(sample instanceof Double)
{
doAverageDouble((List<Double>) list);
}
else if (sample instanceof Integer)
{
doAverageInteger((List<Integer>) list);
}
else if (sample instanceof Float)
{
doAverageFloat((List<Float>) list);
}
else
{
throw new IllegalStateException("Constructor must be initialiez with either of Double, Integer or Float list");
}
}
// Methods
private void doAverageDouble(List<Double> list) {
Double sum = new Double(0);
for(Double d : list)
{
sum += d;
}
average = sum/new Double(list.size());
}
private void doAverageInteger(List<Integer> list) {
Integer sum = new Integer(0);
for(Integer d : list)
{
sum += d;
}
average = sum/new Integer(list.size());
}
private void doAverageFloat(List<Float> list) {
Float sum = new Float(0);
for(Float d : list)
{
sum += d;
}
average = sum/new Float(list.size());
}
Object getAverage()
{
return average;
}
}
}
Console Output
Integer average: 5
Double average : 5.5
Float average : 5.5

zipWithIndex on Apache Flink

I'd like to assign each row of my input an id - which should be a number from 0 to N - 1, where N is the number of rows in the input.
Roughly, I'd like to be able to do something like the following :
val data = sc.textFile(textFilePath, numPartitions)
val rdd = data.map(line => process(line))
val rddMatrixLike = rdd.zipWithIndex.map { case (v, idx) => someStuffWithIndex(idx, v) }
But in Apache Flink. Is it possible?

This is now a part of the 0.10-SNAPSHOT release of Apache Flink. Examples for zipWithIndex(in) and zipWithUniqueId(in) are available in the official Flink documentation.

Here is a simple implementation of the function:
public class ZipWithIndex {
public static void main(String[] args) throws Exception {
ExecutionEnvironment ee = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> in = ee.readTextFile("/home/robert/flink-workdir/debug/input");
// count elements in each partition
DataSet<Tuple2<Integer, Long>> counts = in.mapPartition(new RichMapPartitionFunction<String, Tuple2<Integer, Long>>() {
#Override
public void mapPartition(Iterable<String> values, Collector<Tuple2<Integer, Long>> out) throws Exception {
long cnt = 0;
for (String v : values) {
cnt++;
}
out.collect(new Tuple2<Integer, Long>(getRuntimeContext().getIndexOfThisSubtask(), cnt));
}
});
DataSet<Tuple2<Long, String>> result = in.mapPartition(new RichMapPartitionFunction<String, Tuple2<Long, String>>() {
long start = 0;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
List<Tuple2<Integer, Long>> offsets = getRuntimeContext().getBroadcastVariable("counts");
Collections.sort(offsets, new Comparator<Tuple2<Integer, Long>>() {
#Override
public int compare(Tuple2<Integer, Long> o1, Tuple2<Integer, Long> o2) {
return ZipWithIndex.compare(o1.f0, o2.f0);
}
});
for(int i = 0; i < getRuntimeContext().getIndexOfThisSubtask(); i++) {
start += offsets.get(i).f1;
}
}
#Override
public void mapPartition(Iterable<String> values, Collector<Tuple2<Long, String>> out) throws Exception {
for(String v: values) {
out.collect(new Tuple2<Long, String>(start++, v));
}
}
}).withBroadcastSet(counts, "counts");
result.print();
}
public static int compare(int x, int y) {
return (x < y) ? -1 : ((x == y) ? 0 : 1);
}
}
This is how it works: I'm using the first mapPartition() operation to go over all elements in the partitions to count how many elements are in there.
I need to know the number of elements in each partition to properly set the offsets when assigning the IDs to the elements.
The result of the first mapPartition is a DataSet containing mappings. I'm broadcasting this DataSet to all the second mapPartition() operators which will assign the IDs to the elements from the input.
In the open() method of the second mapPartition() I'm computing the offset for each partition.
I'm probably going to contribute the code to Flink (after discussing it with the other committers).

gwt celltable paging

I have celltable and a simplepager. I am making async calls to the server to return data as a list.
AsyncDataProvider<Entry> provider = new AsyncDataProvider<Entry>() {
#Override
protected void onRangeChanged(HasData<Entry> display) {
final int start = display.getVisibleRange().getStart();
int length = display.getVisibleRange().getLength();
AsyncCallback<List<Entry>> callback = new AsyncCallback<List<Entry>>() {
#Override
public void onFailure(Throwable caught) {
Window.alert(caught.getMessage());
}
#Override
public void onSuccess(List<Entry> result) {
updateRowData(start, result);
}
};
// The remote service that should be implemented
rpcService.fetchEntries(start, length, callback);
}
};
On the server side ...
public List<Entry> fetchEntries(int start, int length) {
if (start > ENTRIES.size())
return new ArrayList<Entry>();
int end = start + length > ENTRIES.size() ? ENTRIES.size() : start
+ length;
ArrayList<Entry> sublist = new ArrayList<Entry>(
(List<Entry>) ENTRIES.subList(start, end));
return sublist;
}
The problem is that I don't know the size of the dataset returned by the aync call. So I cannot set the updateRowCount. So now the next button is always active even though the dataset has only 24 fields. Any ideas ?

How about modifying your RPC service to return a flag as well:
class FetchEntriesResult{
private int totalNumberOfEntries;
private List<Entry> entries;
//getters,setters, constructor etc...
}
And your service method becomes something like:
public FetchEntriesResult fetchEntries(int start, int length) {
if (start > ENTRIES.size())
return new ArrayList<Entry>();
int end = start + length > ENTRIES.size() ? ENTRIES.size() : start
+ length;
ArrayList<Entry> sublist = new ArrayList<Entry>(
(List<Entry>) ENTRIES.subList(start, end));
return new FetchEntriesResult(sublist,ENTRIES.size());
}
Now you can use the FetchEntriesResult.getTotalNumberOfEntries() on the client.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache Beam - Aggregate date from beginning to logged timestamps - real-time

Related

what causes ConstraintMatchTotal could not add constraintMatch, when issue is tied to a .drl 'or' clause?

How to throttle flink output to kafka?

How to get the average of a generic array list in java?

zipWithIndex on Apache Flink

gwt celltable paging

Categories

Resources