Group by and then ungroup causes loss of all events on beam TestPipeline - apache-beam

We found this line(mainly the reshuffle to be INTERMITTENTLY dropping events(causing flaky tests) on apache beam org.apache.beam.sdk.testing.TestPipeline ->
PCollection<OrderlyBeamDto<RosterRecord>> records = PCollectionList.of(csv.get(TupleTags.OUTPUT_RECORD)).and(excel.get(TupleTags.OUTPUT_RECORD))
.apply(Flatten.pCollections())
.apply("Reshuffle Records", Reshuffle.viaRandomKey());
soooo, I modified it to this and now we lose ALL events. In fact, the UngroupFn() is never called
PCollection<OrderlyBeamDto<RosterRecord>> records = PCollectionList.of(csv.get(TupleTags.OUTPUT_RECORD)).and(excel.get(TupleTags.OUTPUT_RECORD))
.apply(Flatten.pCollections())
.apply(ParDo.of(new CreateKeyFn()))
.apply(GroupByKey.<String, OrderlyBeamDto<RosterRecord>>create())
.apply(ParDo.of(new UngroupFn()));
The code for Ungroup is simply this
#ProcessElement
public void processElement(final ProcessContext c) {
log.info("running ungroup");
KV<String, Iterable<OrderlyBeamDto<RosterRecord>>> keyValue = c.element();
Iterable<OrderlyBeamDto<RosterRecord>> list = keyValue.getValue();
for(OrderlyBeamDto<RosterRecord> record : list) {
log.info("ungroup="+record.getValue().getRowNumber());
c.output(record);
}
}
We never see the 'running ungroup' log :(

Related

Flink streaming table using kafka source and using flink sql to query

I'm trying to read data from kafka topic into DataStream and register DataStream, after that use TableEnvironment.sqlQuery("SQL") to query the data, when TableEnvironment.execute() there is no error and no output.
public static void main(String[] args){
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharateristic.EventTime);
env.enableCheckpointing(5000);
StreamTableEnvironment tableEnvironment = StreamTableEnvironment.create(env);
FlinkKafkaConsumer<Person> consumer = new FlinkKafkaConsumer(
"topic",
new JSONDeserializer(),
Job.getKafkaProperties
);
consumer.setStartFromEarliest();
DataStream<Person> stream = env.addSource(consumer).fliter(x -> x.status != -1).assignTimestampAndWatermarks(new AssignerWithPeriodicWatermarks<Person>(){
long current = 0L;
final long expire = 1000L;
#Override
public Watermakr getCurrentWatermark(){
return new Watermark(current - expire);
}
#Override
public long extractTimestamp(Person person){
long timestamp = person.createTime;
current = Math.max(timestamp,current);
return timestamp;
}
});
//set createTime as rowtime
tableEnvironment.registerDataStream("Table_Person",stream,"name,age,sex,createTime.rowtime");
Table res = tableEnvironment.sqlQuery("select TUMBLE_END(createTime,INTERVAL '1' minute) as registTime,sex,count(1) as total from Table_Person group by sex,TUMBLE(createTime,INTERVAL '1' minute)");
tableEnvironment.toAppendStream(t,Types.Row(new TypeInformation[]{Types.SQL_TIMESTAMP,Types.STRING,Types.LONG})).print();
tableEnvironment.execute("person-query");
}
when i execute,there was nothing print on console or throw any exceptions;
but if i use fromCollection() as a source,the program will print something on the console;
Can you please guide me to fix this?
dependencies:
flink-streaming-java_2.11 version:1.9.0-csa1.0.0.0;
flink-streaming-scala_2.11 version:1.9.0-csa1.0.0.0;
flink-connector-kafka_2.11 version:1.9.0-csa1.0.0.0;
flink-table-api-java-bridge_2.11 version:1.9.0-csa1.0.0.0;
flink-table-planner_2.11 version:1.9.0-csa1.0.0.0;
In the code where you convert the SQL query's result back to a DataStream, you need to pass res rather than t to toAppendStream. (I can't see how the code you've posted will even compile — where is t declared?) And I think you should be able to do this
Table res = tableEnvironment.sqlQuery("select TUMBLE_END(createTime,INTERVAL '1' minute) as registTime,sex,count(1) as total from Table_Person group by sex,TUMBLE(createTime,INTERVAL '1' minute)");
tableEnvironment.toAppendStream(res,Row.class).print();
rather than bothering with the TypeInformation.

kafka streams groupBy aggregate produces unexpected values

my question is about Kafka streams Ktable.groupBy.aggregate. and the resulting aggregated values.
situation
I am trying to aggregate minute events per day.
I have a minute event generator (not shown here) that generates events for a few houses. Sometimes the event value is wrong and the minute event must be republished.
Minute events are published in the topic "minutes".
I am doing an aggregation of these events per day and house using kafka Streams groupBy and aggregate.
problem
Normally as there are 1440 minutes in a day, there should never have an aggregation with more than 1440 values.
Also there should never have an aggregation with a negative amount of events.
... But it happens anyways and we do not understand what is wrong in our code.
sample code
Here is a sample simplified code to illustrate the problem. The IllegalStateException is thrown sometimes.
StreamsBuilder builder = new StreamsBuilder();
KTable<String, MinuteEvent> minuteEvents = builder.table(
"minutes",
Consumed.with(Serdes.String(), minuteEventSerdes),
Materialized.<String, MinuteEvent, KeyValueStore<Bytes, byte[]>>with(Serdes.String(), minuteEventSerdes)
.withCachingDisabled());
// preform daily aggregation
KStream<String, MinuteAggregate> dayEvents = minuteEvents
// group by house and day
.filter((key, minuteEvent) -> minuteEvent != null && StringUtils.isNotBlank(minuteEvent.house))
.groupBy((key, minuteEvent) -> KeyValue.pair(
minuteEvent.house + "##" + minuteEvent.instant.atZone(ZoneId.of("Europe/Paris")).truncatedTo(ChronoUnit.DAYS), minuteEvent),
Grouped.<String, MinuteEvent>as("minuteEventsPerHouse")
.withKeySerde(Serdes.String())
.withValueSerde(minuteEventSerdes))
.aggregate(
MinuteAggregate::new,
(String key, MinuteEvent value, MinuteAggregate aggregate) -> aggregate.addLine(key, value),
(String key, MinuteEvent value, MinuteAggregate aggregate) -> aggregate.removeLine(key, value),
Materialized
.<String, MinuteAggregate, KeyValueStore<Bytes, byte[]>>as(BILLLINEMINUTEAGG_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(minuteAggSerdes)
.withLoggingEnabled(new HashMap<>())) // keep this aggregate state forever
.toStream();
// check daily aggregation
dayEvents.filter((key, value) -> {
if (value.nbValues < 0) {
throw new IllegalStateException("got an aggregate with a negative number of values " + value.nbValues);
}
if (value.nbValues > 1440) {
throw new IllegalStateException("got an aggregate with too many values " + value.nbValues);
}
return true;
}).to("days", minuteAggSerdes);
and here are the sample class used in this code snippet :
public class MinuteEvent {
public final String house;
public final double sensorValue;
public final Instant instant;
public MinuteEvent(String house,double sensorValue, Instant instant) {
this.house = house;
this.sensorValue = sensorValue;
this.instant = instant;
}
}
public class MinuteAggregate {
public int nbValues = 0;
public double totalSensorValue = 0.;
public String house = "";
public MinuteAggregate addLine(String key, MinuteEvent value) {
this.nbValues = this.nbValues + 1;
this.totalSensorValue = this.totalSensorValue + value.sensorValue;
this.house = value.house;
return this;
}
public MinuteAggregate removeLine(String key, MinuteEvent value) {
this.nbValues = this.nbValues -1;
this.totalSensorValue = this.totalSensorValue - value.sensorValue;
return this;
}
public MinuteAggregate() {
}
}
If someone could tell us what we are doing wrong here and why we have these unexpected values that would be great.
additional notes
we configure our stream job to run with 4 threads properties.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
we are forced to use a Ktable.groupBy().aggregate() because minute values can be
republished with different sensorValue for an already published Instant. And daily aggregation modified accordingly.
Stream.groupBy().aggregate() does not have an adder AND a substractor.
I think, it is actually possible that the count become negative temporary.
The reason is, that each update in your first KTable sends two messaged downstream -- the old value to be subtracted in the downstream aggregation and the new value to be added to the downstream aggregation. Both message will be processed independently in the downstream aggregation.
If the current count is zero, and a subtractions is processed before an addition, the count would become negative temporarily.

applying keyed state on top of stream from co group stream

I have two kafka sources
I am trying to perform world count and merge the counts from two streams
I have created window of 1 min for both data streams and applying coGroupBykey , from DoFn , i am emitting <Key,Value> (word,count)
On top of this coGroupByKey function , I am applying stateful ParDo
Let say if i get (Test,2) from stream 1, (Test,3) from stream 2 in same window time then in CogroupByKey function , i ll merge as (Test,5), but if they are not falling in same window , i will emit (Test,2) and (Test,3)
Now i will apply state for merging these elements
So finally as result i should get (Test,5), but i am not getting the expected result , All elements form stream 1 are going to one partition and
elements from stream 2 to another partition , thats why i am getting result
(Test,2)
(Test,3)
// word count stream from kafka topic 1
PCollection<KV<String,Long>> stream1 = ...
// word count stream from kafka topic 2
PCollection<KV<String,Long>> stream2 = ...
PCollection<KV<String,Long>> windowed1 =
stream1.apply(
Window
.<KV<String,Long>>into(FixedWindows.of(Duration.millis(60000)))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.millis(1000))
.discardingFiredPanes());
PCollection<KV<String,Long>> windowed2 =
stream2.apply(
Window
.<KV<String,Long>>into(FixedWindows.of(Duration.millis(60000)))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.millis(1000))
.discardingFiredPanes());
final TupleTag<Long> count1 = new TupleTag<Long>();
final TupleTag<Long> count2 = new TupleTag<Long>();
// Merge collection values into a CoGbkResult collection.
PCollection<KV<String, CoGbkResult>> joinedStream =
KeyedPCollectionTuple.of(count1, windowed1).and(count2, windowed2)
.apply(CoGroupByKey.<String>create());
// applying state operation after coGroupKey fun
PCollection<KV<String,Long>> finalCountStream =
joinedStream.apply(ParDo.of(
new DoFn<KV<String, CoGbkResult>, KV<String,Long>>() {
#StateId(stateId)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext processContext,
#StateId(stateId) MapState<String, Long> state) {
KV<String, CoGbkResult> element = processContext.element();
Iterable<Long> count1 = element.getValue().getAll(web);
Iterable<Long> count2 = element.getValue().getAll(assist);
Long sumAmount =
StreamSupport
.stream(
Iterables.concat(count1, count2).spliterator(), false)
.collect(Collectors.summingLong(n -> n));
System.out.println(element.getKey()+"::"+sumAmount);
// processContext.output(element.getKey()+"::"+sumAmount);
Long currCount =
state.get(element.getKey()).read() == null
? 0L
: state.get(element.getKey()).read();
Long newCount = currCount+sumAmount;
state.put(element.getKey(),newCount);
processContext.output(KV.of(element.getKey(),newCount));
}
}));
finalCountStream
.apply("finalState", ParDo.of(new DoFn<KV<String,Long>, String>() {
#StateId(myState)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext c,
#StateId(myState) MapState<String, Long> state) {
KV<String,Long> e = c.element();
Long currCount = state.get(e.getKey()).read()==null
? 0L
: state.get(e.getKey()).read();
Long newCount = currCount+e.getValue();
state.put(e.getKey(),newCount);
c.output(e.getKey()+":"+newCount);
}
}))
.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values());
Alternatively, you can use a Flatten + Combine approach, which should be give you simpler code:
PCollection<KV<String, Long>> pc1 = ...;
PCollection<KV<String, Long>> pc2 = ...;
PCollectionList<KV<String, Long>> pcs = PCollectionList.of(pc1).and(pc2);
PCollection<KV<String, Long>> merged = pcs.apply(Flatten.<KV<String, Long>>pCollections());
merged.apply(windiw...).apply(Combine.perKey(Sum.ofLongs()))
You have set up both streams with the trigger Repeatedly.forever(AfterPane.elementCountAtLeast(1)) and discardingFiredPanes(). This will cause the CoGroupByKey to output as soon as possible after each input element and then reset its state each time. So it is normal behavior that it basically passes each input straight through.
Let me explain more: CoGroupByKey is executed like this:
All elements from stream1 and stream2 are tagged as you specified. So every (key, value1) from stream1 effectively becomes (key, (count1, value1)). And every (key, value2) from stream2 becomes `(key, (count2, value2))
These tagged collects are flattened together. So now there is one collection with elements like (key, (count1, value1)) and (key, (count2, value2)).
The combined collection goes through a normal GroupByKey. This is where triggers happen. So with the default trigger, you get (key, [(count1, value1), (count2, value2), ...]) with all the values for a key getting grouped. But with your trigger, you will often get separate (key, [(count1, value1)]) and (key, [(count2, value2)]) because each grouping fires right away.
The output of the GroupByKey is wrapped in just an API that is CoGbkResult. In many runners this is just a filtered view of the grouped iterable.
Of course, triggers are nondeterministic and runners are also allowed to have different implementations of CoGroupByKey. But the behavior you are seeing is expected. You probably don't want to use trigger like that or discarding mode, or else you need to do more grouping downstream.
Generally, doing a join with CoGBK is going to require some work downstream, until Beam supports retractions.
PipelineOptions options = PipelineOptionsFactory.create();
options.as(FlinkPipelineOptions.class)
.setRunner(FlinkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<KV<String,Long>> stream1 = new KafkaWordCount("localhost:9092","test1")
.build(p);
PCollection<KV<String,Long>> stream2 = new KafkaWordCount("localhost:9092","test2")
.build(p);
PCollectionList<KV<String, Long>> pcs = PCollectionList.of(stream1).and(stream2);
PCollection<KV<String, Long>> merged = pcs.apply(Flatten.<KV<String, Long>>pCollections());
merged.apply("finalState", ParDo.of(new DoFn<KV<String,Long>, String>() {
#StateId(myState)
private final StateSpec<MapState<String, Long>> mapState = StateSpecs.map();
#ProcessElement
public void processElement(ProcessContext c, #StateId(myState) MapState<String, Long> state){
KV<String,Long> e = c.element();
System.out.println("Thread ID :"+ Thread.currentThread().getId());
Long currCount = state.get(e.getKey()).read()==null? 0L:state.get(e.getKey()).read();
Long newCount = currCount+e.getValue();
state.put(e.getKey(),newCount);
c.output(e.getKey()+":"+newCount);
}
})).apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values()
);
p.run().waitUntilFinish();

please help me to convert trigger to batch apex

please help me in converting my after trigger to batch apex.
This trigger fires when opportunity stage changes to won.
It runs through line items and checks if forecast(custom objet) exists with that acunt.if yes,iit links to them..if no,itt will create a new forecat.
my trigger works fine forr some records.but to mass update i am getting timed out error.So opting batch apex but i had never written it.pls help me.
trigger Accountforecast on Opportunity (after insert,after update) {
List<Acc_c> AccproductList =new List<Acc_c>();
List<Opportunitylineitem> opplinitemlist =new List<Opportunitylineitem>();
list<opportunitylineitem > oppdate= new list<opportunitylineitem >();
List<Acc__c> accquery =new List<Acc__c>();
List<date> dt =new List<date>();
Set<Id> sProductIds = new Set<Id>();
Set<Id> sAccountIds = new Set<Id>();
Set<id> saccprodfcstids =new set<Id>();
Acc__c accpro =new Acc__c();
string aname;
Integer i;
Integer myIntMonth;
Integer myIntyear;
Integer myIntdate;
opplinitemlist=[select Id,PricebookEntry.Product2.Name,opp_account__c,Opp_account_name__c,PricebookEntry.Product2.id, quantity,ServiceDate,Acc_Product_Fcst__c from Opportunitylineitem WHERE Opportunityid IN :Trigger.newMap.keySet() AND Acc__c=''];
for(OpportunityLineItem oli:opplinitemlist) {
sProductIds.add(oli.PricebookEntry.Product2.id);
sAccountIds.add(oli.opp_account__c);
}
accquery=[select id,Total_Qty_Ordered__c,Last_Order_Qty__c,Last_Order_Date__c,Fcst_Days_Period__c from Acc__c where Acc__c.product__c In :sproductids and Acc__c.Account__c in :saccountids];
for(Acc__c apf1 :accquery){
saccprodfcstids.add(apf1.id);
}
if(saccprodfcstids!=null){
oppdate=[select servicedate from opportunitylineitem where Acc__c IN :saccprodfcstids ];
i =[select count() from Opportunitylineitem where acc_product_fcst__c in :saccprodfcstids];
}
for(Opportunity opp :trigger.new)
{
if(opp.Stagename=='Closed Won')
{
for(opportunitylineitem opplist:opplinitemlist)
{
if(!accquery.isempty())
{
for(opportunitylineitem opldt :oppdate)
{
string myDate = String.valueOf(opldt);
myDate = myDate.substring(myDate.indexof('ServiceDate=')+12);
myDate = myDate.substring(0,10);
String[] strDate = myDate.split('-');
myIntMonth = integer.valueOf(strDate[1]);
myIntYear = integer.valueOf(strDate[0]);
myIntDate = integer.valueOf(strDate[2]);
Date d = Date.newInstance(myIntYear, myIntMonth, myIntDate);
dt.add(d);
}
dt.add(opp.closedate);
dt.sort();
integer TDays=0;
system.debug('*************dt:'+dt.size());
for(integer c=0;c<dt.size()-1;c++)
{
TDays=TDays+dt[c].daysBetween(dt[c+1]);
}
for(Acc_product_fcst__c apf:accquery)
{
apf.Fcst_Days_Period__c = TDays/i;
apf.Total_Qty_Ordered__c =apf.Total_Qty_Ordered__c +opplist.quantity;
apf.Last_Order_Qty__c=opplist.quantity;
apf.Last_Order_Date__c=opp.CloseDate ;
apf.Fcst_Qty_Avg__c=apf.Total_Qty_Ordered__c/(i+1);
Opplist.Acc__c =apf.Id;
}
}
else{
accpro.Account__c=opplist.opp_account__c;
accpro.product__c=opplist.PricebookEntry.Product2.Id;
accpro.opplineitemid__c=opplist.id;
accpro.Total_Qty_Ordered__c =opplist.quantity;
accpro.Last_Order_Qty__c=opplist.quantity;
accpro.Last_Order_Date__c=opp.CloseDate;
accpro.Fcst_Qty_Avg__c=opplist.quantity;
accpro.Fcst_Days_Period__c=7;
accproductList.add(accpro);
}
}
}
}
if(!accproductlist.isempty()){
insert accproductlist;
}
update opplinitemlist;
update accquery;
}
First of all, you should take a look at this: Apex Batch Processing
Once you get a better idea on how batches work, we need to take into account the following points:
Identify the object that requires more processing. Account? Opportunity?
Should the data be maintained across batch calls? Stateful?
Use correct data structure in terms of performance. Map, List?
From your code, we can see you have three objects: OpportunityLineItems, Accounts, and Opportunities. It seems that your account object is using the most processing here.
It seems you're just keeping track of dates and not doing any aggregations. Thus, you don't need to maintain state across batch calls.
Your code has a potential of hitting governor limits, especially memory limits on the heap. You have a four-nested loop. Our suggestion would be to maintain opportunity line items related to Opportunities in a Map rather than in a List. Plus, we can get rid of those unnecessary for loops by refactoring the code as follows:
Note: This is just a template for the batch you will need to construct.
globalglobal Database.QueryLocator start(Database.BatchableContext BC) class AccountforecastBatch implements Database.Batchable<sObject>
{
global Database.QueryLocator start(Database.BatchableContext BC)
{
// 1. Do some initialization here: (i.e. for(OpportunityLineItem oli:opplinitemlist) {sProductIds.add(oli.PricebookEntry.Product2.id)..}
// 2. return Opportunity object here: return Database.getQueryLocator([select id,Total_Qty_Ordered__c,Last_Order_Qty ....]);
}
global void execute(Database.BatchableContext BC, List<sObject> scope)
{
// 1. Traverse your scope which at this point will be a list of Accounts
// 2. You're adding dates inside the process for Opportunity Line Items. See if you can isolate this process outside the for loops with a Map data structure.
// 3. You have 3 potential database transactions here (insert accproductlist;update opplinitemlist; update accquery; ). Ideally, you will only need one DB transaction per batch.If you can complete step 2 above, you might only need to update your opportunity line items. Otherwise, you're trying to do more than one thing in a method and you will need to redesign your solution
}
global void finish(Database.BatchableContext BC)
{
// send email or do some other tasks here
}
}

How to supress the very next event of stream A whenever stream B fires

I want to stop stream A for exactly one notification whenever stream B fires. Both streams will stay online and won't ever complete.
A: o--o--o--o--o--o--o--o--o
B: --o-----o--------o-------
R: o-----o-----o--o-----o--o
or
A: o--o--o--o--o--o--o--o--o
B: -oo----oo-------oo-------
R: o-----o-----o--o-----o--o
Here's a version of my SkipWhen operator I did for a similar question (the difference is that, in the original, multiple "B's" would skip multiple "A's"):
public static IObservable<TSource> SkipWhen<TSource, TOther>(this IObservable<TSource> source,
IObservable<TOther> other)
{
return Observable.Create<TSource>(observer =>
{
object lockObject = new object();
bool shouldSkip = false;
var otherSubscription = new MutableDisposable();
var sourceSubscription = new MutableDisposable();
otherSubscription.Disposable = other.Subscribe(
x => { lock(lockObject) { shouldSkip = true; } });
sourceSubscription.Disposable = source.Where(_ =>
{
lock(lockObject)
{
if (shouldSkip)
{
shouldSkip = false;
return false;
}
else
{
return true;
}
}
}).Subscribe(observer);
return new CompositeDisposable(
sourceSubscription, otherSubscription);
});
}
If the current implementation becomes a bottleneck, consider changing the lock implementation to use a ReaderWriterLockSlim.
This solution will work when the observable is hot (and without refCount):
streamA
.takeUntil(streamB)
.skip(1)
.repeat()
.merge(streamA.take(1))
.subscribe(console.log);
.takeUntil(streamB): make stream A complete upon stream B producing a value.
.skip(1): make stream A skip one value upon starting (or as a result of .repeat()).
.repeat(): make stream A repeat (reconnect) indefinitely.
.merge(streamA.take(1)): offset the effect of .skip(1) at the beginning of the stream.
Example of making A stream skip every 5 seconds:
var streamA,
streamB;
streamA = Rx.Observable
.interval(1000)
.map(function (x) {
return 'A:' + x;
}).publish();
streamB = Rx.Observable
.interval(5000);
streamA
.takeUntil(streamB)
.skip(1)
.repeat()
.merge(streamA.take(1))
.subscribe(console.log);
streamA.connect();
You can also use this sandbox http://jsbin.com/gijorid/4/edit?js,console to execute BACTION() in the console log at the time of running the code to manually push a value to streamB (which is helpful for analysing the code).