How to debug further this dropped record in apache beam? - apache-beam

I am seeing intermittent dropped records(only for error messages though not for success ones). We have a test case that intermittenly fails/passes because of a lost record. We are using "org.apache.beam.sdk.testing.TestPipeline.java" in the test case. This is the relevant setup code where I have tracked the dropped record too ....
PCollectionTuple processed = records
.apply("Process RosterRecord", ParDo.of(new ProcessRosterRecordFn(factory))
.withOutputTags(TupleTags.OUTPUT_INTEGER, TupleTagList.of(TupleTags.FAILURE))
);
errors = errors.and(processed.get(TupleTags.FAILURE));
PCollection<OrderlyBeamDto<Integer>> validCounts = processed.get(TupleTags.OUTPUT_INTEGER);
PCollection<OrderlyBeamDto<Integer>> errorCounts = errors
.apply("Flatten Roster File Error Count", Flatten.pCollections())
.apply("Publish Errors", ParDo.of(new ErrorPublisherFn(factory)));
The relevant code in ProcessRosterRecordFn.java is this
if(dto.hasValidationErrors()) {
RosterIngestError error = new RosterIngestError(record.getRowNumber(), record.toTitleValue());
error.getValidationErrors().addAll(dto.getValidationErrors());
error.getOldValidationErrors().addAll(dto.getOldValidationErrors());
log.info("Tagging record row number="+record.getRowNumber());
c.output(TupleTags.FAILURE, new OrderlyBeamDto<>(error));
return;
}
I see this log for the lost record of Tagging record row for 2 rows that fail. After that however, inside the first line of ErrorPublisherFn.java, we log immediately after receiving each message. We only receive 1 of the 2 rows SOMETIMES. When we receive both, the test passes. The test is very flaky in this regard.
Apache Beam is really annoying in it's naming of threads(they are all the same name), so I added a logback thread hashcode to get more insight and I don't see any and the ErrorPublisherFn could publish #4 on any thread anyways.
Ok, so now the big question: How to insert more things to figure out why this is being dropped INTERMITTENTLY?
Do I have to debug apache beam itself? Can I insert other functions or make changes to figure out why this error is 'sometimes' lost on some test runs and not others?
EDIT: Thankfully, this set of tests are not testing errors upstream and this line "errors = errors.and(processed.get(TupleTags.FAILURE));" can be removed which forces me to remove ".apply("Flatten Roster File Error Count", Flatten.pCollections())" and in removing those 2 lines, the issue goes away for 10 test runs in a row(ie. can't completely say it is gone with this flaky stuff going on). Are we doing something wrong in the join and flattening? I checked the Error structure and rowNumber is a part of equals and hashCode so there should be no duplicates and I am not sure why it would be intermittently failure if there are duplicate objects either.
What more can be done to debug here and figure out why this join is not working in the TestPipeline?
How to get insight into the flatten and join so I can debug why we are losing an event and why it is only 'sometimes' we lose the event?
Is this a windowing issue? even though our job started with a file to read in and we want to process that file. We wanted a constant dataflow stream available as google kept running into limits but perhaps this was the wrong decision?

Related

is there a way to disable a mega-detailed error message in Rundeck's failed executions?

The message that is posted every time an execution fails is too verbose and creates too much noise. Is there a way to disable or hide it somehow? We have our own error messages within the scripts and don't need a red chunk of extra text displayed in the logs. Adding an error handler that will exit with code 0 is not an option because we still need the job to fail if a step fails.
That message is Rundeck standard output in case of failure (you can manipulate the loglevel but that doesn't affect the NonZeroResultCode last line, just the text from your commands/scripts), you can suggest that here. The rest are just workarounds as you mentioned.
This occurs even using the Quiet output.

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.
I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

Quartz.net tracking and misfiring

I have a few questions regarding quartz.net.
What is it that keeps track of if there has been a missfire situation i Quartz.net?
What happens in the following scenarios:
If a job is run but cannot finnish due to some bug, does that count as a missfire or not?
What happens if i republish the solution, is the tracking reset?
Is there a way to receive information on what the scheduler has done and not been able to do?
I have the following code in my Run method:
IJobDetail dailyUserMailJob = new JobDetailImpl("DailyUserMailJob", null, typeof(Jobs.TestJob));
ITrigger trigger = TriggerBuilder.Create()
.WithIdentity("trigger1", "group1")
.WithCronSchedule("0 0 4 1 * ?", x => x.WithMisfireHandlingInstructionFireAndProceed())
.Build();
this.Scheduler.ScheduleJob(dailyUserMailJob, trigger);
this.Scheduler.Start();
The job is supposed to run the first every month on 4 am.
When testing I have set the system clock so that the jobb is missed for one month. According to the documentation when using WithMisfireHandlingInstructionFireAndProceed the job should be run the first thing that happens, but it dosent. Is there something wrong with the code or could it be some other reason the job is not run when using WithMisfireHandlingInstructionFireAndProceed() ?
If a job is missed, there is logic to bring it back. However, there is a "window" on how far back to go.
<add key="quartz.jobStore.misfireThreshold" value="60000"/>
You can increase this value.
If you have an ADOStore, misfires are persisted. Thus "if the power goes out", when restarting...you can recover from misfires.
If you have a RamStore...if "the power goes out", everything was in memory to begin with..so you won't get mis-fire handling, because everything was "in memory" and the memory is lost.
..
If you use Sql Server (AdoStore) and put a Profiler/Trace on it, you'll see the engine "poll" for misfires.......with a "go back this far in time" based on the misfireThreshold.
See this link:
http://nurkiewicz.blogspot.com/2012/04/quartz-scheduler-misfire-instructions.html
for more detailed info. Which has a "withMisfireHandlingInstructionFireAndProceed" note.

Calling a method from ABL code not working

When I create a new quote from Epicor I would like to add an item from the parts form automatically.
I am trying to do this using the following ABL code which runs when 'GetNewQuoteHed' is called:
run Update.
run GetNewQuoteDtl.
run ChangePartNumMaster("Rod Tube").
ttQuoteDtl.OrderQty = 5.
run Update.
I am getting the error:
Index -1 is either negative or above rows count.
This error occurs for each line in my ABL code.
What am I doing wrong?
That's not the proper format for a 4GL error message (nor is it at all familiar) so I'd say it is an Epicor application message. Epicor support is probably your best bet. However... Just guessing but it sounds like you might need to somehow initialize the thing that you're updating.
Agree with #Tom, but i would also say try and isolate the error and see where the error is raised as soon as you find the point the error is actually raised it is normally much easier to figure out exactly what is going wrong and how to solve it.
Working between a 0 based and a 1 based system there can be issues with the 1st or last entry depending on which way you moving. As the index for 0 based systems starts at 0 and ends at n-1 where 1 based systems start at 1 and end at n.

the svnpoller is not triggered (warning in the twistd.log)

I am not sure what is going on, but i get this weird issue with buildbot.
The SVNPoller is configured as it should (checked various config example files), when i run the buildbot checkconfig it says that everything is fine....but it won't work at all.
If i trigger a build via the scheduler class it works fine, i can retrieve the source updates and build without problems (tried with a 1h timeframe).
The problem thou is that the poller is not working, so even if i build each hour, the changes column stays empty (i get the changes for the various versions thou, so if i click on the build detail i can see the sourcestamp carrying the right and most recent revision everytime that i modify the codebase); so I have no way to know if the build fails who did the last change.
Another peculiar thing is that in the twistd.log i see this line:
Warning: no ChangeSources specified in c['change_source']
And i am not sure why it wouldn't work since the checkconfig does not raise any error.
The result of this is of course that the only thing built is the hourly one, leaving me without the poller, and without knowing who is putting code in each build.
This is the code for the poller:
c['change source']=SVNPoller
(svnurl="svn+ssh://user#svnserver.domain.com/svn/project/trunk,
pollinterval=60*5,
histmax=10,
project=myproj,
svnbin = '/usr/bin/svn')
So far it looks good, so I am not really sure what is wrong here...why the SVNPoller is not triggering any build.
Anyone that has some suggestions about why is this happening ? Is there any other way to get changes from an SVN server? I am a total newbie at BuildBot and I am not really getting too much out of the manual; that looks much more like a scholastic book instead of being a manual that shows you how you do stuff :)
Thanks!!!!!
Ok, silly me :) the problem is the missing underscore on change_source...once added it the problem is solved
c['change_source'] = SVNPoller (svnurl=source_svn_url,
pollinterval=60,
histmax=10,
project='The_project',
svnbin= '/usr/bin/svn'
)
this will poll the svn codebase at source_svn_url (just put your svn:// path); and will check every minute to see if anyone has done changes; and will keep 10 changes in the record list (any change after the 10th will not show up so use it carefully if you do a lot of commits).
Hope that this helps who uses buildbot!