Talend memory leak in tHashOutput/tHashInput - talend

I have a semi-slow memory leak in a Talend joblet. I am using a tHashOutput/tHashInput pair in the middle of a joblet because I need to find out how many rows are in the flow. Therefore, I push them into a tHashOutput and later on reference tHashOutput_1_NB_LINE from the globalMap.
I have what I think are the proper options:
allRows - "append" is FALSE
tHashinput_1 - "Clear after reading" is TRUE
Yet, when I run this for a period of time, and analyzing with the Eclipse Memory Analyzer, I see objects building up over time. This is what I get after 12 hours:
This usage (64MB/12 hours) increases steadily and is unrelated to what the job is doing (i.e. actively pumping data or just idling - and this code while invoked for idling also). If I look inside the memory references in MAT, I can see strings that point me to this place in the code, like
tHashFile_DAAgentProductAccountCDC_delete_BPpuaT_jsonToDataPump_1_tHashOutput_2
(jsonToDataPump being the name of the joblet). Am I doing something wrong in using these hash components?

i believe you should set your garbage collector pace to minimum time duration so that it will take care of unused object in application

Related

Replacement subjects stop for no reason in Anylogic

I am having a problem keeping my Gilt Replacement Rate slider continuing to add new animals to my model. I have added in a giltQuarantine delay of 8 weeks just after the source block, which helps to visualize how the gilt replacement rate is working.
Everything is working, initially; however, after several weeks, the giltQuarantine delay drops to
0, and no new gilts enter the herd. The Gilt Replacement Rate adds the desired amount to the model, each week, with no stop time listed.
At around 30 weeks, the number of agents in the giltQuarantine delay begins to
decline and finally becomes 0, while the number of sows in the system is only 167. It should be steadily increasing to 1000 sows.
I cannot see why this is happening, as I should have a consistent supply of gilts
entering the herd each week, which the variable giltReplacement says is happening (see Model running at 54 weeks (screenshot 4)).
I also tried increasing the Gilt Replacement Rate, which worked for several weeks,
but then also declined as the number of sows in the system reached 1024. I want my herd size to remain stable at 1000.
Is there any reason that would be causing this decline in replacement animals?
Probably because you limit the total number of arrivals in enterHerd to breedingHerd. Remove the limit and test it.
Also, you can see current rates and other charactistics of flow chart blocks by clicking on them at runtime. Maybe that helps you pinpoint the issue further.
if nothing works, simplify your setup. It is already fairly complex. If you run into this issue now without knowing what causes it, it shows that you do not follow a good modelling approach (add 1 tiny feature, test everything, repeat) :)

Why is executionStartToCloseTimeoutSeconds required?

When using the Java client to start a workflow in Candece "executionStartToCloseTimeoutSeconds" is required on the Workflow. If I have a workflow that can run for an indeterminate amount of time, how do I get around this restriction?
That was a mistake to require this value. The new version of the platform I'm working on (temporal.io) defaults this value to infinity.
an indeterminate amount of time
First of all I believe indeterminate amount of time is not infinite amount of time.
As letting a workflow execution run and grow infinitely is anti-pattern in Cadence workflow. See Recommendation #5 in this article https://longquanzheng.github.io/cadence-lab/book/learnings/what-should-be-in-a-workflow-or-an-activity-in-cadence.html
A good timeout value can protect your workflow grow infinitely.
Because it's not recommended to let workflow run forever, as it will cause potential performance issue in both worker and server, the original idea is to enforce client providing a timeout value. We didn't provide defaults, as it's difficult to have a reasonable default for all use cases.
A too small default values will be even worse, because no one like workflow timeouted in production unexpectedly. Even though you can use "Reset" command to reopen it.
A too big default value, like Maxim suggests, is slightly better than too small values. But I personally disagree because that induces client forget thinking about how long the workflow will run, and how long the workflow history will grow. This will also turn out to be a production issue at some points later.
The biggest issue I see is that this required option is not friendly. It should be compiling error instead of running error. I think this is probably we can improve in Cadence -- if this is a required field, make it required at coding experience. At the same time provide some hardcoded fake "infinite" value to help some edge cases may also make sense.
Back to you question, I would suggest using some fake infinite value if you think it's indeterminate. An good example here is in Cadence system workflow: https://github.com/uber/cadence/blob/11547ee6db5dd306cb507b263381a6ea94c3faf1/service/worker/scanner/workflow.go#L48

Are there a way to know how much of the EEPROM memmory that is used?

I have looked trough the "logbook" and "datalogger" APIs and there are no way of telling that the data logger is almost full. I found the API call with the following path "/Mem/Logbook/IsFull". If I have understood it correct this will notify me when log is full and the datalogger has stopped logging.
So my question is: Are there a way to know how much of the memmory is currently in use so that I do a cleanup old data (need to do some calculations on them before they are deleted) before the EEPROM is full and the Datalogger stops recording?
The data memory of Logbook/DataLogger is conceptually a ring buffer. That's why /Mem/DataLogger/IsFull always returns false on Movesense sensor (Suunto uses the same API in its watches where the situation is different). Therefore the sensor never stops recording, it just replaces oldest data with new.
Here are a couple of strategies that you could use:
Plan A:
Create a new log (POST /Mem/Logbook/Entries => returns the logId for it)
Start Logging (PUT /Mem/DataLogger/State: LOGGING)
Every once in a while create a new log (POST /Mem/Logbook/Entries). Note: This can be done while logging is ongoing!
When you want to know what is the status of the log, read /Mem/Logbook/Entries. When the oldest entry has completely been overwritten, it disappears from the list. Note: The GET /Entries is a heavy operation so you may not want to do it when the logger is running!
Plan B
Every now and then start a new log and process the previous one. That way the log never overwrites something you have not processed.
Plan C
(Note: This is low level and may break with some future Movesense sensor release)
GET the first 256 bytes of EEPROM chip #0 using the /Component/EEPROM API. This area contains a number of ExtflashChunkStorage::StorageHeader structs (see: ExtflashChunkStorage.h), rest is filled with 0xFF. The last StorageHeader before 0xFF is the current one. With that StorageHeader one can see where the ring buffer starts (firstChunk) and where next data is written (cursor). The difference of the two is the used memory. (Note: Since it is a ring buffer the difference can be negative. In that case add the "Size of Logbook area - 256" to it)
Full disclosure: I work for Movesense team

Profiler slef and total time difference

I am working on some in which the speed and time are of high importance. I am using profiler to find the bottleneck of my code, but i cannot understand some things in profiler.
first, what does self and total time mean?
second, it has something called workspacefunc>local_min and workspacefunc>local_max, what are they?
self time is the total time spent in a function, not including any spent in any child functions called. As an example, if you had a function which was calling a whole bunch of other functions, the profiler only includes the time spent in the main function called from the profiler and not in any of the other functions defined inside the main function.
total time is the total time spent on a function (makes sense, right?). This includes the timing in all of the child functions called. Also, you need to be careful where the profiler itself can take some time to execute as well, which is included in the results. One small thing as well: the total time can be zero for functions whose running time are inconsequential.
Reference: http://www.mathworks.com/help/matlab/matlab_prog/profiling-for-improving-performance.html
workspacefunc... there doesn't seem to be any documentation on it, but this is the help text that I get when checking what it does:
workspacefunc Support function for Workspace browser component.
The Workspace browser is a window that shows you all of the variables that are defined in your workspace. If I were to take an educated guess, profiler does some analysis on your workspace variables, which include the min and max of certain variables in your workspace. I can't really say much more as there is absolutely no documentation on this, but it's safe to ignore. Simply focus on the functions that you are calling from your own code.

Simultaneously incrementing the program counter and loading the Instruction register

In my Computer Architecture lectures, I was told that the IR assignment and PC increment are done in parallel. However surely this has an effect on which instruction is loaded.
If PC = 0, then the IR is loaded and then the PC incremented then the IR will hold the instruction that was at address 0.
However if PC = 0, the PC incremented and then the IR is loaded and then the IR will hold the instruction that was at address 1.
So surely they can't be done simultaneously and the order must be defined?
You're not taking into account the wonders of FlipFlops. The exact implementation depends of course on your specific design, but it's perfectly possible to read the value currently latched on some register or latch, while at the same time preparing a different value to be stored there, as long as you know these values are independent (there's also a possibility of doing a "bypass" in more sophisticated designs, but that's besides the point here).
In this case, you'd be reading the current value of the PC (and using it to fetch the code from memory, or cache, or whatever), while preparing the next value (for e.g. PC+4 or some branch target if you know it). This is how pipelines work.
Generally speaking, you either have enough time to do some work withing the same cycle (incrementing PC and using it for code fetch), in which case they'll fit in the same pipestage, or if you can't make it in time - you just break these serial activities to two pipestages, so that they can be done in "parallel" because one of them belongs to the next operation flowing through the pipe, so there's no longer a dependency (aside from corner cases like branches or bubbles)