Windows Task Scheduler incorrectly spawns multiple instances per trigger (but need to be able to run multiple instances in parallel) - triggers

I have a task set to run an executable every 30 minutes via Windows Task Scheduler (OS: 64bit Windows Server Standard with SP2). This task needs to be able to run multiple instances of itself simultaneously, so this setting is selected: "If the task is already running, then the following rule applies: Run a new instance in parallel". (Reason: the task processes records in a queue table which may be empty or contain hundreds of thousands of records. Each task instance reserves a chunk of records to work on, so the instances won't collide)
The problem is, the task spawns multiple NEW instances at each trigger interval. It should only fire ONE NEW instance every 30 minutes. Often it spawns 2, 3, 4 or more new instances. At this point, the executable can handle the duplicate new instances without significant errors, but the server is doing more work than it needs to, and it just bugs me to no end that the task scheduler is misbehaving in this way. Here is what I have tried so far to fix:
Deleted and recreated the task (many times)
Rebooted the server
Installed this hotfix: http://support.microsoft.com/en-us/kb/2461249
Set to run every 30 minutes indefinitely
Set to run every 30 minutes daily, for duration of one day
Set "Synchronize across time zones" = true
Set "Run with highest privileges" = true
Set "Delay task for up to random delay of [X] seconds" = false (multiple new - instances are spawned all within the same second)
Set "Delay task for up to random delay of [30] seconds" = true (instead of firing during the same second, multiple new instances fire within a 30 second span)
Set "If the task fails, restart every 1 minute" = true
Set "If the task fails, restart every 1 minute" = false
Set "Run task as soon as possible after a scheduled start is missed" = false (if set to true, the problem is worse)
Even more puzzling: some other tasks on this server have the same or similar settings and do not have this problem. They had the problem before the hotfix, but after the hotfix it has been rare. Except for with this one task. What on earth could be the problem?
The exported task settings are below (with XXXX replacing sensitive info). I compared this point for point with another similar task that is not having the issue. The only differences: the working task has a different author, a different exe file, and runs every 5 minutes instead of every 30 minutes.
I'm about to chalk this up as a bug that Microsoft needs to fix some day, but thought I'd offer it up for review here before giving up.
<?xml version="1.0" encoding="UTF-16"?>
<Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task">
<RegistrationInfo>
<Date>2014-08-27T10:09:33.7980839</Date>
<Author>XXXX\XXXX</Author>
<Description>Links newly downloaded images to products. Resizes and uploads different sizes to XXXX. Updates relevant tables. Logs errors.</Description>
</RegistrationInfo>
<Triggers>
<CalendarTrigger>
<Repetition>
<Interval>PT30M</Interval>
<Duration>P1D</Duration>
<StopAtDurationEnd>false</StopAtDurationEnd>
</Repetition>
<StartBoundary>2015-02-11T19:06:00Z</StartBoundary>
<Enabled>true</Enabled>
<RandomDelay>PT30S</RandomDelay>
<ScheduleByDay>
<DaysInterval>1</DaysInterval>
</ScheduleByDay>
</CalendarTrigger>
</Triggers>
<Principals>
<Principal id="Author">
<UserId>XXXX\XXXX</UserId>
<LogonType>Password</LogonType>
<RunLevel>HighestAvailable</RunLevel>
</Principal>
</Principals>
<Settings>
<IdleSettings>
<Duration>PT10M</Duration>
<WaitTimeout>PT1H</WaitTimeout>
<StopOnIdleEnd>true</StopOnIdleEnd>
<RestartOnIdle>false</RestartOnIdle>
</IdleSettings>
<MultipleInstancesPolicy>Parallel</MultipleInstancesPolicy>
<DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries>
<StopIfGoingOnBatteries>true</StopIfGoingOnBatteries>
<AllowHardTerminate>true</AllowHardTerminate>
<StartWhenAvailable>false</StartWhenAvailable>
<RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>
<AllowStartOnDemand>true</AllowStartOnDemand>
<Enabled>true</Enabled>
<Hidden>false</Hidden>
<RunOnlyIfIdle>false</RunOnlyIfIdle>
<WakeToRun>true</WakeToRun>
<ExecutionTimeLimit>P1D</ExecutionTimeLimit>
<Priority>7</Priority>
</Settings>
<Actions Context="Author">
<Exec>
<Command>C:\XXXX\XXXX\XXXX.exe</Command>
</Exec>
</Actions>
</Task>

I was in the same boat, had one task firing off 3 times, and another firing off 9 times. but also a bunch that only fired once as expected. Problem persisted after installing the hotfix as well.
After all my research turned up no good leads, my next step was going to be opening a support case with Microsoft. Before doing so I figured I would try deleting and recreating the task now that I've installed the patch. I started by only deleting and recreating the trigger (which was set to run once a day) and setting it for a different time. Bingo, my problem was fixed!
So I don't know what the key was, whether it was deleting and recreating the trigger after the patch was installed or if it was changing the time, but doing both worked.
Hope this helps!

Related

JES2 SPOOL volume waiting for jobs, but $DJ shows no jobs

I’m trying to drain a JES2 SPOOL volume. It says it’s waiting for jobs:
$DSPL
$HASP893 VOLUME(SPLZ00) 852
$HASP893 VOLUME(SPLZ00) STATUS=DRAINING,AWAITING(JOBS),
$HASP893 PERCENT=2
$HASP893 VOLUME(SPLZ01) STATUS=ACTIVE,PERCENT=38
$HASP893 VOLUME(SPLZ02) STATUS=ACTIVE,PERCENT=36
$HASP646 37.5371 PERCENT SPOOL UTILIZATION
But when I look to see which jobs it’s waiting for, I don’t find any:
$DJ(*),SPL=(VOL=SPLZ00)
$HASP003 RC=(52),D 879
$HASP003 RC=(52),D J(*) - NO SELECTABLE ENTRIES FOUND MATCHING
$HASP003 SPECIFICATION
Any ideas about why this volume won’t finish draining?
Thanks to Dave Gibney on the IBM-MAIN mailing list (IBM-MAIN#LISTSERV.UA.EDU), I have the answer.
$DJ doesn't show started tasks or TSO users. $DJQ(*),SPL=(VOL=SPLZ00) displays everything. There's also $DS that just shows STC and $DT that only show TSU.
Though $DJQ commands show batch jobs, TSO users and Started tasks, but it does not include job group logging jobs. You would need to use the command $DG(*),SPL=(VOL=SPLZ00) to show any job groups using a spool volume.

Spring batch state when steps fails

I'm trying out spring batch. I have seen many examples when running jobs via ItemReader and ItemWriter. If a job runs without errors there is no problem.
But I haven't found out how to handle state when an job fails after processing a number of records.
My scenario is realy simple. Read records from a xml file (ItemReader) and call an external system for storing (ItemWriter). So what happens if the external system is not available in the middle of the process and after a while the job status is set to FAILED? If I restart the job again manually the next day when the external system is up and running I will get duplicates for the previously loaded records.
In some way I must have information for skipping the already loaded records.
I have tried to store a cursor via the ExecutionContext but when I restart the job I get a new JOB_EXECUTION_ID and the cursor data is lost because a get a new line in the BATCH_STEP_EXECUTION_CONTEXT.SHORT_CONTEXT. The BATCH_STEP_EXECUTION.COMMIT_COUNT and BATCH_STEP_EXECUTION.READ_COUNT is also reset when do restart.
I restart the job by using the JobOperator:
jobOperator.restart(jobExecutionId);
Is there a way of restart a job without get a new jobExecutionId or alterntive way of get state of failing jobs. If someone have found (can provide) an example contaning state and error handling I would be happy.
One alternative solution is of course to create my own table that keeps tracks of processed records but I hope really that the framework has a mechanism for this. Otherwise I don't understan the idea with spring-batch.
Regards
Mats
One of the primary features Spring Batch provides is the persistence of the state of a job in the job repository. When a job fails, upon restart, the default behavior is for the job to restart at the step that failed (skipping the steps that have already been successfully completed). Within a chunk based step, most of our readers (the StaxEventItemReader included) store what records have been processed in the job repository (specifically within the ExecutionContext). By default, when a chunk based step fails, it's restarted at the chunk that failed last time, skipping the successfully processed chunks.
An example of all of this would be if you had a three step job:
<job id="job1">
<step id="step1" next="step2">
<tasklet>
<chunk reader="reader1" writer="writer1" commit-interval="10"/>
</tasklet>
</step>
<step id="step2" next="step3">
<tasklet>
<chunk reader="reader2" writer="writer2" commit-interval="10"/>
</tasklet>
</step>
<step id="step3">
<tasklet>
<chunk reader="reader3" writer="writer3" commit-interval="10"/>
</tasklet>
</step>
</job>
And let's say this job completes step1, then step2 has 1000 records to process but fails at record 507. The chunk that consists of records 500-510 would roll back and the job would be marked as failed. The restart of that job would skip step1, skip records 1-499 in step2 and start back at record 500 of step2 (assuming you're using stateful item readers).
With regards to the jobExecutionId on a restart, Spring Batch has the concept of a job instance (a logical run) and a job execution (a physical run). For a job that runs daily, the logical run would be the Monday run, the Tuesday run, etc. Each of these would consist of their own JobInstance. If the job is successful, the JobInstance would end up with only one JobExecution associated with it. If it failed and was re-run, a new JobExecution would be created for each of the times the job is restarted.
You can read more about error handling in general and specific scenarios in the Spring Batch documentation found here: http://docs.spring.io/spring-batch/trunk/reference/html/index.html

Glassfish4 log rotation "Maximum History Files" issue

Environment:
Glassfish 4.0 (only one DAS), Windows Server 2012 R2, Java 1.7.0_51
Create the DAS instance service by using the create-service subcommand.
Issue:
The maximum history files attribute has been set, however, Glassfish Server couldn’t remove the old log files due to the lock file server.log.lck
Path --> C:\glassfish4\glassfish\domains\domain1\config\logging.properties
com.sun.enterprise.server.logging.GFFileHandler.maxHistoryFiles=10
Log Snippet:
[2014-12-10T18:00:39.372+0900] [glassfish 4.0] [SEVERE] [] [] [tid: _ThreadID=16 _ThreadName=Thread-5] [timeMillis: 1418202039372] [levelValue: 1000] [[
java.util.logging.ErrorManager: 0: FATAL ERROR: COULD NOT DELETE LOG FILE.]]
[2014-12-10T18:00:39.372+0900] [glassfish 4.0] [SEVERE] [] [] [tid: _ThreadID=16 _ThreadName=Thread-5] [timeMillis: 1418202039372] [levelValue: 1000] [[
java.io.IOException: Could not delete log file: C:\glassfish4\glassfish\domains\domain1\logs\server.log.lck
at com.sun.enterprise.server.logging.GFFileHandler.cleanUpHistoryLogFiles(GFFileHandler.java:725)
at com.sun.enterprise.server.logging.GFFileHandler$4.run(GFFileHandler.java:802)
at java.security.AccessController.doPrivileged(Native Method)
at com.sun.enterprise.server.logging.GFFileHandler.rotate(GFFileHandler.java:744)
at com.sun.enterprise.server.logging.GFFileHandler$1.run(GFFileHandler.java:301)
at com.sun.enterprise.server.logging.LogRotationTimerTask.run(LogRotationTimerTask.java:68)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)]]
Findings:
1, If the lock file “server.log.lck” exists in the log folder, the issue occurred, and can find the above errors in the log every day when Glassfish server tries to remove the old log files. If there is no “server.log.lck” in the log folder, no any issue and work properly.
2, If the DAS instance is started by the command “asadmin start-domain domain1”, there is no lock file “server.log.lck” generated in the log folder. But if the DAS instance is started in Windows Service, the lock file “server.log.lck” will be generated automatically and keep 0KB until stop the service, this file will be removed automatically.
3, If the DAS instance is started by the command “asadmin start-domain -w domain1” which adds the watchdog option, the lock file “server.log.lck” will be generated automatically and exist until stop the service.
4, When the lock file “server.log.lck” appears, there is always one more java.exe process existing. Therefore, when start the DAS instance from Windows Service, there are two “java.exe” running in the process and “server.log.lck” is using by one of them.
Questions:
1, I’d like to start/stop the DAS instance by Windows Service, not using the subcommand. Moreover, I don’t want to keep all Glassfish logs on my server and it will cause a disk full issue so that I would prefer to turn on the Glassfish Logging Maximum history Files option. Is there any workaround or solution for that?
2, Is this a defect of Glassfish or it’s just a setting issue? I did try to install on other servers and all had the same issue.
3, Why there are two java.exe processes running if started from Windows Server, is the 2nd one used for “watchdog”?
Thank you so much for your help and please let me know if there is any further information you’d like to know or want me to do some other tests.
In case someone is still struggling I found a solution.
When you create a GF service in Windows environment via asadmin create-service GF creates a file domain1Service.xml in glassfish\domains\domain1\bin which contains parametres for server to start.
It looks something like the following
<service>
<id>domain1</id>
<name>domain1 GlassFish Server</name>
<description>GlassFish Server</description>
<executable>C:/Supertel-NMSv3/glassfish-4.1/glassfish/lib/nadmin.bat</executable>
<logpath>C:\\Supertel-NMSv3\\glassfish-4.1\\glassfish\\domains/domain1/bin</logpath>
<logmode>reset</logmode>
<depend>tcpip</depend>
<startargument>start-domain</startargument>
<startargument>--watchdog</startargument>
<startargument>--domaindir</startargument>
<startargument>C:\\Supertel-NMSv3\\glassfish-4.1\\glassfish\\domains</startargument>
<startargument>domain1</startargument>
<stopargument>stop-domain</stopargument>
<stopargument>--domaindir</stopargument>
<stopargument>C:\\Supertel-NMSv3\\glassfish-4.1\\glassfish\\domains</stopargument>
<stopargument>domain1</stopargument>
</service>
the line <startargument>--watchdog</startargument> is responsible for launching watchdog process which prevents log file from being deleted.
You can't just delete this startargument section (the service won't start) but you can switch this off by setting false flag like this
<startargument>--watchdog=false</startargument>
After that the service will start like via manual start-domain command without watchdog process.
You should do it after every service creation and it could be pretty annoying so I did further research.
It turns out that asadmin create OS specific domainService.xml by using templates located in glassfish\lib\install\templates Those templates also OS specific. And template for Windows (named Domain-service-winsw.xml.template) looks like this
<service>
<id>%%%NAME%%%</id>
<name>%%%DISPLAY_NAME%%%</name>
<description>GlassFish Server</description>
<executable>%%%AS_ADMIN_PATH%%%</executable>
<logpath>%%%LOCATION%%%/%%%ENTITY_NAME%%%/bin</logpath>
<logmode>reset</logmode>
<depend>tcpip</depend>
<startargument>%%%START_COMMAND%%%</startargument>
<startargument>--watchdog</startargument>
%%%CREDENTIALS_START%%%%%%LOCATION_ARGS_START%%%<startargument>%%%ENTITY_NAME%%%</startargument>
<stopargument>%%%STOP_COMMAND%%%</stopargument>
%%%CREDENTIALS_STOP%%%%%%LOCATION_ARGS_STOP%%%<stopargument>%%%ENTITY_NAME%%%</stopargument>
</service>
So you can edit template directly by setting param --watchdog=false and this change will reflect in all future created files domainService.xml
Hope it helps.
That’s not the right solution. Watchdog has an important function: it monitors whether the service is running or not. Without watchdog, glassfish is started correctly, but shortly afterwards the system no longer knows if the service is still running or maybe crashed. In the Services GUI, only the “start” button is active (always!). A “stop” and “restart” cannot be used.
A right solution would be the possibility to change the path to the lock file.

Is it available multi task tag in <StartUp> or do I have to merge these cmd files to only one?

I am new Azure development and writing powershell script.
I want to run two cmd files for azure start up tasks. I added these files into solutions and set properties as "copy always".After I add new note into ServiceDefinition.csdef Here it is :
<Startup>
<Task commandLine="Startup\startupcmd.cmd > c:\logs\startuptasks.log" executionContext="elevated" taskType="background">
<Environment>
<Variable name="EMULATED">
<RoleInstanceValue xpath="/RoleEnvironment/Deployment/#emulated" />
</Variable>
</Environment>
</Task>
<Task commandLine="Startup\disableTimeout.cmd" executionContext="elevated" />
</Startup>
It's not deploying and getting this error : Instance 0 of role Web is busy
Now In my question : Is it available multi task tag in <StartUp> or do I have to merge these cmd files to only one ?
As per definition:
The Startup element describes a collection of tasks that run when the
role is started.
So yes, the answer to your concrete question is: Yes, you can define multiple startup tasks.
State Busy is almost fine, in terms it is bit better than cycling! What I would suggest it to enable Remote Desktop and connect to see what is going on with the start up task. Busy is set until all simple tasks have completed and returned 0 exit code. Your task may fail or may hang for a while and that's why you would see busy.

NServiceBus pipeline with Distributors

I'm building a processing pipeline with NServiceBus but I'm having trouble with the configuration of the distributors in order to make each step in the process scalable. Here's some info:
The pipeline will have a master process that says "OK, time to start" for a WorkItem, which will then start a process like a flowchart.
Each step in the flowchart may be computationally expensive, so I want the ability to scale out each step. This tells me that each step needs a Distributor.
I want to be able to hook additional activities onto events later. This tells me I need to Publish() messages when it is done, not Send() them.
A process may need to branch based on a condition. This tells me that a process must be able to publish more than one type of message.
A process may need to join forks. I imagine I should use Sagas for this.
Hopefully these assumptions are good otherwise I'm in more trouble than I thought.
For the sake of simplicity, let's forget about forking or joining and consider a simple pipeline, with Step A followed by Step B, and ending with Step C. Each step gets its own distributor and can have many nodes processing messages.
NodeA workers contain a IHandleMessages processor, and publish EventA
NodeB workers contain a IHandleMessages processor, and publish Event B
NodeC workers contain a IHandleMessages processor, and then the pipeline is complete.
Here are the relevant parts of the config files, where # denotes the number of the worker, (i.e. there are input queues NodeA.1 and NodeA.2):
NodeA:
<MsmqTransportConfig InputQueue="NodeA.#" ErrorQueue="error" NumberOfWorkerThreads="1" MaxRetries="5" />
<UnicastBusConfig DistributorControlAddress="NodeA.Distrib.Control" DistributorDataAddress="NodeA.Distrib.Data" >
<MessageEndpointMappings>
</MessageEndpointMappings>
</UnicastBusConfig>
NodeB:
<MsmqTransportConfig InputQueue="NodeB.#" ErrorQueue="error" NumberOfWorkerThreads="1" MaxRetries="5" />
<UnicastBusConfig DistributorControlAddress="NodeB.Distrib.Control" DistributorDataAddress="NodeB.Distrib.Data" >
<MessageEndpointMappings>
<add Messages="Messages.EventA, Messages" Endpoint="NodeA.Distrib.Data" />
</MessageEndpointMappings>
</UnicastBusConfig>
NodeC:
<MsmqTransportConfig InputQueue="NodeC.#" ErrorQueue="error" NumberOfWorkerThreads="1" MaxRetries="5" />
<UnicastBusConfig DistributorControlAddress="NodeC.Distrib.Control" DistributorDataAddress="NodeC.Distrib.Data" >
<MessageEndpointMappings>
<add Messages="Messages.EventB, Messages" Endpoint="NodeB.Distrib.Data" />
</MessageEndpointMappings>
</UnicastBusConfig>
And here are the relevant parts of the distributor configs:
Distributor A:
<add key="DataInputQueue" value="NodeA.Distrib.Data"/>
<add key="ControlInputQueue" value="NodeA.Distrib.Control"/>
<add key="StorageQueue" value="NodeA.Distrib.Storage"/>
Distributor B:
<add key="DataInputQueue" value="NodeB.Distrib.Data"/>
<add key="ControlInputQueue" value="NodeB.Distrib.Control"/>
<add key="StorageQueue" value="NodeB.Distrib.Storage"/>
Distributor C:
<add key="DataInputQueue" value="NodeC.Distrib.Data"/>
<add key="ControlInputQueue" value="NodeC.Distrib.Control"/>
<add key="StorageQueue" value="NodeC.Distrib.Storage"/>
I'm testing using 2 instances of each node, and the problem seems to come up in the middle at Node B. There are basically 2 things that might happen:
Both instances of Node B report that it is subscribing to EventA, and also that NodeC.Distrib.Data#MYCOMPUTER is subscribing to the EventB that Node B publishes. In this case, everything works great.
Both instances of Node B report that it is subscribing to EventA, however, one worker says NodeC.Distrib.Data#MYCOMPUTER is subscribing TWICE, while the other worker does not mention it.
In the second case, which seem to be controlled only by the way the distributor routes the subscription messages, if the "overachiever" node processes an EventA, all is well. If the "underachiever" processes EventA, then the publish of EventB has no subscribers and the workflow dies.
So, my questions:
Is this kind of setup possible?
Is the configuration correct? It's hard to find any examples of configuration with distributors beyond a simple one-level publisher/2-worker setup.
Would it make more sense to have one central broker process that does all the non-computationally-intensive traffic cop operations, and only sends messages to processes behind distributors when the task is long-running and must be load balanced?
Then the load-balanced nodes could simply reply back to the central broker, which seems easier.
On the other hand, that seems at odds with the decentralization that is NServiceBus's strength.
And if this is the answer, and the long running process's done event is a reply, how do you keep the Publish that enables later extensibility on published events?
The problem you have is that your nodes don't see each others list of subscribers. The reason you're having that problem is that your trying out a production scenario (scale-out) under the default NServiceBus profile (lite) which doesn't support scale-out but makes single-machine development very productive.
To solve the problem, run the NServiceBus host using the production profile as described on this page:
http://docs.particular.net/nservicebus/hosting/nservicebus-host/profiles
That will let different nodes share the same list of subscribers.
Other than that, your configuration is right on.