Design and Error handling in windows service

Design and Error handling in windows service - service

I have to design windows service and I have some question:
Proper error handling, if there is an error what happens to the
service? Does he continue to be up and logging in? Error recording
in event viewer? Is he falling?
What happens to a long run? How do you know for sure that everything
is running as required, and he is not stuck?
How to handle high memory consumption, out of memory, or other error
that wasn't been write to the log?
Handle users - what happened if create log to user A and changed to
user B? Rewrite or continue from same point?
How to handle times? - Is the service automatically up?
Thank you.

For error handling, the best I can recommend is taking advantage of try / catch cases. This way you ensure that you handle the cases where something unexpected happens and you can either try to correct it or bring the service down cleanly. Keep in mind that exceptions are not propagated outside the scope of a thread so you need to handle them in each thread.
To be able to tell if the service is doing fine, you can periodically log into the Event Log what the service does. If you do proper try / catch for each thread, it should go smoothly. In C# you can use log4net and the EventLogAppender to log crucial / error info in the Event Log.
If your service causes high memory usage for no apparent reason, it is likely a memory leak. Microsoft has a free tool called ".Net CLR profiler" that allows you to properly profile your service and see what exactly is causing the leak.
Unless you are dealing with user-protected files (in which case you need to consult the Log On tab of your service to give it the appropriate credentials), your service shouldn't depend on any logged-in user. Services run independently of the users on the computer.
A service can be set to start automatically, to start only on-demand, or to simply be disabled completely.

Related

Is there a way other than eval to prevent my perl scripts from terminating on errors

I am coding a web API that uses a MongoDB database, interacts with node.js and starts all types of processes, anything can go wrong and if it does I want the api to return an "unknown error" message to the caller.
The problem is that sometimes the modules I'm using crash and the whole application dies without giving the api the opportunity to return an "Unknown error" message I want to control this without having to put an eval block in every database insert, process call, etc.
is there something like autoeval ?

If your process is crashing, something is very wrong, and you should look into why that is and fix it.
But failing that, do all your work in a child process, and have the parent monitor it and return an error response.
Though even easier than that is running your service behind a proxy server (which you may very well be doing anyway) and ensuring that the proxy server returns an appropriate api response on proxy errors.

AppFabric Hosted Workflow does not always reload after delay/unload

I have a WCF Windows Workflow (4.5) Workflow Service hosted under IIS and using AppFabric 1.1. The workflow instances are long-running (up to about a week), but much of the time is spent in Delay activities.
This seemed to work fine at first, but when running multiple instances of the workflow at the same time (2+ instances causes this), some of them just never wake up once they've unloaded from memory during the Delay step. When I look at the logs, the errors I find all look like this:
System.OperationCanceledException: The execution of InstancePersistenceCommands has been canceled because the InstanceHandle was freed.
at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result)
at System.ServiceModel.Activities.Dispatcher.DurableInstanceManager.WaitAndHandleStoreEventsCallback(IAsyncResult result)
Unfortunately, I'm not finding any useful information on that error message.
The SuspensionExceptionName and SuspensionReason fields in the AppFabric Persisted Instances Table show System.NullReferenceException: Object reference not set to an instance of an object. But this doesn't happen inside my workflow, only outside.
Additional Info:
I'm running the activity as a Fire & Forget (receive activity, no send)
My workflow calls into other WCF services to fetch data.
I am running this on Server 2012 R2, IIS 8 (not azure)
Workflow Persistence is working. I can reset IIS, reboot... its just when I run 2 instances that it has problems.
I'm definitely not hitting any kind of throttling limits. While the workflow deals with a few MB of data, this issue happens at 2+ instances.
Any idea what might be happening here?
Edit:
I realized I found more information on how the issue operates and never added it to the question. When the delay issue happens, it operates a lot like a static variable getting written by 2 threads.
Here's a visualization:
WF1 Start ---->Do Stuff--->Sleep------------*1----->Cancelled Exception at some point
------WF 2 Start---->Do Stuff------->Sleep->Wake up---------*2------>More Stuff---->End Successfully
*1 - When WF Instance 1 Should Wake up (Same time as WF 2 wakes)
*2 - When WF Instance 2 Should have woken up (Seems to be ignored)
Before anyone asks... I got rid of every static variable, method, class in my code. Nothing is static anymore.

I've been struggling with similar issues for quite a while. I use WFW4 and I find similar errors when a workflow instance is in a long delay.
I don't know what the cause of the problem is, but I have a work around that you might find helpful.
In my case, the errors I get are from Workflow Management Service and say:
Failed to invoke service management endpoint at 'net.pipe://.svc' to activate service '/Alerts/Workflows/.xamlx'. Exception: 'Access is denied.'
These errors start happening sometime between 6 and 30 hours after the instance goes into a long delay.
I have found that if I create a new instance of the workflow when the first instance is in delay and the errors are happening, then Workflow Management Service is able to resume interacting with the first sleeping instance.
So, I made a new workflow whose sole purpose is to periodically launch and then kill instances of the workflow that contains the long delay.
It actually gets a bit more complicated to make this work. I wanted this new workflow to also go to sleep between times when it creates and kills a new instance of the first workflow. But this going to sleep causes the instance of the new workflow to suffer the same problem as the first workflow. So, I modified the new workflow so it does the following:
-- delay for some rather short period, such as 30 minutes
-- create an instance of the first workflow
-- wait a minute
-- kill the just-created instance of the first workflow
-- create a new instance of this new error-preventing workflow
-- terminate
Since having done this, I no longer get the Access is Denied error from Workflow Management Service!
Hope this helps

Turns out my first answer was not correct, but I believe this answer is right, and solves the issue ChrisG is having.
My workaround did not actually work. Took a while for the problem to resurface. 29 hours to be precise - the default time it takes for an app pool to recycle.
So for me, the solution was to make my app pool not recycle. When an app pool recycles while a workflow instance is in a delay activity, the workflowManagementService is not able to wake up the instance and throws Access is Denied errors. If you create a new instance of the workflow after the app pool has recycled, the first instance will pick up where it left off, but sometimes still has problems, which is what I believe is happening to ChrisG.
ChrisG, looking at your visualization, is it possible that an appPool is recycling during the time wf1 is sleeping? I believe that is the cause the exception. If you then launch a new wf instance after *2 has passed (and if an app pool recycle happened prior to *1), that will wake up both wf1 and wf2, but wf1 won't work properly (at least in my experience)
Also, this happens after iisresets and server reboots. To handle those, you need to use IIS7 which allows the web application (as well as the web site) which is hosting the xamlx files to autostart after an iisreset or server reboot. This option is not available in IIS6. See http://www.postseek.com/meta/991815402b369e71ce925cde47ac907d for details
Hope this helps!

Silly WebSphere MQ questions

I have two very basic questions on WebSphere MQ - given that I had been kind of administrating it for past few months I tend to think that these are silly questions
Is there a way to "deactivate" a
queue ? (for example through a
runmqsc command or through the
explorer interface) - I think not. I
think what I can do is just delete
it.
What will happen if I create a
remote queue definition if the real
remote queue is not in place? Will
it cause any issues on the queue
manager? - I think not. I think all
I will have are error messages in
the logs.
Please let me know your thoughts.
Thanks!

1 Is there a way to "deactivate" a
queue?
Yes. You can change the queue attributes like so:
ALTER Q(QUEUE_NAME) PUT(DISABLED) GET(DISABLED)
Any connected applications will receive a return code on the next API call telling them that the queue is no longer available for PUT/GET. If these are well-behaved programs they will then report the error and either end or go into a retry loop.
2 What will happen if I create a
remote queue definition if the real
remote queue is not in place?
The QRemote definition will resolve to a transmit queue. If the message can successfully be placed there your application will receive a return code of zero. (Any unsuccessful PUT will be due to hitting MAXDEPTH or other local problem not connected to the fact that the remote definition does not exist.)
The problem will be visible when the channel tries to deliver the message. If the remote QMgr has a Dead Letter Queue, the message will go there. If not, it will be backed out onto the local XMitQ and the channel will stop.

Windows Services troubleshooting, recovery setting?

Right now i have some sort of services application on windows server 2003 for inputting data from devices into database.
Sometimes the services fail due to data error or anything else (database connection problem, internet connection down, etc) which i have to restart the services, right now the solution i provide for this problem was a simple batch command using NET START/STOP command that scheduled every 1 hour.
I then take a look at recovery tab on service properties, there was an option to restart the services, which i want to know was how to test it? Such as, how Windows know the services was failed? And the most important was how to know that services successfully restarted when failure occur (based on recovery setting)?
PS: I didn't have access to the code
Thanks

The service console's auto restart kicks in when a service crashes from an unhandled exception. (Some part of your code throws an exception, but nothing catches it, and it bubbles all the way up and out of the main function.)
If you have control over the code, it might be better to put in some try/catch blocks around the areas that tend to cause problems and handle errors more gracefully. You could also put a try/catch around the main entry point of the application, to catch and try to handle any unhandled exceptions from the code.
If you can't control the code, you can test the auto service recovery by forcing one of these errors to occur. If you service crashes in the event of a connection problem, you can force this by unplugging the network cable on the computer.

The easiest way to test the recovery options is to kill your service's process from the task manager. Windows will detect it and run the First Failure recovery option. Subsequent kills will test the Second Failure and Subsequent Failure options. The Event Log will note the exit and the actions taken.
Depending on your environment and your service this may or may not be a viable option for you as you are killing the service.

You can restore it back to an earlier point in time, Restoring Surface doesn’t change your personal files, but it might remove recently installed apps and drivers.
1.Swipe in from the right edge of the screen, and then tap Search.
(If you're using a mouse, point to the upper-right corner of the screen, move the mouse pointer down, then click Search.)
2.Enter Control Panel in the search box, and tap or click Control Panel.
3.Enter Recovery in the Control Panel search box, and then tap or click Recovery.
4.Tap or click Open System Restore, and then follow the instructions.

How can I prevent Windows from catching my Perl exceptions?

I have this Perl software that is supposed to run 24/7. It keeps open a connection to an IMAP server, checks for new mail and then classifies new messages.
Now I have a user that is hibernating his XP laptop every once in a while. When this happens, the connection to the server fails and an exception is triggered. The calling code usually catches that exception and tries to reconnect. But in this case, it seems that Windows (or Perl?) is catching the exception and delivering it to the user via a message box.
Anyone know how I can prevent that kind of wtf? Could my code catch a "system-is-about-to-hibernate" signal?
To clear up some points you already raised:
I have no problem with users hibernating their machines. I just need to find a way to deal with that.
The Perl module in question does throw an exception. It does something like "die 'foo bar'. Although the application is completely browser based and doesn't use anything like Wx or Tk, the user gets a message box titled "poll_timer". The content of that message box is exactly the contents of $# ('foo bar' in this example).
The application is compiled into an executable using perlapp. The documentation doesn't mention anything about exception handling, though.

I think that you're dealing with an OS-level exception, not something thrown from Perl. The relevant Perl module is making a call to something in a DLL (I presume), and the exception is getting thrown. Your best bet would be to boil this down to a simple, replicable test case that triggers the exception (you might have to do a lot of hibernating and waking the machines involved for this process). Then, send this information to the module developer and ask them if they can come up with a means of catching this exception in a way that is more useful for you.
If the module developer can't or won't help, then you'll probably wind up needing to use the Perl debugger to debug into the module's code and see exactly what is going on, and see if there is a way you can change the module yourself to catch and deal with the exception.

It's difficult to offer intelligent suggestions without seeing relevant bits of code. If you're getting a dialog box with an exception message the program is most likely using either the Tk or wxPerl GUI library, which may complicate things a bit. With that said, my guess would be that it would be pretty easy to modify the exception handling in the program by wrapping the failure point in an eval block and testing $# after the call. If $# contains an error message indicating connection failure, then re-establish the connection and go on your way.

Your user is not the exception but rather the rule. My laptop is hibernated between work and home. At work, it is on on DHCP network; at home, it is on another altogether. Most programs continue to work despite a confusing multiplicity of IP addresses (VMWare, VPN, plain old connection via NAT router). Those that don't (AT&T Net Client, for the VPN - unused in the office, necessary at home or on the road) recognize the disconnect at hibernate time (AT&T Net Client holds up the StandBy/Hibernate process until it has disconnected), and I re-establish the connection if appropriate when the machine wakes up. At airports, I use the local WiFi (more DHCP) but turn of the wireless altogether (one physical switch) before boarding the plane.
So, you need to find out how to learn that the machine is going into StandBy or Hibernation mode for your software to be usable. What I don't have, I'm sorry to say, is a recipe for what you need to do.
Some work with Google suggests that ACPI (Advanced Configuration and Power Interface) is part of the solution (Microsoft). APM (Advanced Power Management) may also be relevant.

I've found a hack to avoid modal system dialog boxes for hard errors (e.g. "encountered and exception and needs to close"). I don't know if the same trick will work for this kind of error you're describing, but you could give it a try.
See: Avoiding the “encountered a problem and needs to close” dialog on Windows
In short, set the
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Windows\ErrorMode
registry key to the value “2″.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse