503 Server Unavailable - Dynamics CRM Web Service down - how to diagnose? - soap

I provide support for a large application across multiple servers. System has been running live for 6+ months.
8th December: total system failure. iisreset across each of the servers sorted it out. Everything back to normal.
Post failure investigation showed various processes not able to get a response from a particular server which hosts an instance of Dynamics CRM (2011 R11). Specifically it seems the SOAP service was not responding (Organization.svc). 503 - Server Unavailable (really it was just the web service). I suspect it died.
Having the exact time of the error I checked the event logs on the server but these did not have anything of use. The last error prior to the failure was a report rendering error which was 9 minutes before the system actually went down. Surely if web service crashed this would be reflected in the event log?
Fast forward to today, 8th January and the system fails again. The 8th of the month again! iisreset fixes it... again!
Again, completely useless event logs showings no errors prior to failure.
Entertained the idea of Dynamics CRM trace logging but this is out of the question due to the performance hit.
Apart from the event logs where else to look? Are there possible external factors or causes? I'm trying to find the root cause but have run out of ideas!

While this may not address the source of your problem, maybe it can help minimize the symptoms. May I suggest that you configure the IIS server to recycle the application pool at a scheduled interval within your production environment.
http://technet.microsoft.com/en-us/library/cc753179%28v=ws.10%29.aspx

Related

REST API does not return answer back after more than 3600 seconds of processing

We have spent several weeks trying to fix an issue that occurs in the customer's production environment and does not occur in our test environment.
After several analyses, we have found that this error occurs only when one condition is met: processing times greater than 3600 seconds in the API.
The situation is the following:
SAP is connected to a server with Windows Server 2016 and IIS 10.0 where we have an API that is responsible for interacting with a DB use by an external system.
The process that we execute sends data from SAP to the API and this, with the data it receives from SAP and the data it obtains from the DB of the external system, performs a processing and a subsequent update in the DB.
This process finishes without problems when the processing time in the API is less than 3600 seconds.
On the other hand, when the processing time is greater than 3600 seconds, the API generates the response correctly, and the server tries to return the response to SAP, but it is not possible.
Below I show an example of a server log entry when it tries to return a response after more than 3600 seconds of API processing. As you can see, a 995 error occurs: (I have censored some parts)
Any idea where the error could come from?
We have compared IIS configurations in Production and Test. We have also reviewed the parameters of the SAP system in Production and Test and we have not found anything either.
I remain at your disposal to provide any type of additional information that may be useful for solving the problem.
UPDATE 1 - 02/09/2022
After enabling FRT (Failed Request Tracing) on IIS for 200 response codes, looking at the event log of the request that is causing the error, we have seen this event at the end:
Any information about what could be causing this error? ErrorCode="The I/O operation has been aborted because of either a thread exit or an application request. (0x800703e3)"
UPDATE 2 - 02/09/2022
Comparing configurations from customer's environment and our test environment:
There is a Firewall between SAP Server and IIS Server with the default idle timeout configured for TCP (3600 seconds). This is not happening in Test Environment because there is no Firewall.
Establishing a Firewall policy specifying a custom idle timeout for this service (7200 seconds) the problem will be solved.
sc-win32 status 995, the I/O operation has been aborted because of
either a thread exit or an application request.
Please check the setting of minBytesPerSecond configuration parameter in IIS. The default "minBytesPerSecond" is 240.
Specifies the minimum throughput rate, in bytes, that HTTP.sys
enforces when it sends a response to the client. The minBytesPerSecond
attribute prevents malicious or malfunctioning software clients from
using resources by holding a connection open with minimal data. If the
throughput rate is lower than the minBytesPerSecond setting, the
connection is terminated.

Tableau Vizql, Backgrounder and Data server down

Yesterday our extracts failed to refresh with the following message (image extract_error):
Failure: Failed 1 time. Sign in failed.
Resolution Details: Check the Data Connection page for necessary updates to an access token or embedded credentials.
I verified that all our passwords were unchanged and test connections which were successful.
The tableau dashboards now give an error message saying:
HTTP 404:
Unable to connect to the server "localhost". Check that the server is running and that you have access privileges to the requested database. (image tableau_error)
Further, when I opened the Server Status page, I saw that our one of our two Vizql, backgrounder and data servers were down. We have two of each and only one of them is active for all three of them. (image server_status)
So, I decided to remote desktop into the server and run the tabadmin status -v command and strangely it is showing that all processes are running. (image tabadmin_status)
Finally, I opened a case with Tableau Customer Portal and letting them know about this issue (they asked me send them the log.zip file) but the mean time I was trying to problem solve this issue. Any help would be really appreciated.
After trying a lot of things, one process seemed to work.
Stopped the tableau server
Configured it to run 1 Vizql server process instead of 2
Started the server again
Finally, it worked. The status page now shows all the processes are active.
Hopefully, this helps someone who is facing a similar problem.
This may be caused by a firewall issue. Since tabadmin status -v returned all as "running" the cluster is healthy and this is a false alert. The firewall rules could be allowing just the first port and not the entire range (see https://onlinehelp.tableau.com/current/server/en-us/ports.htm) to respond to requests from the application server to build that fancy table with the green and red boxes.
The firewall can be reverted/altered behind the scenes for a number of reasons, usually windows updates or regular group policy synchronization.
Try disabling the windows firewall (https://www.faqforge.com/windows-server-2016/turn-off-firewall-windows-server-2016/), or add an inbound rule allowing access to all ports if your org policy doesn't allow you to actually turn it off. (Follow the steps here, except use "All Local Ports" instead of "Specific Local Ports" https://www.parallels.com/blogs/ras/configuring-windows-server-firewall-for-parallels-ras/)
I had a similar problem and followed these similar steps that Sravee mentioned above to bring the all processes back to active.
Stopped the server
Change the configuration for VizQL server from 2 to 1
Started the server
Enter the licence key (else the server status page will show unlicensed error)
Note: This does not bring the site back but this step is for 'tricking' VizQL server
Stopped the server again
Change the VizQL configuration from 1 to 2 now.
Start the server
Enter the license key
This steps did bring back the server back to active for us. Posting to see if this helps who faces the same problem. Thank you so much.

VS 2013 Web Deployment Failing - Socket Error 10054

I posted on friday regarding this issue and received no response, however since then some updates have occurred which might affect whether people have an answer to my question or not.
I'm trying to deploy an MVC website to Azure, and in this particular project the web deploy receives a Socket Error 10054 and gives up after 10 attempts saying it was Unable to write data to the transport connection. It makes varying progress in between the socket failures but never completes within the 10 attempts.
I have since had a play around with other projects which are deployed to different url locations within this same Azure Account and they deploy fine! I think this means the problem is not on my end, i.e. port 8172 is open and deployment can be achieved with my current local settings.
What are the problems that can cause this socket error 10054? I saw somewhere that I should enable the "Allow untrusted certificate" option when deploying but I can't find that option within VS 2013.
Any ideas are welcome please,
This issue is driving me mad, it seems there's a real mixed bag of solution ideas which have worked for others but not me.
JK

SQL Server Reporting Services (MSSQLSERVER) is screw up

I’m trying to investigate for which reason sometime my service SQL Server Reporting Services (MSSQLSERVER) is screwed up and doesn’t respond, producing an error in my SSRS reports. When the users run their report they get an ambiguous and confusing error message like: Server Error in ‘/Reports’ application.
This problem was resolved only by stopping and restarting the service. It comes only a few times, but I don't want to get it again.
The stability of the reporting is the first priority for our team. I want to know the reason or how can we prevent this error in the future.

Windows Service "Starting"

I have a critical windows service that I need for my web application.
Unfortunately, the windows service does not start properly, but remains in a status of "Starting" for about 7 minutes and 38 seconds, and then fails.
My web application works fine when the service is in the "Starting" mode.
I have a windows scheduled task that runs every minute to restart the service if necessary.
net start "my service"
Therefore there is a gap of about 22 seconds from when the service fails until it starts up again. In additional it takes an additional 30 seconds or so for my application (which is dependent on this service) to start working.
I have intentionally not named the errant service. I did open a separate question https://stackoverflow.com/questions/8470975/oracle-oc4j-service-keeps-stopping whose aim was to actually solve the problem.
In this question, I am not trying to solve the problem, but rather find a workaround to try and keep this service in a status of "Starting" the whole time.
What is infuriating, is that until I restarted the server today, my workaround of restarting the service every 3 minutes actually worked, with no application downtime whatsoever.
Does anybody have any suggestions? I did try changing the registry key of ServicesPipeTimeout to 86400000 (24 hours!) in a bid to keep the service in the status of "Starting" for longer.
I have found a possible solution to my problem that I am very uneasy about...
I downloaded WinDbg from http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=8279
I opened WinDbg and did Attach to Process, and selected my service.
As long as WinDbg is open, it seems to "hold" the process and prevent it from stopping.
How long it will continue to do so, remains to be seen, but it has held for over half an hour now (whereas before the service stopped after 8 minutes)
If you have the timeout set to 24 hours and the service does not start or stay in 'starting' mode , then it must be either crashing or closing itself down.
If you want to try to restarting your service immediately it crashes, then, on the properties of your service, select the 'Recovery' tab. You should be able to set the service to restart on first, second and subsequent failures and set the service to restart after 0 minutes,
Note, this will not work if windows thinks that the service is closing down properly.
It should go without saying that this is a last resort only if you can't get whoever wrote the service to fix the problems.
Try specifying 'Restart the Service' for all three sections on the Recovery tab, but that will only work if the service is ending abnormally.
Our company faced a similar problem and we developed Service Protector, a commercial application that can babysit a service and keep it running 24/7. It may work in your situation too.