Azure Service Fabric Cluster: 5 Windows Nodes, 29 Applications
Fabric Version: 6.1.472.9494
Applications report unhealthy due to lack of disk space on the data drive.
The \SvcFab\Log\Traces folder was consuming about 80GB of the available 99GB.
This folder contained a very large number of .etl files in the following naming format fabric_traces_6.1.472.9494_131695673562955183_2.etl.
The Trace folders on all other nodes were less than 5GB.
This issue has happened twice over the past month. Both times it was only one node.
I found this stackoverflow post from 2016 for the same problem
The accepted answer mentions that this was fixed in 6.1.472.9494, however I'm still experiencing this problem.
The accepted answer also contained a link to the Service-Fabric-Issues Github. This comment mentions a workaround using SF-FolderCleaner. The last comment states that version 6.1.467 fixes this issue.
Is this still a known issue? Is the best path forward to implement the SF-FolderCleaner workaround?
Related
We tried to harden the gke optimized image (gke-1.15.11) for our cluster. We took an ssh into the node instance and made the cis porposed changes in the /home/kubernetes/kubelet-config.yaml file and ran kubebench to check if all the conditions have passed around 8 condtions failed these where the exact conditions we changed in the file. But, then we made the exact argument changes in /etc/default/kubernetes and ran kubebench again the conditions passed. But, when we restarted the instance we all the changes we made in the /ect/default/kubernetes file where gone. Can someone let me know where we are going wrong or is there any other path where we have to make the cis benchmark suggested entries
GKE doesn't support user-provided node images as of April 2020. Recommended option is to create your own DaemonSet with host filesystem writes and/or host services restart to propagate all the required changes.
There's two parts to this question. First, what falls under the purview of the Diagnostics---MaxDiskQuotaInMB configuration? Is it everything under SvcFab/Log? Just SvcFab/Log/AppInstanceData/? Having more info on this would be nice.
Second, what is the proper course of action if the FabricDCA.exe is running but the SvcFab/Log and SvcFab/Log/AppInstanceData/ folders exceed the limits we've set on their size? My team set them to 10,000 MB, but SvcFab/Log regularly takes up 12-16 GB.
The cluster configuration on Azure recognizes the change to the MaxDiskQuotaInMB configuration but there seems to be no impact on the node itself. I've tried resetting FabricDCA.exe as well and so far it has not helped either (after several hours).
One node in our cluster had so much space taken up by logs (over our limit) that remaining storage space was reduced to 1 MB.
Posting a more complete answer since it may be helpful to other people.
Most of the things under SvcFab/Log folder should fall under the quota set by MaxDiskQuotaInMB. There are a few things that may not, but the majority of things that usually take disk space are included. Keep in mind also that the task cleaning the disk usually runs every 5 minutes so you may see usage go over the quota within this timeframe.
If FabricDCA.exe is not properly cleaning files from this folder it is possible that you are hitting a bug in .Net runtime where all system.threading.timers stop firing and the disk to not be cleaned because FabricDCA relies on these timers to do so.
This is the bug on the .NET core side tracking the issue: (https://github.com/dotnet/coreclr/issues/26771). It seems to happen when the machine is running out of memory intermittently.
There is an auto-mitigation added in FabricDCA in Service Fabric 7.0.
The manual mitigation is usually to kill FabricDCA.exe process.
The process should start again and after a few minutes it will start cleaning again.
You mentioned that you already tried killing FabricDCA.exe so maybe the solution above does not work for you. In this case, try taking a look at the Service Fabric cluster manifest directly, it might be the case where your new configurations seem to be accepted by the ARM template deployment but the new configuration doesn't reach the cluster manifest which is the source of truth in this case.
Update:
There was a regression introduced as part of the auto-mitigation above which caused The AppInstanceFolder to fill up the disk. This is fixed in SF version 7.0.466
I operate an on-premise Azure Service Fabric cluster for testing purposes. It consists of three nodes, which are running on a single virtual machine (Windows Server 2012) with a 50 GB disk attached to it.
Further I set up continuous deployment from TFS release pipeline to the cluster. However after approx. 80 deployments, service fabric consumed all available disk space and further deployments fail.
Most of the space is taken by C:\ProgramData\SF\Data, which took around 28GB, while each code package has a size of ~130 MB. After I have unprovisioned many of the old deployments (manually via SF portal), only around 5GB were released. Many of the old files are still around in C:\ProgramData\SF\Data.
What is the best approach to improve this?
Why are the files from the old deployments still on disk after unprovisioning?
Is it possible to delete these files manually?
Is it possible to automate the deprovisioning?
On a production environment this situation should be relaxed anyhow (since there is only one node per machine and bigger disks). Nevertheless this would only put off the evil day. I would feel safer to avoid this situation at all.
Edit
It seems that SF is deleting the deployment packages with some delay. I checked the test cluster after one day, and all unprovisioned packages vanished finally.
It seems that SF is deleting the deployment packages with some delay. I checked the test cluster after one day, and all unprovisioned packages vanished finally.
Further I found the Unregister-ServiceFabricApplicationType Cmdlet to automate the unprovisioning process (https://msdn.microsoft.com/en-us/library/mt125885.aspx).
We have a Windows Azure Web Role on two extra-small instances that has been running for weeks without problems. This morning, we unintentionally passed some spending limit, which resulted in Windows Azure shutting down our complete service, without any prior warning!
We removed the spending cap and began to re-deploy the Web Role, with the same codebase that has been running for weeks. To our astonishment, we got the deployment error
Validation Errors: Total requested resources are too large for the specified VM size.
We upgraded the deployment to two small instances instead of the extra-small instances, whereupon deployment was working again. Now, the web role is back in the web.
However, we still don't understand why our deployment was suddenly too big for an extra-small instance. We didn't change one bit since the last successful deployment to extra-small instances. We then tried to remove the deployment size by moving some files to Azure Storage, but even after reducing the package file by more than 1 MB, deployment still failed.
The cspkg file, the deployment package, is currently at 9'359 KB. If unzipped, the complete sitesroot folder's size is 14 MB. Which is way below the 19'480 KB limit for the x-small instance.
Before we lose more time with trial-and-error, here's my question: How exactly are those 19'480 KB calculated? Is it just the sitesroot folder, or the zipped package, or is it the sitesroot and approot folder together, or the whole unzipped package?
Thank you!
EDIT:
Could you verify if your local resources exceed 20 GB:
I'm trying to understand why it can take from 20-60min to deploy a small application to Azure (using the configuration/package upload method, not from within VS).
I've read through this situation and this one but I'm still a little unclear - is there a weird non-technology ritual that occurs while the instances are distributing, like somebody over at Microsoft lighting a candle or doing a dance?
As a fellow Azure user, I share your pain - deploying isn't "quick"/"painless" - and this hurts especially when you're in a development cycle and want to test dev iterations on Azure. However, in general deployments should take much less than 60 minutes - and less than 20 minutes too.
Steve Marx provided a brief overview of the steps involved in deployment:
http://blog.smarx.com/posts/what-happens-when-you-deploy-on-windows-azure
And he references a deeper level explanation at: http://channel9.msdn.com/blogs/pdc2008/es19
There's a lot that goes on behind the scenes when you deploy an application to the Azure cloud. I don't have any special insight into what's going on behind the curtain, but having worked on the VS tools to upload projects to the Azure cloud, these are my impressions as an outsider looking in:
Among other things:
Hardware must be allocated from the available pool of servers
The VHD of the core OS must be uploaded to the machine
A VM instance must be initialized and booted off that VHD image
Your application package must be copied to the VM and installed
The VM monitor must wait for your service to start up, or fail
The data center load balancer and firewall must be made aware of your application's service endpoints
Once all of that has synchronized, your app is accessible from the web.
The VHD image is probably gigabytes in size, much larger than your app upload. Even on a superfast datacenter network, it takes time to move that much stuff into the VM, unpack it, and boot from it. Also, the load balancer and firewall are probably optimized to make routing requests the highest priority. Reconfiguring the firewall and load balancer is lower priority, and has to be done without interrupting traffic flow.
Also note that all this work only has to be done for a new deployment. Updating an existing deployment rolls out much faster - 2 to 3 minutes instead of 20 to 30 minutes.
Check out this PDC10 video by Mark Russinovich. He goes into great detail on what's going on inside Azure with some insights into the (admittedly slow) deployment process.
Original link is no longer working. Here's another link to a version of the same presentation: https://channel9.msdn.com/events/Build/BUILD2011/SAC-853T