Service Fabric cheap cluster options - azure-service-fabric

What is the cheapest possible vm sku you can run a service fabric cluster on in Azure?
I just tried to create a 1 node cluster using DS1_V2 and got a warning "Not enough disk space". Then I tried DS2_V2 Promo and the warning goes away.
It cost 142.85 USD and you need 5 of them. So that will be a total cost of 714.25 $ month plus usage.
Is the minimum cost for a service fabric cluster really around 1.000 USD a month.
What are the minimum requirement for running it on premise?
Is it possible to deploy 1 virtual machine in azure, install service fabric on that and deploy to that. (I know that wont scale, be fault tolerant etc)

For a production environment, you are correct you will need at least 5x D class machines for a Service fabric cluster.
For a QA environment you can set up a 3 node cluster with a Bronze durability level which should bring down the costs a bit.
For a development environment, you could use the Service Fabric Local Cluster manager which allows you to emulate a 1 Node or a 5 Node environment on your local machine and recently there is a new option in Azure to create and run a 1 Node cluster - see below.
As for capacity planning, you can find some good guidelines in the official docs.
For production workloads
The recommended VM SKU is Standard D3 or Standard D3_V2 or equivalent with a minimum of 14 GB of local SSD.
The minimum supported use VM SKU is Standard D1 or Standard D1_V2 or equivalent with a minimum of 14 GB of local SSD.
Partial core VM SKUs like Standard A0 are not supported for production workloads
Standard A1 SKU is specifically not supported for production workloads for performance reasons.

Related

Improved deployment strategies for high memory Azure App Service

I have a Flask API running in an Azure App Service. The API loads quite a lot of data on startup and is using about 60-70% memory on a 8GB plan (P1V3). I'm planning to scale the App Service plan to 3 - 5 instances depending on traffic.
Now I also want to release new versions of the API without downtime, but having a stage slot requires me to scale the plan to 16GB in order to run two versions of the API simultaneously before swapping.
This is just a very inefficient use of resources as our API then runs at around 30% memory for double the cost, so I'm looking for solutions in order to optimize our approach.
I've tried to manually scale up from 8 to 16 GB on release, but this takes down the API even when we have multiple instances and "Always On" enabled.
Does App Services support deploying one instance at a time (rolling deployment), or other deployment strategies which doesn't require us to scale our app service plan to 16GB?

Rightsizing Kubernetes Nodes | How much cost we save when we switch from VMs to containers

We are running 4 different micro-services on 4 different ec2 autoscaling groups:
service-1 - vcpu:4, RAM:32 GB, VM count:8
service-2 - vcpu:4, RAM:32 GB, VM count:8
service-3 - vcpu:4, RAM:32 GB, VM count:8
service-4 - vcpu:4, RAM:32 GB, VM count:16
We are planning to migrate this workload on EKS (in containers)
We need help in deciding the right node configuration (in EKS) to start with.
We can start with a small machine vcpu:4, RAM:32 GB, but will not get any cost saving as each container will need a separate vm.
We can use a large machine vcpu:16, RAM: 128 GB, but when these machines scale out, scaled out machine will be large and thus can be underutiliized.
Or we can go with a Medium machine like vcpu: 8, RAM:64 GB.
Other than this recommendation, we were also evaluating the cost saving of moving to containers.
As per our understanding, every VM machine comes with following overhead
Overhead of running hypervisor/virtualisation
Overhead of running separate Operating system
Note: One large VM vs many small VMs cost the same on public cloud as cost is based on number of vCPUs + RAM.
Hypervisor/virtualization cost is only valid if we are running on-prem, so no need to consider this.
On the 2nd point, how much resources a typical linux machine can take to run a OS? If we provision a small machine (vcpu:2, RAM:4GB), an approximate cpu usage is 0.2% and memory consumption (other than user space is 500Mb).
So, running large instances (count:5 instances in comparison to small instances count:40) can save 35 times of this cpu and RAM, which does not seem significant.
You are unlikely to see any cost savings in resources when you move to containers in EKS from applications running directly on VM's.
A Linux Container is just an isolated Linux process with specified resource limits, it is no different from a normal process when it comes to resource consumption. EKS still uses virtual machines to provide compute to the cluster, so you will still be running processes on a VM, regardless of containerization or not and from a resource point of view it will be equal. (See this answer for a more detailed comparison of VM's and containers)
When you add Kubernetes to the mix you are actually adding more overhead compared to running directly on VM's. The Kubernetes control plane runs on a set of dedicated VM's. In EKS those are fully managed in a PaaS, but Amazon charges a small hourly fee for each cluster.
In addition to the dedicated control plane nodes, each worker node in the cluster need a set of programs (system pods) to function properly (kube-proxy, kubelet etc.) and you may also define containers that must run on each node (daemon sets), like log collectors and security agents.
When it comes to sizing the nodes you need to find a balance between scaling and cost optimization.
The larger the worker node is the smaller the relative overhead of system pods and daemon sets become. In theory a worker node large enough to accommodate all your containers would maximize resources consumed by your applications compared to supporting applications on the node.
The smaller the worker nodes are the smaller the horizontal scaling steps can be, which is likely to reduce waste when scaling. It also provides better resilience as a node failure will impact fewer containers.
I tend to prefer nodes that are small so that scaling can be handled efficiently. They should be slightly larger than what is required from the largest containers, so that system pods and daemon sets also can fit.

Is the Service Fabric service running on the primary node a stateful service?

Is the Service Fabric service running on the primary node itself a stateful service?
The reason for asking is capacity planning. Recommended is d3v2, but is doing with d2v2 more than enough for Primary with Windows OS?
We are using Standard_B2ms for our system nodes. You could probably go lower on the system nodes, they are not under heavy load anyway
The Service Fabric system services that run on the primary node type are indeed stateful services. If you are doing capacity planning, I recommend you visit the following document which has our production readiness checklist. https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-production-readiness-checklist
For production workloads:
It's recommended to dedicate your clusters primary NodeType to system services, and use placement constraints to deploy your application to secondary NodeTypes.
The recommended VM SKU is Standard D3 or Standard D3_V2 or equivalent with a minimum of 14 GB of local SSD.
The minimum supported use VM SKU is Standard D1 or Standard D1_V2 or equivalent with a minimum of 14 GB of local SSD.
The 14 GB local SSD is a minimum requirement. Our recommendation is a minimum of 50 GB. For your workloads, especially when running Windows containers, larger disks are required.
Partial core VM SKUs like Standard A0 are not supported for production workloads.
Standard A1 SKU is not supported for production workloads for performance reasons.

Adding Desired State Configuration extension to a service fabric VMSS

We recently needed to add the Microsoft.Powershell.DSC extension to our VMSS that contain our service fabric cluster. We redeployed the cluster using our ARM template, with the addition of the new extension for DSC. During the deployment we observed that as many as 4 out of 5 scale set instances were in the restarting stage at a given time. The services in our cluster were also unresponsive during that time. The outage was only a few minutes long, but this seems like something that should not happen.
Reliability Level: Silver
Durability Level: Bronze
This is caused by the selected durability level 'bronze'.
The durability tier is used to indicate to the system the privileges
that your VMs have with the underlying Azure infrastructure. In the
primary node type, this privilege allows Service Fabric to pause any
VM level infrastructure request (such as a VM reboot, VM reimage, or
VM migration) that impact the quorum requirements for the system
services and your stateful services. In the non-primary node types,
this privilege allows Service Fabric to pause any VM level
infrastructure requests like VM reboot, VM reimage, VM migration etc.,
that impact the quorum requirements for your stateful services running
in it.
Bronze - No privileges. This is the default and is recommended if you are only > running stateless workloads in your cluster.
I suggest reading this article. Its a MS employee blog. I'll copy out the relevant part:
If you don’t mind all your VMs being rebooted at the same time, you can set upgradePolicy to “Automatic”. Otherwise set it to “Manual” and take care of applying changes to the scale set model to individual VMs yourself. It is fairly easy to script rolling out the update to VMs while maintaining application uptime. See https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-upgrade-scale-set for more details.
If your scale set is in a Service Fabric cluster, certain updates like changing OS version are blocked (currently – that will change in future), and it is recommended that upgradePolicy be set to “Automatic”, as Service Fabric takes care of safely applying model changes (like updated extension settings) while maintaining availability.

Service Fabric with Microsoft Nano nodes

Is it possible to build a standalone Service Fabric cluster where Microsoft Nano is used as nodes?
I presume you mean for on-premise deployment. If it is then the answer is no as the minimum requirement wont be met.
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-standalone-deployment-preparation
Prerequisites for each machine that you want to add to the cluster:+
A minimum of 16 GB of RAM is recommended.
A minimum of 40 of GB available disk space is recommended.
A 4 core or greater CPU is recommended.