Scale and fix a Azure Service Fabric Cluster manually

After having not quite an easy time fixing our production server-cluster manually I like to write a few things down so maybe other or my future self will find these information useful.
Situation
Our production server is a 5 node “Silver” cluster. It currently serves 17 services, 85 replicas. It serves 5 mio HTTP-Requests and about 25k users per day. This means that this system is fairly busy all day as we have users online also during night-times.
This system reported failed Service Fabric Cluster Upgrades because one node did not update (timeout) for some reason. Service Fabric Cluster Upgrades are initiated by Microsoft to update the cluster-runtimes on the machines.
Because we can not RDP into these nodes for security reason (working on this right now) we decided – together with Microsoft Support – to simply replace the “broken” node with a fresh one. That’s where our Service-Fabric manual scaling adventure started.
Learning: Get a healthy cluster after each operation
After you have done an operation on your cluster get a healthy cluster before starting the next operation. Otherwise your cluster can get into serious trouble if problems are start summarise up.
Learning: How to scale-out/-in manually
Every Service-Fabric has its associated Azure VM-Scalue-Set underneath. The scale-set consists of a set of identical VM’s. These VM’s have the Service-Fabric extension installed and that is how Service Fabric Cluster Manager connects and controls the VM’s. Each VM becomes a “node” in the cluster. Its also recommended to have multiple scale-sets per fabric. At least one for the so called “seed” nodes and another for the application-services. We don’t do that (yet) because of simplicity and budget.
If you like to have additional nodes on your Service-Fabric you simply can change the number the number of VM’s in your scale-set and wait some time. It may take of up to an hour before the VM’s are provisioned and become full nodes with our software running on it. If you don’t need the additional power (nodes) anymore you simply can scale the VM scale-set accordingly. You find these scaling option in the Azure Portal on our VM scale-set under “Scaling”.
Here is a more detailed documentation from Microsoft.
Learning: Never ever go below 3 nodes
No matter what the reliability tier is (Bronze, Silver, Gold, Platinum), if you go below 3 nodes your Service-Fabric can completely break and is not accessible at all. We faced this issue several times on our dev/test environments. Lucky us we have scripted our infrastructure so we deleted the scale-set and service-fabric and re-depolyed them from scratch. Otherwise not even the Service Fabric Explorer (SFX) of those fabric’s was accessible anymore.
Learning: Seed nodes promotion
Seed nodes are nodes which some Service Fabric System stuff on it. Depending on our reliability tier Service-Fabric needs 1 (None), 3 (Bronze), 5 (Silver), 7 (Gold) or 9 (Platinum) seed nodes. If a seed-node is un-healthy over about 2 hours or so the Service-Fabric will promote another node to a seed node automatically. So, if you run into this situation you just have to wait a decent time.
Let’s say you did it the hard way and deleted a VM which was a seed-node. Service Fabric will list this node as seed-node with status “down”. You can not delete this node yet as it is a seed node. If you did not already have spin up a fresh VM with scale-out on the scale-set. Then wait up to two hours or so. SF will promote the healthy non-seed node to a seed-node. If this happens you can “Disable (remote data)” (see below how to get this option) and it will be removed from the node list.
Learning: Node management in SFX with “Advanced Mode”
The Azure Portal UI of Service-Fabric is fairly basic. If you like to have more insights or get a UI for advanced operations you need to use the Service Fabric Explorer (SFX). You’ll find a link to it in the Azure Portal of your Service-Fabric.
On the nodes you get a “…” menu which – by default – has only a few options in it. And here is the trick we where told by Microsoft Support: Click on the gear-icon (“Settings”) on the top-right of the screen and activate the “Advanced mode” checkbox.

Now your “…” menu shows a bunch more options especially for the nodes. For example you then can “Disable (remove data)” and other disable options. This instructs SF to disable a node with the given ‘intent’. For example the “remove data” will shut down your services and remove the replicas. Wait a second to check that the node will not run your app/services anymore.
These options are helpful if you like to remove nodes for some reason. If you do this using Power-Shell or Azure CLI you can use the –force option so you don’t have to wait on seed-node timeouts. But again: never ever go below 3 healthy seed-nodes. If you need to remove one, first make sure you spin up new nodes before so after the removal you still have at least 3 healthy seed-nodes.
Learning: SF will not re-balance fault-domain if you scale manually
When you do things like to above operations it can happen, that you cluster-map ist “not balanced” anymore. This means that some of the nodes share upgrade- or fault-domains. You see this in the tab “Cluster Map” of the SFX.
A balanced cluster-map looks like this:

Service-Fabric will not fix this if you scaled manually or if you manually removed nodes. You have to take care of it.
A fault-domain can not be changed easily because it defines where you VM physically lives. To change this you need to get a new VM in place eg. by scale-out again. If you have VM’s in the right spot of your cluster-map you can remove the obsolete VM’s. To do so, first disable them in SFX with the remove-data option. When the replicas are down, you can delete the VM’s in the VM scale-set using Azure Portal or CLI.
Learning: Use supported VM-sizes
When you do a scale-up or scale-down by selecting a new VM-size make sure you choose one that is supported by Service-Fabric otherwise your Service-Fabric can explore so you can not access it anymore and have to start over from scratch (re-create SF and Scale-Set).
The bad thing: there is no safety belt when re-sizing VM’s. You will not get any warning or so but your SF will break afterwards.
Check the Microsoft Documentation here…
Conclusion
There are several things one needs to know (or learn) for operating a production Service Fabric Cluster. If you don’t you can burn down your cluster if you do the wrong thing.
Therefore make sure you have a multi-node test-cluster with 3 nodes or so. Its still not the same as a production cluster with 5 or more nodes (see the fault-domain limitations) but at least you can try most of the things on the test-cluster first.
Categories