Production readiness checklist
Is your application and cluster ready to take production traffic? Running and testing your application and your cluster doesn't necessarily mean it's ready to go into production. Keep your application and cluster running smoothly by going through the following checklist. We strongly recommend all these items to be checked off. Obviously, you can choose to use alternative solutions for a particular line item (for example, your own diagnostics frameworks).
Prerequisites for production
- Azure Service Fabric best practices: Application Design, Security, Networking, Capacity planning and scaling, Infrastructure as Code, and Monitoring and Diagnostics.
- Implement the Reliable Actors security configuration if using the Actors programming model
- For clusters with more than 20 cores or 10 nodes, create a dedicated primary node type for system services. Add placement constraints to reserve the primary node type for system services.
- Use a D2v2 or higher SKU for the primary node type. It is recommended to pick a SKU with at least 50 GB hard disk capacity.
- Production clusters must be secure. For an example of setting up a secure cluster, see this cluster template. Use common names for certificates and avoid using self signed certs.
- Add resource constraints on containers and services, so that they don't consume more than 75% of node resources.
- Understand and set the durability level. Silver or higher durability level is recommended for node types running stateful workloads. The primary node type should have a durability level set to Silver or higher.
- Understand and pick the reliability level of the node type. Silver or higher reliability is recommended.
- Load and scale test your workloads to identify capacity requirements for your cluster.
- Your services and applications are monitored and application logs are being generated and stored, with alerting. For example, see Add logging to your Service Fabric application and Monitor containers with Azure Monitor logs.
- The cluster is monitored with alerting (for example, with Azure Monitor logs).
- The underlying virtual machine scale set infrastructure is monitored with alerting (for example, with Azure Monitor logs.
- The cluster has primary and secondary certificates always (so you don't get locked out).
- Maintain separate clusters for development, staging, and production.
- Application upgrades and cluster upgrades are tested in development and staging clusters first.
- Turn off automatic upgrades in production clusters, and turn it on for development and staging clusters (rollback as needed).
- Establish a Recovery Point Objective (RPO) for your service, and set up a disaster recovery process and test it out.
- Plan for scaling your cluster manually or programmatically.
- Plan for patching your cluster nodes.
- Establish a CI/CD pipeline so that your latest changes are being continually tested. For example, using Azure DevOps or Jenkins
- Test your development & staging clusters under load with the Fault Analysis Service and induce controlled chaos.
- Plan for scaling your applications.
If you're using the Service Fabric Reliable Services or Reliable Actors programming model, the following items need to be checked off:
- Upgrade applications during local development to check that your service code is honoring the cancellation token in the
RunAsyncmethod and closing custom communication listeners.
- Avoid common pitfalls when using Reliable Collections.
- Monitor the .NET CLR memory performance counters when running load tests and check for high rates of Garbage Collection or runaway heap growth.
- Maintain offline backup of Reliable Services and Reliable Actors and test the restoration process.
- Your Primary NodeType Virtual Machine instance count should ideally be equal to the minimum for your Clusters Reliability tier; conditions when appropriate to exceed the Tier minimum includes: temporarily when vertically scaling your Primary NodeTypes Virtual Machine Scale Set SKU.
Optional best practices
While the above lists are pre-requisites to go into production, the following items should also be considered:
- Plug into the Service Fabric health model for extending the built-in health evaluation and reporting.
- Deploy a custom watchdog that is monitoring your application and reports load for resource balancing.