In this blog we’ll discuss what happens when a Cluster or a single Node in your Azure Stack HCI Cluster reports a connection status of ‘OutofPolicy‘.
As you probably know by now, Azure Stack HCI is a an Azure hybrid service that is connected to Azure for cloud-based monitoring, support, billing, and optional management and security features. Although it’s designed to be connected to Azure, it can handle periods of limited or zero connectivity without affecting the function of the cluster or running workloads. That being said it’s advisable to have stable connectivity to Azure, and Azure Stack HCI needs to sync successfully Azure at least once per 30 consecutive days.
Cluster OutOfPolicy
If Azure Stack HCI hasn’t synced with Azure in more than 30 consecutive days, the cluster’s connection status will show ‘Outofpolicy’ in the Azure portal and other tools, and the cluster will enter a reduced functionality mode. In this mode, the host infrastructure stays up and all current VMs continue to run normally. However, new VMs can’t be created until Azure Stack HCI is able to sync again. The internal technical reason is that the cluster’s cloud-generated license has expired and needs to be renewed by syncing with Azure.
If the above situation occurs then it should be possible to unregister and re-register the cluster, providing the connectivity is in place, to bring in back into policy and restore full functionality. For example, if all nodes in a cluster are displaying error 594 then you need to unregister the Cluster, reboot each node in turn and wait for a repair of the Enclave to complete (15 minutes), and then re-register Cluster.
Single Node OutOfPolicy
Another example of ‘OutOfPolicy‘ is when a single node is affected. If this happens then the it’s likely that the Cluster itself is still connected to Azure and still has full functionality, but to ensure there will be no issues with functionality the affected node needs to be brought back into Policy as soon as possible.
One of the reasons this can occur has been tracked back to incorrect local time sync configured on the Node where it is not set to sync its time with the domain hierarchy. Another reason this can happen is if the Hostname of the node is changed after registering the cluster with Azure.
Note:
Ensuring your local infrastructure domain is configured to sync with a reliable external time source and that the Cluster nodes are syncing with the domain hierarchy is vital before registering Clusters with Azure!
If one of the above occurs then it can cause the Encryption Key to be lost for the node and the Node won’t be able to renew it’s cloud-generated license. In this case the Key will need to be regenerated within its Secure Enclave.
To identify the issue further – apart from running Get-AzureStackHCI on the node and seeing the ConnectionStatus as ‘OutOfPolicy’ – look for event 592 in the AzureStack\HCI\Admin logs.
As per the below image, you can see that in the Description if provides steps on how to resolve either the entire Cluster or just the node being OutOfPolicy. The steps for the entire cluster have already been mentioned above, but to resolve just a node then the steps are a little bit more involved.
Other related events that may be seen are: 501, 546, 547, 554
Resolve a Single Node being OutOfPolicy
As per the steps suggested in the event, the resolution is to evict the node from the cluster, reboot it and re-add it. This will then trigger are repair of the Key in the Secure Enclave and then when the node is re-added and synced with the cluster, and synced with Azure, it will generate a valid certificate and become back in Policy.
*Please note that at the time of writing this resolution is only supported for version 21H2
Steps to Evict and re-add Node:
- Ensure all Virtual Disks, Physical Disks and the Storage Pool are healthy and no Storage Jobs are running
- If using PowerShell, run the Remove-ClusterNode cmdlet (do NOT add the -CleanupDisk parameter)
- Reboot the Evicted Node
- Wait at least 15 minutes for the Enclave repair to complete
- Re-add the Node to the Cluster
- Resync Cluster to Azure
The below screen shots support the above steps and show the related logs to indicate resolution:
Conclusion
I hope you found the above information useful and it helps resolves any OutOfPolicy issues.