Workarounds for known issues in AKS on Azure Stack HCI

This article includes workaround steps for resolving known issues that occur when using Azure Kubernetes Service on Azure Stack HCI.

Install-Akshci fails on a multi-node installation

When running Install-AksHci on a single-node setup, the installation worked, but when setting up the failover cluster, the installation fails with the error message Nodes have not reached active state. However, pinging the cloud agent showed the CloudAgent was reachable.

To ensure all nodes can resolve the CloudAgent's DNS, run the following command on each node:

Resolve-DnsName <FQDN of cloudagent>

When the step above succeeds on the nodes, make sure the nodes can reach the CloudAgent port to verify that a proxy is not trying to block this connection and the port is open. To do this, run the following command on each node:

Test-NetConnection  <FQDN of cloudagent> -Port <Cloudagent port - default 65000>

Linux and Windows VMs were not configured as highly available VMs

When scaling out a workload cluster, the corresponding Linux and Windows VMs were added as worker nodes, but they were not configured as highly available VMs. When running the Get-ClusterGroup command, the newly created Linux VM was not configured as a Cluster Group.

This is a known issue. After a reboot, the ability to have VMs configured as highly available is sometimes lost. The current workaround is to restart wssdagent on each of the Azure Stack HCI nodes. Note that this works only for new VMs that are generated by creating node pools when performing a scale up operation or when creating new Kubernetes clusters after restarting the wssdagent on the nodes. However, you will have to manually add the existing VMs to the failover cluster.

When you scale down a cluster, the high availability cluster resources are in a failed state, while the VMs are removed. The workaround for this issue is to manually remove the failed resources.

Attempt to create new workload clusters failed

An AKS on Azure Stack HCI cluster deployed in an Azure VM was previously working fine, but after the AKS host was turned off the host for several days, the Kubectl command did not work. After running either the Kubectl get nodes or Kubectl get services commands, this error message appeared: Error from server (InternalError): an error on the server ("") has prevented the request from succeeding.

This issue occurred because the AKS host was turned off for longer than four days, which caused the certificates to expire. Certificates are frequently rotated in a four-day cycle. Run Repair-AksHciClusterCerts to fix the certificate expiration issue.

After running SetAksHciRegistration in an AKS on Azure Stack HCI installation, an error occurred

The error Unable to check registered Resource Providers occurred after running Set-AksHciRegistration in an AKS on Azure Stack HCI installation. This error indicates that the Kubernetes Resource Providers are not registered for the current logged-in tenant.

To resolve this issue, run either the Azure CLI or PowerShell steps below:

az provider register --namespace Microsoft.Kubernetes
az provider register --namespace Microsoft.KubernetesConfiguration
Register-AzResourceProvider -ProviderNamespace Microsoft.Kubernetes
Register-AzResourceProvider -ProviderNamespace Microsoft.KubernetesConfiguration

The registration takes approximately 10 minutes to complete. To monitor the registration process, use the following commands.

az provider show -n Microsoft.Kubernetes -o table
az provider show -n Microsoft.KubernetesConfiguration -o table
Get-AzResourceProvider -ProviderNamespace Microsoft.Kubernetes
Get-AzResourceProvider -ProviderNamespace Microsoft.KubernetesConfiguration

Creating a workload cluster fails with the error A parameter cannot be found that matches parameter name 'nodePoolName'

On an AKS on Azure Stack HCI installation with the Windows Admin Center extension version 1.82.0, the management cluster was set up using PowerShell, and an attempt was made to deploy a workload cluster using Windows Admin Center. One of the machines had PowerShell module version 1.0.2 installed, and other machines had PowerShell module 1.1.3 installed. The attempt to deploy the workload cluster failed with the error A parameter cannot be found that matches parameter name 'nodePoolName'. This error may have occurred because of a version mismatch. Starting with PowerShell version 1.1.0, the -nodePoolName <String> parameter was added to the New-AksHciCluster cmdlet, and by design, this parameter is now mandatory when using the Windows Admin Center extension version 1.82.0.

To resolve this issue, do one of the following:

  • Use PowerShell to manually update the workload cluster to version 1.1.0 or later.
  • Use Windows Admin Center to update the cluster to version 1.1.0 or to the latest PowerShell version.

This issue does not occur if the management cluster is deployed using Windows Admin Center as it already has the latest PowerShell modules installed.

When using PowerShell to upgrade, an excess number of Kubernetes configuration secrets is created on a cluster

The June 1.0.1.10628 build of AKS on Azure Stack HCI creates an excess number of Kubernetes configuration secrets in the cluster. The upgrade path from the June 1.0.1.10628 release to the July 1.0.2.10723 release was improved to clean up the extra Kubernetes secrets. However, in some cases during upgrading, the secrets were still not cleaned up, and therefore, the upgrade process fails.

If you experience this issue, run the following steps:

  1. Save the script below as a file named fix_leaked_secrets.ps1:

    upgparam (
    [Parameter(Mandatory=$true)]
    [string] $ClusterName,
    [Parameter(Mandatory=$true)]
    [string] $ManagementKubeConfigPath
    )
    
    $ControlPlaneHostName = kubectl get nodes --kubeconfig $ManagementKubeConfigPath -o=jsonpath='{.items[0].metadata.name}'
    "Hostname is: $ControlPlaneHostName"
    
    $leakedSecretPath1 = "$ClusterName-template-secret-akshci-cc"
    $leakedSecretPath2 = "$ClusterName-moc-kms-plugin"
    $leakedSecretPath3 = "$ClusterName-kube-vip"
    $leakedSecretPath4 = "$ClusterName-template-secret-akshc"
    $leakedSecretPath5 = "$ClusterName-linux-template-secret-akshci-cc"
    $leakedSecretPath6 = "$ClusterName-windows-template-secret-akshci-cc"
    
    $leakedSecretNameList = New-Object -TypeName 'System.Collections.ArrayList';
    $leakedSecretNameList.Add($leakedSecretPath1) | Out-Null
    $leakedSecretNameList.Add($leakedSecretPath2) | Out-Null
    $leakedSecretNameList.Add($leakedSecretPath3) | Out-Null
    $leakedSecretNameList.Add($leakedSecretPath4) | Out-Null
    $leakedSecretNameList.Add($leakedSecretPath5) | Out-Null
    $leakedSecretNameList.Add($leakedSecretPath6) | Out-Null
    
    foreach ($leakedSecretName in $leakedSecretNameList)
    {
    "Deleting secrets with the prefix $leakedSecretName"
    $output = kubectl --kubeconfig $ManagementKubeConfigPath exec etcd-$ControlPlaneHostName -n kube-system -- sh -c "ETCDCTL_API=3 etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt del /registry/secrets/default/$leakedSecretName --prefix=true"
    "Deleted: $output"
    }
    
  2. Next, run the following command using the fix_secret_leak.ps1 file you saved:

       .\fix_secret_leak.ps1 -ClusterName (Get-AksHciConfig).Kva.KvaName -ManagementKubeConfigPath (Get-AksHciConfig).Kva.Kubeconfig
    
  3. Finally, use the following PowerShell command to repeat the upgrade process:

       Update-AksHci
    

Attempt to upgrade from the GA release to version 1.0.1.10628 is stuck at Update-KvaInternal

When attempting to upgrade AKS on Azure Stack HCI from the GA release to version 1.0.1.10628, if the ClusterStatus shows OutOfPolicy, you could be stuck at the Update-KvaInternal stage of the upgrade installation. If you use the repair-akshcicerts PowerShell cmdlet as a workaround, it also may not work. You should ensure that the AKS on Azure Stack HCI billing status shows as connected before upgrading. An AKS on Azure Stack HCI upgrade is forward only and does not support version rollback, so if you get stuck, you cannot upgrade.

Install-AksHci timed out with an error

After running Install-AksHci, the installation stopped and displayed the following waiting for API server error message:

\kubectl.exe --kubeconfig=C:\AksHci\0.9.7.3\kubeconfig-clustergroup-management 
get akshciclusters -o json returned a non zero exit code 1 
[Unable to connect to the server: dial tcp 192.168.0.150:6443: 
connectex: A connection attempt failed because the connected party 
did not properly respond after a period of time, or established connection 
failed because connected host has failed to respond.]

There are multiple reasons why an installation might fail with the waiting for API server error. See the following sections for possible causes and solutions for this error.

Reason 1: Incorrect IP gateway configuration

If you're using static IP and you received the following error message, confirm that the configuration for the IP address and gateway is correct.

Install-AksHci 
C:\AksHci\kvactl.exe create –configfile C:\AksHci\yaml\appliance.yaml  --outfile C:\AksHci\kubeconfig-clustergroup-management returned a non zero exit code 1 [ ]

To check whether you have the right configuration for your IP address and gateway, run the following:

ipconfig /all

In the displayed configuration settings, confirm the configuration. You could also attempt to ping the IP gateway and DNS server.

If these methods don't work, use New-AksHciNetworkSetting to change the configuration.

Reason 2: Incorrect DNS server

If you’re using static IP, confirm that the DNS server is correctly configured. To check the host's DNS server address, use the following command:

Get-NetIPConfiguration.DNSServer | ?{ $_.AddressFamily -ne 23} ).ServerAddresses

Confirm that the DNS server address is the same as the address used when running New-AksHciNetworkSetting by running the following command:

Get-MocConfig

If the DNS server has been incorrectly configured, reinstall AKS on Azure Stack HCI with the correct DNS server. For more information, see Restart, remove, or reinstall Azure Kubernetes Service on Azure Stack HCI .

The issue was resolved after deleting the configuration and restarting the VM with a new configuration.

Install-AksHci fails due to an Azure Arc onboarding failure

After running Install-AksHci, a Failed to wait for addon arc-onboarding error occurred.

To resolve this issue, use the following steps:

  1. Open PowerShell and run Uninstall-AksHci.
  2. Open the Azure portal and navigate to the resource group you used when running Install-AksHci.
  3. Check for any connected cluster resources that appear in a Disconnected state and include a name shown as a randomly generated GUID.
  4. Delete these cluster resources.
  5. Close the PowerShell session and open new session before running Install-AksHci again.

Install-AksHci fails because the nodes did not reach an Active state

After running Uninstall-AksHci, Install-AksHci may fail with a Nodes have not reached Active state error message if it's run in the same PowerShell session that was used when running Uninstall-AksHci. You should close the PowerShell session after running Uninstall-AksHci and then open a new session before running Install-AksHci. This issue can also appear when deploying AKS on Azure Stack HCI using Windows Admin Center.

This error message is an infrastructure issue that happens if the node agent is unable to connect to CloudAgent. There should be connectivity between the nodes, and each node should be able to resolve the CloudAgent ca-<guid>. While the deployment is stuck, manually check each node to see if Resolve-DnsName works.

When running Get-AksHciCluster, a release version not found error occurs

When running Get-AksHciCluster to verify the status of an AKS on Azure Stack HCI installation in Windows Admin Center, the output shows an error: A release with version 1.0.3.10818 was NOT FOUND. However, when running Get-AksHciVersion, it showed the same version was installed. This error indicates that the build is expired.

To resolve this issue, run Uninstall-AksHci, and then run a new AKS on Azure Stack HCI build.

When multiple versions of PowerShell modules are installed, Windows Admin Center does not pick the latest version

If you have multiple versions of the PowerShell modules installed (for example, 0.2.26, 0.2.27, and 0.2.28), Windows Admin Center may not use the latest version (or the one it requires). Make sure you have only one PowerShell module installed. You should uninstall all unused PowerShell versions of the PowerShell modules and leave just one installed. More information on which Windows Admin Center version is compatible with which PowerShell version can be found in the release notes..

After a failed installation, the Install-AksHci PowerShell command cannot be run

If your installation fails using Install-AksHci, you should run Uninstall-AksHci before running Install-AksHci again. This issue happens because a failed installation may result in leaked resources that have to be cleaned up before you can install again.

During deployment, the error Waiting for pod ‘Cloud Operator’ to be ready appears

When attempting to deploy an AKS on Azure Stack HCI cluster on an Azure VM, the installation was stuck at Waiting for pod 'Cloud Operator' to be ready..., and then failed and timed out after two hours. Attempts to troubleshoot by checking the gateway and DNS server showed they were working appropriately. Checks to see if there was an IP or MAC address conflict showed none were found. When viewing the logs, it showed that the VIP pool had not reached the logs. There was a restriction on pulling the container image using sudo docker pull ecpacr.azurecr.io/kube-vip:0.3.4 that returned a Transport Layer Security (TLS) timeout instead of unauthorized.

To resolve this issue, run the following steps:

  1. Start to deploy your cluster.

  2. When deployed, connect to management cluster VM through SSH as shown below:

    ssh -i (Get-MocConfig)['sshPrivateKey'] clouduser@<IP Address>
    
  3. Change the maximum transmission unit (MTU) setting. Don't hesitate to make the change because if you make the change too late, then the deployment fails. Modifying the MTU setting helps unblock the container image pull.

    sudo ifconfig eth0 mtu 1300
    
  4. To view the status of your containers, run the following command:

    sudo docker ps -a
    

After performing these steps, the container image pull should be unblocked.

When running Update-AksHci, the update process was stuck at Waiting for deployment 'AksHci Billing Operator' to be ready

When running the Update-AksHci PowerShell cmdlet, the update was stuck with a status message: Waiting for deployment 'AksHci Billing Operator' to be ready.

This issue could have the following root causes:

  • Reason one: During the update of the AksHci Billing Operator, it's possible that the Operator incorrectly marked itself as out of policy. To resolve this, open up a new PowerShell window and run Sync-AksHciBilling. You should see the billing operation continue within the next 20-30 minutes.

  • Reason two: The management cluster VM may be out of memory which causes the API server to be unreachable, and consequently, makes all commands from Get-AksHciCluster, billing, and update run into a timeout. As a workaround, set the management cluster VM to 32GB in Hyper-V and reboot it.

  • Reason three: The AKS on Azure Stack HCI Billing Operator may be out of storage space, which is due to a bug in the Microsoft SQL configuration settings. The lack of storage space may be causing the upgrade to stop responding. To workaround this issue, manually resize the billing pod pvc using the following steps.

    1. Run the following command to edit the pod settings:

      kubectl edit pvc mssql-data-claim --kubeconfig (Get-AksHciConfig).Kva.kubeconfig -n azure-arc
      
    2. When Notepad or another editor opens with a YAML file, edit the line for storage from 100Mi to 5Gi:

      spec:
        resources:
          requests:
            **storage: 5Gi**
      
    3. Check the status of the billing deployment using the following command:

      kubectl get deployments/billing-manager-deployment --kubeconfig (Get-AksHciConfig).Kva.kubeconfig -n azure-arc
      

Using Remote Desktop to connect to the management cluster produces a connection error

When using Remote Desktop (RDP) to connect to one of the nodes in an Azure Stack HCI cluster and then running the Get-AksHciCluster command, an error appears and says the connection failed because the host failed to respond.

The reason for the connection failure is because some PowerShell commands that use kubeconfig-mgmt fail with an error similar to the following one:

Unable to connect to the server: d ial tcp 172.168.10.0:6443, where 172.168.10.0 is the IP of the control plane.

The kube-vip pod can go down for two reasons:

  • The memory pressure in the system can slow down etcd, which ends up affecting kube-vip.
  • The kube-apiserver is not available.

To help resolve this issue, try rebooting the machine. However, the issue of the memory pressure slowing down may return.

When running kubect get pods, pods were stuck in a Terminating state

When deploying AKS on Azure Stack HCI, and then running kubect get pods, pods in the same node are stuck in the Terminating state. The machine rejects SSH connections because the node was likely experiencing a lot of memory demand.

This issue occurs because the Windows nodes are over-provisioned, and there's no reserve for core components. To avoid this situation, add the resource limits and resource request for CPU and memory to the pod specification to ensure that the nodes aren't over-provisioned. Windows nodes don't support eviction based on resource limits, so you should estimate how much the containers will use and then set the CPU and memory amounts.

Running the Remove-ClusterNode command evicts the node from the failover cluster, but the node still exists

When running the Remove-ClusterNode command, the node is evicted from the failover cluster, but if Remove-AksHciNode is not run afterwards, the node will still exist in CloudAgent.

Since the node was removed from the cluster, but not from CloudAgent, if you use the VHD to create a new node, a File not found error appears. This issue occurs because the VHD is in shared storage, and the evicted node does not have access to it.

To resolve this issue, remove a physical node from the cluster and then follow the steps below:

  1. Run Remove-AksHciNode to de-register the node from CloudAgent.
  2. Perform routine maintenance, such as re-imaging the machine.
  3. Add the node back to the cluster.
  4. Run Add-AksHciNode to register the node with CloudAgent.

An Arc connection on an AKS cluster cannot be enabled after disabling it.

To enable an Arc connection, after disabling it, run the following Get-AksHciCredential PowerShell command as an administrator, where -Name is the name of your workload cluster.

Get-AksHciCredential -Name myworkloadcluster
kubectl --kubeconfig=kubeconfig delete secrets sh.helm.release.v1.azure-arc.v1

Container storage interface pod stuck in a ContainerCreating state

A new Kubernetes workload cluster was created with Kubernetes version 1.16.10, and then updated to 1.16.15. After the update, the csi-msk8scsi-node-9x47m pod was stuck in the ContainerCreating state, and the kube-proxy-qqnkr pod was stuck in the Terminating state as shown in the output below:

Error: kubectl.exe get nodes  
NAME              STATUS     ROLES    AGE     VERSION 
moc-lf22jcmu045   Ready      <none>   5h40m   v1.16.15 
moc-lqjzhhsuo42   Ready      <none>   5h38m   v1.16.15 
moc-lwan4ro72he   NotReady   master   5h44m   v1.16.15

\kubectl.exe get pods -A 

NAMESPACE     NAME                        READY   STATUS              RESTARTS   AGE 
    5h38m 
kube-system   csi-msk8scsi-node-9x47m     0/3     ContainerCreating   0          5h44m 
kube-system   kube-proxy-qqnkr            1/1     Terminating         0          5h44m  

Since kubelet ended up in a bad state and can no longer talk to the API server, the only solution is to restart the kubelet service. After restarting, the cluster goes into a running state.

All pods in a Windows node are stuck in a ContainerCreating state

In a workload cluster with the Calico network plug-in enabled, all of the pods in a Windows node are stuck in the ContainerCreating state except for the calico-node-windows daemonset pod.

To resolve this issue, find the name of the kube-proxy pod on that node and then run the following command:

kubectl delete pod <KUBE-PROXY-NAME> -n kube-system

All the pods should start on the node.

In a workload cluster with static IP, all pods in a node are stuck in a ContainerCreating state

In a workload cluster with static IP and Windows nodes, all of the pods in a node (including the daemonset pods) are stuck in a ContainerCreating state. When attempting to connect to that node using SSH, it fails with a Connection timed out error.

To resolve this issue, use Hyper-V Manager or the Failover Cluster Manager to turn off the VM of that node. After five to ten minutes, the node should have been recreated and with all the pods running.

Attempt to increase the number of worker nodes fails

When using PowerShell to create a cluster with static IP and then attempt to increase the number of worker nodes in the workload cluster, the installation got stuck at control plane count at 2, still waiting for desired state: 3. After a period of time, another error message appears: Error: timed out waiting for the condition.

When Get-AksHciCluster was run, it showed that the control plane nodes were created and provisioned and were in a Ready state. However, when kubectl get nodes was run, it showed that the control plane nodes had been created but not provisioned and were not in a Ready state.

If you get this error, verify that the IP addresses have been assigned to the created nodes using either Hyper-V Manager or PowerShell:

(Get-VM |Get-VMNetworkAdapter).IPAddresses |fl

Then, verify the network settings to ensure there are enough IP addresses left in the pool to create more VMs.

When deploying AKS on Azure Stack HCI with a misconfigured network, deployment timed out at various points

When deploying AKS on Azure Stack HCI, the deployment may time out at different points of the process depending on where the misconfiguration occurred. You should review the error message to determine the cause and where it occurred.

For example, in the following error, the point at which the misconfiguration occurred is in Get-DownloadSdkRelease -Name "mocstack-stable":

$vnet = New-AksHciNetworkSettingSet-AksHciConfig -vnet $vnetInstall-AksHciVERBOSE: 
Initializing environmentVERBOSE: [AksHci] Importing ConfigurationVERBOSE: 
[AksHci] Importing Configuration Completedpowershell : 
GetRelease - error returned by API call: 
Post "https://msk8s.api.cdp.microsoft.com/api/v1.1/contents/default/namespaces/default/names/mocstack-stable/versions/0.9.7.0/files?action=generateDownloadInfo&ForegroundPriority=True": 
dial tcp 52.184.220.11:443: connectex: 
A connection attempt failed because the connected party did not properly
respond after a period of time, or established connection failed because
connected host has failed to respond.At line:1 char:1+ powershell -command
{ Get-DownloadSdkRelease -Name "mocstack-stable"}

This indicates that the physical Azure Stack HCI node can resolve the name of the download URL, msk8s.api.cdp.microsoft.com, but the node can't connect to the target server.

To resolve this issue, you need to determine where the breakdown occurred in the connection flow. Here are some steps to try to resolve the issue from the physical cluster node:

  1. Ping the destination DNS name: ping msk8s.api.cdp.microsoft.com.
  2. If you get a response back and no time-out, then the basic network path is working.
  3. If the connection times out, then there could be a break in the data path. For more information, see check proxy settings. Or, there could be a break in the return path, so you should check the firewall rules.

An Unable to acquire token error appears when running Set-AksHciRegistration

An Unable to acquire token error can occur when you have multiple tenants on your Azure account. Use $tenantId = (Get-AzContext).Tenant.Id to set the right tenant. Then, include this tenant as a parameter while running Set-AksHciRegistration.

When upgrading a deployment, some pods might be stuck at waiting for static pods to have a ready condition

To release the pods and resolve this issue, you should restart kubelet. To view the NotReady node with the static pods, run the following command:

kubectl get nodes -o wide

To get more information on the faulty node, run the following command:

kubectl describe node <IP of the node>

Use SSH to log into the NotReady node by running the following command:

ssh -i <path of the private key file> administrator@<IP of the node>

Then, to restart kubelet, run the following command:

/etc/.../kubelet restart

When creating a persistent volume, an attempt to mount the volume fails

After deleting a persistent volume or a persistent volume claim in an AKS on Azure Stack HCI environment, a new persistent volume is created to map to the same share. However, when attempting to mount the volume, the mount fails, and the pod times out with the error, NewSmbGlobalMapping failed.

To work around the failure to mount the new volume, you can SSH into the Windows node and run Remove-SMBGlobalMapping and provide the share that corresponds to the volume. After running this command, attempts to mount the volume should succeed.

Next steps

If you continue to run into problems when you're using Azure Kubernetes Service on Azure Stack HCI, you can file bugs through GitHub.