Fix known issues and errors when configuring a network

Use this topic to help you troubleshoot and resolve networking-related issues with Azure Kubernetes Service on Azure Stack HCI and Windows Server.

Error: 'Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group os in the 'failed' state.'

Cloud agent may fail to successfully start when using path names with spaces in them.

When using Set-AksHciConfig to specify -imageDir, -workingDir, -cloudConfigLocation, or -nodeConfigLocation parameters with a path name that contains a space character, such as D:\Cloud Share\AKS HCI, the cloud agent cluster service will fail to start with the following (or similar) error message:

Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group os in the 'failed' state. Resources in 'failed' or 'pending' states: 'MOC Cloud Agent Service'

To work around this issue, use a path that does not include spaces, for example, C:\CloudShare\AKS-HCI.

Load balancer in Azure Kubernetes Service requires DHCP reservation.

The load balancing solution in Azure Kubernetes Service on Azure Stack HCI uses DHCP to assign IP addresses to service endpoints. If the IP address changes for the service endpoint due to a service restart, DHCP lease expires due to a short expiration time. Therefore, the service becomes inaccessible because the IP address in the Kubernetes configuration is different from what is on the endpoint. This can lead to the Kubernetes cluster becoming unavailable.

To get around this issue, use a MAC address pool for the load balanced service endpoints and reserve specific IP addresses for each MAC address in the pool.

The WSSDAgent service is stuck while starting and fails to connect to the cloud agent.

Symptoms:

  • Proxy enabled in AKS-HCI. The WSSDAgent service stuck in the starting state. Shows up as the following:
  • Test-NetConnection -ComputerName <computer IP/Name> -Port <port> from the node where the node agent is failing towards the cloud agent is working properly on the system (even when the wssdagent fails to start)
  • Curl.exe from the node on which the agent is failing towards the cloud agent reproduces the problem and is getting stuck: curl.exe https://<computerIP>:6500
  • When you pass the --noproxy flag to curl.exe, this fixes the problem. Curl returns an error from from wssdcloudagent. This is expected since curl is not a GRPC client. Curl doesn't get stuck waiting when you send the --noproxy flag. So returning an error is considered a success here):
curl.exe --noproxy '*' https://<computerIP>:65000

It is likely the proxy settings were changed to a faulty proxy on the host. The proxy settings for AKS on Azure Stack HCI are environment variables that are inherited from the parent process on the host. These settings only get propagated when a new service starts or an old one updates or reboots. It is possible faulty proxy settings were set on the host and they were propagated to the WSSDAgent after an update or reboot, which has caused the WSSDAgent to fail.

You will need to fix the proxy settings by changing the environmental variables on the machine. On the machine, change the variables with the following commands:

  [Environment]::SetEnvironmentVariable("https_proxy", <Value>, "Machine")
  [Environment]::SetEnvironmentVariable("http_proxy", <Value>, "Machine")
  [Environment]::SetEnvironmentVariable("no_proxy", <Value>, "Machine")

Reboot the machine so that the service manager, and the WSSDAgent, pick up the fixed proxy.

CAPH pod fails to renew certificate

This error occurs because every time the CAPH pod is started, a login to cloudagent is attempted and the certificate is stored in the temporary storage volume, which will clean out on pod restarts. Therefore, every time a pod is restarted, the certificate is destroyed and a new login attempt is made.

A login attempt starts a renewal routine, which renews the certificate when it nears expiry. The caph pod decides if a login is needed if the certificate is available or not. If the certificate is available, the login is not attempted, assuming the renewal routine is already there.

However, on a container restart, the temporary directory is not cleaned, so the file is still persisted and the login attempt is not made, causing the renewal routine to not start. This leads to certificate expiration.

To mitigate this issue, restart the CAPH pod using the following command:

kubectl delete pod pod-name

Authentication handshake failed: x509: certificate signed by unknown authority

You may see this error when deploying a new AKS cluster or adding a node pool to an existing cluster.

  1. Check that the user who has run the command is the same user that installed AKS on Azure Stack or Windows Server. For more information on granting access to multiple users, see Set up multiple administrators
  2. If the user is the same and you see the error again, you can: If this error is seen when trying to perform operations such as deploying a new AKS cluster or adding a node pool to an existing cluster, ensure the user running the command is the same user who installed AKS-HCI. If the user is the same and the error persists, follow the below steps to resolve the issue:
  • Delete old management appliance certificate by removing $env:UserProfile.wssd\kvactl\cloudconfig
  • Run Repair-AksHciCerts.
  • Run Get-AksHciCluster to check that it's fixed

Set-AksHciConfig fails with WinRM errors, but shows WinRM is configured correctly.

When running Set-AksHciConfig, you might encounter the following error:

WinRM service is already running on this machine.
WinRM is already set up for remote management on this computer.
Powershell remoting to TK5-3WP08R0733 was not successful.
At C:\Program Files\WindowsPowerShell\Modules\Moc\0.2.23\Moc.psm1:2957 char:17
+ ...             throw "Powershell remoting to "+$env:computername+" was n ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (Powershell remo...not successful.:String) [], RuntimeException
    + FullyQualifiedErrorId : Powershell remoting to TK5-3WP08R0733 was not successful.

Most of the time, this error occurs as a result of a change in the user's security token (due to a change in group membership), a password change, or an expired password. In most cases, the issue can be remediated by logging off from the computer and logging back in. If this still fails, you can create a support ticket through the Azure portal.

Using Remote Desktop to connect to the management cluster produces a connection error.

When using Remote Desktop (RDP) to connect to one of the nodes in an Azure Stack HCI cluster and then running Get-AksHciCluster, an error appears and says the connection failed because the host failed to respond.

The reason for the connection failure is because some PowerShell commands that use kubeconfig-mgmt fail with an error similar to the following one:

Unable to connect to the server: d ial tcp 172.168.10.0:6443, where 172.168.10.0 is the IP of the control plane.

The kube-vip pod can go down for two reasons:

  • The memory pressure in the system can slow down etcd, which ends up affecting kube-vip.
  • The kube-apiserver is not available.

To help resolve this issue, try rebooting the machine. However, the issue of the memory pressure slowing down may return.

The workload cluster is not found

The workload cluster may not be found if the IP address pools of two AKS on Azure Stack HCI deployments are the same or overlap. If you deploy two AKS hosts and use the same AksHciNetworkSetting configuration for both, PowerShell and Windows Admin Center will potentially fail to find the workload cluster as the API server will be assigned the same IP address in both clusters resulting in a conflict.

The error message you receive will look similar to the example shown below.

A workload cluster with the name 'clustergroup-management' was not found.
At C:\Program Files\WindowsPowerShell\Modules\Kva\0.2.23\Common.psm1:3083 char:9
+         throw $("A workload cluster with the name '$Name' was not fou ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (A workload clus... was not found.:String) [], RuntimeException
    + FullyQualifiedErrorId : A workload cluster with the name 'clustergroup-management' was not found.

Note

Your cluster name will be different.

What is the fix?

Creating virtual networks with a similar configuration cause overlap issues.

When creating overlapping network objects using the new-akshcinetworksetting and new-akshciclusternetwork PowerShell cmdlets, issues can occur. For example, issues may happen in scenarios where two virtual network configurations are almost the same.

What is the fix?

Get-AksHCIClusterNetwork does not show the current allocation of IP addresses.

Running the Get-AksHciClusterNetwork command provides a list of all virtual network configurations. However, the command does not show the current allocation of the IP addresses.

To find out what IP addresses are currently in use in a virtual network, use the steps below:

  1. To get the group, run the following command:
Get-MocGroup -location MocLocation
  1. To get the list of IP addresses that are currently in use, and the list of available or used virtual IP addresses, run the following command:
Get-MocNetworkInterface -Group <groupName> | ConvertTo-Json -depth 10
  1. Use the following command to view the list of virtual IP addresses that are currently in use:
Get-MocLoadBalancer -Group <groupName> | ConvertTo-Json -depth 10

When you deploy AKS on Azure Stack HCI with a misconfigured network, deployment timed out at various points.

When you deploy AKS on Azure Stack HCI, the deployment may time out at different points of the process depending on where the misconfiguration occurred. You should review the error message to determine the cause and where it occurred.

For example, in the following error, the point at which the misconfiguration occurred is in Get-DownloadSdkRelease -Name "mocstack-stable":

$vnet = New-AksHciNetworkSettingSet-AksHciConfig -vnet $vnetInstall-AksHciVERBOSE: 
Initializing environmentVERBOSE: [AksHci] Importing ConfigurationVERBOSE: 
[AksHci] Importing Configuration Completedpowershell : 
GetRelease - error returned by API call: 
Post "https://msk8s.api.cdp.microsoft.com/api/v1.1/contents/default/namespaces/default/names/mocstack-stable/versions/0.9.7.0/files?action=generateDownloadInfo&ForegroundPriority=True": 
dial tcp 52.184.220.11:443: connectex: 
A connection attempt failed because the connected party did not properly
respond after a period of time, or established connection failed because
connected host has failed to respond.At line:1 char:1+ powershell -command
{ Get-DownloadSdkRelease -Name "mocstack-stable"}

This indicates that the physical Azure Stack HCI node can resolve the name of the download URL, msk8s.api.cdp.microsoft.com, but the node can't connect to the target server.

To resolve this issue, you need to determine where the breakdown occurred in the connection flow. Here are some steps to try to resolve the issue from the physical cluster node:

  1. Ping the destination DNS name: ping msk8s.api.cdp.microsoft.com.
  2. If you get a response back and no time-out, then the basic network path is working.
  3. If the connection times out, then there could be a break in the data path. For more information, see check proxy settings. Or, there could be a break in the return path, so you should check the firewall rules.

Network proxy server blocks HTTP requests

When applying the platform configuration, the network proxy server blocked HTTP requests originating from the user agent string Google Chrome 65 because this string is an out-of-date user agent client.

The user agent will be updated to Google Chrome 91 in the next release.