This page walks through several common issues with Kubernetes setup, networking, and deployments.
Suggest an FAQ item by raising a PR to our documentation repository.
This page is subdivided into the following categories:
How do I know start.ps1 on Windows completed successfully?
You should see kubelet, kube-proxy, and (if you chose Flannel as your networking solution) flanneld host-agent processes running on your node, with running logs being displayed in separate PoSh windows. In addition to this, your Windows node should be listed as “Ready” in your Kubernetes cluster.
Can I configure to run all of this in the background instead of PoSh windows?
Starting with Kubernetes version 1.11, kubelet & kube-proxy can be run as native Windows Services. You can also always use alternative service managers like nssm.exe to always run these processes (flanneld, kubelet & kube-proxy) in the background for you. See Windows Services on Kubernetes for example steps.
I have problems running Kubernetes processes as Windows services
For initial troubleshooting, you can use the following flags in nssm.exe to redirect stdout and stderr to a output file:
nssm set <Service Name> AppStdout C:\k\mysvc.log nssm set <Service Name> AppStderr C:\k\mysvc.log
For additional details, see official nssm usage docs.
Common networking errors
Load balancers are plumbed inconsistently across the cluster nodes
On Windows, kube-proxy creates a HNS load balancer for every Kubernetes service in the cluster. In the (default) kube-proxy configuration, nodes in clusters containing many (usually 100+) load balancers may run out of available ephemeral TCP ports (a.k.a. dynamic port range, which by default covers ports 49152 through 65535). This is due to the high number of ports reserved on each node for every (non-DSR) load balancer. This issue may manifest itself through errors in kube-proxy such as:
Policy creation failed: hcnCreateLoadBalancer failed in Win32: The specified port already exists.
Users can identify this issue by running CollectLogs.ps1 script and consulting the
CollectLogs.ps1 will also mimic HNS allocation logic to test port pool allocation availability in the ephemeral TCP port range, and report success/failure in
reservedports.txt. The script reserves 10 ranges of 64 TCP ephemeral ports (to emulate HNS behavior), counts reservation successes & failures, then releases the allocated port ranges. A success number less than 10 indicates the ephemeral pool is running out of free space. A heuristical summary of how many 64-block port reservations are approximately available will also be generated in
To resolve this issue, a few steps can be taken:
- For a permanent solution, kube-proxy load balancing should be set to DSR mode. DSR mode is fully implemented and available on newer Windows Server Insider build 18945 (or higher) only.
- As a workaround, users can also increase the default Windows configuration of ephemeral ports available using a command such as
netsh int ipv4 set dynamicportrange TCP <start_port> <port_count>. WARNING: Overriding the default dynamic port range can have consequences on other processes/services on the host that rely on available TCP ports from the non-ephemeral range, so this range should be selected carefully.
- There is a scalability enhancement to non-DSR mode load balancers using intelligent port pool sharing, which is scheduled to be released through a cumulative update in Q1 2020.
HostPort publishing is not working
It is currently not possible to publish ports using the Kubernetes
containers.ports.hostPort field as this field is not honored by Windows CNI plugins. Please use NodePort publishing for the time being to publish ports on the Node.
I am seeing errors such as "hnsCall failed in Win32: The wrong diskette is in the drive."
This error can occur when making custom modifications to HNS objects or installing new Windows Update that introduce changes to HNS without tearing down old HNS objects. It indicates that a HNS object which was previously created before an update is incompatible with the currently installed HNS version.
On Windows Server 2019 (and below), users can delete HNS objects by deleting the HNS.data file
Stop-Service HNS rm C:\ProgramData\Microsoft\Windows\HNS\HNS.data Start-Service HNS
Users should be able to directly delete any incompatible HNS endpoints or networks:
hnsdiag list endpoints hnsdiag delete endpoints <id> hnsdiag list networks hnsdiag delete networks <id> Restart-Service HNS
Users on Windows Server, version 1903 can go to the following registry location and delete any NICs starting with the network name (e.g.
Containers on my Flannel host-gw deployment on Azure cannot reach the internet
When deploying Flannel in host-gw mode on Azure, packets have to go through the Azure physical host vSwitch. Users should program user-defined routes of type "virtual appliance" for each subnet assigned to a node. This can be done through the Azure portal (see an example here) or via
az Azure CLI. Here is one example UDR with name "MyRoute" using az commands for a node with IP 10.0.0.4 and respective pod subnet 10.244.0.0/24:
az network route-table create --resource-group <my_resource_group> --name BridgeRoute az network route-table route create --resource-group <my_resource_group> --address-prefix 10.244.0.0/24 --route-table-name BridgeRoute --name MyRoute --next-hop-type VirtualAppliance --next-hop-ip-address 10.0.0.4
If you are deploying Kubernetes on Azure or IaaS VMs from other cloud providers yourself, you can also use overlay networking instead.
My Windows pods cannot ping external resources
Windows pods do not have outbound rules programmed for the ICMP protocol today. However, TCP/UDP is supported. When trying to demonstrate connectivity to resources outside of the cluster, please substitute
ping <IP> with corresponding
curl <IP> commands.
If you are still facing problems, most likely your network configuration in cni.conf deserves some extra attention. You can always edit this static file, the configuration will be applied to any newly created Kubernetes resources.
One of the Kubernetes networking requirements (see Kubernetes model) is for cluster communication to occur without NAT internally. To honor this requirement, we have an ExceptionList for all the communication where we do not want outbound NAT to occur. However, this also means that you need to exclude the external IP you are trying to query from the ExceptionList. Only then will the traffic originating from your Windows pods be SNAT’ed correctly to receive a response from the outside world. In this regard, your ExceptionList in
cni.conf should look as follows:
"ExceptionList": [ "10.244.0.0/16", # Cluster subnet "10.96.0.0/12", # Service subnet "10.127.130.0/24" # Management (host) subnet ]
My Windows node cannot access a NodePort service
Local NodePort access from the node itself will fail. This is a known limitation. NodePort access will work from other nodes or external clients.
After some time, vNICs and HNS endpoints of containers are being deleted
This issue can be caused when the
hostname-override parameter is not passed to kube-proxy. To resolve it, users need to pass the hostname to kube-proxy as follows:
On Flannel (vxlan) mode, my pods are having connectivity issues after rejoining the node
Whenever a previously deleted node is being rejoined to the cluster, flannelD will try to assign a new pod subnet to the node. Users should remove the old pod subnet configuration files in the following paths:
Remove-Item C:\k\SourceVip.json Remove-Item C:\k\SourceVipRequest.json
After launching start.ps1, Flanneld is stuck in "Waiting for the Network to be created"
There are numerous reports of this issue which are being investigated; most likely it is a timing issue for when the management IP of the flannel network is set. A workaround is to simply relaunch start.ps1 or relaunch it manually as follows:
PS C:> [Environment]::SetEnvironmentVariable("NODE_NAME", "<Windows_Worker_Hostname>") PS C:> C:\flannel\flanneld.exe --kubeconfig-file=c:\k\config --iface=<Windows_Worker_Node_IP> --ip-masq=1 --kube-subnet-mgr=1
There is also a PR that addresses this issue under review currently.
My Windows pods cannot launch because of missing /run/flannel/subnet.env
This indicates that Flannel didn't launch correctly. You can either try to restart flanneld.exe or you can copy the files over manually from
/run/flannel/subnet.env on the Kubernetes master to
C:\run\flannel\subnet.env on the Windows worker node and modify the
FLANNEL_SUBNET row to the subnet that was assigned. For example, if node subnet 10.244.4.1/24 was assigned:
FLANNEL_NETWORK=10.244.0.0/16 FLANNEL_SUBNET=10.244.4.1/24 FLANNEL_MTU=1500 FLANNEL_IPMASQ=true
It is safer to let flanneld.exe generate this file for you.
Pod-to-pod connectivity between hosts is broken on my Kubernetes cluster running on vSphere
Since both vSphere and Flannel reserves port 4789 (default VXLAN port) for overlay networking, packets can end up being intercepted. If vSphere is used for overlay networking, it should be configured to use a different port in order to free up 4789.
My endpoints/IPs are leaking
There exist 2 currently known issues that can cause endpoints to leak.
- The first known issue is a problem in Kubernetes version 1.11. Please avoid using Kubernetes version 1.11.0 - 1.11.2.
- The second known issue that can cause endpoints to leak is a concurrency problem in the storage of endpoints. To receive the fix, you must use Docker EE 18.09 or above.
My pods cannot launch due to "network: failed to allocate for range" errors
This indicates that the IP address space on your node is used up. To clean up any leaked endpoints, please migrate any resources on impacted nodes & run the following commands:
c:\k\stop.ps1 Get-HNSEndpoint | Remove-HNSEndpoint Remove-Item -Recurse c:\var
My Windows node cannot access my services using the service IP
This is a known limitation of the current networking stack on Windows. Windows pods are able to access the service IP however.
No network adapter is found when starting Kubelet
The Windows networking stack needs a virtual adapter for Kubernetes networking to work. If the following commands return no results (in an admin shell), virtual network creation — a necessary prerequisite for Kubelet to work — has failed:
Get-HnsNetwork | ? Name -ieq "cbr0" Get-NetAdapter | ? Name -Like "vEthernet (Ethernet*"
Often it is worthwhile to modify the InterfaceName parameter of the start.ps1 script, in cases where the host's network adapter isn't "Ethernet". Otherwise, consult the output of the
start-kubelet.ps1 script to see if there are errors during virtual network creation.
Pods stop resolving DNS queries successfully after some time alive
There is a known DNS caching issue in the networking stack of Windows Server, version 1803 and below that may sometimes cause DNS requests to fail. To work around this issue, you can set the max TTL cache values to zero using the following registry keys:
FROM microsoft/windowsservercore:<your-build> SHELL ["powershell', "-Command", "$ErrorActionPreference = 'Stop';"] New-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name MaxCacheTtl -Value 0 -Type DWord New-ItemPropery -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name MaxNegativeCacheTtl -Value 0 -Type DWord
I am still seeing problems. What should I do?
There may be additional restrictions in place on your network or on hosts preventing certain types of communication between nodes. Ensure that:
- you have properly configured your chosen network topology
- traffic that looks like it's coming from pods is allowed
- HTTP traffic is allowed, if you are deploying web services
- Packets from different protocols (ie ICMP vs. TCP/UDP) are not being dropped
For additional self-help resources, there is also a Kubernetes troubleshooting guide for Windows available here.
Common Windows errors
My Kubernetes pods are stuck at "ContainerCreating"
This issue can have many causes, but one of the most common is that the pause image was misconfigured. This is a high-level symptom of the next issue.
When deploying, Docker containers keep restarting
Check that your pause image is compatible with your OS version. The instructions assume that both the OS and the containers are version 1803. If you have a later version of Windows, such as an Insider build, you will need to adjust the images accordingly. Please refer to the Microsoft's Docker repository for images. Regardless, both the pause image Dockerfile and the sample service will expect the image to be tagged as
Common Kubernetes master errors
Debugging the Kubernetes master falls into three main categories (in order of likelihood):
- Something is wrong with the Kubernetes system containers.
- Something is wrong with the way
- Something is wrong with the system.
kubectl get pods -n kube-system to see the pods being created by Kubernetes; this may give some insight into which particular ones are crashing or not starting correctly. Then, run
docker ps -a to see all of the raw containers that back these pods. Finally, run
docker logs [ID] on the container(s) that are suspected to be causing the problem to see the raw output of the processes.
Cannot connect to the API server at
More often than not, this error indicates certificate problems. Ensure that you have generated the configuration file correctly, that the IP addresses in it match that of your host, and that you have copied it to the directory that is mounted by the API server.
If following our instructions, good places to find this is:
otherwise, refer to the API server's manifest file to check the mount points.