Kubernetes Network Troubleshooting — Debug Pods, tcpdump & Connectivity

What Kubernetes Network Troubleshooting Is and Why It Matters in 2026

Kubernetes network troubleshooting is the systematic process of diagnosing and resolving connectivity failures, DNS resolution issues, service mesh misconfigurations, and packet-level anomalies within containerized workloads running on Kubernetes clusters. As enterprises migrate 70-80% of their production workloads to Kubernetes by 2026, network engineers face a major change: traditional ICMP ping and traceroute workflows no longer suffice when pods are ephemeral, IP addresses change every deployment, and CNI plugins abstract the underlying network fabric. In our HSR Layout lab, we observe that 60% of production incidents reported by our 4-month paid internship candidates at Cisco India and Akamai India stem from misconfigured NetworkPolicies, CoreDNS failures, or CNI plugin incompatibilities—issues invisible to legacy monitoring tools.

The stakes are high. A single misconfigured kube-proxy iptables rule can blackhole traffic to critical microservices, causing revenue loss and SLA breaches. Indian enterprises deploying on-premises Kubernetes clusters—especially in BFSI sectors regulated by RBI and SEBI—must maintain audit trails of network flows, making troubleshooting both a technical and compliance imperative. Unlike traditional data center networks where VLANs and routing tables remain static for months, Kubernetes networks are fluid: a pod restart triggers new IP allocation, service endpoints update dynamically, and ingress controllers rewrite HTTP headers in real time. Mastering Kubernetes network troubleshooting separates junior DevOps engineers from senior Site Reliability Engineers commanding ₹12-18 LPA salaries at Aryaka, Movate, and HCL.

How Kubernetes Networking Works Under the Hood

Before troubleshooting, you must understand the four-layer Kubernetes networking model. At Layer 1, the Container Network Interface (CNI) plugin—Calico, Flannel, Cilium, or Weave—assigns IP addresses to pods and programs routing rules on worker nodes. Each pod receives a unique IP from the cluster CIDR range (e.g., 10.244.0.0/16), and the CNI ensures pod-to-pod communication across nodes without NAT. At Layer 2, kube-proxy runs on every node, watching the Kubernetes API server for Service and Endpoints objects. When you create a ClusterIP Service, kube-proxy programs iptables or IPVS rules to load-balance traffic across backend pods. At Layer 3, CoreDNS provides service discovery: when a pod queries my-service.default.svc.cluster.local, CoreDNS returns the ClusterIP, and kube-proxy forwards packets to healthy pods. At Layer 4, Ingress controllers (NGINX, Traefik, Istio Gateway) terminate external HTTPS connections and route HTTP requests to backend Services based on host headers and URL paths.

The troubleshooting complexity arises from interdependencies. A DNS failure might stem from a CoreDNS pod crash, a NetworkPolicy blocking UDP port 53, or a misconfigured resolv.conf inside application pods. A service unreachable error could originate from incorrect label selectors, missing Endpoints, or a CNI plugin bug that fails to program routes. In production clusters at Wipro and TCS—where our placement candidates work—we see cascading failures: a node running out of IP addresses causes new pods to stay in Pending state, triggering autoscaler thrashing, which overwhelms the API server and delays kube-proxy rule updates, ultimately breaking service discovery for unrelated workloads.

The Packet Journey: Pod to External API

When a pod at 10.244.1.5 sends an HTTPS request to an external API at 203.0.113.50, the packet traverses six hops. First, the pod's network namespace routes the packet to the veth pair endpoint on the host. Second, the host's routing table forwards it to the default gateway (often the node's primary interface). Third, if a NetworkPolicy applies, the CNI plugin's iptables chains evaluate rules—if denied, the packet drops here. Fourth, the packet exits the node via the physical NIC, passing through any overlay encapsulation (VXLAN for Flannel, IP-in-IP for Calico). Fifth, the data center router performs SNAT, replacing the pod IP with the node's external IP. Sixth, the packet reaches the internet gateway and routes to the destination. The return path reverses this flow, with kube-proxy performing DNAT if the response targets a Service ClusterIP. Understanding this journey is critical: if you see asymmetric routing or blackholed packets, you know to check iptables FORWARD chains, CNI plugin logs, and node-level routing tables.

Essential Tools for Kubernetes Network Troubleshooting

Kubernetes network troubleshooting requires a hybrid toolkit combining container-native utilities and traditional packet analysis. The kubectl CLI is your primary interface: kubectl get pods -o wide reveals pod IPs and node assignments, kubectl describe service my-svc shows Endpoints and label selectors, and kubectl logs coredns-xyz -n kube-system exposes DNS query logs. For interactive debugging, kubectl exec -it my-pod -- /bin/sh drops you into a pod's shell, where you can run curl, nslookup, or ping—assuming the container image includes these binaries. Many production images (Alpine-based, distroless) strip debugging tools to reduce attack surface, so you must deploy ephemeral debug containers using kubectl debug (stable since Kubernetes 1.23).

For packet-level inspection, tcpdump remains indispensable. Since pods share the host's network stack via veth pairs, you can capture pod traffic from the node: tcpdump -i cali123abc -nn port 443 (Calico interface naming). To identify the correct interface, run ip link | grep $(kubectl get pod my-pod -o jsonpath='{.status.podIP}') on the node. For encrypted traffic, you need session keys—Istio and Linkerd service meshes expose sidecar proxy logs that include plaintext payloads before TLS encryption. In our HSR Layout lab, we teach candidates to use nsenter to enter a pod's network namespace from the host: nsenter -t $(docker inspect --format '{{.State.Pid}}' container-id) -n tcpdump -i eth0. This technique bypasses the need for debug containers and works even when the pod is CrashLooping.

Debug Pods and Ephemeral Containers

A debug pod is a temporary workload deployed with network troubleshooting tools pre-installed. Create one using kubectl run netshoot --image=nicolaka/netshoot -it --rm -- /bin/bash. The nicolaka/netshoot image bundles tcpdump, nmap, iperf3, curl, dig, and mtr. From this pod, test connectivity to Services (curl http://my-service.default.svc.cluster.local), verify DNS resolution (nslookup my-service.default.svc.cluster.local), and scan open ports (nmap -p 1-65535 10.96.0.1). For troubleshooting a specific failing pod, use ephemeral containers: kubectl debug my-pod -it --image=nicolaka/netshoot --target=my-container. This injects a sidecar container into the existing pod, sharing its network namespace, so you see the exact network view the application sees—critical when NetworkPolicies or service mesh sidecars alter traffic flow.

Diagnosing Pod-to-Pod Connectivity Failures

Pod-to-pod connectivity failures manifest as connection timeouts, DNS resolution errors, or HTTP 503 responses. Start with the OSI model bottom-up. At Layer 3, verify IP reachability: kubectl exec source-pod -- ping 10.244.2.10. If ping succeeds but application traffic fails, the issue is at Layer 4 or above. If ping fails, check the CNI plugin status: kubectl get pods -n kube-system | grep calico (or flannel, cilium). A CrashLooping CNI pod means no new pods receive IP addresses, and existing pods lose connectivity during CNI restarts. Inspect CNI logs: kubectl logs calico-node-xyz -n kube-system. Common errors include IPAM exhaustion ("no IPs available in pool"), BGP peering failures (Calico), or VXLAN port conflicts (Flannel).

At Layer 4, verify the target pod is listening on the expected port. Deploy a debug pod in the same namespace: kubectl run netshoot --image=nicolaka/netshoot -it --rm -- /bin/bash, then run nc -zv 10.244.2.10 8080. If the connection is refused, the application inside the pod isn't bound to 0.0.0.0:8080—check the Dockerfile's EXPOSE directive and application configuration. If the connection times out, a NetworkPolicy is likely blocking traffic. List policies: kubectl get networkpolicies -n my-namespace. Kubernetes NetworkPolicies are whitelist-based: if any policy selects a pod, all traffic not explicitly allowed is denied. A common mistake is creating an ingress policy without an egress policy, breaking DNS lookups because UDP port 53 to CoreDNS is blocked. In our 4-month paid internship at Cisco India, candidates debug this by temporarily deleting the NetworkPolicy (kubectl delete netpol my-policy) to confirm it's the culprit, then refining the policy's egress rules to allow DNS.

Service Discovery and DNS Failures

When a pod cannot resolve service names, test DNS directly: kubectl exec my-pod -- nslookup my-service.default.svc.cluster.local. If this times out, check the pod's /etc/resolv.conf: kubectl exec my-pod -- cat /etc/resolv.conf. It should contain nameserver 10.96.0.10 (the ClusterIP of the kube-dns Service) and search default.svc.cluster.local svc.cluster.local cluster.local. If resolv.conf is empty or points to an external DNS server, the kubelet's --cluster-dns and --cluster-domain flags are misconfigured. Verify CoreDNS is running: kubectl get pods -n kube-system -l k8s-app=kube-dns. If CoreDNS pods are Pending, check node resources—CoreDNS requires 170Mi memory and 100m CPU. If CoreDNS is Running but DNS queries fail, inspect its ConfigMap: kubectl get configmap coredns -n kube-system -o yaml. A misconfigured forward plugin (e.g., forwarding to an unreachable upstream DNS server) causes all external domain lookups to hang.

Troubleshooting Service and Ingress Issues

A Kubernetes Service is an abstraction over a set of pods, identified by a label selector. When you create a Service, the control plane creates an Endpoints object listing the IPs of matching pods. If curl http://my-service.default.svc.cluster.local returns "connection refused" or times out, first check the Endpoints: kubectl get endpoints my-service. If the Endpoints list is empty, the Service's label selector doesn't match any pods. Compare the Service spec (kubectl get svc my-service -o yaml) with pod labels (kubectl get pods --show-labels). A single character typo—app: myapp vs. app: my-app—causes zero Endpoints. If Endpoints exist but traffic still fails, verify kube-proxy is running on all nodes: kubectl get pods -n kube-system -l k8s-app=kube-proxy. On each node, kube-proxy programs iptables rules (or IPVS load-balancing) to forward Service ClusterIP traffic to pod IPs. Inspect iptables: iptables-save | grep my-service. You should see DNAT rules mapping the ClusterIP to pod IPs.

Ingress troubleshooting adds another layer. An Ingress resource defines HTTP routing rules, but the Ingress controller (NGINX, Traefik, HAProxy) implements them. If external clients cannot reach your application via the Ingress hostname, first verify the Ingress controller pod is Running: kubectl get pods -n ingress-nginx. Check the Ingress resource: kubectl describe ingress my-ingress. The Address field should show the external IP or hostname of the LoadBalancer Service fronting the Ingress controller. If it's empty, the cloud provider's load balancer provisioning failed—check the LoadBalancer Service events: kubectl describe svc ingress-nginx-controller -n ingress-nginx. If the Address is present but HTTP requests return 404, the Ingress controller hasn't picked up the Ingress rules. Inspect controller logs: kubectl logs ingress-nginx-controller-xyz -n ingress-nginx. Common errors include TLS secret not found, backend Service not found, or path regex syntax errors. At Akamai India and Aryaka—where our placement candidates deploy edge Kubernetes clusters—Ingress misconfigurations cause customer-facing outages, so we drill candidates on reading NGINX access logs (kubectl logs ingress-nginx-controller-xyz -n ingress-nginx | grep "GET /api") to trace request paths.

NodePort and LoadBalancer Service Debugging

NodePort Services expose a pod on a static port (30000-32767) on every node's IP. If external clients cannot connect to node-ip:30080, first verify the Service type: kubectl get svc my-service -o jsonpath='{.spec.type}'. If it's ClusterIP, the NodePort isn't allocated. Change it to NodePort: kubectl patch svc my-service -p '{"spec":{"type":"NodePort"}}'. Next, check firewall rules on the nodes. Cloud providers (AWS, Azure, GCP) require security group rules allowing inbound traffic on the NodePort range. On-premises clusters require firewall exceptions. Test from the node itself: curl localhost:30080. If this succeeds but external access fails, the issue is network-level, not Kubernetes-level. For LoadBalancer Services, the cloud provider provisions an external load balancer and updates the Service's status.loadBalancer.ingress field. If this field remains empty after 5 minutes, check the cloud controller manager logs: kubectl logs cloud-controller-manager-xyz -n kube-system. Quota limits, IAM permission errors, or VPC subnet exhaustion prevent load balancer creation.

Using tcpdump and Packet Captures in Kubernetes

Packet captures are the gold standard for diagnosing intermittent failures, TLS handshake errors, and protocol-level bugs. To capture traffic from a specific pod, first identify the pod's network interface on the host node. SSH into the node running the pod (kubectl get pod my-pod -o wide shows the node name), then list interfaces: ip link. Calico names interfaces caliXXX, Flannel uses vethXXX. Match the interface to the pod IP: ip addr | grep 10.244.1.5. Once identified, run tcpdump -i cali123abc -w /tmp/capture.pcap. For encrypted traffic, capture both client and server sides, then use Wireshark's "Follow TLS Stream" with session keys exported from the application (if supported). In service mesh environments (Istio, Linkerd), the sidecar proxy terminates TLS, so capture traffic on the lo interface between the application container and the sidecar: tcpdump -i lo port 15001 -w /tmp/envoy.pcap.

For real-time analysis, pipe tcpdump output to grep: tcpdump -i cali123abc -nn -A | grep "HTTP/1.1". This shows HTTP request and response headers in plaintext, useful for debugging API gateway routing. To capture only DNS queries: tcpdump -i any port 53 -nn. If you see queries to my-service.default.svc.cluster.local but no responses, CoreDNS is either down or the NetworkPolicy blocks UDP port 53. To capture traffic between two specific pods, use BPF filters: tcpdump -i any 'host 10.244.1.5 and host 10.244.2.10' -w /tmp/pod-to-pod.pcap. At HCL and Movate—where our candidates manage multi-tenant Kubernetes clusters—packet captures prove whether a customer's complaint ("my app is slow") stems from network latency, application bugs, or database query performance. We teach candidates to calculate TCP retransmission rates in Wireshark: if >5% of packets are retransmitted, the underlying network (CNI overlay, node NICs, or data center fabric) is lossy.

Capturing Traffic from CrashLooping Pods

When a pod CrashLoops, it restarts every few seconds, making interactive debugging impossible. To capture its traffic, deploy a sidecar container in the same pod that runs tcpdump continuously. Edit the pod's Deployment: kubectl edit deployment my-app, and add a sidecar container with hostNetwork: true and privileged: true. The sidecar runs tcpdump -i any -w /shared/capture.pcap, writing to a shared emptyDir volume. After the main container crashes, retrieve the capture: kubectl cp my-pod:/shared/capture.pcap ./capture.pcap -c sidecar. This technique is invaluable for debugging startup failures caused by unreachable databases, misconfigured health checks, or DNS resolution delays. Founder Vikas Swami used this approach while architecting QuickZTNA, where zero-trust agents running in Kubernetes pods needed to establish mTLS tunnels to a central controller within 2 seconds—packet captures revealed that DNS lookups were taking 8 seconds due to a CoreDNS cache miss, which we fixed by pre-populating the cache with critical service names.

NetworkPolicy Debugging and Enforcement

Kubernetes NetworkPolicies are namespace-scoped firewall rules that control ingress and egress traffic to pods. They are implemented by the CNI plugin (Calico, Cilium, Weave support them; Flannel does not). A common mistake is assuming NetworkPolicies are deny-by-default: they are allow-by-default until you create a policy selecting a pod. Once a policy selects a pod, all traffic not explicitly allowed is denied. To debug a suspected NetworkPolicy block, first list all policies in the namespace: kubectl get networkpolicies -n my-namespace. Describe each policy: kubectl describe networkpolicy my-policy. Check the podSelector field—if it matches your pod's labels, the policy applies. Examine ingress and egress rules. An ingress rule without a corresponding egress rule blocks return traffic (stateful firewalls handle this automatically, but NetworkPolicies are stateless at the API level—the CNI plugin adds conntrack rules).

To test if a NetworkPolicy is the culprit, temporarily delete it: kubectl delete networkpolicy my-policy. If connectivity restores, the policy is too restrictive. Refine it by adding explicit allow rules. For example, to allow DNS, add an egress rule permitting UDP port 53 to the kube-dns Service CIDR: egress: - to: - namespaceSelector: matchLabels: name: kube-system ports: - protocol: UDP port: 53. To allow all egress traffic (common during development), use an empty egress: [{}] rule. In production environments at Wipro and TCS, we enforce least-privilege NetworkPolicies: each microservice has an ingress policy allowing traffic only from its immediate upstream dependencies, and an egress policy allowing traffic only to its downstream dependencies and CoreDNS. This zero-trust model prevents lateral movement during breaches—a requirement for CERT-In compliance in Indian BFSI deployments.

Visualizing NetworkPolicy Enforcement

Calico provides a policy visualization tool: calicoctl get networkpolicy -o yaml shows the compiled iptables rules. For Cilium, use cilium endpoint list to see which policies apply to each pod, then cilium policy get to view the effective policy. In our HSR Layout lab, we use open-source tools like kubectl-netshoot and kubectl-trace (based on eBPF) to trace packet drops in real time. For example, kubectl trace run my-pod -e 'kprobe:ip_rcv { @[comm] = count(); }' counts packets received by the kernel, helping identify if packets reach the node but are dropped by iptables before reaching the pod. This level of introspection is critical for senior SRE roles commanding ₹15-20 LPA at Aryaka and Akamai India, where candidates must root-cause packet loss in multi-cluster service meshes spanning AWS, Azure, and on-premises data centers.

Common Pitfalls and Interview Gotchas

CCIE and CCNP-level interviewers at Cisco India, Barracuda, and Akamai probe deep Kubernetes networking knowledge. A frequent gotcha: "Why does ping work but curl fails?" The answer: ICMP (ping) bypasses kube-proxy and NetworkPolicies that filter TCP/UDP only. If a NetworkPolicy allows ICMP but blocks TCP port 80, ping succeeds while HTTP requests fail. Another trap: "A Service has Endpoints, but clients get 'connection refused'." The issue: the Service's targetPort doesn't match the pod's container port. The Service spec might say targetPort: 8080, but the pod listens on port 3000. Always cross-check: kubectl get svc my-service -o jsonpath='{.spec.ports[0].targetPort}' vs. kubectl get pod my-pod -o jsonpath='{.spec.containers[0].ports[0].containerPort}'.

A third gotcha: "CoreDNS is running, but DNS queries for external domains fail." The root cause: the CoreDNS ConfigMap's forward plugin points to an internal DNS server (e.g., 10.0.0.2) that's unreachable from the Kubernetes cluster. The fix: change the forward target to a public DNS server (8.8.8.8) or ensure the internal DNS server is reachable via the CNI network. Interviewers also ask about DNS caching: "Why do DNS queries return stale IPs after a pod restart?" CoreDNS caches responses for the TTL specified by the upstream DNS server. If the upstream TTL is 3600 seconds, clients see stale IPs for up to an hour. The solution: reduce the upstream TTL or configure CoreDNS's cache plugin with a shorter TTL override.

A fourth pitfall: assuming hostNetwork: true pods can always reach Services. When a pod uses the host network, it bypasses the CNI plugin and kube-proxy iptables rules. If the node's routing table doesn't include routes to the Service CIDR (10.96.0.0/12), the pod cannot reach ClusterIP Services. The workaround: use NodePort or LoadBalancer Services, or configure static routes on the host. In our AWS DevOps course in Bangalore, we simulate this scenario by deploying a monitoring agent with hostNetwork: true that needs to scrape metrics from in-cluster Services—candidates must add a static route on each node or reconfigure the agent to use the Kubernetes API server as a proxy.

Real-World Deployment Scenarios

At Cisco India's Bangalore office, our placement candidates support a multi-region Kubernetes deployment spanning AWS ap-south-1 and on-premises data centers in Whitefield. A recurring issue: cross-cluster service discovery. When a pod in the AWS cluster needs to call a service in the on-premises cluster, DNS resolution fails because CoreDNS only knows about in-cluster Services. The solution: configure CoreDNS with a forward rule for the on-premises cluster's domain (e.g., onprem.example.com) pointing to the on-premises CoreDNS Service's external IP. Alternatively, deploy a service mesh like Istio with multi-cluster support, which federates service discovery across clusters. Candidates debug this by running nslookup my-service.onprem.example.com from a pod and verifying the CoreDNS ConfigMap includes the forward rule.

At Akamai India, edge Kubernetes clusters run on bare-metal servers in ISP co-location facilities. These clusters use Calico with BGP peering to advertise pod CIDRs to upstream routers, enabling direct pod-to-internet routing without NAT. A common failure mode: BGP sessions flap due to MTU mismatches. Calico encapsulates packets in IP-in-IP tunnels (adding 20 bytes overhead), so if the physical NIC MTU is 1500, the effective pod MTU is 1480. When pods send 1500-byte packets, they fragment, causing routers to drop them if the "Don't Fragment" bit is set. The fix: reduce the pod MTU to 1480 by setting veth_mtu: 1480 in the Calico ConfigMap. Candidates verify this with ip link show on the node and ping -M do -s 1472 8.8.8.8 from a pod (1472 + 28 bytes ICMP/IP headers = 1500 bytes).

At Aryaka, SD-WAN controllers run in Kubernetes and must establish IPsec tunnels to customer edge devices. A challenge: Kubernetes assigns random pod IPs, but IPsec requires stable endpoint IPs. The solution: use a StatefulSet with a headless Service, which assigns predictable DNS names (e.g., controller-0.controller-svc.default.svc.cluster.local). Combine this with a LoadBalancer Service with a static external IP (loadBalancerIP: 203.0.113.10) to ensure customer devices always connect to the same IP. Candidates troubleshoot IPsec tunnel failures by capturing ESP packets (tcpdump -i any proto 50) and verifying the source IP matches the LoadBalancer IP, not the pod IP.

How Kubernetes Network Troubleshooting Connects to DevOps and Cloud Certifications

Kubernetes network troubleshooting is a core competency for the Certified Kubernetes Administrator (CKA) exam, which includes a 15% weight on "Troubleshooting" covering pod connectivity, service discovery, and DNS failures. The CKA exam is hands-on: you must debug a broken cluster within a time limit, using only kubectl and SSH access to nodes. Our AWS DevOps course in Bangalore includes 40 hours of CKA-aligned labs where candidates troubleshoot pre-broken clusters—scenarios include misconfigured NetworkPolicies, CoreDNS crashes, and CNI plugin failures. We also map Kubernetes networking to AWS networking: a Kubernetes Service is analogous to an AWS Application Load Balancer, a NodePort is like an EC2 instance with a security group allowing inbound traffic, and a NetworkPolicy is similar to an AWS Security Group rule.

For the AWS Certified DevOps Engineer – Professional exam, you must understand EKS (Elastic Kubernetes Service) networking, including VPC CNI plugin behavior, security group integration, and ALB Ingress Controller configuration. A common exam question: "Why can't pods in a private subnet reach the internet?" The answer: the VPC CNI assigns pod IPs from the VPC subnet, so pods need a NAT Gateway in the route table. Another question: "How do you expose a Kubernetes Service to the internet in EKS?" Options include LoadBalancer Service (provisions a Classic Load Balancer), Ingress with ALB Ingress Controller (provisions an Application Load Balancer), or NodePort with an external load balancer. Our course covers all three, with cost comparisons—Classic Load Balancers cost ₹1,500/month, ALBs cost ₹2,200/month but support advanced routing, and NodePort is free but requires manual load balancer setup.

The CCNP Enterprise and CCIE Enterprise Infrastructure exams now include SD-WAN and cloud networking, where Kubernetes is increasingly relevant. Cisco's SD-WAN solution (Viptela) can run in Kubernetes, and troubleshooting involves correlating Kubernetes pod logs with SD-WAN overlay tunnel status. For example, if a Viptela vEdge pod cannot establish a DTLS tunnel to the vBond orchestrator, you must check both the pod's network connectivity (kubectl exec vedge-pod -- ping vbond.example.com) and the vEdge configuration (kubectl exec vedge-pod -- viptela-cli show control connections). This cross-domain troubleshooting—Kubernetes + SD-WAN—is what differentiates ₹18-25 LPA senior network engineers at Cisco India from junior engineers.

Frequently Asked Questions

Why does my pod get "connection refused" when accessing a Service, but direct pod IP works?

This indicates kube-proxy is not running or has failed to program iptables rules. Verify kube-proxy pods are Running on all nodes: kubectl get pods -n kube-system -l k8s-app=kube-proxy. Check kube-proxy logs for errors: kubectl logs kube-proxy-xyz -n kube-system. Common issues include kube-proxy unable to reach the API server (check --kubeconfig flag), or iptables rules corrupted by another process. Restart kube-proxy: kubectl delete pod kube-proxy-xyz -n kube-system. On the node, verify iptables rules exist: iptables-save | grep KUBE-SERVICES. If rules are missing, kube-proxy is not functioning.

How do I troubleshoot intermittent DNS failures in Kubernetes?

Intermittent DNS failures often stem from CoreDNS pod resource exhaustion or conntrack table overflow. Check CoreDNS CPU and memory usage: kubectl top pods -n kube-system -l k8s-app=kube-dns. If CPU is >80%, scale CoreDNS: kubectl scale deployment coredns -n kube-system --replicas=3. Check conntrack table on nodes: sysctl net.netfilter.nf_conntrack_count and sysctl net.netfilter.nf_conntrack_max. If count approaches max, increase the limit: sysctl -w net.netfilter.nf_conntrack_max=262144. Also enable CoreDNS query logging: edit the CoreDNS ConfigMap and add log to the Corefile, then check logs for SERVFAIL responses indicating upstream DNS issues.

What causes "no route to host" errors between pods on different nodes?

This indicates a CNI plugin routing failure or firewall blocking inter-node traffic. First, verify the CNI plugin is running: kubectl get pods -n kube-system | grep calico (or your CNI). Check CNI logs for routing errors. For Calico with IP-in-IP, verify the tunl0 interface exists on nodes: ip link show tunl0. For Flannel with VXLAN, verify the flannel.1 interface exists. Next, check node-to-node connectivity: from one node, ping another node's IP. If this fails, the underlying network (VPC, data center fabric) is broken. If node-to-node ping works but pod-to-pod fails, check firewall rules—cloud providers require security group rules allowing all traffic between nodes in the cluster.

How do I capture encrypted traffic between pods in a service mesh?

Service meshes like Istio and Linkerd use sidecar proxies (Envoy, Linkerd-proxy) that terminate TLS. To capture plaintext traffic, use tcpdump on the loopback interface between the application container and the sidecar. For Istio: kubectl exec my-pod -c istio-proxy -- tcpdump -i lo port 15001 -w /tmp/capture.pcap, then copy the file: kubectl cp my-pod:/tmp/capture.pcap ./capture.pcap -c istio-proxy. Alternatively, enable Envoy access logs: kubectl edit configmap istio -n istio-system and set accessLogFile: /dev/stdout. Envoy logs include request/response headers, status codes, and latencies in plaintext, even for encrypted client-to-proxy traffic.

Why do pods in one namespace cannot reach Services in another namespace?

By default, pods can reach Services in any namespace using the FQDN service-name.namespace.svc.cluster.local. If this fails, a NetworkPolicy in the target namespace is blocking cross-namespace traffic. Check for NetworkPolicies: kubectl get networkpolicies -n target-namespace. If a policy exists with an ingress rule that specifies namespaceSelector, it restricts which namespaces can send traffic. To allow traffic from a specific namespace, add a namespaceSelector matching the source namespace's labels: ingress: - from: - namespaceSelector: matchLabels: name: source-namespace. Verify namespace labels: kubectl get namespace source-namespace --show-labels.

How do I debug "ImagePullBackOff" errors that are actually network-related?

ImagePullBackOff occurs when the kubelet cannot pull a container image from the registry. If the registry is external (Docker Hub, ECR, GCR), verify the node can reach the registry: SSH to the node and run curl https://registry-1.docker.io/v2/. If this fails, the node's internet connectivity is broken—check the default gateway and DNS resolution. If using a private registry with authentication, verify the ImagePullSecret exists: kubectl get secret my-registry-secret, and is referenced in the pod spec: imagePullSecrets: - name: my-registry-secret. For ECR, ensure the node's IAM role has ecr:GetAuthorizationToken and ecr:BatchGetImage permissions. Check kubelet logs on the node: journalctl -u kubelet | grep "Failed to pull image".

What is the difference between ClusterIP, NodePort, and LoadBalancer Services for troubleshooting?

ClusterIP Services are only reachable from within the cluster—if external clients cannot connect, this is expected behavior, not a bug. Use kubectl port-forward for temporary external access during debugging. NodePort Services expose the Service on a static port on every node's IP—if external clients cannot connect, check firewall rules on the nodes and verify the NodePort is allocated: kubectl get svc my-service -o jsonpath='{.spec.ports[0].nodePort}'. LoadBalancer Services provision an external load balancer (cloud provider-specific)—if the external IP is not assigned after 5 minutes, check the cloud controller manager logs and verify quota limits. For troubleshooting, start with ClusterIP (simplest), then add NodePort or LoadBalancer only when external access is required.

How do I troubleshoot Kubernetes networking in air-gapped or offline environments?

Air-gapped clusters cannot reach external DNS servers or container registries, so you must pre-configure internal alternatives. For DNS, configure CoreDNS to forward queries to an internal DNS server: edit the CoreDNS ConfigMap and change the forward plugin target from . /etc/resolv.conf to . 10.0.0.2 (your internal DNS server IP). For container images, set up a local registry (Harbor, Nexus) and configure the kubelet's --pod-infra-container-image flag to pull the pause container from the local registry. Verify nodes can reach the local registry: curl https://registry.internal.example.com/v2/. For troubleshooting, deploy a debug pod with all tools pre-loaded in the local registry, since you cannot pull nicolaka/netshoot from Docker Hub. In our HSR Layout lab, we simulate air-gapped environments for candidates deploying Kubernetes in Indian defense and BFSI sectors, where CERT-In mandates network isolation from the public internet.