Engineering
k8s-troubleshooter avatar

k8s-troubleshooter

Systematic Kubernetes troubleshooting, pod diagnostics, cluster health monitoring, and incident response playbooks.

Introduction

This skill provides a comprehensive toolkit for DevOps engineers and SREs to perform systematic troubleshooting in Kubernetes environments. It streamlines incident response by automating the collection of diagnostic data and providing actionable insights for production issues. Whether you are dealing with common scheduling errors, resource constraints, or complex network failures, this agent guides you through the diagnostic process to minimize mean time to recovery (MTTR).

  • Automated triage: Perform instant cluster-level health checks, identify non-running pods across all namespaces, and analyze node resource utilization using kubectl and python-based diagnostic scripts.

  • Deep dive pod investigation: Retrieve logs, events, and configuration details to troubleshoot common failure modes like CrashLoopBackOff, ImagePullBackOff, OOMKilled, and Pending status.

  • Namespace health analysis: Execute automated scripts to assess deployment availability, service endpoints, PVC storage status, and resource quota usage within specific namespaces.

  • Structured incident response: Follow established playbooks for SEV-1 to SEV-4 incidents, including assessment, investigation, resolution, and post-incident review procedures.

  • Resource and network debugging: Gain visibility into node DiskPressure, NotReady states, network policies, and persistent volume connectivity issues.

  • Users should initiate this skill when encountering errors with Kubernetes components, pods, services, or storage volumes.

  • Input requirements: Provide the namespace and object name (pod, service, node) when specific resource investigation is required.

  • Expected output: Clear diagnostic findings, root cause analysis, recommended remediation steps, and verification commands.

  • Operational context: Designed for production-grade environments where safety, documentation of changes, and monitoring of post-fix behavior are critical.

  • Limitations: Requires access to kubectl, sufficient RBAC permissions on the target cluster, and python3 installed for advanced diagnostic script execution.

Repository Stats

Stars
139
Forks
26
Open Issues
1
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
May 3, 2026, 05:54 PM
View on GitHub