Deploy Checkly Private Locations on Kubernetes for production-scale monitoring with built-in orchestration, scaling, and reliability features. This guide covers Helm chart deployment, manual manifests, and production best practices.

Prerequisites

Before deploying on Kubernetes, ensure you have:
  • Running Kubernetes cluster (v1.19+)
  • kubectl configured and connected
  • Sufficient cluster resources
  • Network policies configured (if required)

Quick Start with Helm

The fastest way to deploy Checkly Agents on Kubernetes is using our official Helm chart.

Get the Helm Chart

# Clone the official repository
git clone https://github.com/checkly/checkly-k8s.git
cd checkly-k8s

# Verify chart structure
ls -la helm-chart/

Deploy with Helm

# Install with your API key
helm install checkly-agent \
  --set apiKeySecret.apiKey="pl_your_api_key_here" \
  ./helm-chart

# Install with custom values
helm install checkly-agent \
  --set apiKeySecret.apiKey="pl_your_api_key_here" \
  --set replicaCount=3 \
  --set resources.requests.memory="2Gi" \
  --set resources.requests.cpu="1" \
  ./helm-chart

# Install in custom namespace
kubectl create namespace checkly
helm install checkly-agent \
  --namespace checkly \
  --set apiKeySecret.apiKey="pl_your_api_key_here" \
  ./helm-chart

Verify Deployment

# Check pod status
kubectl get pods -l app.kubernetes.io/name=checkly-agent

# View agent logs
kubectl logs -l app.kubernetes.io/name=checkly-agent --tail=50

# Check agent connectivity
kubectl exec -it deployment/checkly-agent -- curl -I https://agent.checklyhq.com
The Helm chart automatically creates API key secrets, configures resource limits, and sets up pod security policies for production use.

Manual Kubernetes Manifests

For more control or custom environments, deploy using individual Kubernetes manifests.

1. Create Namespace (Optional)

# checkly-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: checkly
  labels:
    name: checkly
    purpose: monitoring
kubectl apply -f checkly-namespace.yaml

2. Create API Key Secret

# api-key-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: checkly-api-key
  namespace: checkly
type: Opaque
data:
  # Base64 encoded API key: echo -n "pl_your_api_key" | base64
  api-key: cGxfeW91cl9hcGlfa2V5X2hlcmU=
# Create secret directly (easier method)
kubectl create secret generic checkly-api-key \
  --namespace=checkly \
  --from-literal=api-key="pl_your_api_key_here"

3. Create Deployment

# checkly-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkly-agent
  namespace: checkly
  labels:
    app: checkly-agent
    version: "6.0.3"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: checkly-agent
  template:
    metadata:
      labels:
        app: checkly-agent
      annotations:
        prometheus.io/scrape: "false"
    spec:
      containers:
      - name: checkly-agent
        image: checkly/agent:6.0.3
        imagePullPolicy: IfNotPresent
        env:
        - name: API_KEY
          valueFrom:
            secretKeyRef:
              name: checkly-api-key
              key: api-key
        - name: JOB_CONCURRENCY
          value: "3"
        - name: LOG_LEVEL
          value: "INFO"
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "4Gi"
            cpu: "2"
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "ps aux | grep -v grep | grep node"
          initialDelaySeconds: 30
          periodSeconds: 30
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "ps aux | grep -v grep | grep node"
          initialDelaySeconds: 10
          periodSeconds: 10
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          runAsUser: 1000
          runAsGroup: 1000
          readOnlyRootFilesystem: false
          capabilities:
            drop:
            - ALL
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      securityContext:
        fsGroup: 1000
kubectl apply -f checkly-deployment.yaml

Production Configuration

Resource Planning

Configure appropriate resource requests and limits based on your workload:
# High-capacity agent configuration
resources:
  requests:
    memory: "2Gi"
    cpu: "1"
  limits:
    memory: "8Gi"
    cpu: "4"
env:
- name: JOB_CONCURRENCY
  value: "8"  # 8GB / 1GB per browser check
# API-only agent configuration  
resources:
  requests:
    memory: "500Mi"
    cpu: "250m"
  limits:
    memory: "2Gi"
    cpu: "1"
env:
- name: JOB_CONCURRENCY
  value: "10"  # 2GB / 0.15GB per API check

Node Affinity and Scheduling

# Pin agents to specific nodes
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: checkly.io/agent-node
                operator: In
                values:
                - "true"
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - checkly-agent
              topologyKey: kubernetes.io/hostname
      tolerations:
      - key: "checkly.io/dedicated"
        operator: "Equal"
        value: "agent"
        effect: "NoSchedule"

Pod Disruption Budget

# pod-disruption-budget.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkly-agent-pdb
  namespace: checkly
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: checkly-agent

Horizontal Pod Autoscaler

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkly-agent-hpa
  namespace: checkly
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkly-agent
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Advanced Configuration

ConfigMap for Agent Settings

# agent-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: checkly-agent-config
  namespace: checkly
data:
  JOB_CONCURRENCY: "5"
  LOG_LEVEL: "INFO"
  USE_OS_DNS_RESOLVER: "true"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkly-agent
spec:
  template:
    spec:
      containers:
      - name: checkly-agent
        envFrom:
        - configMapRef:
            name: checkly-agent-config
        env:
        - name: API_KEY
          valueFrom:
            secretKeyRef:
              name: checkly-api-key
              key: api-key

Network Policies

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: checkly-agent-netpol
  namespace: checkly
spec:
  podSelector:
    matchLabels:
      app: checkly-agent
  policyTypes:
  - Egress
  egress:
  # Allow DNS resolution
  - to: []
    ports:
    - protocol: UDP
      port: 53
  # Allow HTTPS to Checkly API
  - to: []
    ports:
    - protocol: TCP
      port: 443
  # Allow HTTP/HTTPS for checks
  - to: []
    ports:
    - protocol: TCP
      port: 80
    - protocol: TCP
      port: 443

Monitoring and Observability

# service-monitor.yaml (for Prometheus)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: checkly-agent
  namespace: checkly
spec:
  selector:
    matchLabels:
      app: checkly-agent
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
---
apiVersion: v1
kind: Service
metadata:
  name: checkly-agent-metrics
  namespace: checkly
  labels:
    app: checkly-agent
spec:
  ports:
  - name: metrics
    port: 9090
    targetPort: 9090
  selector:
    app: checkly-agent

Helm Chart Customization

Custom Values File

# values-production.yaml
replicaCount: 3

image:
  repository: checkly/agent
  tag: "6.0.3"
  pullPolicy: IfNotPresent

apiKeySecret:
  apiKey: "pl_your_production_key"
  
env:
  JOB_CONCURRENCY: "5"
  LOG_LEVEL: "INFO"
  USE_OS_DNS_RESOLVER: "true"

resources:
  requests:
    memory: "2Gi"
    cpu: "1"
  limits:
    memory: "6Gi"
    cpu: "3"

nodeSelector:
  checkly.io/agent-node: "true"

tolerations:
- key: "checkly.io/dedicated"
  operator: "Equal"
  value: "agent"
  effect: "NoSchedule"

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - checkly-agent
        topologyKey: kubernetes.io/hostname

podDisruptionBudget:
  enabled: true
  minAvailable: 1

monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
Deploy with custom values:
helm install checkly-agent -f values-production.yaml ./helm-chart

Upgrade Strategy

# Upgrade with rolling deployment
helm upgrade checkly-agent \
  --set image.tag="6.1.0" \
  --reuse-values \
  ./helm-chart

# Rollback if needed
helm rollback checkly-agent 1

Security Best Practices

Pod Security Standards

# pod-security-policy.yaml
apiVersion: v1
kind: Pod
metadata:
  name: checkly-agent
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: checkly-agent
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: false
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
      capabilities:
        drop:
        - ALL

Secret Management with External Secrets

# external-secret.yaml (using External Secrets Operator)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: checkly-api-key-external
  namespace: checkly
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-secret-store
    kind: SecretStore
  target:
    name: checkly-api-key
    creationPolicy: Owner
  data:
  - secretKey: api-key
    remoteRef:
      key: secret/checkly
      property: api-key

Troubleshooting

Common issues and solutions:
# Check pod status and events
kubectl describe pod -l app=checkly-agent -n checkly

# Check resource constraints
kubectl top pods -n checkly

# Verify secret exists
kubectl get secret checkly-api-key -n checkly -o yaml

# Check image pull issues
kubectl get events -n checkly --sort-by=.metadata.creationTimestamp
Common solutions:
  • Verify API key is correctly base64 encoded in secret
  • Check resource requests vs available cluster capacity
  • Ensure image tag exists and is accessible
  • Verify namespace exists and has proper RBAC
Diagnostic commands:
# Test connectivity from pod
kubectl exec -it deployment/checkly-agent -n checkly -- \
  curl -I https://agent.checklyhq.com

# Check DNS resolution
kubectl exec -it deployment/checkly-agent -n checkly -- \
  nslookup agent.checklyhq.com

# View agent logs
kubectl logs deployment/checkly-agent -n checkly --tail=100 -f

# Check network policies
kubectl describe networkpolicy -n checkly
Common solutions:
  • Verify egress network policies allow HTTPS to agent.checklyhq.com
  • Check if corporate firewall blocks outbound connections
  • Ensure DNS resolution works for external domains
  • Verify proxy configuration if required
Monitoring and optimization:
# Monitor resource usage
kubectl top pods -n checkly

# Check HPA status
kubectl describe hpa checkly-agent-hpa -n checkly

# View detailed metrics
kubectl exec -it deployment/checkly-agent -n checkly -- \
  cat /proc/meminfo

# Check for OOMKilled pods
kubectl get events -n checkly | grep OOMKilled
Optimization strategies:
  • Adjust JOB_CONCURRENCY based on available memory
  • Increase resource limits if pods are being throttled
  • Scale horizontally with more replicas
  • Use node affinity to place agents on appropriate nodes

Monitoring and Alerting

Prometheus Metrics

Monitor Checkly Agents with custom metrics:
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: checkly-agent-rules
  namespace: checkly
spec:
  groups:
  - name: checkly-agent
    rules:
    - alert: ChecklyAgentDown
      expr: up{job="checkly-agent"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Checkly Agent is down"
        description: "Checkly Agent {{ $labels.instance }} has been down for more than 5 minutes"
    
    - alert: ChecklyAgentHighMemory
      expr: container_memory_usage_bytes{pod=~"checkly-agent-.*"} / container_spec_memory_limit_bytes{pod=~"checkly-agent-.*"} > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Checkly Agent high memory usage"
        description: "Checkly Agent {{ $labels.pod }} memory usage is above 90%"

Logging Configuration

# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: checkly
data:
  fluent-bit.conf: |
    [INPUT]
        Name tail
        Path /var/log/containers/checkly-agent-*.log
        Parser docker
        Tag checkly.agent.*
        Refresh_Interval 5
        
    [OUTPUT]
        Name forward
        Match checkly.agent.*
        Host log-aggregator.company.com
        Port 24224

Next Steps

Repository: Find complete examples and the latest Helm chart at github.com/checkly/checkly-k8s