Health Checks — HEALTHCHECK Instruction, Restart Policies, and Monitoring Unhealthy Containers

So you've got your Docker container running. The terminal shows it's "Up." You feel good. You grab a chai, sit back, and think everything's fine.

But is it really?

Here's the thing — Docker saying a container is "Up" doesn't mean your app inside it is actually working. It just means the container process didn't crash. Your Node.js server could be stuck in an infinite loop, your database might have lost its connection, or your Flask app could be returning 500 errors on every single request — and Docker would still happily report: "Status: Up 3 hours."

That's exactly the gap that HEALTHCHECK fills. And once you pair it with restart policies, you've got a self-healing setup that can recover from failures without you having to wake up at 3am.

Let's dig in.

The Problem: "Running" Doesn't Mean "Working"

Imagine you've deployed a web app. The container starts, the process is running, Docker is happy. But behind the scenes, the app failed to connect to the database on startup. It's sitting there, alive but completely useless — like a TV that's plugged in but has no signal.

Without health checks, Docker has no idea. It won't restart the container. Your load balancer (if you have one) will keep sending traffic to a broken instance. Users get errors. Your phone starts ringing.

With a health check, Docker periodically knocks on the door — "Hey, are you actually functional?" — and if the answer is silence or an error, Docker marks that container as unhealthy and can act on it.

The HEALTHCHECK Instruction

The HEALTHCHECK instruction goes inside your Dockerfile. It defines a command that Docker will run inside the container at regular intervals to check if the app is working.

Here's the basic syntax:

HEALTHCHECK [OPTIONS] CMD <command>

And here's a real example for a web app:

HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

Let's break down what each option does:

--interval=30s This is how often Docker runs the health check. Every 30 seconds, Docker will run the CMD. Think of it like a doctor taking your pulse every 30 seconds.

--timeout=10s If the health check command doesn't respond within 10 seconds, Docker counts it as a failure. Your doctor isn't going to wait forever — if there's no response in 10 seconds, something's wrong.

--start-period=15s This is a grace period when the container first starts. Maybe your app needs 10-12 seconds to warm up — initialize connections, load config, etc. During the start period, failed health checks don't count against the container. Docker won't panic-mark it as unhealthy just because it took a moment to boot.

--retries=3 How many consecutive failures before Docker marks the container as "unhealthy." One bad check could be a fluke — network blip, CPU spike. Three failures in a row? That's a real problem.

CMD curl -f http://localhost:8080/health || exit 1 This is the actual check command. It hits the /health endpoint on your app. The -f flag on curl makes it return a non-zero exit code if the HTTP response is an error (4xx, 5xx). If curl fails OR the endpoint returns an error, we exit with 1 (failure). If it works, curl exits with 0 (success).

The golden rule: Exit code 0 = healthy. Any other exit code = unhealthy. That's it.

A Full Working Example

Let's say you're building a simple Node.js Express app. Here's how you'd write the Dockerfile with a proper health check:

FROM node:20-alpine

WORKDIR /app

COPY package*.json ./
RUN npm install

COPY . .

EXPOSE 3000

# Health check: ping the /health route every 30 seconds
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
  CMD wget -qO- http://localhost:3000/health || exit 1

CMD ["node", "server.js"]

Notice I used wget here instead of curl — both work, and wget is often already installed in Alpine-based images. If your base image doesn't have either, you can use a Node.js one-liner:

HEALTHCHECK CMD node -e "require('http').get('http://localhost:3000/health', (r) => { process.exit(r.statusCode === 200 ? 0 : 1) })"

And in your server.js, make sure you actually have that /health endpoint:

app.get('/health', (req, res) => {
  // You can add real checks here — DB connection, memory, etc.
  res.status(200).json({ status: 'ok' });
});

Checking Container Health Status

Once your container is running with a health check configured, you can inspect it:

docker ps

The STATUS column now shows something like:

STATUS
Up 2 minutes (healthy)
Up 5 minutes (unhealthy)
Up 10 seconds (health: starting)

Those three states are the health lifecycle:

health: starting — The start period is still running. Docker is being patient.
healthy — The check passed. Everything's good.
unhealthy — The check failed (retries exhausted). Something's broken.

For detailed information, use docker inspect:

docker inspect --format='{{json .State.Health}}' <container_name>

This gives you a JSON output with the last few health check results, the time they ran, and the exit codes. Super useful for debugging.

Or for something more human-readable:

docker inspect <container_name> | grep -A 20 '"Health"'

Adding HEALTHCHECK via docker run (Without Dockerfile)

If you don't control the Dockerfile (using a third-party image, for example), you can add a health check when running the container:

docker run -d \
  --name my-nginx \
  --health-cmd="curl -f http://localhost/ || exit 1" \
  --health-interval=30s \
  --health-timeout=5s \
  --health-retries=3 \
  --health-start-period=10s \
  nginx

This overrides or adds a health check without touching any image. Handy for quick setups.

Restart Policies — Teaching Docker to Recover Automatically

Okay, so now Docker knows your container is unhealthy. But what does it do about it?

By default — nothing. Docker just marks it unhealthy and moves on. If you want automatic recovery, you need restart policies.

A restart policy tells Docker what to do when a container exits or becomes unhealthy. You set it with the --restart flag:

docker run -d --restart <policy> my-image

There are four restart policies:

`no` (default)

docker run -d --restart no my-image

Docker does nothing when the container exits. The default behavior — you have to restart it manually. Fine for development, bad for production.

`always`

docker run -d --restart always my-image

Docker always restarts the container if it stops, regardless of why. Even if you manually docker stop it, it'll restart when the Docker daemon restarts. This is useful for critical services but can be annoying if you deliberately stop a container and it keeps coming back.

`on-failure`

docker run -d --restart on-failure:5 my-image

Docker restarts the container only if it exits with a non-zero exit code — meaning something actually crashed. The :5 limits restarts to 5 attempts. Great for apps where you want recovery from crashes but don't want Docker endlessly restarting something that's fundamentally broken.

`unless-stopped`

docker run -d --restart unless-stopped my-image

This is probably the most practical policy for production. It's like always, except if you manually stop the container with docker stop, it stays stopped — even after a Docker daemon restart. You have explicit control. This is the one most production teams reach for.

HEALTHCHECK + Restart Policies: The Important Gotcha

Here's something that trips up a lot of beginners: Docker does NOT automatically restart containers just because they're marked "unhealthy."

Read that again.

Health checks update the container's health status — but the restart policy only kicks in when the container exits (the main process stops). If your app is stuck and hanging (returning errors but the process is still running), Docker will mark it unhealthy but the restart policy won't trigger because the container never exited.

So how do you handle a truly stuck container? A few approaches:

Option 1: Make your app exit on critical failures The cleanest solution. If your app detects it's in a broken state — lost DB connection, corrupted state, whatever — just process.exit(1). Docker sees the non-zero exit, the restart policy kicks in, container restarts. Clean and simple.

Option 2: Use Docker Swarm or Kubernetes Docker Swarm actually does respond to unhealthy containers by rescheduling them. If you're running in Swarm mode, mark unhealthy containers get replaced. This is one reason orchestrators exist.

Option 3: External monitoring + scripted intervention A monitoring script watches for unhealthy containers and calls docker restart on them. Not elegant, but it works for simple setups.

Monitoring Unhealthy Containers

You've got health checks running. Now how do you actually monitor this stuff?

The Quick Manual Check

# See health status for all running containers
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Image}}"

Watch for Health Status Changes

# Live watch — runs every 2 seconds
watch -n 2 docker ps

Docker Events — Real-Time Stream

Docker emits events you can subscribe to. This is great for lightweight monitoring:

docker events --filter event=health_status

This streams events like:

2026-04-25T10:32:01.123456789+05:30 container health_status my-app (status=healthy)
2026-04-25T10:45:22.987654321+05:30 container health_status my-app (status=unhealthy)

You can pipe this into a log file or hook it up to alerting tools.

Scripted Health Monitor

Here's a simple shell script to check for unhealthy containers and restart them:

#!/bin/bash
# healthcheck-monitor.sh

UNHEALTHY=$(docker ps --filter health=unhealthy --format "{{.Names}}")

if [ -n "$UNHEALTHY" ]; then
  echo "Found unhealthy containers: $UNHEALTHY"
  for container in $UNHEALTHY; do
    echo "Restarting $container..."
    docker restart "$container"
  done
else
  echo "All containers healthy."
fi

Run this via cron every minute:

# Add to crontab (crontab -e)
* * * * * /path/to/healthcheck-monitor.sh >> /var/log/docker-health.log 2>&1

Not a full-blown monitoring solution, but for a small VPS setup? It gets the job done.

Common Mistakes (And How to Avoid Them)

Mistake 1: Health check endpoint doesn't actually check anything

A lot of devs add a /health route that just returns { status: "ok" } unconditionally. That's better than nothing, but you're missing the point. Your health check should verify that the app is actually functional — at minimum, check that database connections are alive:

app.get('/health', async (req, res) => {
  try {
    await db.query('SELECT 1'); // real check
    res.status(200).json({ status: 'ok' });
  } catch (err) {
    res.status(503).json({ status: 'error', message: err.message });
  }
});

Mistake 2: Forgetting curl or wget isn't in the base image

Especially with minimal images like alpine or distroless, curl and wget might not be installed. Either install them in your Dockerfile or use a language-native check:

RUN apk add --no-cache curl  # for Alpine

Mistake 3: Setting --interval too short

If your interval is 5 seconds and the check itself takes 3-4 seconds, you're hammering your own app with health checks. Set the interval reasonably — 30 seconds is a solid default for most apps.

Mistake 4: No --start-period on slow-starting apps

Java apps, apps with heavy initialization, anything that takes >10 seconds to start — if you don't set a start period, Docker will immediately start failing health checks and might never get the container into a healthy state. Always add a --start-period that's longer than your expected startup time.

Mistake 5: Using always restart policy in development

You make a code change, the container crashes (expected), and Docker immediately restarts it running the old image. Confusing. Use --restart no in development.

Health Checks in Docker Compose

Everything we've covered works in Docker Compose too. Here's how you'd write it in docker-compose.yml:

version: '3.8'

services:
  webapp:
    build: .
    ports:
      - "3000:3000"
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 20s

  db:
    image: postgres:15
    restart: unless-stopped
    environment:
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 10s

Notice two things here:

The db service uses pg_isready — a PostgreSQL-native command that checks if the server is ready to accept connections. Much better than a generic TCP check.
You can make your webapp wait for db to be healthy before starting:

  webapp:
    depends_on:
      db:
        condition: service_healthy

This is one of the most underused features of Docker Compose. The condition: service_healthy means Compose will wait until the database's health check passes before even starting the web app. No more race conditions where your app crashes because the DB wasn't ready.

Try It Yourself — Challenge

Here's your hands-on challenge. Build this and see it in action.

Step 1: Create a simple Python Flask app with a /health endpoint that checks a counter. After 3 requests, make the health endpoint return 500 (simulating a failure).

# app.py
from flask import Flask, jsonify
import os

app = Flask(__name__)
request_count = 0

@app.route('/')
def index():
    return "Hello from Docker!"

@app.route('/health')
def health():
    global request_count
    request_count += 1
    if request_count > 3:
        return jsonify({"status": "error"}), 500
    return jsonify({"status": "ok"}), 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Step 2: Write a Dockerfile with a HEALTHCHECK that checks this endpoint every 10 seconds with a 3-second timeout and 2 retries.

Step 3: Build and run the container with --restart on-failure:3.

Step 4: Watch docker ps every few seconds. See when the status changes from healthy to unhealthy. Does Docker restart the container? Why or why not?

Bonus: Modify the Flask app so that when /health fails, it also calls sys.exit(1). What changes now?

This exercise will make the behavior of health checks and restart policies click in a way that no amount of reading can.

Wrapping Up

Let's pull it all together. Here's what you now know:

A container being "Up" doesn't mean your app is working — that's what HEALTHCHECK is for.
The HEALTHCHECK instruction runs a command periodically and marks the container healthy or unhealthy based on the exit code.
Use --interval, --timeout, --start-period, and --retries to tune the check for your app's behavior.
Restart policies (no, always, on-failure, unless-stopped) tell Docker what to do when a container exits.
Health checks and restart policies work together, but remember — the restart policy triggers on container exit, not just on unhealthy status.
In Docker Compose, use condition: service_healthy to control startup order properly.

Getting health checks right is one of those things that separates hobbyist Docker usage from production-grade deployments. It's not complicated — but most people skip it until something breaks at 2am. Don't be that person.

Add a health check. Set a restart policy. Sleep better.

Health Checks — HEALTHCHECK Instruction, Restart Policies, and Monitoring Unhealthy Containers

The Problem: "Running" Doesn't Mean "Working"

The HEALTHCHECK Instruction

A Full Working Example

Checking Container Health Status

Adding HEALTHCHECK via docker run (Without Dockerfile)

Restart Policies — Teaching Docker to Recover Automatically

no (default)

always

on-failure

unless-stopped

HEALTHCHECK + Restart Policies: The Important Gotcha

Monitoring Unhealthy Containers

The Quick Manual Check

Watch for Health Status Changes

Docker Events — Real-Time Stream

Scripted Health Monitor

Common Mistakes (And How to Avoid Them)

Health Checks in Docker Compose

Try It Yourself — Challenge

Wrapping Up

Sarthak Varshney

Health Checks — HEALTHCHECK Instruction, Restart Policies, and Monitoring Unhealthy Containers

The Problem: "Running" Doesn't Mean "Working"

The HEALTHCHECK Instruction

A Full Working Example

Checking Container Health Status

Adding HEALTHCHECK via docker run (Without Dockerfile)

Restart Policies — Teaching Docker to Recover Automatically

no (default)

always

on-failure

unless-stopped

HEALTHCHECK + Restart Policies: The Important Gotcha

Monitoring Unhealthy Containers

The Quick Manual Check

Watch for Health Status Changes

Docker Events — Real-Time Stream

Scripted Health Monitor

Common Mistakes (And How to Avoid Them)

Health Checks in Docker Compose

Try It Yourself — Challenge

Wrapping Up

`no` (default)

`always`

`on-failure`

`unless-stopped`

`no` (default)

`always`

`on-failure`

`unless-stopped`