One Day My Batch Job Stopped - Implementing K8S Command Line Health Check

Introduction

Recently, I migrated equipment that had reached OS EOS (End of Support).

There were no issues after the migration, but that was just the calm before the storm.

About a week after the migration was complete, I received a call from the monitoring center.

"XXX data has not been applied. Please check."

Thinking something got tangled during processing, I checked Sentry and error logs but found nothing unusual.

Looking at the full logs, it had stopped at "03:00" initially, and since the current time was around "09:00", I thought, "Did something go wrong with the UTC conversion?" But the timezone was innocent - it was a batch job that ran every 10 minutes.

Sweat started running down my back as I realized what had happened. The batch job had hung while processing.

It was strange. The container was alive, resource usage was normal, and no more logs were being output. It looked fine on the outside, but the loop had stopped inside.

The root cause was resolved through other means, but I wanted to address the inadequate monitoring of this issue.

In this post, I'll introduce how to detect when an infinite loop-based job has stopped and automatic recovery methods using Kubernetes' livenessProbe.

Problem Definition

A Python worker running batch jobs in an infinite loop
However, even if the job stops or the loop halts due to an exception, the container is considered healthy
Kubernetes' default health checks make it difficult to detect such issues

Solution Idea

If the loop is running normally, periodically record the current timestamp to a specific file
Kubernetes periodically calculates the difference between that timestamp and the current time; if it's been stopped for too long, restart the container

Implementation

1. Recording Heartbeat in Python

import time

def write_heartbeat(path="/tmp/heartbeat"):
    with open(path, "w") as f:
        f.write(str(int(time.time())))

def main():
    while True:
        # Actual job logic
        process_job()

        # Record heartbeat for health check
        write_heartbeat()

        # Cycle interval (e.g., 30 seconds)
        time.sleep(30)

def process_job():
    # Write job logic here
    print("Running job...")

if __name__ == "__main__":
    main()

Records the current time to /tmp/heartbeat.
The recording interval should be shorter than or equal to the Kubernetes check interval.

2. Kubernetes livenessProbe Configuration

livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - test $(($(date +%s) - $(cat /tmp/heartbeat))) -lt 90
  initialDelaySeconds: 10
  periodSeconds: 30

If the difference between the current time and the heartbeat file time is less than 90 seconds, it's healthy
If the heartbeat hasn't been updated for more than 90 seconds, it's unhealthy -> container restarts

Cautions

If write_heartbeat() is not called due to job failure or exceptions, Kubernetes will automatically detect and restart the container.
/tmp/heartbeat is a path inside the container, so it won't be shared with other containers in the Pod.
It's recommended to set the heartbeat recording interval slightly shorter than the actual batch cycle.

Conclusion

This incident made me reflect on how I had only focused on pod health checks for API services while neglecting batch jobs.

Fortunately, I found a good methodology that enabled quick detection and recovery through monitoring. Even without complex methods, this is a way to monitor service failures through simple file recording.

If you're operating in a similar setup (K8S + infinite loop batch), I strongly recommend applying this method.

Additionally, please take a moment to consider whether you might have been neglecting health checks for your batch jobs as well.

Introduction

Recently, I migrated equipment that had reached OS EOS (End of Support).

There were no issues after the migration, but that was just the calm before the storm.

About a week after the migration was complete, I received a call from the monitoring center.

"XXX data has not been applied. Please check."

Thinking something got tangled during processing, I checked Sentry and error logs but found nothing unusual.

Sweat started running down my back as I realized what had happened. The batch job had hung while processing.

It was strange. The container was alive, resource usage was normal, and no more logs were being output. It looked fine on the outside, but the loop had stopped inside.

The root cause was resolved through other means, but I wanted to address the inadequate monitoring of this issue.

In this post, I'll introduce how to detect when an infinite loop-based job has stopped and automatic recovery methods using Kubernetes' livenessProbe.

Problem Definition

A Python worker running batch jobs in an infinite loop
However, even if the job stops or the loop halts due to an exception, the container is considered healthy
Kubernetes' default health checks make it difficult to detect such issues

Solution Idea

If the loop is running normally, periodically record the current timestamp to a specific file
Kubernetes periodically calculates the difference between that timestamp and the current time; if it's been stopped for too long, restart the container

Implementation

1. Recording Heartbeat in Python

import time

def write_heartbeat(path="/tmp/heartbeat"):
    with open(path, "w") as f:
        f.write(str(int(time.time())))

def main():
    while True:
        # Actual job logic
        process_job()

        # Record heartbeat for health check
        write_heartbeat()

        # Cycle interval (e.g., 30 seconds)
        time.sleep(30)

def process_job():
    # Write job logic here
    print("Running job...")

if __name__ == "__main__":
    main()

Records the current time to /tmp/heartbeat.
The recording interval should be shorter than or equal to the Kubernetes check interval.

2. Kubernetes livenessProbe Configuration

livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - test $(($(date +%s) - $(cat /tmp/heartbeat))) -lt 90
  initialDelaySeconds: 10
  periodSeconds: 30

If the difference between the current time and the heartbeat file time is less than 90 seconds, it's healthy
If the heartbeat hasn't been updated for more than 90 seconds, it's unhealthy -> container restarts

Cautions

If write_heartbeat() is not called due to job failure or exceptions, Kubernetes will automatically detect and restart the container.
/tmp/heartbeat is a path inside the container, so it won't be shared with other containers in the Pod.
It's recommended to set the heartbeat recording interval slightly shorter than the actual batch cycle.

Conclusion

This incident made me reflect on how I had only focused on pod health checks for API services while neglecting batch jobs.

If you're operating in a similar setup (K8S + infinite loop batch), I strongly recommend applying this method.

Additionally, please take a moment to consider whether you might have been neglecting health checks for your batch jobs as well.

#Introduction

#Problem Definition

#Solution Idea

#Implementation

#1. Recording Heartbeat in Python

#2. Kubernetes livenessProbe Configuration

#Cautions

#Conclusion

#Introduction

#Problem Definition

#Solution Idea

#Implementation

#1. Recording Heartbeat in Python

#2. Kubernetes livenessProbe Configuration

#Cautions

#Conclusion

Introduction

Problem Definition

Solution Idea

Implementation

1. Recording Heartbeat in Python

2. Kubernetes livenessProbe Configuration

Cautions

Conclusion

Introduction

Problem Definition

Solution Idea

Implementation

1. Recording Heartbeat in Python

2. Kubernetes livenessProbe Configuration

Cautions

Conclusion