One Day My Batch Job Stopped - Implementing K8S Command Line Health Check

Introduction
Recently, I migrated equipment that had reached OS EOS (End of Support).
There were no issues after the migration, but that was just the calm before the storm.
About a week after the migration was complete, I received a call from the monitoring center.
"XXX data has not been applied. Please check."
Thinking something got tangled during processing, I checked Sentry and error logs but found nothing unusual.
Looking at the full logs, it had stopped at "03:00" initially, and since the current time was around "09:00", I thought, "Did something go wrong with the UTC conversion?" But the timezone was innocent - it was a batch job that ran every 10 minutes.

Sweat started running down my back as I realized what had happened. The batch job had hung while processing.
It was strange. The container was alive, resource usage was normal, and no more logs were being output. It looked fine on the outside, but the loop had stopped inside.
The root cause was resolved through other means, but I wanted to address the inadequate monitoring of this issue.
In this post, I'll introduce how to detect when an infinite loop-based job has stopped and automatic recovery methods using Kubernetes' livenessProbe.
Problem Definition
- A Python worker running batch jobs in an infinite loop
- However, even if the job stops or the loop halts due to an exception, the container is considered healthy
- Kubernetes' default health checks make it difficult to detect such issues
Solution Idea
- If the loop is running normally, periodically record the current timestamp to a specific file
- Kubernetes periodically calculates the difference between that timestamp and the current time; if it's been stopped for too long, restart the container
Implementation
1. Recording Heartbeat in Python
import time
def write_heartbeat(path="/tmp/heartbeat"):
with open(path, "w") as f:
f.write(str(int(time.time())))
def main():
while True:
# Actual job logic
process_job()
# Record heartbeat for health check
write_heartbeat()
# Cycle interval (e.g., 30 seconds)
time.sleep(30)
def process_job():
# Write job logic here
print("Running job...")
if __name__ == "__main__":
main()
- Records the current time to /tmp/heartbeat.
- The recording interval should be shorter than or equal to the Kubernetes check interval.
2. Kubernetes livenessProbe Configuration
livenessProbe:
exec:
command:
- /bin/sh
- -c
- test $(($(date +%s) - $(cat /tmp/heartbeat))) -lt 90
initialDelaySeconds: 10
periodSeconds: 30
- If the difference between the current time and the heartbeat file time is less than 90 seconds, it's healthy
- If the heartbeat hasn't been updated for more than 90 seconds, it's unhealthy -> container restarts
Cautions
- If write_heartbeat() is not called due to job failure or exceptions, Kubernetes will automatically detect and restart the container.
- /tmp/heartbeat is a path inside the container, so it won't be shared with other containers in the Pod.
- It's recommended to set the heartbeat recording interval slightly shorter than the actual batch cycle.
Conclusion

This incident made me reflect on how I had only focused on pod health checks for API services while neglecting batch jobs.
Fortunately, I found a good methodology that enabled quick detection and recovery through monitoring. Even without complex methods, this is a way to monitor service failures through simple file recording.
If you're operating in a similar setup (K8S + infinite loop batch), I strongly recommend applying this method.
Additionally, please take a moment to consider whether you might have been neglecting health checks for your batch jobs as well.