Exploring Apache Flink Application Lifecycle Management with Kubernetes Operator

Exploring Apache Flink Application Lifecycle Management with Kubernetes Operator

Introduction

While the official Apache Flink Kubernetes Operator documentation offers insights into managing the Flink application lifecycle, it lacks specific details about the operator's behavior during transitions between different application states.

In this blog post, we'll explore how to manage Apache Flink application lifecycles using the Kubernetes operator, focusing on understanding its behavior under different upgradeMode settings like last-state and savepoint. Our discussion will focus only on stateful applications.

Drawing from my own testing and observations, we'll explore the effects of lifecycle management actions, including deletion and suspension of Flink deployments, on job state, checkpoints and recovery. The aim is to provide insights for effectively managing Flink applications in Kubernetes environments.

Application lifecycle

Delete

When you delete a Flink deployment through the Kubernetes API, all checkpoint and status information is removed. However, the external state stored in remote storage remains unaffected. The Kubernetes resources related to the application, such as TaskManager pods, JobManager pods along with ConfigMaps containing cluster and checkpoint information, are deleted.

Deleting an application does not trigger final checkpoint or savepoint actions before shutdown. This is because "Delete" is not a job status managed by the Flink operator. Instead, it is simply a Kubernetes API operation that deletes all resources associated with that custom resource.

As a result, future deployments will start from an empty state unless the user manually overrides the initialSavePoint JobSpec with an existing checkpoint or savepoint location from the remote location. This behavior is the default for any new Flink application.

Suspend

Suspending an application notifies Flink Kubernetes operator that the user intends to upgrade the job and would like to gracefully pause the application. The application shutdown behavior depends on the upgradeMode setting of the JobSpec.

If set to last-state, the application is suspended immediately.

If set to savepoint, the application triggers a savepoint creation call and ensures the savepoint is created before shutting down the application. It is worth noting that the application deletes all previous checkpoints post the savepoint creation during a suspend. This behavior remains unchanged even when configured to RETAIN_ON_CANCELLATION

In both cases, the job status and state is maintained and the next deployment will recover from the last successful checkpoint or savepoint.

From a Kubernetes resource stand point, only the TaskManager pods are deleted. The deployment remains active with a JobManager pod and all other ConfigMaps intact.

Deploy

The state restore behavior of a Flink deployment depends on whether the job is being deployed as a new application or transitioning into a running state from a suspended state.

In case of a new deployment, initialSavePoint JobSpec parameter value dictates if the job will start with an empty state or restore from a remote state data. The application starts in NO_CLAIM_MODE , meaning the job will quickly restore from the specified checkpoint or savepoint but will take more time to create a full initial checkpoint in its configured remote location.

For a job resuming from a suspended state, the initialSavePoint parameter is ignored as the job has access to the state information locally. It restores from the last successful checkpoint in CLAIM_MODE . In this mode, the application goes through all the available checkpoints(state.checkpoints.num-retained) before starting the tasks and will only create an incremental initial checkpoint.

Conclusion

Understanding the nuances of upgradeMode settings, deletion, and suspension actions is key for effectively managing Flink jobs in Kubernetes. These actions impact job state, checkpointing, and recovery, which are essential for maintaining data integrity and optimizing performance. By leveraging the insights provided in this post, users can effectively manage their Flink deployments in Kubernetes environments.