automatic IKS worker replacement as a response to a certain metric (classic or VPC infrastructure)

While operating IBM Kubernetes (IKS) clusters (classsic infrastructure) we have run several times into a situation where IKS worker nodes start under-performing. We notice performance issues that are usually visible as:

Read timeouts while performing intense read operations via network from other applications running in our IKS cluster.
High IOWait metrics at worker pool level that we notice because we capture worker level OS metrics in all our worker nodes.

Such events happen at least once a month in our IKS clusters.

These problems are most of the time worked around by:

Cordoning the worker node
Draining the worker node
Waiting for auto-scaler to trigger a scale out event
Allowing IKS to add the new worker node in the relevant worker pool (it takes around 10 minutes to spin up a new VSI)
Remove the cordoned node from the IKS cluster

It would be really useful if IKS could handle these sort of events with an auto healing feature. And example of such feature is how Auto-Scaling Groups at AWS react when an instance fails its configured health check. Such feature is described at https://docs.aws.amazon.com/autoscaling/ec2/userguide/ts-as-healthchecks.html#ts-failed-status-checks

Idea priority	High
Needed By	Quarter

Post comment

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

automatic IKS worker replacement as a response to a certain metric (classic or VPC infrastructure)

Please enter your email address

RELATED IDEAS

automatic IKS worker replacement as a response to a certain metric (classic or VPC infrastructure)