Healthy containers cloud Cloud

What is the Key to Keeping Containers Healthy in the Cloud? ☁️

24/01/23 10 min. read

Have you ever thought about how the big companies provide continuous service in the cloud if they never stop uploading versions of their software 🤔 (Banco Santander is one of them with our Hybrid Cloud).

How do you ensure that our customers have uninterrupted access with continuous improvement?

How do you continue to provide service if a container fails?

Who makes sure everything is up and running?

If you are wondering about this, apart from reading how to do complex deployments without loss of service, read on and you will find out 👇👇 :

  • What is its secret.
  • How to apply it on your Openshift or your Kubernettes (even on your local kubernettes).
  • And how to do it for services with a simple http server or a more complex service.

What’s the key: check that it works!

For something to work, there is nothing as simple as periodically checking that it is working 😊

It seems trivial, but you can only provide service if the software is available to be used, i.e. the container where it runs is working correctly.

But how do you get to verify that all your containers are running?

Very simple! with Liveness probe and readiness probe.

What is the Liveness probe?

Container managers are able to check from time to time whether containers are “alive”, i.e. a test can be run to check that they are responding correctly, and even if their dependencies are also responding properly.

This can be automated and the container manager can perform actions in case it does not respond or responds that something is wrong, and in that case, for example, restart the failing container. That is the liveness probe.

What is the Readiness probe?

As for the readiness probe, it can be the same test, but it is used to know if the container is already up, that is to say, it can already give service and be used.

When is it used?

Well, for example, when a new version is deployed.

Until the new version is ready, it is not allowed to start providing service, and of course, previous versions are not destroyed.

At Santander Digital Services it is widely used so that a version of the software that does not start correctly does not go into production, and once it is running, in the event of a crash, the service can be recovered quickly.

And what does a liveness/readiness probe return?

It can give more or less detail… it can be simply that a console command returns an “exit 0” or a JSON with complete information about the state of the container and its dependencies, and with a simple “UP” or with much more information, depending on what is needed.

Usually, with the “UP” we can determine if it works correctly or not and act accordingly.

Here are some examples of an HTTP readiness test:

  • Only if it is up… e.g. unauthenticated users:

{“status”: “UP”}

  • With details of the components:

{
“status”: “UP”,
“components”: {
“diskSpace”: {
“status”: “UP”
},
“livenessState”: {
“status”: “UP”
},
“mongo”: {
“status”: “UP”
},
“ping”: {
“status”: “UP”
},
“readinessState”: {
“status”: “UP”
},
“refreshScope”: {
“status”: “UP”
}
},
“groups”: [“liveness”, “readiness”]
}

How is the container manager configured?

Well, depending on the container manager and how we want it to check it, it is defined in the deploy configuration of that container.

One of the simplest ways is usually by verification through an HTTP call, but also, as we have said, it can be the output of a command or any other check.

As you will see in the examples, it is possible to configure:

  • How long do we wait for the first test
  • How long do we wait for it to respond
  • How often we are going to run the test
  • How many times in a row it has to fail to be considered a real failure (it can fail punctually because of a CPU peak for example)
  • How many times it works properly to restore its “works properly” state

Below we show examples of the container configuration for both liveness and readiness verification in the most used container managers, Openshift and Kubernetes.

At Santander Digital Services we have both infrastructures and this is how we use them:

Openshift

containers:
  readinessProbe:
    httpGet:
      path: /actuator/health
      port: 8080
      scheme: HTTP
    initialDelaySeconds: 150
    timeoutSeconds: 10
    periodSeconds: 30
    successThreshold: 1
    failureThreshold: 3
  livenessProbe:
    httpGet:
      path: /actuator/health
      port: 8080
      scheme: HTTP
    initialDelaySeconds: 200
    timeoutSeconds: 10
    periodSeconds: 30
    successThreshold: 1
    failureThreshold: 3

Many other examples can be found in the Openshift documentation.

Kubernetes

The configuration is practically identical, in this case, the example we show is the checking of the existence of a file (it implies that the command is executed and the file can be read).

containers:
  - name: liveness
[…]
    livenessProbe:
      exec:
        command:
          - cat
          - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

More examples and configurations can be found in the Kubernetes documentation.

How can I prepare my container for it? 🧐

The content manager will query something from our container (a command, a TCP port, the result of an HTTP request) so we must prepare our software or our container to answer when it is asked…

It can be done in the container itself when creating it, to have a command at the end that generates, for example, a file (as in the previous example), so that when the container is active this file is available.

For example, to adjust it to the previous Kubernetes example, in the Docker file we can create the file:

spec:
  containers:
    - name: liveness
      image: registry.k8s.io/busybox
      args:
        - /bin/sh
        - -c
        - touch /tmp/healthy

Or you can create an .html file with the “UP” that expects to receive the HTTP request (and then check for /healthy.html).

spec:
  containers:
    - name: liveness
      image: nginx:latest
      args:
        - /bin/sh
        - -c
        - echo '{"status": "UP"}' > /usr/share/nginx/html/healthy.html

What if I only have pre-created images? 🤔

If we cannot create our own container (due to company restrictions, for example), we can always create our own health check in our software.

At Santander Digital Services we use this option to maximise the security of the containers.

You can always configure the container manager to verify that the port through which we provide service is responding, but if we want something more, we can respond with a more elaborate health check file.

Below, we show some examples in various cases and they are the ones we use in Santander Digital Services to ensure the proper functioning of our containers…

Web server (Netty/Apache sirviendo HTML, Angular, React, Vue…)

It’s as simple as creating some .html file with this simple content:

health.html:

{"status": "UP"}

So if you ask for /health.html you can verify that the service is up and serving the files.

Java Springboot

If we have a Springboot development, we are lucky because just by adding the actuator dependency we would already have an endpoint for our health.

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

This dependency allows much more than just health but here we will focus on that functionality.

The default endpoint for verification is /actuator/health. Example container manager configuration:

livenessProbe:
  httpGet:
    path: /actuator/health
    port: 8080
    scheme: HTTP

This dependency can be configured so that we can rename the endpoint, for example, this configuration changes the path to /vrfy/health and only exposes the health endpoint, and in a restrictive way, with as little information as possible:le:

management:
  context-path: /vrfy # change the endpoint to /vrfy/health so that it is not the default
  security:
    enabled: true
  endpoints:
    web:
      base-path: /vrfy # change the endpoint to /vrfy/health so that it is not the default
      exposure:
        include: 'health' # I only enable health
    health:
      sensitive: false # I don't want you to show the details
  endpoint:
    health:
      show-details: never # I don't want you to show the details

Any other software

In any other type of software we can create an HTTP endpoint to respond (and that responds as it is done from Java in the previous point), or TCP port, register in DB that is consulted from the micro or any other way that allows the container manager to know that the microservice is still “alive”.

When is my process killed? 👇👇

Depending on how we have configured the container manager (number of attempts, timeout…), when the conditions are met, the container manager will destroy the container and create a new one.

In case this happens, you have to check the events, which usually identify that the health has failed and the number of times.

It should be taken into account that, for example, if it has been configured in this way, it could be that what is down is a POD dependency (for example, the DB or it has run out of space) and this could generate a down status.

Example of POD down due to dependency:

$ curl http://127.0.0.1:8080/actuator/health

{	
  "status": "DOWN",
  "components": {
    "diskSpace": {
      "status": "UP",
    },
    "livenessState": {
      "status": "UP"
    },
    "mongo": {
      "status": "DOWN",
    },
    "ping": {
      "status": "UP"
    },
    "readinessState": {
      "status": "UP"
    },
    "refreshScope": {
      "status": "UP"
    }
  },
  "groups": ["liveness",   "readiness"]
}

What else can the liveness probe be used for?

It can be used for many more things! 🙌

For example, if you have several clusters (openshift or several kubernetes) where your containers are running, you usually have a load balancer in front of them to balance the load.

The load balancer can be configured to check if the services of each cluster are running, and if for any reason they do not respond, i.e. they do not provide service, they do not send traffic to that cluster.

At Santander Digital Services we use this capability of the balancers to avoid traffic to a cluster that is not going to return results and avoid unnecessary errors.

It can also be used, in case the metrics it offers are detailed, to check or analyse container loads in real time (disk, memory, cpu…).

To conclude 👐

Whether we want it so that our service does not lose availability while we deploy new versions or to restart if we stop providing it, both the liveness probe and the readiness probe allow us to keep our cloud healthy by means of the health of our containers.

At Santander Digital Services, this is one of the ways we have to ensure that our cloud services remain available.

Santander Digital Services is a company belonging to Grupo Santander, which is based in Madrid and has almost 5,000 employees. We are working to move Santander towards a Digital Bank with branches.

Take a look at the current vacancies here and join this amazing team and Be Tech! with Santander 🚀

Follow us on LinkedIn and Instagram.

Ruben Rodríguez Martín

Rubén Rodríguez Martín

Santander

Computer Engineer programming since I was a child (10 years old!), user and internet administrator since 96’s, software development specialist, systems, networks, security, home automation, complex deploymens, PaaS… and a new technologies big fan, process optimization, innovation, science fiction ¡and much more!

👉 My LinkedIn profile

 

Other posts