Filebeat Modules with Docker and Kubernetes

Contents:

While writing another blog post, I realized that using Filebeat modules with Docker or Kubernetes is less evident than it should be. Let me shed some light here. If you’re only interested in the final solution, jump to Plan D.

Filebeat and Filebeat Modules #

What is Filebeat?

Filebeat is a lightweight shipper for forwarding and centralizing log data. Installed as an agent on your servers, Filebeat monitors the log files or locations that you specify, collects log events, and forwards them […]

What are Filebeat modules?

Filebeat modules simplify the collection, parsing, and visualization of common log formats.

Currently, there are 70 modules for web servers, databases, cloud services,… and the list grows with every release. For example, the Elasticsearch module adds the features:

Set the default paths based on the operating system to the log files of Elasticsearch.
Collect multiline logs as a single event.
Add an ingest pipeline to parse the various log files.
It doesn’t (yet) have visualizations, dashboards, or Machine Learning jobs, but many other modules provide them out of the box.

All you need to do is to enable the module with filebeat modules enable elasticsearch. I’m sticking to the Elasticsearch module here since it can demo the scenario with just three components:

Elasticsearch to generate the logs, but also to store them.
Filebeat to collect the logs and forward them to Elasticsearch.
Kibana to visualize the logs from Elasticsearch.

Minimal architecture to get Elasticsearch logs with Filebeat into Elasticsearch again visualize them with Kibana

A minimal Filebeat configuration for this use-case would be:

filebeat.modules:
  - module: elasticsearch

output.elasticsearch:
  hosts: ["http://localhost:9200"]

setup.kibana:
  host: "http://localhost:5601"

This configuration would automatically collect the different log files from /var/log/elasticsearch/ (on Linux). Since 7.0 JSON log files are the new default and map to:

server: *_server.json
gc: gc.log and gc.log.[0-9]*
audit: *_audit.json
slowlog: *_index_search_slowlog.json and *_index_indexing_slowlog.json
deprecation: *_deprecation.json

Here is a quick demo with the .tar.gz binaries, which are already downloaded and extracted. First, starting Elasticsearch and Kibana:

$ ls
elasticsearch/  filebeat/  kibana/
$ ./elasticsearch/bin/elasticsearch
...
$ ./kibana/bin/kibana
...

Some sample log lines from elasticsearch/logs/elasticsearch_server.json (@timestamp, cluster.name, and node.name fields removed for brevity):

{"type": "server", "level": "INFO", "component": "o.e.c.m.MetaDataMappingService", "message": "[.kibana_1/GQJ7d9pGQ_qZs36hSydGAA] update_mapping [_doc]", "cluster.uuid": "Out1VK3sSJeY-KmxhG5uvg", "node.id": "I-rG31vYQR-Ov07XB6mW-g"  }
{"type": "server", "level": "INFO", "component": "o.e.c.m.MetaDataCreateIndexService", "message": "[test] creating index, cause [api], templates [], shards [1]/[1], mappings []", "cluster.uuid": "Out1VK3sSJeY-KmxhG5uvg", "node.id": "I-rG31vYQR-Ov07XB6mW-g"  }
{"type": "server", "level": "INFO", "component": "o.e.x.i.IndexLifecycleTransition", "message": "moving index [filebeat-7.16.2-2020.04.26-000001] from [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"wait-for-follow-shard-tasks\"}] to [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"pause-follower-index\"}] in policy [filebeat]", "cluster.uuid": "Out1VK3sSJeY-KmxhG5uvg", "node.id": "I-rG31vYQR-Ov07XB6mW-g"  }

Before starting Filebeat, you need to edit filebeat/filebeat.yml to enable the Elasticsearch module and change to the custom paths of the log files. Also, connect to Elasticsearch to forward the data and Kibana for the setup:

filebeat.modules:
  - module: elasticsearch
    server:
      var.paths:
        - "../elasticsearch/logs/*_server.json"
    gc:
      var.paths:
        - "../elasticsearch/logs/gc.log.[0-9]*"
        - "../elasticsearch/logs/gc.log"
    slowlog:
      var.paths:
        - "../elasticsearch/logs/*_index_search_slowlog.json"
        - "../elasticsearch/logs/*_index_indexing_slowlog.json"
    deprecation:
      var.paths:
        - "../elasticsearch/logs/*_deprecation.json"

output.elasticsearch:
  hosts: ["http://localhost:9200"]

setup.kibana:
  host: "http://localhost:5601"

Now you can start Filebeat, and the output shows three sample log entries (there will be a lot more). Those are good indicators that the setup is working — harvesting the intended log files, adding the ingest pipelines, and connecting to Elasticsearch:

$ cd ./filebeat/
$ ./filebeat -e
...
INFO   Harvester started for file: /Users/philipp/Downloads/demo/elasticsearch/logs/elasticsearch_server.json
INFO   Elasticsearch pipeline with ID 'filebeat-7.16.2-elasticsearch-server-pipeline' loaded
INFO   Connection to backoff(elasticsearch(http://localhost:9200)) established

If everything works correctly, Discover in Kibana can show the collected events: Discover with the Elasticsearch module logs

And the data is correctly structured so that you can easily filter on a specific node, index, or log level. The screenshot shows one of the three sample log lines from above: An example log event from the Elasticsearch module

Adding Docker and Kubernetes to the Mix #

What changes when you move this demo to containers? The location of log files. Quoting from the Elasticsearch Docker documentation:

Log messages go to the console and are handled by the configured Docker logging driver. By default you can access logs with docker logs.

STDOUT and STDERR are the right approach for containers, but what does that mean for Filebeat modules? To keep it simple, I’m using Docker Compose for the demo.

Plan A: Mount Folders #

The default path of /var/log/elasticsearch/ won’t contain any logs. If your initial plan was to mount that folder from the Elasticsearch container to the host and then into the Filebeat container, that is not going to work.

Here is an example of docker logs from an Elasticsearch container for inspiration (@timestamp, cluster.name, and node.name fields removed for brevity):

{ "type": "server", "level": "INFO", "component": "o.e.c.m.MetaDataCreateIndexService", "message": "[filebeat-7.16.2-2020.04.26-000001] creating index, cause [api], templates [filebeat-7.16.2], shards [1]/[1], mappings [_doc]", "cluster.uuid": "F5afZGv0SW25LctEgkgwXA", "node.id": "U_Hv1fUgToCVZ2_OS964yA" }
{ "type": "server", "level": "INFO", "component": "o.e.c.m.MetaDataIndexTemplateService", "message": "adding template [.management-beats] for index patterns [.management-beats]", "cluster.uuid": "F5afZGv0SW25LctEgkgwXA", "node.id": "U_Hv1fUgToCVZ2_OS964yA" }
{ "type": "server", "level": "INFO", "component": "o.e.c.m.MetaDataMappingService", "message": "[.kibana_1/BE7x8ad1SxiToIKvC5rHOQ] update_mapping [_doc]", "cluster.uuid": "F5afZGv0SW25LctEgkgwXA", "node.id": "U_Hv1fUgToCVZ2_OS964yA" }

The Docker logs are stored in /var/lib/docker/containers/. Though on macOS there is a VM layer in-between, so it can run on a Linux Kernel. There you need to run the following commands to access the logs of the container <ID> (replace with the actual ID):

$ screen ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty
$ tail -n 3 /var/lib/docker/containers/<ID>/<ID>-json.log
{"log":"{\"type\": \"server\", \"timestamp\": \"2020-04-26T10:01:09,384Z\", \"level\": \"INFO\", \"component\": \"o.e.x.i.IndexLifecycleTransition\", \"cluster.name\": \"docker-cluster\", \"node.name\": \"elasticsearch\", \"message\": \"moving index [ilm-history-1-000001] from [{\\\"phase\\\":\\\"hot\\\",\\\"action\\\":\\\"unfollow\\\",\\\"name\\\":\\\"open-follower-index\\\"}] to [{\\\"phase\\\":\\\"hot\\\",\\\"action\\\":\\\"unfollow\\\",\\\"name\\\":\\\"wait-for-yellow-step\\\"}] in policy [ilm-history-ilm-policy]\", \"cluster.uuid\": \"F5afZGv0SW25LctEgkgwXA\", \"node.id\": \"U_Hv1fUgToCVZ2_OS964yA\"}\n","stream":"stdout","time":"2020-04-26T10:01:09.385191Z"}
{"log":"{\"type\": \"server\", \"timestamp\": \"2020-04-26T10:01:09,531Z\", \"level\": \"INFO\", \"component\": \"o.e.x.i.IndexLifecycleTransition\", \"cluster.name\": \"docker-cluster\", \"node.name\": \"elasticsearch\", \"message\": \"moving index [filebeat-7.16.2-2020.04.26-000001] from [{\\\"phase\\\":\\\"hot\\\",\\\"action\\\":\\\"unfollow\\\",\\\"name\\\":\\\"wait-for-yellow-step\\\"}] to [{\\\"phase\\\":\\\"hot\\\",\\\"action\\\":\\\"rollover\\\",\\\"name\\\":\\\"check-rollover-ready\\\"}] in policy [filebeat]\", \"cluster.uuid\": \"F5afZGv0SW25LctEgkgwXA\", \"node.id\": \"U_Hv1fUgToCVZ2_OS964yA\"}\n","stream":"stdout","time":"2020-04-26T10:01:09.5325047Z"}
{"log":"{\"type\": \"server\", \"timestamp\": \"2020-04-26T10:01:09,697Z\", \"level\": \"INFO\", \"component\": \"o.e.x.i.IndexLifecycleTransition\", \"cluster.name\": \"docker-cluster\", \"node.name\": \"elasticsearch\", \"message\": \"moving index [ilm-history-1-000001] from [{\\\"phase\\\":\\\"hot\\\",\\\"action\\\":\\\"unfollow\\\",\\\"name\\\":\\\"wait-for-yellow-step\\\"}] to [{\\\"phase\\\":\\\"hot\\\",\\\"action\\\":\\\"rollover\\\",\\\"name\\\":\\\"check-rollover-ready\\\"}] in policy [ilm-history-ilm-policy]\", \"cluster.uuid\": \"F5afZGv0SW25LctEgkgwXA\", \"node.id\": \"U_Hv1fUgToCVZ2_OS964yA\"}\n","stream":"stdout","time":"2020-04-26T10:01:09.6978308Z"}

To exit screen, press ctrl + a then k and y when asked for confirmation.

The log file looks promising. How could you use this data to get back to the result of the Filebeat module?

Plan B: Reuse the Ingest Pipelines #

Combine the Docker logs with some Filebeat features and tie the ingest pipeline into it. The idea is:

Collect the logs with container input.
Add the container metadata. With the add_docker_metadata processor, each log event includes container ID, name, image, and labels from the Docker API. If you are using Kubernetes, you could enrich each log event with add_kubernetes_metadata processor to get pod, namespace,… from the Kubernetes API.
On the Elasticsearch output, you can add ingest pipelines with conditions. So based on conditions from the metadata you could apply the different ingest pipelines from the Filebeat module.

Putting this into practice, the first step is to fetch the names of the ingest pipelines with GET _ingest/pipeline; for example, from the demo before adding Docker. The relevant ones are:

{
  "filebeat-7.16.2-elasticsearch-deprecation-pipeline" : { ... },
  "filebeat-7.16.2-elasticsearch-slowlog-pipeline" : { ... },
  "filebeat-7.16.2-elasticsearch-gc-pipeline" : { ... },
  "filebeat-7.16.2-elasticsearch-server-pipeline" : { ... },
  "filebeat-7.16.2-elasticsearch-audit-pipeline" : { ... }
}

Then the filebeat.yml configuration could look like the following:

filebeat.inputs:
- type: container
  paths:
    - "/var/lib/docker/containers/*/*.log"

filebeat.modules:
  - module: elasticsearch  1️⃣

processors:
  - add_docker_metadata: ~

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]
  pipelines:
    - pipeline: "filebeat-%{[agent.version]}-elasticsearch-server-pipeline"
      when:
        and:
          - equals:  2️⃣
              container.labels.org_label-schema_url: "https://www.elastic.co/products/elasticsearch"
          - contains:  3️⃣
              message: '"type": "server"'
    - pipeline: "filebeat-%{[agent.version]}-elasticsearch-deprecation-pipeline"
      when:
        and:
          - equals:
              container.labels.org_label-schema_url: "https://www.elastic.co/products/elasticsearch"
          - contains:
              message: '"type": "deprecation"'
    - pipeline: "filebeat-%{[agent.version]}-elasticsearch-audit-pipeline"
      when:
        and:
          - equals:
              container.labels.org_label-schema_url: "https://www.elastic.co/products/elasticsearch"
          - contains:
              message: '"type": "audit"'
    - pipeline: "filebeat-%{[agent.version]}-elasticsearch-slowlog-pipeline"
      when:
        and:
          - equals:
              container.labels.org_label-schema_url: "https://www.elastic.co/products/elasticsearch"
          - contains:  4️⃣
              message: '_slowlog'
    - pipeline: "filebeat-%{[agent.version]}-elasticsearch-gc-pipeline"
      when.equals:  5️⃣
        container.labels.org_label-schema_url: "https://www.elastic.co/products/elasticsearch"

setup.kibana:
  host: "http://kibana:5601"

1️⃣ You still need to add the Filebeat module for Elasticsearch. Otherwise, it wouldn’t create the ingest pipelines.
2️⃣ The ingest pipeline should only apply to Elasticsearch images. There are quite a few fields from add_docker_metadata to choose from, but one that should be unique and stable is container.labels.org_label-schema_url.
3️⃣ With the different log files, there are different formats, making this example one of the more complicated ones. The type field is a differentiator for server, deprecation, and audit.
4️⃣ Slowlogs have multiple type fields, and while you could have nested an additional or condition into this, but to keep it simple, the condition is only doing a partial match.
5️⃣ The GC log isn’t JSON and also doesn’t have a type field. This condition is a fallback rule since only the first matching pipeline for every log event is being executed.

This approach makes sense for connecting custom pipelines and log formats. And while this is an excellent example of learning about multiple container-related features of Filebeat and how to combine them, it is also horrible to write and maintain. Could there be a better approach for Filebeat modules?

Plan C: Docker Templates #

Instead of collecting logs manually from a specific folder, Filebeat supports autodiscover.providers for both docker and kubernetes. It uses the default location of logs automatically — like /var/lib/docker/containers/ from the previous example.

With Docker the following metadata fields are added to every log event:

host
port
docker.container.id
docker.container.image
docker.container.name
docker.container.labels

The idea for this approach is to add a conditional template based on the metadata that applies the Filebeat modules. The complete filebeat.yml then contains:

filebeat.autodiscover.providers:
  - type: docker  1️⃣
    templates:
      - condition:
        contains:
          docker.container.image: elasticsearch  2️⃣
        config:
          - module: elasticsearch  3️⃣
            audit:
              input:
                type: container
                paths:
                  - "/var/lib/docker/containers/${data.docker.container.id}/*.log"  4️⃣
            deprecation:
              input:
                type: container
                paths:
                  - "/var/lib/docker/containers/${data.docker.container.id}/*.log"
            gc:
              input:
                type: container
                paths:
                  - "/var/lib/docker/containers/${data.docker.container.id}/*.log"
            server:
              input:
                type: container
                paths:
                  - "/var/lib/docker/containers/${data.docker.container.id}/*.log"
            slowlog:
              input:
                type: container
                paths:
                  - "/var/lib/docker/containers/${data.docker.container.id}/*.log"

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]

setup.kibana:
  host: "http://kibana:5601"

1️⃣ Using the docker provider, Filebeat must be able to access the Docker logs in their default location. For example with Docker Compose you need to mount volumes: [ /var/lib/docker/containers:/var/lib/docker/containers:ro ] into the Filebeat container. You can use the docker-compose.yml from the final plan D here as well.
2️⃣ Add a condition for this template based on the docker.container.image metadata.
3️⃣ Enable the Elasticsearch module and apply its five different log types on the Elasticsearch container. You don’t have to enable modules with filebeat.modules any more, since this line already does that.
4️⃣ Make sure you scope the template correctly. ${data.docker.container.id} applies the module only to the logs of the container matching the condition. /var/lib/docker/containers/*/*.log would apply the module to all the container logs for every Elasticsearch instance.

This approach is better than the previous one, but it contains a lot of boilerplate configuration. Could this get simpler?

Plan D: Back to Filebeat Modules #

hints are the key — quoting from the documentation:

The hints system looks for hints in Kubernetes Pod annotations or Docker labels that have the prefix co.elastic.logs. As soon as the container starts, Filebeat checks if it contains any hints and launch the proper config for it. Hints tell Filebeat how to get logs for the given container.

The complete filebeat.yml configuration is now much shorter and cleaner:

filebeat.autodiscover.providers:
  - type: docker
    hints.enabled: true

filebeat.modules:  1️⃣
  - module: elasticsearch

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]

setup.kibana:
  host: "http://kibana:5601"

1️⃣ Don’t forget to enable the Filebeat module with this approach. Otherwise, it wouldn’t add the necessary ingest pipelines.

Now, how does this configuation apply the Filebeat module? Through Docker labels, for example in a docker-compose.yml file. It’s mostly a standard Elasticsearch and Kibana setup plus Filebeat — running as a sidecar on Docker or a daemonset on Kubernetes:

---
version: '3.3'
services:

  elasticsearch:
    container_name: elasticsearch
    hostname: elasticsearch
    image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION
    environment:
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - discovery.type=single-node
    labels:
      co.elastic.logs/module: elasticsearch  1️⃣
    volumes:
      - esdata1:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - stack

  kibana:
    container_name: kibana
    hostname: kibana
    image: docker.elastic.co/kibana/kibana:$ELASTIC_VERSION
    labels:
      co.elastic.logs/module: kibana
    depends_on:
      - elasticsearch
    ports:
      - 5601:5601
    networks:
      - stack

  filebeat:
    container_name: filebeat
    hostname: filebeat
    image: docker.elastic.co/beats/filebeat:${ELASTIC_VERSION}
    user: root
    volumes:
      - ./filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro  2️⃣
      - /var/run/docker.sock:/var/run/docker.sock:ro  3️⃣
    command: filebeat -e
    depends_on:
      - elasticsearch
      - kibana
    restart: on-failure
    networks:
      - stack

volumes:
  esdata1:
    driver: local
networks:
  stack:

1️⃣ The co.elastic.logs/module label tells Filebeat with autodiscovery, which Filebeat module to apply to this container. It is like an inversion of control: Rather than configuring the rules during collection, the container itself declares how its logs should be processed. There are additional rules you can apply like co.elastic.logs/multiline.*, co.elastic.logs/processors, or co.elastic.logs/enabled, but only the module one is required for this scenario.
2️⃣ Mounting the container logs into the sidecar so that Filebeat can access them.
3️⃣ The Docker socket needs to be mounted so Filebeat can read the co.elastic.logs annotations. Accessing the Docker socket also requires the container to run as root.

Finally, a .env file manages the Elastic Stack version centrally:

ELASTIC_VERSION=7.16.2

This is it. The Filebeat module applies its rules automatically to the Elasticsearch container, and the result shows up in Kibana: Filebeat module on Docker

The full example of the final plan D approach is also on GitHub.

Conclusion #

Once you know what you are looking for, this is a straightforward problem to solve. But at first, moving from log files to STDOUT looks like a challenge, as shown in my failed plan A. Also, explicit processing rules only complicate log processing as you add log formats, which plan B covered. Plan C was relatively straightforward but requires a lot of boilerplate code — I’m not sure I would want to do this for 20 different Filebeat modules. Finally, with the change in paradigm that every container should define its own rules in plan D, the setup with Docker is not so different to log files again. After all, when creating the Docker image, you know the format, and you can roll out changes in a single place.

One final point for discussion: Adding a Docker label for logging on the container sounds like additional coupling. If many systems use a similar approach, you’ll end up with a soup of labels. That’s why they should be namespaced like co.elastic.logs/*. For me, this is an acceptable tradeoff, and I would rather have it than more boilerplate code. If you disagree, you can use the plan C approach instead.