A note to future me, since I keep forgetting — maybe it’s also useful for others — on how to write:

  • a regular expression (or regex)
  • in Elasticsearch’s Painless scripting language
  • in a query through runtime_mappings

Example dataset: Extract the top- and second-level domain from a string with subdomains. So how to get cnn.com from weather.cnn.com, for example.

Regular Expression #

A tool like regex101.com will make this easier, but plenty of alternatives exist.

A working regular expression would be ^([\w\-]+\.)*([\w\-]+\.\w+){1}$ (without going into further tweaks):

  • ^ must match from the start of the string.
  • (...) capture everything enclosed and be able to reference it by ID (starting at 1).
  • [...] for a group of characters.
  • \w for any word character.
  • \- for a dash (which needs to be escaped).
  • + one or more of it.
  • \. a dot (which needs to be escaped).
  • * zero or more of it.
  • {1} with exactly one occurrence.
  • $ the string must end here.

Example from regex101.com

Runtime Mapping #

Before running the query, a quick test dataset:

PUT test/_doc/1
{
  "name": "nba.espn.com"
}
PUT test/_doc/2
{
  "name": "weather.CNN.com"
}
PUT test/_doc/3
{
  "name": "cnn.com"
}
PUT test/_doc/4
{
  "name": "wrong"
}
PUT test/_doc/5
{
  "no-name": "cnn.com"
}
PUT test/_doc/6
{
  "name": "some.more.sub-d0mains.com"
}

And then the query:

GET test/_search
{
  "runtime_mappings": {
    "domain": {
      "type": "keyword",
      "script": """
        if(doc["name.keyword"].size()>0){
          def domainLevel = /^([\w\-]+\.)*([\w\-]+\.\w+){1}$/.matcher(doc["name.keyword"].value);
          if(domainLevel.matches()) {
            emit(domainLevel.group(2));
          }
        }
      """
    }
  },
  "query": {
    "match_all": { }
  },
  "fields": ["domain"]
}

The most important parts:

  • "runtime_mappings" to add the Painless script at query time.
  • "domain" is the name of the newly created runtime field.
  • "type": "keyword" and its data type.
  • if(doc["name.keyword"].size()>0) being on the safe side that the field exists.
  • /^([\w\-]+\.)*([\w\-]+\.\w+){1}$/ the unquoted regular expression between forward slashes.
  • .matcher(doc["name.keyword"].value matching on the value of the indexed keyword (rather than a text field).
  • if(domainLevel.matches()) if there is a match.
  • emit(domainLevel.group(2)) emitting the second group of the matched regular expression.
  • "match_all" searching across all documents.
  • "fields": ["domain"] explicitly includes the extracted runtime field in the search results, which wouldn’t be the case otherwise.

The result (shortened for easier readability) is then:

{
  "_source" : {
    "name" : "nba.espn.com"
  },
  "fields" : {
    "domain" : [
      "espn.com"
    ]
  }
},
{
  "_source" : {
    "name" : "weather.CNN.com"
  },
  "fields" : {
    "domain" : [
      "CNN.com"
    ]
  }
},
{
  "_source" : {
    "name" : "cnn.com"
  },
  "fields" : {
    "domain" : [
      "cnn.com"
    ]
  }
},
{
  "_source" : {
    "name" : "wrong"
  }
},
{
  "_source" : {
    "no-name" : "cnn.com"
  }
},
{
  "_source" : {
    "name" : "some.more.sub-d0mains.com"
  },
  "fields" : {
    "domain" : [
      "sub-d0mains.com"
    ]
  }
}

Output in Kibana’s Console

PS: Remember, the plural of regex is regret. Use with caution 🫠