How to Write a Query with a Regular Expression in Painless

A note to future me, since I keep forgetting — maybe it’s also useful for others — on how to write:

a regular expression (or regex)
in Elasticsearch’s Painless scripting language
in a query through runtime_mappings

Example dataset: Extract the top- and second-level domain from a string with subdomains. So how to get cnn.com from weather.cnn.com, for example.

Regular Expression #

A tool like regex101.com will make this easier, but plenty of alternatives exist.

A working regular expression would be ^([\w\-]+\.)*([\w\-]+\.\w+){1}$ (without going into further tweaks):

^ must match from the start of the string.
(...) capture everything enclosed and be able to reference it by ID (starting at 1).
[...] for a group of characters.
\w for any word character.
\- for a dash (which needs to be escaped).
+ one or more of it.
\. a dot (which needs to be escaped).
* zero or more of it.
{1} with exactly one occurrence.
$ the string must end here.

Example from regex101.com

Runtime Mapping #

Before running the query, a quick test dataset:

PUT test/_doc/1
{
  "name": "nba.espn.com"
}
PUT test/_doc/2
{
  "name": "weather.CNN.com"
}
PUT test/_doc/3
{
  "name": "cnn.com"
}
PUT test/_doc/4
{
  "name": "wrong"
}
PUT test/_doc/5
{
  "no-name": "cnn.com"
}
PUT test/_doc/6
{
  "name": "some.more.sub-d0mains.com"
}

And then the query:

GET test/_search
{
  "runtime_mappings": {
    "domain": {
      "type": "keyword",
      "script": """
        if(doc["name.keyword"].size()>0){
          def domainLevel = /^([\w\-]+\.)*([\w\-]+\.\w+){1}$/.matcher(doc["name.keyword"].value);
          if(domainLevel.matches()) {
            emit(domainLevel.group(2));
          }
        }
      """
    }
  },
  "query": {
    "match_all": { }
  },
  "fields": ["domain"]
}

The most important parts:

"runtime_mappings" to add the Painless script at query time.
"domain" is the name of the newly created runtime field.
"type": "keyword" and its data type.
if(doc["name.keyword"].size()>0) being on the safe side that the field exists.
/^([\w\-]+\.)*([\w\-]+\.\w+){1}$/ the unquoted regular expression between forward slashes.
.matcher(doc["name.keyword"].value matching on the value of the indexed keyword (rather than a text field).
if(domainLevel.matches()) if there is a match.
emit(domainLevel.group(2)) emitting the second group of the matched regular expression.
"match_all" searching across all documents.
"fields": ["domain"] explicitly includes the extracted runtime field in the search results, which wouldn’t be the case otherwise.

The result (shortened for easier readability) is then:

{
  "_source" : {
    "name" : "nba.espn.com"
  },
  "fields" : {
    "domain" : [
      "espn.com"
    ]
  }
},
{
  "_source" : {
    "name" : "weather.CNN.com"
  },
  "fields" : {
    "domain" : [
      "CNN.com"
    ]
  }
},
{
  "_source" : {
    "name" : "cnn.com"
  },
  "fields" : {
    "domain" : [
      "cnn.com"
    ]
  }
},
{
  "_source" : {
    "name" : "wrong"
  }
},
{
  "_source" : {
    "no-name" : "cnn.com"
  }
},
{
  "_source" : {
    "name" : "some.more.sub-d0mains.com"
  },
  "fields" : {
    "domain" : [
      "sub-d0mains.com"
    ]
  }
}

Output in Kibana’s Console

PS: Remember, the plural of regex is regret. Use with caution 🫠