How to Write a Query with a Regular Expression in Painless
A note to future me, since I keep forgetting — maybe it’s also useful for others — on how to write:
- a regular expression (or regex)
- in Elasticsearch’s Painless scripting language
- in a query through
runtime_mappings
Example dataset: Extract the top- and second-level domain from a string with subdomains. So how to get cnn.com
from weather.cnn.com
, for example.
Regular Expression #
A tool like regex101.com will make this easier, but plenty of alternatives exist.
A working regular expression would be ^([\w\-]+\.)*([\w\-]+\.\w+){1}$
(without going into further tweaks):
^
must match from the start of the string.(...)
capture everything enclosed and be able to reference it by ID (starting at 1).[...]
for a group of characters.\w
for any word character.\-
for a dash (which needs to be escaped).+
one or more of it.\.
a dot (which needs to be escaped).*
zero or more of it.{1}
with exactly one occurrence.$
the string must end here.
Runtime Mapping #
Before running the query, a quick test dataset:
PUT test/_doc/1
{
"name": "nba.espn.com"
}
PUT test/_doc/2
{
"name": "weather.CNN.com"
}
PUT test/_doc/3
{
"name": "cnn.com"
}
PUT test/_doc/4
{
"name": "wrong"
}
PUT test/_doc/5
{
"no-name": "cnn.com"
}
PUT test/_doc/6
{
"name": "some.more.sub-d0mains.com"
}
And then the query:
GET test/_search
{
"runtime_mappings": {
"domain": {
"type": "keyword",
"script": """
if(doc["name.keyword"].size()>0){
def domainLevel = /^([\w\-]+\.)*([\w\-]+\.\w+){1}$/.matcher(doc["name.keyword"].value);
if(domainLevel.matches()) {
emit(domainLevel.group(2));
}
}
"""
}
},
"query": {
"match_all": { }
},
"fields": ["domain"]
}
The most important parts:
"runtime_mappings"
to add the Painless script at query time."domain"
is the name of the newly created runtime field."type": "keyword"
and its data type.if(doc["name.keyword"].size()>0)
being on the safe side that the field exists./^([\w\-]+\.)*([\w\-]+\.\w+){1}$/
the unquoted regular expression between forward slashes..matcher(doc["name.keyword"].value
matching on the value of the indexedkeyword
(rather than atext
field).if(domainLevel.matches())
if there is a match.emit(domainLevel.group(2))
emitting the second group of the matched regular expression."match_all"
searching across all documents."fields": ["domain"]
explicitly includes the extracted runtime field in the search results, which wouldn’t be the case otherwise.
The result (shortened for easier readability) is then:
{
"_source" : {
"name" : "nba.espn.com"
},
"fields" : {
"domain" : [
"espn.com"
]
}
},
{
"_source" : {
"name" : "weather.CNN.com"
},
"fields" : {
"domain" : [
"CNN.com"
]
}
},
{
"_source" : {
"name" : "cnn.com"
},
"fields" : {
"domain" : [
"cnn.com"
]
}
},
{
"_source" : {
"name" : "wrong"
}
},
{
"_source" : {
"no-name" : "cnn.com"
}
},
{
"_source" : {
"name" : "some.more.sub-d0mains.com"
},
"fields" : {
"domain" : [
"sub-d0mains.com"
]
}
}
PS: Remember, the plural of regex is regret. Use with caution 🫠