How Does Elasticsearch Store a float Value into an integer Field?

Today there was a Discuss post on “Elasticsearch data type” that demonstrates one of the more confusing features in Elasticsearch. But if you are familiar with Elasticsearch it is an excellent puzzle — so follow along and test your knowledge.

First, add a document:

PUT users/_doc/1
{
  "user_id": 1
}

This index uses a dynamic mapping, which defaults to what data type for the user_id field?

Default Numeric Type #

GET users/_mapping shows the answer:

{
  "users" : {
    "mappings" : {
      "properties" : {
        "user_id" : {
          "type" : "long"
        }
      }
    }
  }
}

So your user_id field is a long. Next, you try to add four more documents:

PUT users/_doc/2
{
  "user_id" : 2
}

PUT users/_doc/3
{
  "user_id" : "3"
}

PUT users/_doc/4
{
  "user_id" : 4.5
}

PUT users/_doc/5
{
  "user_id" : "5.1"
}

The Document with ID two is the same as our first one, so that will work. But what happens if you try to store a string, a float, or even a stringified float value into a long field?

Handling Dirty Data #

It still works. But why?

By default, Elasticsearch will coerce data to clean it up. Quoting from its documentation:

Coercion attempts to clean up dirty values to fit the datatype of a field. For instance:
Strings will be coerced to numbers.
Floating points will be truncated for integer values.

Especially for quoted numbers, this makes sense, since some systems err on the side of quoting too much rather than too little. Perl used to be one of the well-known offenders there, and coercing would helpfully clean this up.

Sounds reasonable, but you want to verify this by retrieving the documents with GET users/_search and expect the user_id values 1 2 3 4 5, right? But you actually get the result — focus on the array in hits.hits:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "users",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 1
        }
      },
      {
        "_index" : "users",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 2
        }
      },
      {
        "_index" : "users",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "user_id" : "3"
        }
      },
      {
        "_index" : "users",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "user_id" : 4.5
        }
      },
      {
        "_index" : "users",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "user_id" : "5.1"
        }
      }
    ]
  }
}

Is this a bug? How could you store 4.5 in a long? If you recheck the mapping with GET users/_mapping, it’s still returning "type": "long".

`_source` Is Only an Illusion #

The final piece in this puzzle is that Elasticsearch never changes the _source. But the stored field user_id is a long as you would expect. You can verify this by running an aggregation on the field:

GET users/_search
{
  "size": 0,
  "aggs": {
    "my_sum": {
      "sum": {
        "field": "user_id"
      }
    }
  }
}

Which gives the correct result for 1 + 2 + 3 + 4 + 5 = 15:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "my_sum" : {
      "value" : 15.0
    }
  }
}

Which, by the way, now defaults to a floating representation for value which you could change with the extra parameter "format": "0". That would add a "value_as_string" : "10" to the result. But let’s leave it at that.

Conclusion #

I hope you are less confused than before or at least enjoyed the puzzle. As a parting note, be aware that coerce might be removed in the future since it is a trappy feature — especially around truncating floating-point numbers 😄.

Default Numeric Type #

Handling Dirty Data #

_source Is Only an Illusion #

Conclusion #

`_source` Is Only an Illusion #