开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> ElasticSearch Aggregation(五) -> 正文阅读

[大数据]ElasticSearch Aggregation(五)

ElasticSearch Aggregation(五)

指标聚合

这个家族中的聚合基于从被聚合的文档中以某种方式提取的值来计算指标。值通常从文档的字段中提取(使用字段数据)，但也可以使用脚本生成。

数值指标聚合是一种特殊类型的指标聚合，输出数值。一些聚合输出单个数值指标(如avg)，称为单值数值指标聚合，其他生成多个指标(如stats)，称为多值数值指标聚合。当这些聚合作为某些桶聚合的直接子聚合(某些桶聚合使您能够根据每个桶中的数值指标对返回的桶进行排序)时，单值和多值数值指标聚合之间的区别就发挥了作用。

avg

一个单值指标聚合。计算从聚合文档中提取的字段值的平均数。这些值可以从文档中的特定字段中提取。

缺失值

这个missing参数定义如何为缺失的字段值赋一个默认值。默认情况下，不指定missing参数，那么缺失字段值的文档会被忽略。

curl -X POST "localhost:9200/exams/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "grade_avg": {
      "avg": {
        "field": "grade",
        "missing": 10     
      }
    }
  }
}
'

直方图字段

当在histogram字段上执行平均聚合的时候，聚合的结果是数组中相同位置的数的加权平均值。

curl -X PUT "localhost:9200/metrics_index/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "network.name" : "net-1",
  "latency_histo" : {
      "values" : [0.1, 0.2, 0.3, 0.4, 0.5], 
      "counts" : [3, 7, 23, 12, 6] 
   }
}
'
curl -X PUT "localhost:9200/metrics_index/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
  "network.name" : "net-2",
  "latency_histo" : {
      "values" :  [0.1, 0.2, 0.3, 0.4, 0.5], 
      "counts" : [8, 17, 8, 7, 6] 
   }
}
'
curl -X POST "localhost:9200/metrics_index/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "avg_latency":
      { "avg": { "field": "latency_histo" }
    }
  }
}
'

对于每个直方图字段，avg聚合将值数组<1>中的每个数字乘以计数数组<2>相同下标的数值。最后，它将计算所有直方图的这些值的平均值，并返回以下结果:

{
  ...
  "aggregations": {
    "avg_latency": {
      "value": 0.29690721649
    }
  }
}

Cardinality聚合

一个单值的聚合指标，该会计算不同值的近似个数。也叫作一个字段值的基数。例如男和女，那么基数就是2。

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "cardinality_test": {
      "nested": {
        "path": "address"
      }, 
      "aggs": {
        "cardinality_test": {
          "cardinality": {
            "field": "address.city"
          }
        }
      }
    }
  }
}'

响应：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "cardinality_test" : {
      "doc_count" : 3,
      "cardinality_test" : {
        "value" : 2
      }
    }
  }
}

精确控制

这个聚合也支持precision_threshold选项：

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "type_count": {
      "cardinality": {
        "field": "type",
        "precision_threshold": 100 
      }
    }
  }
}
'

提示：选项允许以内存换取精度，并定义一个唯一的计数，低于该计数时，计数将接近准确。高于此值时，计数可能会变得更加模糊。支持的最大值为40000，高于此数字的阈值将与40000的阈值具有相同的效果。默认值为3000。

Extended stats 聚合

一个多桶值源聚合。计算从聚合文档中提取的数值的统计信息。

extended_stats是stats聚合的一个扩展，增加了额外的指标，例如sum_of_squares, variance, std_deviation 和std_deviation_bounds。

curl -XGET "http://localhost:9200/my-index-000001/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
   "aggs": {
     "extended_stats_test": {
       "extended_stats": {
         "field": "age"
       }
     }
   }
}'

响应：

{
  "took" : 130,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "extended_stats_test" : {
      "count" : 3,
      "min" : 18.0,
      "max" : 30.0,
      "avg" : 22.666666666666668,
      "sum" : 68.0,
      "sum_of_squares" : 1624.0,
      "variance" : 27.555555555555582,
      "variance_population" : 27.555555555555582,
      "variance_sampling" : 41.33333333333337,
      "std_deviation" : 5.249338582674543,
      "std_deviation_population" : 5.249338582674543,
      "std_deviation_sampling" : 6.42910050732864,
      "std_deviation_bounds" : {
        "upper" : 33.16534383201575,
        "lower" : 12.167989501317582,
        "upper_population" : 33.16534383201575,
        "lower_population" : 12.167989501317582,
        "upper_sampling" : 35.52486768132395,
        "lower_sampling" : 9.808465652009389
      }
    }
  }
}

Geo-Bounds聚合

计算包含所有地理点的边界框。例如：

curl -X PUT "localhost:9200/museums?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      }
    }
  }
}
'
curl -X POST "localhost:9200/museums/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d\u0027Orsay"}
'
curl -X POST "localhost:9200/museums/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": { "name": "musée" }
  },
  "aggs": {
    "viewport": {
      "geo_bounds": {
        "field": "location",    
        "wrap_longitude": true  
      }
    }
  }
}
'

上面的例子聚合一个边界框，这个边界框里包含所有名字为musée的文档。

{
  ...
  "aggregations": {
    "viewport": {
      "bounds": {
        "top_left": {
          "lat": 48.86111099738628,
          "lon": 2.3269999679178
        },
        "bottom_right": {
          "lat": 48.85999997612089,
          "lon": 2.3363889567553997
        }
      }
    }
  }
}

在`geo_shape`字段上聚合边界框

该聚合也支持在`geo_shape字段上运行。

Geo-centroid聚合

计算所有地理坐标点的加权矩心。

例如：

curl -X PUT "localhost:9200/museums?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "location": {
        "type": "geo_point"
      }
    }
  }
}
'
curl -X POST "localhost:9200/museums/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "city": "Amsterdam", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "city": "Amsterdam", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "city": "Amsterdam", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "city": "Antwerp", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "city": "Paris", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "city": "Paris", "name": "Musée d\u0027Orsay"}
'
curl -X POST "localhost:9200/museums/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "centroid": {
      "geo_centroid": {
        "field": "location" 
      }
    }
  }
}
'

上面的聚合演示了如何计算具有盗窃犯罪类型的所有文档的位置字段的中心。

响应为：

{
  ...
  "aggregations": {
    "centroid": {
      "location": {
        "lat": 51.00982965203002,
        "lon": 3.9662131341174245
      },
      "count": 6
    }
  }
}

在`geo_shape`字段上聚合边界框

该聚合也支持在`geo_shape字段上运行。

curl -X PUT "localhost:9200/places?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "geometry": {
        "type": "geo_shape"
      }
    }
  }
}
'
curl -X POST "localhost:9200/places/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"name": "NEMO Science Museum", "geometry": "POINT(4.912350 52.374081)" }
{"index":{"_id":2}}
{"name": "Sportpark De Weeren", "geometry": { "type": "Polygon", "coordinates": [ [ [ 4.965305328369141, 52.39347642069457 ], [ 4.966979026794433, 52.391721758934835 ], [ 4.969425201416015, 52.39238958618537 ], [ 4.967944622039794, 52.39420969150824 ], [ 4.965305328369141, 52.39347642069457 ] ] ] } }
'
curl -X POST "localhost:9200/places/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "centroid": {
      "geo_centroid": {
        "field": "geometry"
      }
    }
  }
}
'

响应

{
  ...
  "aggregations": {
    "centroid": {
      "location": {
        "lat": 52.39296147599816,
        "lon": 4.967404240742326
      },
      "count": 2
    }
  }
}

Matrix stats 聚合

matrix_stats聚合是一个数值聚合，它对一组文档字段计算以下统计信息:

`count`	计算中包含的每个字段的样本数。
`mean`	每个字段的平均值。
`variance`	根据平均值测量样品的分布情况。
`skewness`	Per field measurement quantifying the asymmetric distribution around the mean.
`kurtosis`	Per field measurement quantifying the shape of the distribution.
`covariance`	A matrix that quantitatively describes how changes in one field are associated with another.
`correlation`	The covariance matrix scaled to a range of -1 to 1, inclusive. Describes the relationship between field distributions.

提示：与其他聚合不同的是，该聚合不支持脚本

max聚合

一种单值度量聚合，它跟踪并返回从聚合文档中提取的数值中的最大值。

最大值和最小值的聚合操作利用double数据类型来表示。如果绝对超出了2^53，那么得到的值可能是不精确的。

计算价格的最大值：

curl -X POST "localhost:9200/sales/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "max_price": { "max": { "field": "price" } }
  }
}
'

响应：

{
  ...
  "aggregations": {
      "max_price": {
          "value": 200.0
      }
  }
}

脚本

如果您需要获得比单个字段更复杂的最大值，请在运行时字段上运行聚合。

curl -X POST "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "runtime_mappings": {
    "price.adjusted": {
      "type": "double",
      "script": "double price = doc[\u0027price\u0027].value;\nif (doc[\u0027promoted\u0027].value) {\n  price *= 0.8;\n}\nemit(price);"
    }
  },
  "aggs": {
    "max_price": {
      "max": { "field": "price.adjusted" }
    }
  }
}
'

缺失值

missing 参数定义应如何处理缺少值的文档。默认情况下，它们将被忽略，但也可以将它们视为具有值。

curl -X POST "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "aggs" : {
      "grade_max" : {
          "max" : {
              "field" : "grade",
              "missing": 10       
          }
      }
  }
}
'

直方图字段

在直方图字段上计算 max 时，聚合的结果是 values 数组中所有元素的最大值。请注意，直方图的计数数组被忽略。

curl -X PUT "localhost:9200/metrics_index/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "network.name" : "net-1",
  "latency_histo" : {
      "values" : [0.1, 0.2, 0.3, 0.4, 0.5], 
      "counts" : [3, 7, 23, 12, 6] 
   }
}
'
curl -X PUT "localhost:9200/metrics_index/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
  "network.name" : "net-2",
  "latency_histo" : {
      "values" :  [0.1, 0.2, 0.3, 0.4, 0.5], 
      "counts" : [8, 17, 8, 7, 6] 
   }
}
'
curl -X POST "localhost:9200/metrics_index/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{
  "aggs" : {
    "max_latency" : { "max" : { "field" : "latency_histo" } }
  }
}
'

响应：

{
  ...
  "aggregations": {
    "min_latency": {
      "value": 0.5
    }
  }
}

min聚合

类似于Max聚合，不做阐述。

Percentile ranks聚合

百分位等级聚合。一种多值指标聚合，它根据从聚合文档中提取的数值计算一个或多个百分位级别。这些值可以从文档中的特定数字或直方图字段中提取。

百分位是指观测值低于某一值的百分比。例如，如果一个值大于或等于观察值的95%，就称它处于第95百分位。

假设您的数据包含网站加载时间。您可能有一个服务协议，95%的页面加载在500ms内完成，99%的页面加载在600ms内完成。

curl -X GET "localhost:9200/latency/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "load_time_ranks": {
      "percentile_ranks": {
        "field": "load_time",   
        "values": [ 500, 600 ]
      }
    }
  }
}
'

响应：

{
  ...

 "aggregations": {
    "load_time_ranks": {
      "values": {
        "500.0": 90.01,
        "600.0": 100.0
      }
    }
  }
}

以上例子说明网页的响应时间小于等于600ms的文档占比为100%，但是响应时间小于等于500ms的文档占比90.01%。

keyed response

默认情况下，keyed 标志设置为 true 将唯一的字符串键与每个存储桶相关联，并将范围作为散列而不是数组返回。将 keyed 标志设置为 false 将禁用此行为：

curl -X GET "localhost:9200/latency/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "load_time_ranks": {
      "percentile_ranks": {
        "field": "load_time",
        "values": [ 500, 600 ],
        "keyed": false
      }
    }
  }
}
'

响应：

{
  ...

  "aggregations": {
    "load_time_ranks": {
      "values": [
        {
          "key": 500.0,
          "value": 90.01
        },
        {
          "key": 600.0,
          "value": 100.0
        }
      ]
    }
  }
}

脚本

如果您需要针对未编入索引的值运行聚合，请使用运行时字段。例如，如果我们的加载时间以毫秒为单位，但我们希望以秒为单位计算百分位数：

curl -X GET "localhost:9200/latency/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "runtime_mappings": {
    "load_time.seconds": {
      "type": "long",
      "script": {
        "source": "emit(doc[\u0027load_time\u0027].value / params.timeUnit)",
        "params": {
          "timeUnit": 1000
        }
      }
    }
  },
  "aggs": {
    "load_time_ranks": {
      "percentile_ranks": {
        "values": [ 500, 600 ],
        "field": "load_time.seconds"
      }
    }
  }
}
'

缺失值

missing 参数定义应如何处理缺少值的文档。默认情况下，它们将被忽略，但也可以将它们视为具有值。

curl -X GET "localhost:9200/latency/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "load_time_ranks": {
      "percentile_ranks": {
        "field": "load_time",
        "values": [ 500, 600 ],
        "missing": 10           
      }
    }
  }
}
'

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理