深入理解聚合分析原理及精确性问题
1. Metric Aggregation
-
单值分析,只输出一个分析结果
- min max avg sum
- cardinality (类似distinct count)
-
多值分析,输出多个分析结果
- stats extended stats
- percentile, percentile rank
- top hits(排在前面的示例)
# 聚合所有类型type,统计唯一值数量
POST kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"type": {
"cardinality": {
"field": "type"
}
}
}
}
# 求价格中位数
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"AvgTicketPrice": {
"percentiles": {
"field": "AvgTicketPrice",
"percents": [
50
]
}
}
}
}
# 对嵌套类型detail_info中price聚合
GET poi/_search
{
"size": 0,
"aggs": {
"detail_info": {
"nested": {
"path": "detail_info"
},
"aggs": {
"price": {
"stats": {
"field": "detail_info.price"
}
}
}
}
}
}
2. Bucket Aggregation
- 按照一定的规则,将文档分配到不同的桶中。ES提供了常见的Bucket Aggregation
- Terms
- 数字类型
- Range / Date Range
- Histogram / Date Histogram
- 支持嵌套:也就是在桶中继续分桶
一个较为复杂的聚合示例:获取城市某个时间段的移动平均中位数
def get_city_median(city, start_time, end_time):
return es.elastic_client.search(body={
"query": {
"bool": {
"must": [
{
"term": {
"city": city
}
}
],
"filter": [
{
"range": {
"publish_date": {
"gt": start_time,
"lte": end_time
}
}
}
]
}
},
"size": 0,
"aggs": {
"group_by_city": {
"terms": {
"field": "city"
},
"aggs": {
"group_by_date": {
"date_histogram": {
"field": "publish_date",
"calendar_interval": "month",
"format": "yyyy-MM"
},
"aggs": {
"avg_price_percentile": {
"percentiles": {
"field": "avg_price",
"percents": [50]
}
},
"the_movperc": { // 用到了下面所说的管道概念
"moving_percentiles": {
"buckets_path": "avg_price_percentile",
"window": 3
}
}
}
}
}
}
}
}, index='xxxx')
3. Pipeline Aggregation
- 管道的概念:支持对聚合结果分析,再次进行聚合分析
- Pipeline的分析结果会输出在原结果中,根据位置的不同,分为两类
- Sibling - 结果和现有分析结果同级
- Max Min Avg Sum Bucket
- Stats Extended Stats Bucket
- Percentiles Bucket
- Parent - 结果内嵌在现有的聚合分析结果之中
- Derivative(求导)
- Cumultive Sum(累计求和)
- Moving Function(滑动窗口)
- Sibling - 结果和现有分析结果同级
# 聚合费用最少的飞行目的地
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"dest": {
"terms": {
"field": "Dest",
"size": 10
},
"aggs": {
"price": {
"avg": {
"field": "AvgTicketPrice"
}
}
}
},
"min_dest_price": {
"min_bucket": {
"buckets_path": "dest>price"
}
}
}
}
4. 聚合作用范围
#Filter
POST employees/_search
{
"size": 0,
"aggs": {
"older_person": {
"filter":{ // filter在该聚合中过滤生效
"range":{
"age":{
"from":35
}
}
},
"aggs":{
"jobs":{
"terms": {
"field":"job.keyword"
}
}
}},
"all_jobs": {
"terms": {
"field":"job.keyword"
}
}
}
}
POST employees/_search
{
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword"
}
}
},
"post_filter": { // 筛选聚合后符合条件的结果
"match": {
"job.keyword": "Dev Manager"
}
}
}
#global
POST employees/_search
{
"size": 0,
"query": {
"range": {
"age": {
"gte": 40
}
}
},
"aggs": {
"jobs": {
"terms": {
"field":"job.keyword"
}
},
"all":{
"global":{}, // 忽略全局范围过滤,筛选所有年龄段
"aggs":{
"salary_avg":{
"avg":{
"field":"salary"
}
}
}
}
}
}
5. 聚合分析的原理及精确性问题
分布式系统的近似统计算法
GET kibana_sample_data_flights/_search
{
"size": 0,
"aggs": {
"dest": {
"terms": {
"field": "Dest",
"size": 10
}
}
}
}
// 返回结果
"aggregations" : {
"dest" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8898
}
}
在Terms Aggregation的返回中有两个特殊的数值
- doc_count_error_upper_bound: 被遗漏的term分桶包含的文档,有可能的最大值
- sum_other_doc_count:ES除了返回结果bucket的terms外,该参数返回其他terms的文档总数(总数-返回的文档总数)
其中,当doc_count_error_upper_bound大于0的时候,可能结果不准
- 不准原因为,数据分散在多个分片上,Coordinating Node无法获取数据全貌
- 解决方案1:当数据量不大时,设置Primary Shard为1,实现准确性
- 解决方案2:在分布式数据上,设置shard_size参数,提高精确度
- 原理:每个从Shard上额外多获取数据,提升准确性
- 调整shard_size大小,降低doc_count_error_upper_bound来提升准确度
- 增加整体计算量,提高了准确度,但会降低相应时间
- shard_size默认大小=size*1.5+10
DELETE my_flights
PUT my_flights
{
"settings": {
"number_of_shards": 20
},
"mappings" : {
"properties" : {
"AvgTicketPrice" : {
"type" : "float"
},
"Cancelled" : {
"type" : "boolean"
},
"Carrier" : {
"type" : "keyword"
},
"Dest" : {
"type" : "keyword"
},
"DestAirportID" : {
"type" : "keyword"
},
"DestCityName" : {
"type" : "keyword"
},
"DestCountry" : {
"type" : "keyword"
},
"DestLocation" : {
"type" : "geo_point"
},
"DestRegion" : {
"type" : "keyword"
},
"DestWeather" : {
"type" : "keyword"
},
"DistanceKilometers" : {
"type" : "float"
},
"DistanceMiles" : {
"type" : "float"
},
"FlightDelay" : {
"type" : "boolean"
},
"FlightDelayMin" : {
"type" : "integer"
},
"FlightDelayType" : {
"type" : "keyword"
},
"FlightNum" : {
"type" : "keyword"
},
"FlightTimeHour" : {
"type" : "keyword"
},
"FlightTimeMin" : {
"type" : "float"
},
"Origin" : {
"type" : "keyword"
},
"OriginAirportID" : {
"type" : "keyword"
},
"OriginCityName" : {
"type" : "keyword"
},
"OriginCountry" : {
"type" : "keyword"
},
"OriginLocation" : {
"type" : "geo_point"
},
"OriginRegion" : {
"type" : "keyword"
},
"OriginWeather" : {
"type" : "keyword"
},
"dayOfWeek" : {
"type" : "integer"
},
"timestamp" : {
"type" : "date"
}
}
}
}
POST _reindex
{
"source": {
"index": "kibana_sample_data_flights"
},
"dest": {
"index": "my_flights"
}
}
GET my_flights/_search
{
"size": 0,
"aggs": {
"weather": {
"terms": {
"field":"OriginWeather",
"size":1,
"shard_size":10 // 当设置为5时,可以看到返回的doc_count_error_upper_boundda大于0,10则为0
}
}
}
}
标签:聚合,keyword,16,field,ElasticSearch,aggs,terms,type,size
From: https://www.cnblogs.com/shenjian-online/p/16811341.html