上篇回顾:elasticsearch入门检索 本篇主要记录进阶检索的使用
1、样本测试数据
准备了一份顾客银行账户信息的虚构的JSON文档样本。每个文档都有下列的schema(模式)。
"account_number" : 2,
"balance" : 28838,
"firstname" : "Roberta",
"lastname" : "Bender",
"age" : 22,
"gender" : "F",
"address" : "560 Kingsway Place",
"employer" : "Chillium",
"email" : "robertabender@chillium.com",
"city" : "Bennett",
"state" : "LA"
文档地址:https://github.com/elastic/elasticsearch/blob/v7.4.2/docs/src/test/resources/accounts.json ,导入测试数据
2、search API
ES支持两种基本方式检索;
-
通过REST request uri 发送搜索参数 (uri +检索参数); 如:GET bank/_search?q=*&sort=account_number:asc 响应体:  之所以只有10条数据,是因为存在分页查询,可以使用from和size指定。 -
通过REST request body 来发送它们(uri+请求体); GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" },
{"balance":"desc"}
],
"from": 20,
"size": 10
}
更多详细信息可以参照:文档地址
3、Query DSL
(1) 基本查询
Elasticsearch提供了一个可以执行查询的json风格的DSL。被称为Query DSL,该查询语言非常全面,且还可以针对某个字段。
{
QUERY_NAME:{
ARGUMENT:VALUE,
ARGUMENT:VALUE,
FIELD_NAME:{
ARGUMENT:VALUE,
ARGUMENT:VALUE,...
}
}
}
如:
GET bank/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"account_number": {
"order": "desc"
}
}
],
"from": 20,
"size": 10
}
(2) 只返回部分字段查询
GET bank/_search
{
"query": {
"match_all": {}
},
"from": 0,
"size": 5,
"sort": [
{
"account_number": {
"order": "desc"
}
}
],
"_source": ["balance","firstname"]
}
响应体:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "999",
"_score" : null,
"_source" : {
"firstname" : "Dorothy",
"balance" : 6087
},
"sort" : [
999
]
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "998",
"_score" : null,
"_source" : {
"firstname" : "Letha",
"balance" : 16869
},
"sort" : [
998
]
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "997",
"_score" : null,
"_source" : {
"firstname" : "Combs",
"balance" : 25311
},
"sort" : [
997
]
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "996",
"_score" : null,
"_source" : {
"firstname" : "Andrews",
"balance" : 17541
},
"sort" : [
996
]
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "995",
"_score" : null,
"_source" : {
"firstname" : "Phelps",
"balance" : 21153
},
"sort" : [
995
]
}
]
}
}
(3) match匹配查询
- 基本类型(非字符串),精确控制
GET bank/_search
{
"query": {
"match": {
"account_number": "20"
}
}
}
match返回account_number=20的数据。 - 字符串,全文检索
GET bank/_search
{
"query": {
"match": {
"address": "kings"
}
}
}
全文检索,最终会按照评分进行排序,会对检索条件进行分词匹配。 响应体:{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 5.9908285,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "20",
"_score" : 5.9908285,
"_source" : {
"account_number" : 20,
"balance" : 16418,
"firstname" : "Elinor",
"lastname" : "Ratliff",
"age" : 36,
"gender" : "M",
"address" : "282 Kings Place",
"employer" : "Scentric",
"email" : "elinorratliff@scentric.com",
"city" : "Ribera",
"state" : "WA"
}
},
{
"_index" : "bank",
"_type" : "account",
"_id" : "722",
"_score" : 5.9908285,
"_source" : {
"account_number" : 722,
"balance" : 27256,
"firstname" : "Roberts",
"lastname" : "Beasley",
"age" : 34,
"gender" : "F",
"address" : "305 Kings Hwy",
"employer" : "Quintity",
"email" : "robertsbeasley@quintity.com",
"city" : "Hayden",
"state" : "PA"
}
}
]
}
}
(4) match_phrase 短句匹配
将需要匹配的值当成一整个单词(不分词)进行检索
GET bank/_search
{
"query": {
"match_phrase": {
"address": "mill road"
}
}
}
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 8.926605,
"hits" : [
{
"_index" : "bank",
"_type" : "account",
"_id" : "970",
"_score" : 8.926605,
"_source" : {
"account_number" : 970,
"balance" : 19648,
"firstname" : "Forbes",
"lastname" : "Wallace",
"age" : 28,
"gender" : "M",
"address" : "990 Mill Road",
"employer" : "Pheast",
"email" : "forbeswallace@pheast.com",
"city" : "Lopezo",
"state" : "AK"
}
}
]
}
}
(5) multi_math 多字段匹配
GET bank/_search
{
"query": {
"multi_match": {
"query": "mill IL",
"fields": [
"state",
"address"
]
}
}
}
查询出state或者address中包含"mill IL"的文档,并且在查询过程中,会对于"mill IL"也就是查询条件进行分词。
(6) bool 复合查询
复合语句可以合并,任何其他查询语句,包括符合语句。这也就意味着,复合语句之间 可以互相嵌套,可以表达非常复杂的逻辑。 must:必须达到must所列举的所有条件。 must_not:必须不匹配must_not所列举的所有条件。 should:应该满足should所列举的条件最好,不满足也行。
GET bank/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"gender": "M"
}
},
{
"match": {
"address": "mill"
}
}
],
"must_not": [
{
"match": {
"age": "38"
}
}
],
"should": [
{
"match": {
"lastname": "Wallace"
}
}
]
}
}
}
(7) Filter 结果过滤
Filter主要用来对查询结果进行过滤,并不会计算相关得分
GET bank/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"address": "mill"
}
}
],
"filter": {
"range": {
"balance": {
"gte": "10000",
"lte": "20000"
}
}
}
}
}
}
(8) term 精确匹配
和match一样。匹配某个属性的值。但它是精确匹配,建议全文检索字段用match,其他非text字段匹配用term。
GET bank/_search
{
"query": {
"term": {
"address": "mill Road"
}
}
}
响应体:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
(9) Aggregation 执行聚合
聚合提供了从数据中分组和提取数据的能力。最简单的聚合方法大致等于SQL Group by和SQL聚合函数。在elasticsearch中,执行搜索返回this(命中结果),并且同时返回聚合结果,把以响应中的所有hits(命中结果)分隔开的能力。这是非常强大且有效的,你可以执行查询和多个聚合,并且在一次使用中得到各自的(任何一个的)返回结果,使用一次简洁和简化的API来避免网络往返。 如:
GET bank/_search
{
"query": {
"match": {
"address": "Mill"
}
},
"aggs":{
"aggs_name":{
"AGG_TYPE聚合的类型(avg,term,terms)":{}
}
},
"size": 0
}
- 搜索address中包含mill的所有人的年龄分布以及平均年龄,但不显示这些人的详情
GET bank/_search
{
"query": {
"match": {
"address": "Mill"
}
},
"aggs": {
"ageAgg": {
"terms": {
"field": "age",
"size": 10
}
},
"ageAvg": {
"avg": {
"field": "age"
}
},
"balanceAvg": {
"avg": {
"field": "balance"
}
}
},
"size": 0
}
- 查出所有年龄分布,并且这些年龄段中M的平均薪资和F的平均薪资以及这个年龄段的总体平均薪资
GET bank/_search
{
"query": {
"match_all": {}
},
"aggs": {
"ageAgg": {
"terms": {
"field": "age",
"size": 100
},
"aggs": {
"genderAgg": {
"terms": {
"field": "gender.keyword"
},
"aggs": {
"balanceAvg": {
"avg": {
"field": "balance"
}
}
}
},
"ageBalanceAvg": {
"avg": {
"field": "balance"
}
}
}
}
},
"size": 0
}
4、Mapping
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. For instance, use mappings to define:
- which string fields should be treated as full text fields.
- which fields contain numbers, dates, or geolocations.
- the format of date values.
- custom rules to control the mapping for dynamically added fields.
映射是定义文档以及它所包含的字段(field)如何存储和索引的过程。比如:使用mappings 来定义:
- 哪些字符串字段应被视为全文字段。
- 哪些字段包含数字、日期或地理位置。
- 日期值的格式。
- 控制动态添加字段的映射的自定义规则。
(1) 属性的数据类型:
文档链接
(2) 查看mapping信息
GET bank/_mapping
(3) 创建索引并指定映射
PUT /my_index
{
"mappings": {
"properties": {
"age": { "type": "integer" },
"email": { "type": "keyword" },
"name": { "type": "text" }
}
}
}
响应体:
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "my_index"
}
(4) 添加新的字段映射
PUT /my_index/_mapping
{
"properties": {
"employee-id": {
"type": "keyword",
"index": false
}
}
}
响应体
{
"acknowledged" : true
}
(5) 更新字段的映射
对于已经存在的字段映射,我们不能更新。更新必须创建新的索引,进行数据迁移。 先创建新的正确映射。然后使用如下方式进行数据迁移。
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
详细文档地址:文档地址
5、分词
一个tokenizer(分词器)接收一个字符流,将之分割为独立的tokens(词元,通常是独立的单词),然后输出tokens流。
例如:whitespace tokenizer遇到空白字符时分割文本。它会将文本“Quick brown fox!”分割为[Quick,brown,fox!]。
该tokenizer(分词器)还负责记录各个terms(词条)的顺序或position位置(用于phrase短语和word proximity词近邻查询),以及term(词条)所代表的原始word(单词)的start(起始)和end(结束)的character offsets(字符串偏移量)(用于高亮显示搜索的内容)。
elasticsearch提供了很多内置的分词器,可以用来构建custom analyzers(自定义分词器)。
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
响应体:
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brown",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "jumped",
"start_offset" : 24,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "lazy",
"start_offset" : 40,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dog's",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "bone",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
(1) 安装ik分词器
ES中的分词器针对于中文的分词,并不友好。为此需要安装中文的分词器。 下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.4.2
在上一篇文章中,我们已经将elasticsearch容器的“/usr/share/elasticsearch/plugins”目录,映射到宿主机的“ /mydata/elasticsearch/plugins”目录下,所以比较方便的做法就是下载“/elasticsearch-analysis-ik-7.6.2.zip”文件,然后解压到该文件夹下即可。安装完毕后,需要重启elasticsearch容器。 可以通过wget命令来安装:
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
如果Linux没有安装wget,可以通过yum安装:yum install wget 然后通过unzip 命令解压:
unzip elasticsearch-analysis-ik-7.4.2.zip
接着创建ik文件夹并将所有文件移动到ik文件夹
mkdir ik
mv * ik
chmod -R 777 ik/
重启elasticserach
docker restart elasticsearch
测试是否安装成功
POST _analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
}
]
}
(2) 给ik分词器添加自定义词库
- 安装nginx:安装教程
- 修改
/mydata/nginx/html 下的结构,创建es文件夹,并创建文件fenci.txt,内容可以是自己定义的词库就这样吧
别傻了
乔碧罗
- 修改/mydata/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">http://wutingze.cn:88/es/fenci.txt</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
- 然后重启elasticsearch
- 通过分词分析可以看到新的分词
POST _analyze
{
"analyzer": "ik_max_word",
"text": "乔碧罗殿下"
}
{
"tokens" : [
{
"token" : "乔碧罗",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "殿下",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
}
]
}
|