用容器快速上手Elasticsearch

文章目录

1. 启动
2. 索引
3. 搜索
4. 中文

Elasticsearch是用Java开发的基于Apache Lucene的一个近乎实时的分布式搜索分析引擎。维基百科、Stack Overflow、GitHub等都采用它来作为全文搜索引擎。本文旨在用docker来快速入门并尝试Elasticsearch提供的基本功能。

对Apache Solr入门有兴趣的朋友请参考用容器快速上手Apache Solr。

启动

关于Elasticsearch的入门知识，有一本gitbook：《Elasticsearch 权威指南》翻译得不错。本文关心的是实际操作，所以这就开始吧。通过docker，一条命令就可以直接启动Elasticsearch：

1	docker run -d --net=host --name=es elasticsearch:2.3.5

我用的是mac，通过docker-machine env default命令可以看到默认的default docker-machine的IP地址是192.168.99.100，于是便可以通过http://192.168.99.100:9200/来从Elasticsearch获得json数据了：

{
  "name" : "Lorvex",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.3.5",
    "build_hash" : "90f439ff60a3c0f497f91663701e64ccd01edbb4",
    "build_timestamp" : "2016-07-27T10:36:52Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.0"
  },
  "tagline" : "You Know, for Search"
}

也可以使用_cat直接在命令行获取Elasticsearch的健康状态和节点状态：

1 2	curl '192.168.99.100:9200/_cat/health?v' curl '192.168.99.100:9200/_cat/nodes?v'

索引

与Solr不同，Elasticsearch只支持json格式。创建索引的过程，就是向服务器POST数据的过程：

curl -XPOST 'http://192.168.99.100:9200/megacorp/employee' -d '
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}'

立刻就能得到Elasticsearch返回的结果：{“_index”:”megacorp”,”_type”:”employee”,”_id”:”AVbK2ssDm7bdYo65QJ6k”,”_version”:1,”_shards”:{“total”:2,”successful”:1,”failed”:0},”created”:true}。Url里的megacorp是索引的名字（想象成一个数据库），employee是类型的名字（想象成一张表）。返回的_id是Elasticsearch随机创建的一个ID，用于标识数据。

可以通过以下url获取Elasticsearch的索引信息：

1	curl 'http://192.168.99.100:9200/_cat/indices?v'

修改索引的话，发出PUT请求就可以了（别忘了把ID换成你自己生成的_id）：

curl -XPUT 'http://192.168.99.100:9200/megacorp/employee/AVbK2ssDm7bdYo65QJ6k' -d '
{
    "first_name" : "Johnie",
    "last_name" :  "Smithreen",
    "age" :        38,
    "about" :      "I do not love to go rock climbing",
    "interests": []
}'

如果指定的ID不存在，那么PUT也会新建一条记录。删除索引的话，发出DELETE请求就可以了：

1	curl -XDELETE 'http://192.168.99.100:9200/megacorp/employee/AVbK2ssDm7bdYo65QJ6k'

或者删除整个索引：

1	curl -XDELETE 'http://192.168.99.100:9200/megacorp'

再次获取索引信息，便会看到已经没有索引了。让我们加上三条数据，以备下一节搜索使用：

curl -XPOST 'http://192.168.99.100:9200/megacorp/employee/1' -d '
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}'
curl -XPOST 'http://192.168.99.100:9200/megacorp/employee/2' -d '
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}'
curl -XPOST 'http://192.168.99.100:9200/megacorp/employee/3' -d '
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}'

搜索

查询

要查看刚刚创建的数据，直接get就可以了：

1
2
3

curl 'http://192.168.99.100:9200/megacorp/employee/1'
curl 'http://192.168.99.100:9200/megacorp/employee/2'
curl 'http://192.168.99.100:9200/megacorp/employee/3'

全部查询改一下url就可以了：

1
2
3

curl 'http://192.168.99.100:9200/megacorp/employee/_search'
curl 'http://192.168.99.100:9200/megacorp/_search'
curl 'http://192.168.99.100:9200/_search'

简易搜索

上面的命令都只能算查询，还不算搜索。接下来让我们尝试Elasticsearch强大的搜索功能：

curl 'http://192.168.99.100:9200/megacorp/employee/_search?q=last_name:Smith'
curl 'http://192.168.99.100:9200/megacorp/employee/_search' -d '
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}'

查询DSL

上面两条命令都是去获取last_name为Smith的数据，但是下面一条用到了查询DSL。这是Elasticsearch提供的DSL查询语言，可以通过它完成更加复杂的搜索。接下来除了Smith以外，我们还要增加30岁以上这一条件：

curl 'http://192.168.99.100:9200/megacorp/employee/_search' -d '
{
    "query" : {
        "filtered" : {
            "filter" : {
                "range" : {
                    "age" : { "gt" : 30 } 
                }
            },
            "query" : {
                "match" : {
                    "last_name" : "smith" 
                }
            }
        }
    }
}'

在这里介绍一下filter和query。可以这么理解：filter是精确查找（想象成sql里where子句的=、<、>），速度快，有缓存。query是模糊查找（想象成sql里where子句的like，但是能根据匹配度打分），虽然没有filter那么快，但是查询结果能更准确一些。业务上的查询通常都是同时使用二者，通过filter快速找到对象数据，再通过query来匹配。

全文搜索

接下来尝试全文搜索，从about的句子里查询想要的数据：

curl 'http://192.168.99.100:9200/megacorp/employee/_search' -d '
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}'

查到了两个结果。返回值包含了一项_score，既含有rock又含有climbing的数据，得分明显高于只含有rock的数据。理所当然的，得分高的数据排在上面。如果想要精确匹配rock climbing，把match改为match_phrase就可以了：

curl 'http://192.168.99.100:9200/megacorp/employee/_search' -d '
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}'

统计

查找最受欢迎的兴趣：

curl 'http://192.168.99.100:9200/megacorp/employee/_search' -d '
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}'

查找每个兴趣的平均年龄：

curl 'http://192.168.99.100:9200/megacorp/employee/_search' -d '
{
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}'

这些数据都是实时计算出来的。就像使用SQL来查询数据库一样，Elasticsearch提供了自己的DSL来让我们基于复杂的条件来搜索。这里是统计功能的官方文档。

中文

同Solr一样，要想让Elasticsearch支持中文分词，需要使用中文分词组件。这里我们还是用mmseg插件。首先下载并解压：

1
2
3

wget -c https://github.com/medcl/elasticsearch-analysis-mmseg/releases/download/v1.9.4/elasticsearch-analysis-mmseg-1.9.4.zip
unzip -d elasticsearch-analysis-mmseg-1.9.4 elasticsearch-analysis-mmseg-1.9.4.zip
sed -i 's/2.3.4/2.3.5/' elasticsearch-analysis-mmseg-1.9.4/plugin-descriptor.properties

1.9.4版的插件只支持Elasticsearch 2.3.4版，要想支持2.3.5，就需要把elasticsearch.version配置改为2.3.5。除此之外，还需要加点儿东西到Elasticsearch的配置文件里：

cat << EOF > es.yml
network.host: 0.0.0.0
index:
  analysis:
    analyzer:
      mmseg_maxword:
        type: custom
        filter:
        - lowercase
        tokenizer: mmseg_maxword
      mmseg_maxword_with_cut_letter_digi:
        type: custom
        filter:
        - lowercase
        - cut_letter_digit
        tokenizer: mmseg_maxword
EOF

上述配置文件的具体语法可以参考这里。接下来就可以启动容器了，分别挂载配置文件和插件：

docker rm -f es
docker run -d --net=host --name=es \
    -v `pwd`/es.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
    -v `pwd`/elasticsearch-analysis-mmseg-1.9.4/:/usr/share/elasticsearch/plugins/elasticsearch-analysis-mmseg-1.9.4/ \
    elasticsearch:2.3.5

Elasticsearch服务启动后，就可以增加索引和映射（mapping，可以理解为数据类型，有点像solr的schema），并且插入一些数据了：

curl -XPUT http://192.168.99.100:9200/index
curl -XPOST http://192.168.99.100:9200/index/fulltext/_mapping -d'
{
    "fulltext": {
        "_all": {
            "analyzer": "mmseg_maxword",
            "search_analyzer": "mmseg_maxword",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "content": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "analyzer": "mmseg_maxword",
                "search_analyzer": "mmseg_maxword",
                "include_in_all": "true",
                "boost": 8
            }
        }
    }
}'
curl -XPOST http://192.168.99.100:9200/index/fulltext/1 -d'{content:"美国留给伊拉克的是个烂摊子吗"}'
curl -XPOST http://192.168.99.100:9200/index/fulltext/2 -d'{content:"公安部：各地校车将享最高路权"}'
curl -XPOST http://192.168.99.100:9200/index/fulltext/3 -d'{content:"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}'
curl -XPOST http://192.168.99.100:9200/index/fulltext/4 -d'{content:"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'

最后搜索中国看看：

curl -XPOST http://192.168.99.100:9200/index/fulltext/_search  -d'
{
    "query" : { "term" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}
'

可以使用_analyze来查看分析的过程：

1	curl "http://192.168.99.100:9200/index/_analyze?analyzer=mmseg_maxword&pretty=true" -d "美国留给伊拉克的是个烂摊子吗"

可见，这句话被解析成：美国、留给、伊、拉克、的、是个、烂、摊子、吗，还不是非常完美。伊拉克、烂摊子都没有很好地识别出来。

有兴趣的话，还可以重建容器，跟无中文分词的效果对比一下。除了mmseg，大神medcl还写了一个elasticsearch-rtf版本，涵盖诸多中文分词工具，可以直接使用。