設定 elasticsearch 同義詞

同 solr 是基於 lucene 寫出來的開源搜尋引擎, 因此 filter, tokenizer, analyzer 的概念與 solr 是一致的

現在 elasticsearch 已經到 7.4

  • filter, tokenizer, analyzer簡介
    名詞說明
    tokenizer把 input 拆分成 token 產出 token stream
    filter接收 token stream 並進行處理(case/replace/drop…)
    analyzer在建立/搜尋索引的時候要怎麼處理特定類型的字串, 比如說upperFirstCase, 去掉介詞, 同義詞處理…相當於tokenizer+filter

設定

  • Dockerfile(包含 ik 安裝)
FROM elasticsearch:5.3.0

RUN ./bin/elasticsearch-plugin install -b https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.3.0/elasticsearch-analysis-ik-5.3.0.zip

# 記得先放
COPY ./synonym.txt /usr/share/elasticsearch/config/analysis/synonym.txt
COPY ./stopwords.txt /usr/share/elasticsearch/config/analysis/stopwords.txt

USER elasticsearch
  • synonym.txt , 跟 => 的意義同 solr
攝氏 => 華氏
台灣,臺灣,臺灣黑熊
version: '3'
services:
  elasticsearch:
    build: .
    environment:
      - 'http.host=0.0.0.0'
      - 'transport.host=127.0.0.1'
      - 'ES_JAVA_OPTS=-Xms512m -Xmx512m'
      - 'xpack.security.enabled=false'
      - 'ACCESS_TOKEN=e-D9WyQzxfRbpdFvFdhQ'
    ports:
      - '9200:9200'
      - '9300:9300'
    volumes:
      - '/path/to/local/index:/usr/share/elasticsearch/data'
    container_name: local_es

執行

記得把 Dockerfile 放在與 docker-compose.yml 同層

docker-compose up -d

測試

順利跑起來的話輸入 localhost:9200 可以看到 json 如下

{
  "name": "imYjhHG",
  "cluster_name": "docker-cluster",
  "cluster_uuid": "Wq9XBSrlRN6371m4jggxUQ",
  "version": {
    "number": "5.3.0",
    "build_hash": "3adb13b",
    "build_date": "2017-03-23T03:31:50.652Z",
    "build_snapshot": false,
    "lucene_version": "6.4.1"
  },
  "tagline": "You Know, for Search"
}
  • [PUT]新增 index

直接打 localhost:9200/{indexname}?pretty

  • [PUT] 設定 mappings(token的欄位與型態) & settings(如何處理token)
{
  "index": "{indexname}",
  "body": {
    "mappings": {
      "metafield1": {
        "properties": {
          "field1": {
            "type": "text"
          },
          "field2": {
            "analyzer": "ik_syno_max",
            "search_analyzer": "ik_syno_smart",
            "type": "text"
          },
          "field3": {
            "analyzer": "ik_syno_max",
            "search_analyzer": "ik_syno_smart",
            "type": "text"
          },
        }
      },
      "metafield2": {
        "properties": {
          "field1": {
            "type": "text"
          },
          "field2": {
            "type": "long"
          },
          "location": {
            "type": "geo_point"
          }
        }
      }
    },
    "settings": {
      "analysis": {
        "analyzer": {
          "ik_syno_smart": {
            "type": "custom",
            "tokenizer": "ik_smart",
            "filter": [
              "filter_stop",
              "filter_syno"
            ]
          },
          "ik_syno_max": {
            "type": "custom",
            "tokenizer": "ik_max_word",
            "filter": [
              "filter_stop",
              "filter_syno"
            ]
          }
        },
        "filter": {
          "my_synonym": {
            "type": "synonym",
            "synonyms_path": "analysis/synonym.txt"
          },
          "my_stopword": {
            "type": "stop",
            "stopwords_path": "analysis/stopwords.txt"
          }
        }
      }
    }
  }
}
  • [POST]查看同義詞是否成功
// request
{
    "text": "流行性感冒",
    "analyzer": "ik_syno_smart",
    "filter": ["filter_syno", "filter_stop"]
}
// response
{
    "tokens": [
        {
            "token": "流感",
            "start_offset": 0,
            "end_offset": 5,
            "type": "SYNONYM", // 成功
            "position": 0
        }
    ]
}

References

cmd + /