Building Interactive Dictionary using ElasticSearch

Originally written for my employer then, Meedan.


At Meedan, we design creative tools for globally minded journalists and educators. Part of our mission is to facilitate the sharing of ideas across languages by creating tools to help users translate content. One of those tools is an interactive dictionary: when given a piece of text, it highlights the terms existing in the dictionary and provides their definitions and translations.

ElasticSearch is a search server based on Lucene that provides full-text search. So, we initially had the feeling that it might help us implement our dictionary. But, wait a minute. With a full-text search engine, we can store a number of text documents, and when given a keyword or two, the engine should return the documents matching those keywords.

In our case, we want to do things the other way around. We want our engine to store keywords, and when given a text document, it should give us back the keywords which happen to exist in that document, along with their location in the document. How can we make ElasticSearch do that for us?

ElasticSearch Percolator to the rescue.

It does exactly what we want: you store queries into an index and then via the Percolator API you define documents in order to retrieve these queries.

We first need to create an index to store our terms in there. An index is like databases in the database lingo, whereas a type is somehow the equivalent to a table, a document is a record and a field is a, well, field.

Actually, we are creating a multilingual dictionary, and we know that verbs for example take different forms based on their conjugation and tense. Similarly, nouns vary based on whether they are singular or plural. So, we needed to add filters to our index to make sure we can make use of the multilingual stemming to match terms regardless of their morphological variations.

From now on, we can deal with our ElasticSearch through its REST interface on port 9200. You may change those defaults from the configuration file, “config/elasticsearch.yml”.

curl -XPUT ‘http://localhost:9200/dict/' -d ‘{
 “settings” : {
  “analysis” : {
   “filter”: {
    “stemmer_en”: {
     “type” : “stemmer”,
     “name” : “english”
    },
    “stemmer_es”: {
     “type” : “stemmer”,
     “name” : “spanish”
    }
   },
   “analyzer” : {
    “analyzer_en” : {
     “type” : “custom”,
     “tokenizer” : “standard”,
     “filter” : [“lowercase”, “stemmer_en”]
    },
    “analyzer_es” : {
     “type” : “custom”,
     “tokenizer” : “standard”,
     “filter” : [“lowercase”, “stemmer_es”]
    }
   }
  }
 }
}’
        

As stated, a document contains a set of fields, and in order to use the appropriate filter with each language, we are going to use ElasticSearch mapping to map specific field names to specific languages. In our case here, the field “en” will be mapped to the English analyzer, while “es” will be mapped to the Spanish analyzer. Here is the mapping command for the English language.

curl -XPUT ‘http://localhost:9200/dict/type_es/_mapping' -d ‘{
 “type_es” : {
  “properties” : {
   “es” : {
    “type” : “string”,
    “analyzer” : “analyzer_es” 
   }
  }
 }
}’
        

Now, to add a new Spanish term along with its English and Arabic translations into our index.

curl -XPUT ‘localhost:9200/dict/.percolator/1?pretty’ -d ‘{
 “query” : {
  “filtered” : {
   “query” : {
    “match” : {
     “es” : {
      “query”: “ejemplos”,
      “type”: “phrase”,
      “analyzer” : “analyzer_es”
     }
    }
   },
   “filter” : {
    “match_all” : { }
   }
  }
 },
 “definition” : “Una palabra”,
  “dictionary” : {
  “translations” : 
   [
    {
     “term” : “examples”,
     “lang” : “en”,
     “definition” : “Noun”
    },
    {
     “term” : “namazeg”,
     “lang” : “ar”,
     “definition” : “kelmah”
    }
   ]
  }, 
  “data-source” : “dictionary”, 
  “type” : “type_es”
}’
        

Notice in the above code, we chose to set the analyzer to be used with the “es” field by hand. The “match_all” filter is useless here, but in more sophisticated cases, you can make use of filters to have better controls on which terms to match with which documents. All other fields such as “definition” and “dictionary” are arbitrary. Remember, you can add whatever fields you want, and store whatever data there.

Now you can send your ElasticSearch some text, and ask it to highlight the words there that match the ones you already stored, and give you the document ids of the stored terms.

curl -XGET ‘localhost:9200/dict/type_es/_percolate?pretty’ -d ‘{
  “doc” : {
   “es” : “Por ejemplo, tenemos coche.”
  },
  “highlight” : {
   “order” : “score”,
   “pre_tags” : [“”],
   “post_tags” : [“”],
   “fields” : {
   “es” : { “number_of_fragments” : 0 }
  }
 },
 “size” : 100
}’
        

Notice now, we are using the “es” field for our text, which implies two important things. First, it will be only matched with terms entered as “es” queries. Second, it will use the Spanish analyzer configured in our mapping to be applied to the “es” field. Matching terms are to be highlighted using the “pre_tags” and “post_tags”; you can use whatever tag format you want for that. The “size” parameter dictates the maximum number of terms to be matched. Let’s set the “number_of_fragments” to 0 for now. You can read more about the highlights and highlight fragments.

That’s all for today folks




Share on Facebook Share on twitter