Cloudant Analizers

The other day I rewrote CloudanDB (Which is in Japanese 日本語) and realized that some things were not working because of the analizer I was using. So this post is about them.

What are Cloudant Analizers?

From the official documentation:

Analyzers are settings which define how to recognize terms within text. This can be helpful if you need to index multiple languages.

I decided to try some of the analyzers and put my results here. Test text has two completely different languages (English and Japanese) and some other symbols which I think it is enough to find out how different analysers behave.

How to test Analizers

The _search_analyze API is for this purpose

FILE=analyze_classic.json
echo '{"analyzer": "classic","text": "Evening of the seventh"}' > $FILE

# Use _search_analyze
CLOUDANT='https://<username>:<password>@<username>.cloudant.com'
curl -X POST ${CLOUDANT}/_search_analyze  -H 'Content-Type: application/json' --data @${FILE}

{"tokens":["evening","seventh"]}

Experiment

Today is a Japanese festival Tanabata so I will use a piece of this romantic song I like very much Suspicious Mind - Elvis Presley and a part of its japanese translation I found here

We're caught in a trap\n
I can't walk out\n
Because I love you too much baby♬.\n
Sent by nacho@email.com at 2016-07-07 18:00:29 +0900 ★\n
はまった罠から\n
出られないんだ\n
ほんとに君に首ったけなんだ♬.\n
２０１６年７月７日１８時０分２９秒にnacho@email.jpより ★

I added new lines \n, ♬★ marks, an email and a timestamp to make it more interesting ;-]

Results

Analyzer	Result tokens	Length
classic	["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho@email.com", "2016-07-07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "２０１６", "年", "７", "月", "７", "日", "１８", "時", "０", "分", "２９", "秒", "に", "nacho@email.jp", "よ", "り"]	64
email	["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho@email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "２０１６", "年", "７", "月", "７", "日", "１８", "時", "０", "分", "２９", "秒", "に", "nacho@email.jp", "よ", "り"]	66
keyword	["We're caught in a trap\nI can't walk out\nBecause I love you too much baby♬.\nSent by nacho@email.com at 2016-07-07 18:00:29 +0900 ★\nはまった罠から\n出られないんだ\nほんとに君に首ったけなんだ♬.\n２０１６年７月７日１８時０分２９秒にnacho@email.jpより ★"]	1
simple	["we", "re", "caught", "in", "a", "trap", "i", "can", "t", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email", "com", "at", "はまった罠から", "出られないんだ", "ほんとに君に首ったけなんだ", "年", "月", "日", "時", "分", "秒にnacho", "email", "jpより"]	35
standard	["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "２０１６", "年", "７", "月", "７", "日", "１８", "時", "０", "分", "２９", "秒", "に", "nacho", "email.jp", "よ", "り"]	68
whitespace	["We're", "caught", "in", "a", "trap", "I", "can't", "walk", "out", "Because", "I", "love", "you", "too", "much", "baby♬.", "Sent", "by", "nacho@email.com", "at", "2016-07-07", "18:00:29", "+0900", "★", "はまった罠から", "出られないんだ", "ほんとに君に首ったけなんだ♬.", "２０１６年７月７日１８時０分２９秒にnacho@email.jpより", "★"]	29
english	["we'r", "caught", "trap", "i", "can't", "walk", "out", "becaus", "i", "love", "you", "too", "much", "babi", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "２０１６", "年", "７", "月", "７", "日", "１８", "時", "０", "分", "２９", "秒", "に", "nacho", "email.jp", "よ", "り"]	68
spanish	["we'r", "caught", "in", "trap", "i", "can't", "walk", "out", "becaus", "i", "love", "you", "too", "much", "baby", "sent", "by", "nach", "email.com", "at", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "２０１６", "年", "７", "月", "７", "日", "１８", "時", "０", "分", "２９", "秒", "に", "nach", "email.jp", "よ", "り"]	71
japanese	["we", "re", "caught", "in", "a", "trap", "i", "can", "t", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email", "com", "at", "2016", "07", "07", "18", "00", "29", "0900", "はまる", "罠", "出る", "ほんとに", "君", "首ったけ", "2", "0", "1", "6", "年", "7月", "7", "日", "1", "8", "時", "0", "分", "2", "9", "秒", "nacho", "email", "jp"]	56
cjk	["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "はま", "まっ", "った", "た罠", "罠か", "から", "出ら", "られ", "れな", "ない", "いん", "んだ", "ほん", "んと", "とに", "に君", "君に", "に首", "首っ", "った", "たけ", "けな", "なん", "んだ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒に", "nacho", "email.jp", "より"]	63
arabic	["we're", "caught", "in", "a", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email.com", "at", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "２０１６", "年", "７", "月", "７", "日", "１８", "時", "０", "分", "２９", "秒", "に", "nacho", "email.jp", "よ", "り"]	72

Commentaries

Some interesting things to note:

The only notorious difference between standard and classic is the treat of "2016-07-07" and emails. standard got the emails wrong.
email looks like standard but with the emails right
whitespace looks like a good option when text has symbols (ie.: ♬, ★). There are still there as tokens!
keyword just one token. I think it would be useful for exact match searches
None of the analyzers, not even keyword preserved the new line character \n.
Several occurrences of the same string are valid. (ie.: "i" and "★")
I find funny "we're" became "we'r" when english was used. Also "baby" became "babi".
Japanese words were tokenized as characters in english. Makes really hard (if possible) to do a useful search using Japanese
I am depressed that spanish didn't get my name "nacho" right. For some reason it became "nach". I guess it is trying to get the root of it, since there is also nacha, nachito, nachita, nachos, nachas, etc.
cjk (chinese/japanese/korean) complete broke the japanese words. I know it is hard to parse Japanese but cjk is not helping at all here.
I have no idea why I tried arabic I literally have 0 knowledge of the language to comment something about it.

More on "nacho"

curl -X POST ${CLOUDANT}/_search_analyze  -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"nacho"}'

{"tokens":["nach"]}

curl -X POST ${CLOUDANT}/_search_analyze  -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"nachos"}'

{"tokens":["nach"]}

curl -X POST ${CLOUDANT}/_search_analyze  -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"Nachos"}'

{"tokens":["nach"]}

🇵🇪 Creo que voy a reclamar a los de Cloudant o Apache para que mi nombre salga bien! Ahorita mismo!!🇵🇪 (O sera que esta sacando la raiz de la palabra?)

Programming notes

Nacho4d

Programming Notes

More posts