2016/07/07

Cloudant Analizers

The other day I rewrote CloudanDB (Which is in Japanese 日本語) and realized that some things were not working because of the analizer I was using. So this post is about them.

What are Cloudant Analizers?

From the official documentation:
Analyzers are settings which define how to recognize terms within text. This can be helpful if you need to index multiple languages.
I decided to try some of the analyzers and put my results here. Test text has two completely different languages (English and Japanese) and some other symbols which I think it is enough to find out how different analysers behave.

How to test Analizers

The _search_analyze API is for this purpose
FILE=analyze_classic.json
echo '{"analyzer": "classic","text": "Evening of the seventh"}' > $FILE

# Use _search_analyze
CLOUDANT='https://<username>:<password>@<username>.cloudant.com'
curl -X POST ${CLOUDANT}/_search_analyze  -H 'Content-Type: application/json' --data @${FILE}
{"tokens":["evening","seventh"]}

Experiment

Today is a Japanese festival Tanabata so I will use a piece of this romantic song I like very much Suspicious Mind - Elvis Presley and a part of its japanese translation I found here
We're caught in a trap\n
I can't walk out\n
Because I love you too much baby♬.\n
Sent by nacho@email.com at 2016-07-07 18:00:29 +0900 ★\n
はまった罠から\n
出られないんだ\n
ほんとに君に首ったけなんだ♬.\n
2016年7月7日18時0分29秒にnacho@email.jpより ★
I added new lines \n, ♬★ marks, an email and a timestamp to make it more interesting ;-]

Results

AnalyzerResult tokensLength
classic["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho@email.com", "2016-07-07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho@email.jp", "よ", "り"]64
email["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho@email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho@email.jp", "よ", "り"]66
keyword["We're caught in a trap\nI can't walk out\nBecause I love you too much baby♬.\nSent by nacho@email.com at 2016-07-07 18:00:29 +0900 ★\nはまった罠から\n出られないんだ\nほんとに君に首ったけなんだ♬.\n2016年7月7日18時0分29秒にnacho@email.jpより ★"]1
simple["we", "re", "caught", "in", "a", "trap", "i", "can", "t", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email", "com", "at", "はまった罠から", "出られないんだ", "ほんとに君に首ったけなんだ", "年", "月", "日", "時", "分", "秒にnacho", "email", "jpより"]35
standard["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho", "email.jp", "よ", "り"]68
whitespace["We're", "caught", "in", "a", "trap", "I", "can't", "walk", "out", "Because", "I", "love", "you", "too", "much", "baby♬.", "Sent", "by", "nacho@email.com", "at", "2016-07-07", "18:00:29", "+0900", "★", "はまった罠から", "出られないんだ", "ほんとに君に首ったけなんだ♬.", "2016年7月7日18時0分29秒にnacho@email.jpより", "★"]29
english["we'r", "caught", "trap", "i", "can't", "walk", "out", "becaus", "i", "love", "you", "too", "much", "babi", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho", "email.jp", "よ", "り"]68
spanish["we'r", "caught", "in", "trap", "i", "can't", "walk", "out", "becaus", "i", "love", "you", "too", "much", "baby", "sent", "by", "nach", "email.com", "at", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nach", "email.jp", "よ", "り"]71
japanese["we", "re", "caught", "in", "a", "trap", "i", "can", "t", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email", "com", "at", "2016", "07", "07", "18", "00", "29", "0900", "はまる", "罠", "出る", "ほんとに", "君", "首ったけ", "2", "0", "1", "6", "年", "7月", "7", "日", "1", "8", "時", "0", "分", "2", "9", "秒", "nacho", "email", "jp"]56
cjk["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "はま", "まっ", "った", "た罠", "罠か", "から", "出ら", "られ", "れな", "ない", "いん", "んだ", "ほん", "んと", "とに", "に君", "君に", "に首", "首っ", "った", "たけ", "けな", "なん", "んだ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒に", "nacho", "email.jp", "より"]63
arabic["we're", "caught", "in", "a", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email.com", "at", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho", "email.jp", "よ", "り"]72

Commentaries

Some interesting things to note:
  • The only notorious difference between standard and classic is the treat of "2016-07-07" and emails. standard got the emails wrong.
  • email looks like standard but with the emails right
  • whitespace looks like a good option when text has symbols (ie.: ♬, ★). There are still there as tokens!
  • keyword just one token. I think it would be useful for exact match searches
  • None of the analyzers, not even keyword preserved the new line character \n.
  • Several occurrences of the same string are valid. (ie.: "i" and "★")
  • I find funny "we're" became "we'r" when english was used. Also "baby" became "babi".
  • Japanese words were tokenized as characters in english. Makes really hard (if possible) to do a useful search using Japanese
  • I am depressed that spanish didn't get my name "nacho" right. For some reason it became "nach". I guess it is trying to get the root of it, since there is also nacha, nachito, nachita, nachos, nachas, etc.
  • cjk (chinese/japanese/korean) complete broke the japanese words. I know it is hard to parse Japanese but cjk is not helping at all here.
  • I have no idea why I tried arabic I literally have 0 knowledge of the language to comment something about it.


More on "nacho"
curl -X POST ${CLOUDANT}/_search_analyze  -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"nacho"}'
{"tokens":["nach"]}
curl -X POST ${CLOUDANT}/_search_analyze  -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"nachos"}'
{"tokens":["nach"]}
curl -X POST ${CLOUDANT}/_search_analyze  -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"Nachos"}'
{"tokens":["nach"]}
🇵🇪 Creo que voy a reclamar a los de Cloudant o Apache para que mi nombre salga bien! Ahorita mismo!!🇵🇪 (O sera que esta sacando la raiz de la palabra?)

0 comments :