Tokenization vs Guillemet

Luca · December 16, 2020, 8:16pm

While doing some testing, I noticed that the tokenizer treats gullermets «, » differently from the more common ", '.

Look a this string: «a sentence between guillemet». Your tokenizer get this: «a, sentence, between, guillemet».

«a, and guillemet», it seems to, that they are not identified for further processing, like the lower case transformation. So if the tokenizer also encounters the upper case versions GUILLEMET» and «A in the text, it will create two more entries.

This results in the creation of fancy entries in the inverted index dictionary, and the loss of information associated with the loss of the term guillemet. All that worsens the user experience in terms of accuracy, and recall. Not least it unnecessarily exacerbates RAM usage, with useless tokens

meirsh · December 17, 2020, 7:12am

Hey @Luca,

Thanks for pointing it out, mind open an issue of github for this?

Mirko · January 27, 2021, 11:23am

Now I understand why sometimes documents are not returned in some searches, when there are those angle quotes. The same problem is also present in the enterprise version, whose licence is not exactly cheap! A problem of this nature is no longer justified in this day, in the age where unicode is the de facto standard in every document.

Topic		Replies	Views
Indexing fields with diacritics RediSearch	4	867	December 7, 2020
Query Syntax - different results with Field modifiers RediSearch	3	618	July 7, 2020
Redisearch 2.0 Queries RediSearch redisearch	6	1412	September 2, 2020
Synonym groups search using RediSearch RediSearch	2	768	September 1, 2020
[ANN] RediSearch v1.4.0 RediSearch	0	589	August 20, 2018

Tokenization vs Guillemet

Related Topics