RedisSearch OOM and proper indexing technique

Hi all
New to Redis.

Using python as my CLI.

I am trying to load a JSON DOC … and then make it searchable via a particular value in the DOC.

So for example.

RAWDATA = {

  • ‘f1’:‘Peter’,
  • ‘ai’:600368481,
  • ‘h1’:’’,
  • ‘r2’:‘tx’,
  • ‘u2’:’’,
  • ‘d0’:‘CUSTACCP’,
  • ‘s3’:1478309277588,
  • ‘s2’:False,
  • ‘s1’:’’,
  • ‘u1’:2,
  • ‘u0’:‘INTERNET’,
  • ‘s5’:‘0’,
  • ‘s4’:’’,
  • ‘c13’:‘USA’,
  • ‘c12’:‘us’,
  • ‘c11’:‘US’,
  • ‘c10’:‘The Colony’,
  • ‘type’:‘addressbook’
    }

I have a million of these … some with extra attributes also.

I want to load these and then be able to search quickly (aggregate and search) via just one or two attributes. For sake of argument - lets say attribute d0 and attribute r2.

Initially I loaded each doc through a loop from file with index on BODY which is the raw JSON data

client.create_index([TextField(‘body’, sortable=True)])

client.add_document(docid, body=str(doc))


But this essentially splits EVERY element UP and indexes it - and I end up with 10x the keys and run out of memory.

I *think* I have to parse the DOC and assign an FIELD to it for each and every element - which is really cumbersome as it could be anything in the data -- so to know to treat it as a number of text is almost impossible .......

Thoughts - not sure if I am on right track .... or not .....

Greetings…

I’m not an expert, but I’ve accomplished what I’ve wanted with RediSearch… Maybe I can point you in the right direction…

  1. The first thing to do (if you haven’t already) is to get a database GUI working so that you can see what’s going on in the database as you learn Redis and RediSearch… I use Redis Desktop, but Redis has FREE one which I haven’t tried, but sounds fine: https://redislabs.com/redisinsight/

  2. I would also recommend working with a small data set until you get everything tweaked…

  3. A CLI adds and extra layer of complexity to the whole process because many of the CLIs don’t support all of the nuances of the commands… Thus your control over things is not so granular… I used the PHPRedis which does NOT support the RediSearch commands, but has a “rawcommand” method that lets you execute a raw command via bash shell (or something like that)… So, basically, I would suggest getting everything running with just a few records using pure command line and THEN figure out how to replicate that in Python and then in Python in bulk for your million records…

  4. Your index looks too simple… My data set had several fields that I addressed one by one and tried to specify for every field whether it should be indexed, the type of data, and its weight. My raw command was a mile long. I don’t have a copy of the one I wrote for the command line, but via my CLI, it looked like this:

$result = $redis->rawCommand(“FT.CREATE”, “socialPostsIDX”, “NOHL”, “SCHEMA”, “modified”, “NUMERIC”, “SORTABLE”, “NOINDEX”, “title”, “TEXT”, “WEIGHT”, “5.0”, “content”, “TEXT”, “WEIGHT”, “3.0”, “location”, “TEXT”, “WEIGHT”, “2.0”, “author”, “TEXT”, “NOSTEM”, “WEIGHT”, “1.0”, “authid”, “NUMERIC”, “NOINDEX”, “authimg”, “TEXT”, “NOINDEX”, “authreg”, “NUMERIC”, “NOINDEX”, “authurl”, “TEXT”, “NOINDEX”, “imgbase”, “TEXT”, “NOINDEX”, “created”, “NUMERIC”, “NOINDEX”, “permalink”, “TEXT”, “NOINDEX”, “size”, “TEXT”, “NOINDEX”, “vidurl”, “TEXT”, “NOINDEX”, “comments”, “NUMERIC”, “NOINDEX”, “views”, “NUMERIC”, “NOINDEX”, “revive”, “NUMERIC”, “NOINDEX”, “dataextra”, “TEXT”, “NOINDEX”);

If you just want to lump your JSON together into one field maybe specifying the data type “TEXT” would help when you create the index?? Or maybe splitting your JSON up into fields is the way to go, especially if not all will need to be searched. I had a lot of “noindex” fields in my data set, because I decided to store the entire doc in RediSearch. That’s sort of a strategy decision… You can index a few searchable fields including a doc uuid and pull the doc from elsewhere based on a search return OR you can just put the whole doc in RediSearch if you’ve got the resources to do it

  1. Once you get your index setup correctly and you can add a doc against that index and search for it, then you can try to figure out how to replicate things in your CLI and in bulk… Make sure you read up on pipelining… It’s the only way to go when adding a million records… https://redis.io/topics/pipelining I was able to do this via PHP and it saved the day…

Good luck,

Alec

Thanks Alex. Great tips.

I guess my main thought is exactly as you say …

if you have a JSON/DOC/HASH … should you specify ALL attributes to be indexed …

My simplified code version is

for eachrow in f:
    try:
        #Get uniq DocID
        docid=doc['METID']
        #Test for R1 or Type to index appropriately
        if 'r1' in eachrow:
            r1val=doc['r1']
            if 'METID' in eachrow:
                del doc['METID']
            clientr1idx.add_document(docid, r1=r1val, body=str(doc))
        elif 'type' in eachrow:
            typeval = doc['type']
            if 'METID' in eachrow:
                del doc['METID']
            clienttypeidx.add_document(docid, type=typeval, body=str(doc))

The idea being here we want to aggregate/scan documents either by (R1) or by (type).

Essentially here I have loaded all 40 million docs into two indexes.

  1. One Index for ‘r1’

  2. One Index for ‘type’

The if statement above just works out the DOC ID from the payload, assigns it to a variable - and then removes from the payload before insert to redis. It then determines which index to load against depending on payload value.

Raw data size was 13GB. Total data size in redis after load with 2 indexes was.

This way I have the tightest/smallest index for each usecase.

I then run an aggregation query … and this is result.

redis-16467.redispoc.qvcdev.qvc.net:16467> FT.AGGREGATE typeINDEX * GROUPBY 1 @type REDUCE COUNT 0 as type_count

  1. (integer) 10
    1. “type”
    2. “addressbooklookup”
    3. “type_count”
    4. “580598”
    1. “type”
    2. “customer”
    3. “type_count”
    4. “784515”
    1. “type”
    2. “products”
    3. “type_count”
    4. “2149264”
    1. “type”
    2. “email”
    3. “type_count”
    4. “2559136”
    1. “type”
    2. “cn”
    3. “type_count”
    4. “149267”
    1. “type”
    2. “ldapagent”
    3. “type_count”
    4. “3240”
    1. “type”
    2. “addressbooklist”
    3. “type_count”
    4. “515699”
    1. “type”
    2. “orderrange”
    3. “type_count”
    4. “2”
    1. “type”
    2. “phone”
    3. “type_count”
    4. “3101300”
    1. “type”
    2. “custguid”
    3. “type_count”
    4. “2856190”
      (1.50s)

1.50 seconds for an aggregation across 13 million docs.

=== INDEX STRUCTURE - 12.6 million docs

redis-16467.redispoc.qvcdev.qvc.net:16467> FT.INFO typeINDEX

  1. index_name
  2. “typeINDEX”
  3. fields
      1. type
      2. type
      3. TEXT
      4. WEIGHT
      5. “1”
      6. SORTABLE
  4. index_options
  5. (empty list or set)
  6. gc_stats
    1. current_hz
    2. “1”
    3. bytes_collected
    4. (integer) 0
    5. effectiv_cycles_rate
    6. “0”
  7. cursor_stats
    1. global_idle
    2. (integer) 0
    3. global_total
    4. (integer) 0
    5. index_capacity
    6. (integer) 384
    7. index_total
    8. (integer) 0
  8. num_docs
  9. (integer) 12699211