What is the best approach for checking for a certain id in a huge list of ids(from csv)

Hi,
I had a question about a usecase we had in my project.
We had to read a list of ids from 2 csvs - opted-in_ids.csv , available_ids.csv. These are pushed into s3 by another team.
In our site, at a time, we need to check if a particular id is in opted in list or/and available list. ie, check if these ids are present in either of those 2 csvs. For this we just used 2 keys, OPTED_IN, AVAILABLE and just pushed the array of the entries from csvs as a Stringified Json in our lambda code.
So, in client code, we will read this and parse it and search for the id in the array list. The array list is like millions of records.
Is there a better way for this? Are these better stored as sets? SADD with the arraylist and use SISMEMBER on client side?
We did start seeing performance issues when the list got very big.

If we were to move this to SETS, what would be the best way to load the data into sets from the csv(as it contains millions of records). We might need to go for a batch behavior to not bring down the db.
This list is updated once in every 2 weeks or so. SO loading of SET might have to be run multiple times.

Any inputs will be appreciated

Hi @geethu.unnikrishnan

There are multiple ways to address this use case (depends on number of records, operations/sec at peak to check the item exists either of list). Below are suggested options:

  1. create separate keys for each record example
  • opted:1425
  • avail:5324
  • opted:5324
  • avail:5444
    then just execute MGET
  1. Use SETS for each type (opted, available ids) and then execute SISMEMBER, BTW there are UNION or INTERSECT operations on SETS as well that you can you create separate keys once all insertions are done everytime. example
    SADD opted 453 543 655 666
    SADD avail 451 543 155 616

SUNIONSTORE idsunion opted avail

  • this can give union ids from both type and then you do SISMEMBER
  1. Use BITMAPS - can store upto 4 billion bit values. what you actually storing that ids is exist or not. if that suffice your need. below is example:
    SETBIT opted 345 1
    SETBIT opted 145 1
    SETBIT avail 345 1
    GETBIT opted 345
    GETBIT avail 345

Regarding your question how you can improve/load records from .csv to Redis:

  1. when you upload the .csv file to S3 make sure it has only required field - remove unwanted fields
  2. instead of uploading all records in single file, break into multiple files so that you can load them faster and process in parallel with multiple clients
  3. to load records in Redis from .csv use tested solutions like https://www.initpals.com/redis/how-to-bulk-insert-csv-data-into-redis-with-node-js/

Hope this helps, all the best.

Let me know if you have more questions.

1 Like

Thanks @suyog. Thanks for your inputs. The uploading of the list is controlled by business, we do not have control over it. The 3rd approach calls the set command on every record in csv, wont this create a million network requests and load up the service?
If we wait till the end of the csv read, and then call SADD with the entire array list as members(1 command), is it going to be a performance hit?( as the time complexity is O(n) for adding n elements). Or we would need to chunk the array into batches and call the SADD for each batch.

1 Like

@geethu.unnikrishnan you can always use pipeline for sending multiple commands (batches) refer https://redis.io/topics/pipelining

1 Like