What is the best approach for checking for a certain id in a huge list of ids(from csv)

geethu.unnikrishnan · July 9, 2020, 5:01am

Hi,
I had a question about a usecase we had in my project.
We had to read a list of ids from 2 csvs - opted-in_ids.csv , available_ids.csv. These are pushed into s3 by another team.
In our site, at a time, we need to check if a particular id is in opted in list or/and available list. ie, check if these ids are present in either of those 2 csvs. For this we just used 2 keys, OPTED_IN, AVAILABLE and just pushed the array of the entries from csvs as a Stringified Json in our lambda code.
So, in client code, we will read this and parse it and search for the id in the array list. The array list is like millions of records.
Is there a better way for this? Are these better stored as sets? SADD with the arraylist and use SISMEMBER on client side?
We did start seeing performance issues when the list got very big.

If we were to move this to SETS, what would be the best way to load the data into sets from the csv(as it contains millions of records). We might need to go for a batch behavior to not bring down the db.
This list is updated once in every 2 weeks or so. SO loading of SET might have to be run multiple times.

Any inputs will be appreciated

suyog · July 9, 2020, 6:13am

Hi @geethu.unnikrishnan

There are multiple ways to address this use case (depends on number of records, operations/sec at peak to check the item exists either of list). Below are suggested options:

create separate keys for each record example

opted:1425
avail:5324
opted:5324
avail:5444
then just execute MGET

Use SETS for each type (opted, available ids) and then execute SISMEMBER, BTW there are UNION or INTERSECT operations on SETS as well that you can you create separate keys once all insertions are done everytime. example
SADD opted 453 543 655 666
SADD avail 451 543 155 616

SUNIONSTORE idsunion opted avail

this can give union ids from both type and then you do SISMEMBER

Use BITMAPS - can store upto 4 billion bit values. what you actually storing that ids is exist or not. if that suffice your need. below is example:
SETBIT opted 345 1
SETBIT opted 145 1
SETBIT avail 345 1
GETBIT opted 345
GETBIT avail 345

Regarding your question how you can improve/load records from .csv to Redis:

when you upload the .csv file to S3 make sure it has only required field - remove unwanted fields
instead of uploading all records in single file, break into multiple files so that you can load them faster and process in parallel with multiple clients
to load records in Redis from .csv use tested solutions like https://www.initpals.com/redis/how-to-bulk-insert-csv-data-into-redis-with-node-js/

Hope this helps, all the best.

Let me know if you have more questions.

geethu.unnikrishnan · July 9, 2020, 6:33am

Thanks @suyog. Thanks for your inputs. The uploading of the list is controlled by business, we do not have control over it. The 3rd approach calls the set command on every record in csv, wont this create a million network requests and load up the service?
If we wait till the end of the csv read, and then call SADD with the entire array list as members(1 command), is it going to be a performance hit?( as the time complexity is O(n) for adding n elements). Or we would need to chunk the array into batches and call the SADD for each batch.

suyog · July 9, 2020, 7:28am

@geethu.unnikrishnan you can always use pipeline for sending multiple commands (batches) refer https://redis.io/topics/pipelining

Topic		Replies	Views
Whats the best one for table like data storage? Redis commands & data structures	3	748	July 7, 2020
How to solve the problem of storing empty set? Redis commands & data structures sets	4	5758	September 2, 2020
wait for certain key Redis commands & data structures	1	638	December 7, 2022
About Redis Commands & Data Structures Redis commands & data structures	15	3377	July 18, 2020
Doubt regarding Redis code Redis commands & data structures python	0	336	February 28, 2023

What is the best approach for checking for a certain id in a huge list of ids(from csv)

Related Topics