I have a redshift cluster that I use for some analytics application. I have incoming data that I would like to add to a clicks table. Let's say I have ~10 new 'clicks' that I want to store each second. If possible, I would like my data to be available as soon as possible in redshift.
From what I understand, because of the columnar storage, insert performance is bad, so you have to insert by batches. My workflow is to store the clicks in redis, and every minute, I insert the ~600 clicks from redis to redshift as a batch.
I have two ways of inserting a batch of clicks into redshift:
Multi-row insert strategy: I use a regularinsertquery for inserting multiple rows. Multi-row insert documentation hereS3 Copy strategy: I copy the rows in s3 asclicks_1408736038.csv. Then I run aCOPYto load this into theclickstable. COPY documentation here
I've done some tests (this was done on a clicks table with already 2 million rows):
| multi-row insert stragegy | S3 Copy strategy |
|---------------------------+---------------------------+
| insert query | upload to s3 | COPY query |
-------------+---------------------------+--------------+------------+
1 record | 0.25s | 0.20s | 0.50s |
1k records | 0.30s | 0.20s | 0.50s |
10k records | 1.90s | 1.29s | 0.70s |
100k records | 9.10s | 7.70s | 1.50s |
As you can see, in terms of performance, it looks like I gain nothing by first copying the data in s3. The upload + copy time is equal to the insert time.
Questions:
What are the advantages and drawbacks of each approach ? What is the best practise ? Did I miss anything ?
And side question: is it possible for redshift to COPY the data automatically from s3 via a manifest ? I mean COPYing the data as soon as new .csv files are added into s3 ? Doc here and here. Or do I have to create a background worker myself to trigger the COPY commands ?
My quick analysis:
In the documentation about consistency, there is no mention about loading the data via multi-row inserts. It looks like the preferred way is COPYing from s3 with unique object keys (each .csv on s3 has its own unique name)...
S3 Copy strategy:- PROS: looks like the good practice from the docs.
- CONS: More work (I have to manage buckets and manifests and a cron that triggers the
COPYcommands...)
Multi-row insert strategy- PROS: Less work. I can call an
insertquery from my application code - CONS: doesn't look like a standard way of importing data. Am I missing something?
- PROS: Less work. I can call an