How to store a column value in Redshift varchar column with length more than 65535 - amazon-redshift

I tried to load the redshift table but failed on one column- The length of the data column 'column_description'is longer than the length defined in the table. Table: 65535, Data: 86555.
I tried to increase the length of column in RS table, looks like 65535 is the max length RS supports.
Do we have any alternatives to store value in Redshift?

The answer is that Redshift doesn't support anything larger and that one shouldn't store large artifacts in an analytic database. If you are using Redshift for its analytic powers to find specific artifacts (images, files, etc) then these should be stored in S3 and the object key (pointer) should be stored in redshift.

Related

oid and bytea are creating system in tables

oid -> creates a table pg_largeobjects and stores data in there
bytea -> if the compressed data would still exceed 2000 bytes, PostgreSQL splits variable length data types in chunks and stores them out of line in a special “TOAST table” according to https://www.cybertec-postgresql.com/en/binary-data-performance-in-postgresql/
I don't want any other table for large data I want to store them in a column in my defined table, is that possible?
It is best to avoid Large Objects.
With bytea you can prevent PostgreSQL from storing data out of line in a TOAST table by changing the column definition like
ALTER TABLE tab ALTER col SET STORAGE MAIN;
Then PostgreSQL will compress that column but keep it in the main table.
Since the block size in PostgreSQL is 8kB, and one row is always stored in a single block, that will limit the size of your table rows to somewhat under 8kB (there is a block header and other overhead).
I think that you are trying to solve a non-problem, and your request to not store large data out of line is unreasonable.

Shall I SET STORAGE PLAIN on fixed-length bytea PK column?

My Postgres table's primary key is a SHA1 checksum (always 20 bytes) stored in a bytea column (because Postgres doesn't have fixed-length binary types).
Shall I ALTER TABLE t ALTER COLUMN c SET STORAGE PLAIN not to let Postgres compress and/or outsource (TOAST) my PK/FK for the sake of lookup and join performance? And why (not)?
I would say that that is a micro-optimization that will probably not have a measurable effect.
First, PostgreSQL only considers compressing and slicing values if the row exceeds 2000 bytes, so there will only be an effect at all if your rows routinely exceed that size.
Then, even if the primary key column gets toasted, you will probably only be able to measure a difference if you select a large number of rows in a single table scan. Fetching only a few rows by index won't make a big difference.
I'd benchmark both approaches, but I'd assume that it will be hard to measure a difference. I/O and other costs will probably hide the small extra CPU time required for decompression (remember that the row has to be large for TOAST to kick in in the first place).

Encoding a Postgres UUID in Amazon Redshift

We have a couple of entities which are being persisted into Amazon Redshift for reporting purposes, and these entities have a relationship between them. The source tables in Postgres are related via a foreign key with a UUID datatype, which is not supported in Redshift.
One option is to encode the UUID as a 128 bit signed integer. The Redshift documentation refers to the ability to create NUMBER(38,0), and to the ability to create 128 bit numbers.
But 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 which is 39 digits. (thanks Wikipedia). So despite what the docs say, you cannot store the full 128 bits / 39 digits of precision in Redshift. How do you actually make a full 128 bit number column in Redshift?
In short, the real question behind this is - what is Redshift best practice for storing & joining tables which have UUID primary keys?
Redshift joins will perform well even with a VARCHAR key, so that's where I would start.
The main factor for join performance will be co-locating the rows onto the same compute node. To achieve this you should declare the UUID column as the distribution key on both tables.
Alternatively, if one of the tables is fairly small (<= ~1 million rows), then you can declare that table as DISTSTYLE ALL and choose some other dist key for the larger table.
If you have co-located the join and wish to optimize further then you could try splitting the UUID value into 2 BIGINT columns, one for the top 64 bits and another for the bottom 64. Even half of the UUID is likely to be unique and then you can use the second column as a "tie breaker".
c.f. "Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization"

What does the column skew_sorkey1 in Amazon Redshift's svv_table_info imply?

Redshift's documentation (http://docs.aws.amazon.com/redshift/latest/dg/r_SVV_TABLE_INFO.html) states that the definition of the column skew_sortkey1 is - Ratio of the size of the largest non-sort key column to the size of the first column of the sort key, if a sort key is defined. Use this value to evaluate the effectiveness of the sort key.
What does this imply? What does it mean if this value is large? or alternatively small?
Thanks!
A large skew_sortkey1 value means that the ratio of the size of largest non-sort key column to the first column of sort key is large which means row offsets in one disk block for the sort key corresponds to more disk blocks in the data column.
For example lets say skew_sortkey1 value is 5 for a table. Now the row offsets in one disk block for the sort key corresponds to 5 disk blocks for other data columns. Zone map stores the min and max value for the sort key disk block, so when you query this table with a where clause on sort key redshift identifies the sort key block which contains this data (block min < where clause value < block_max) and fetches the row offsets for that column. Now since the skew_sortkey1 is 5, it has to fetch 5 blocks for the data columns before filtering the records to the desired ones.
So to conclude having a high skew_sortkey1 value is not desirable.
Sortkeys define the order in which each field of a table row are stored in a disk block of redshift. This means that column data belonging to a sort key region gets stored together in a single disk block (1 MB size) . Since redshift applies compression to different columns, sortkey columns would have a potential advantage of storing similar data within the same disk block, which leads to higher compression/more efficient storage of data. The same thing cannot be said about other non-sortkey columns.
The column skew_sortkey1 in SVV_TABLE_INFO quantifies the effectiveness of the first sort key within the table. The returned value allows a user to determine whether the selected sort key has improved the compression/efficiency of data storage.

Do Postgres JSONB key lengths matter for disk space?

With the Postgres JSON column it just stores the JSON, so the same JSON values but with longer key names will take up more disk space. Is this also the case for JSONB columns, or does the binary abstraction sidestep this?
It looks like there is no size advantage to JSONB columns with duplicate keys across rows.
I created two tables:
CREATE TABLE temp_a (a_column json);
CREATE TABLE temp_b (b_column jsonb);
And kept inserting {"abcdefghijklmnopqrstuvwxyz": 1} into each of them. The tables both increased in size from 8192 to 16384 and then 49152 at the same rate.
SELECT pg_table_size('temp_a'); SELECT pg_table_size('temp_b');