Hash or masking in Presto - hash

How do I create a column with hashed or masked data in Presto? I'm trying to convert potentially PII data before I export it in order to analyze.

Related

How to store a column value in Redshift varchar column with length more than 65535

I tried to load the redshift table but failed on one column- The length of the data column 'column_description'is longer than the length defined in the table. Table: 65535, Data: 86555.
I tried to increase the length of column in RS table, looks like 65535 is the max length RS supports.
Do we have any alternatives to store value in Redshift?
The answer is that Redshift doesn't support anything larger and that one shouldn't store large artifacts in an analytic database. If you are using Redshift for its analytic powers to find specific artifacts (images, files, etc) then these should be stored in S3 and the object key (pointer) should be stored in redshift.

oid and bytea are creating system in tables

oid -> creates a table pg_largeobjects and stores data in there
bytea -> if the compressed data would still exceed 2000 bytes, PostgreSQL splits variable length data types in chunks and stores them out of line in a special “TOAST table” according to https://www.cybertec-postgresql.com/en/binary-data-performance-in-postgresql/
I don't want any other table for large data I want to store them in a column in my defined table, is that possible?
It is best to avoid Large Objects.
With bytea you can prevent PostgreSQL from storing data out of line in a TOAST table by changing the column definition like
ALTER TABLE tab ALTER col SET STORAGE MAIN;
Then PostgreSQL will compress that column but keep it in the main table.
Since the block size in PostgreSQL is 8kB, and one row is always stored in a single block, that will limit the size of your table rows to somewhat under 8kB (there is a block header and other overhead).
I think that you are trying to solve a non-problem, and your request to not store large data out of line is unreasonable.

Checksum of UUIDs in PostgreSQL

Is there a way to calculate a checksum of UUIDs in PostgreSQL?
I have a table in PostgreSQL, and in another PostgreSQL database I have a similar table, with (perhaps) the same data. The key is a UUID.
What I want to do is calculate a checksum of all the UUIDs in each table, so I can compare the checksums for both tables. I could read all those keys to my client program and perform the calculations there, but I would prefer to do it on the server. Ideally with a single, simple query. Is there a way to do this?
SELECT md5(string_agg(my_uuid_column::text, '' ORDER BY my_uuid_column)) FROM my_table

Encoding in temp tables in RedShift

I am using temp staging tables, TempStaging, for doing some merges. The data in some columns for main table, MainTable, is encoded in lzo, say C1. The merge output goes back into MainTable.
In order to ensure same dist key for TempStaging, I am creating it using create table. For some reasons I cannot use Create Table as.
So should I encode the column C1 in to lzo? Or leave it without encoding? Would RedShift short circuit the [decode while select from MainTable, encode while writing into TempStaging, decode while selecting from TempTable for merge, Encode back again while writing it into MainTable]
Because I am thinking that if that short circuiting is not happening, I am better of leaving the encoding, trading away some memory to CPU gains.
-Amit
Data in Redshift is always decoded when it's read from the table AFAIK. There are a few DBs that can operate directly on compressed data but Redshift does not.
There is no absolute rule on whether you should use encoding in a temp table. It depends on how much data is being written. I've found it's faster with encoding 90+% of the time so that's my default approach.
As you note, ensuring that the temp table uses the same dist key is No. 1 priority. You can specify the dist key (and column encoding) in a CREATE TABLE AS though:
CREATE TABLE my_new_table
DISTKEY(my_dist_key_col)
AS
SELECT *
FROM my_old_table
;

Redshift COPY csv array field to separate rows

I have a relatively large MongoDB collection that I'm migrating into Redshift. It's ~600mm documents, so I want the copy to be as efficient as possible.
The problem is, I have an array field in my Mongo collection, but I'd like to insert each value from the array into separate rows in Redshift.
Mongo:
{
id: 123,
names: ["market", "fashion", "food"]
}
In Redshift, I want columns for "id" and "names", where the primary key is (id, name). So I should get 3 new Redshift rows from that one mongo document.
Is it possible to do that with a Redshift COPY command? I can export my data as either a csv or json into s3, but I don't want to have to do any additional processing on the data due to how long it takes to do that many documents.
You can probably do it on COPY with triggers, but it'd be quite awkward and the performance would be miserable (since you can't just transform the row and would need to do INSERTs from the trigger function).
It's a trivial transform, though, why not just pass it through any scripting language on export?
You can also import as-is, and transform afterwards (should be pretty fast on Redshift):
CREATE TABLE mydata_load (
id int4,
names text[]
);
do the copy
CREATE TABLE mydata AS SELECT id, unnest(names) as name FROM mydata_load;
Redshift does not have support for Arrays as PostgreSQL does, so you cannot just insert the data as is.
However, MongoDB has a simple aggregate function which allows you to unwind arrays exactly as you want - by using the other columns as the keys. So I'd export the result of that into a JSON, and then store it into Redshift using JSONPaths.