Checksum of UUIDs in PostgreSQL - postgresql

Is there a way to calculate a checksum of UUIDs in PostgreSQL?
I have a table in PostgreSQL, and in another PostgreSQL database I have a similar table, with (perhaps) the same data. The key is a UUID.
What I want to do is calculate a checksum of all the UUIDs in each table, so I can compare the checksums for both tables. I could read all those keys to my client program and perform the calculations there, but I would prefer to do it on the server. Ideally with a single, simple query. Is there a way to do this?

SELECT md5(string_agg(my_uuid_column::text, '' ORDER BY my_uuid_column)) FROM my_table

Related

data pseudonymization in postgreSQL

I am trying get the PII data from postgreSQL table. But I can't display the raw data.
How to Pseudonymize the data while fetching(select) it from postgreSQL database?
You can always create (pseudo)anonymized views for tables, and select from those. As for anonymization techniques, that depends on data, you can use regex replaces and md5 very easily in postgres.

Best performance method for getting records by large collection of IDs

I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.

Encoding a Postgres UUID in Amazon Redshift

We have a couple of entities which are being persisted into Amazon Redshift for reporting purposes, and these entities have a relationship between them. The source tables in Postgres are related via a foreign key with a UUID datatype, which is not supported in Redshift.
One option is to encode the UUID as a 128 bit signed integer. The Redshift documentation refers to the ability to create NUMBER(38,0), and to the ability to create 128 bit numbers.
But 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 which is 39 digits. (thanks Wikipedia). So despite what the docs say, you cannot store the full 128 bits / 39 digits of precision in Redshift. How do you actually make a full 128 bit number column in Redshift?
In short, the real question behind this is - what is Redshift best practice for storing & joining tables which have UUID primary keys?
Redshift joins will perform well even with a VARCHAR key, so that's where I would start.
The main factor for join performance will be co-locating the rows onto the same compute node. To achieve this you should declare the UUID column as the distribution key on both tables.
Alternatively, if one of the tables is fairly small (<= ~1 million rows), then you can declare that table as DISTSTYLE ALL and choose some other dist key for the larger table.
If you have co-located the join and wish to optimize further then you could try splitting the UUID value into 2 BIGINT columns, one for the top 64 bits and another for the bottom 64. Even half of the UUID is likely to be unique and then you can use the second column as a "tie breaker".
c.f. "Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization"

Encoding in temp tables in RedShift

I am using temp staging tables, TempStaging, for doing some merges. The data in some columns for main table, MainTable, is encoded in lzo, say C1. The merge output goes back into MainTable.
In order to ensure same dist key for TempStaging, I am creating it using create table. For some reasons I cannot use Create Table as.
So should I encode the column C1 in to lzo? Or leave it without encoding? Would RedShift short circuit the [decode while select from MainTable, encode while writing into TempStaging, decode while selecting from TempTable for merge, Encode back again while writing it into MainTable]
Because I am thinking that if that short circuiting is not happening, I am better of leaving the encoding, trading away some memory to CPU gains.
-Amit
Data in Redshift is always decoded when it's read from the table AFAIK. There are a few DBs that can operate directly on compressed data but Redshift does not.
There is no absolute rule on whether you should use encoding in a temp table. It depends on how much data is being written. I've found it's faster with encoding 90+% of the time so that's my default approach.
As you note, ensuring that the temp table uses the same dist key is No. 1 priority. You can specify the dist key (and column encoding) in a CREATE TABLE AS though:
CREATE TABLE my_new_table
DISTKEY(my_dist_key_col)
AS
SELECT *
FROM my_old_table
;

T-SQL: SELECT INTO sparse table?

I am migrating a large quantity of mostly empty tables into SQL Server 2008.
The tables are vertical partitions of one big logical table.
Problem is this logical table has more than 1024 columns.
Given that most of the fields are null, I plan to use a sparse table.
For all of my tables so far I have been using SELECT...INTO, which has been working really well.
However, now I have "CREATE TABLE failed because column 'xyz' in table 'MyBigTable' exceeds the maximum of 1024 columns."
Is there any way I can do SELECT...INTO so that it creates the new table with sparse support?
What you probably want to do is create the table manually and populate it with an INSERT ... SELECT statement.
To create the table, I would recommend scripting the different component tables and merging their definitions, making them all SPARSE as necessary. Then just run your single CREATE TABLE statement.
You cannot (and probably don't want to anyway). See INTO Clause (TSQL) for the MSDN documentation.
The problem is that sparse tables are a physical storage characteristic and not a logical characteristic, so there is no way the DBMS engine would know to copy over that characteristic. Moreover, it is a table-wide property and the SELECT can have multiple underlying source tables. See the Remarks section of the page I linked where it discusses how you can only use default organization details.