Encoding in temp tables in RedShift - amazon-redshift

I am using temp staging tables, TempStaging, for doing some merges. The data in some columns for main table, MainTable, is encoded in lzo, say C1. The merge output goes back into MainTable.
In order to ensure same dist key for TempStaging, I am creating it using create table. For some reasons I cannot use Create Table as.
So should I encode the column C1 in to lzo? Or leave it without encoding? Would RedShift short circuit the [decode while select from MainTable, encode while writing into TempStaging, decode while selecting from TempTable for merge, Encode back again while writing it into MainTable]
Because I am thinking that if that short circuiting is not happening, I am better of leaving the encoding, trading away some memory to CPU gains.
-Amit

Data in Redshift is always decoded when it's read from the table AFAIK. There are a few DBs that can operate directly on compressed data but Redshift does not.
There is no absolute rule on whether you should use encoding in a temp table. It depends on how much data is being written. I've found it's faster with encoding 90+% of the time so that's my default approach.
As you note, ensuring that the temp table uses the same dist key is No. 1 priority. You can specify the dist key (and column encoding) in a CREATE TABLE AS though:
CREATE TABLE my_new_table
DISTKEY(my_dist_key_col)
AS
SELECT *
FROM my_old_table
;

Related

oid and bytea are creating system in tables

oid -> creates a table pg_largeobjects and stores data in there
bytea -> if the compressed data would still exceed 2000 bytes, PostgreSQL splits variable length data types in chunks and stores them out of line in a special “TOAST table” according to https://www.cybertec-postgresql.com/en/binary-data-performance-in-postgresql/
I don't want any other table for large data I want to store them in a column in my defined table, is that possible?
It is best to avoid Large Objects.
With bytea you can prevent PostgreSQL from storing data out of line in a TOAST table by changing the column definition like
ALTER TABLE tab ALTER col SET STORAGE MAIN;
Then PostgreSQL will compress that column but keep it in the main table.
Since the block size in PostgreSQL is 8kB, and one row is always stored in a single block, that will limit the size of your table rows to somewhat under 8kB (there is a block header and other overhead).
I think that you are trying to solve a non-problem, and your request to not store large data out of line is unreasonable.

What is the best way to backfill a partition table using data from a non partitioned table? (postgres 12)

I'm converting a non partitioned table to a partitioned table in postgres 12. Assuming I have set up the new partition table and have created appropriate triggers for automatic creation of partitions, what is the best way to back fill the currently empty partition table?
Is a naive
insert into partitioned_table(a,d,b,c) select a, d, b, c from non_partitioned_table;
delete from non_partitioned_table;
appropriate? We have ~250M rows in the table so am a bit concerned about doubling the storage requirements.
Or perhaps
WITH moved_rows AS (
DELETE FROM non_partitioned_table
)
INSERT INTO partitioned_table
SELECT [DISTINCT] * FROM moved_rows;
or
COPY non_partitioned_table TO '/tmp/non_partitioned_table.csv' DELIMITER ',';
COPY partitioned_table FROM '/tmp/non_partitioned_table.csv' DELIMITER ',';
Either way seems like it could take a while to transfer the data. I'm also concerned that it will kill INSERT performance while data is being migrated so I assume we'd need to block inserts until it's done. Is there any way to estimate how long it will take to copy the data over?

Checksum of UUIDs in PostgreSQL

Is there a way to calculate a checksum of UUIDs in PostgreSQL?
I have a table in PostgreSQL, and in another PostgreSQL database I have a similar table, with (perhaps) the same data. The key is a UUID.
What I want to do is calculate a checksum of all the UUIDs in each table, so I can compare the checksums for both tables. I could read all those keys to my client program and perform the calculations there, but I would prefer to do it on the server. Ideally with a single, simple query. Is there a way to do this?
SELECT md5(string_agg(my_uuid_column::text, '' ORDER BY my_uuid_column)) FROM my_table

What are good ways to upload large amount of spatial data in Postgis?

I have a large amount of spatial data I need analyze and put into use in an application. Original data is represented in WKT format and I'm wrapping it into a INSERT SQL statements to upload the data.
INSERT INTO sp_table ( ID_Info, "shape") VALUES ('California', , ST_GeomFromText('POLYGON((49153 4168, 49154 4168, 49155 4168, 49155 4167, 49153 4168))'));
However this approach is taking too much time and data is large (10 million rows).
So, is there any other way to upload large amount of spatial data ?
Any speedup hacks & tricks are welcome appreciated.
Insert your text file into a table (with proper columns) using COPY
Add a SERIAL PRIMARY KEY to this table if it doesn't have one
VACUUM
Spawn one process per CPU which does this :
INSERT INTO sp_table ( ID_Info, "shape")
SELECT state_name, ST_GeomFromText( geom_as_text )
FROM temp_table
WHERE id % numbre_of_cpus = x
Use a different value of "x" for each process, so the entire table is processed. This will allow each core to run on the slow ST_GeomFromText function.
Create GIST index after insertion.
Here you can find some general performance tips. Probably you have fsync property enabled and every INSERT command is forced to be physically written to hard disk, that's why it takes so much time.
It's not recommended to turn off fsync (especially in production environments), because it allows you to safely recover data after unexpected OS crash. According to doc:
Thus it is only advisable to turn off
fsync if you can easily recreate your
entire database from external data.

PostgreSQL: Loading data into Star Schema efficiently

Imagine a table with the following structure on PostgreSQL 9.0:
create table raw_fact_table (text varchar(1000));
For the sake of simplification I only mention one text column, in reality it has a dozen. This table has 10 billion rows and each column has lots of duplicates. The table is created from a flat file (csv) using COPY FROM.
To increase performance I want to convert to the following star schema structure:
create table dimension_table (id int, text varchar(1000));
The fact table would then be replaced with a fact table like the following:
create table fact_table (dimension_table_id int);
My current method is to essentially run the following query to create the dimension table:
Create table dimension_table (id int, text varchar(1000), primary key(id));
then to create fill the dimension table I use:
insert into dimension_table (select null, text from raw_fact_table group by text);
Afterwards I need to run the following query:
select id into fact_table from dimension inner join raw_fact_table on (dimension.text = raw_fact_table.text);
Just imagine the horrible performance I get by comparing all strings to all other strings several times.
On MySQL I could run a stored procedure during the COPY FROM. This could create a hash of a string and all subsequent string comparison is done on the hash instead of the long raw string. This does not seem to be possible on PostgreSQL, what do I do then?
Sample data would be a CSV file containing something like this (I use quotes also around integers and doubles):
"lots and lots of text";"3";"1";"2.4";"lots of text";"blabla"
"sometext";"30";"10";"1.0";"lots of text";"blabla"
"somemoretext";"30";"10";"1.0";"lots of text";"fooooooo"
Just imagine the horrible performance
I get by comparing all strings to all
other strings several times.
When you've been doing this a while, you stop imagining performance, and you start measuring it. "Premature optimization is the root of all evil."
What does "billion" mean to you? To me, in the USA, it means 1,000,000,000 (or 1e9). If that's also true for you, you're probably looking at between 1 and 7 terabytes of data.
My current method is to essentially
run the following query to create the
dimension table:
Create table dimension_table (id int, text varchar(1000), primary key(id));
How are you gonna fit 10 billion rows into a table that uses an integer for a primary key? Let's even say that half the rows are duplicates. How does that arithmetic work when you do it?
Don't imagine. Read first. Then test.
Read Data Warehousing with PostgreSQL. I suspect these presentation slides will give you some ideas.
Also read Populating a Database, and consider which suggestions to implement.
Test with a million (1e6) rows, following a "divide and conquer" process. That is, don't try to load a million at a time; write a procedure that breaks it up into smaller chunks. Run
EXPLAIN <sql statement>
You've said you estimate at least 99% duplicate rows. Broadly speaking, there are two ways to get rid of the dupes
Inside a database, not necessarily the same platform you use for production.
Outside a database, in the filesystem, not necessarily the same filesystem you use for production.
If you still have the text files that you loaded, I'd consider first trying outside the database. This awk one-liner will output unique lines from each file. It's relatively economical, in that it makes only one pass over the data.
awk '!arr[$0]++' file_with_dupes > file_without_dupes
If you really have 99% dupes, by the end of this process you should have reduced your 1 to 7 terabytes down to about 50 gigs. And, having done that, you can also number each unique line and create a tab-delimited file before copying it into the data warehouse. That's another one-liner:
awk '{printf("%d\t%s\n", NR, $0);}' file_without_dupes > tab_delimited_file
If you have to do this under Windows, I'd use Cygwin.
If you have to do this in a database, I'd try to avoid using your production database or your production server. But maybe I'm being too cautious. Moving several terabytes around is an expensive thing to do.
But I'd test
SELECT DISTINCT ...
before using GROUP BY. I might be able to do some tests on a large data set for you, but probably not this week. (I don't usually work with terabyte-sized files. It's kind of interesting. If you can wait.)
Just to questions:
- it neccessary to convert your data in 1 or 2 steps?
- May we modify the table while converting?
Running more simplier queries may improve your performance (and the server load while doing it)
One approach would be:
generate dimension_table (If i understand it correctly, you don't have performance problems with this) (maybe with an additional temporary boolean field...)
repeat: choose one previously not selected entry from dimension_table, select every rows from raw_fact_table containing it and insert them into fact_table. Mark dimension_table record as done, and next... You can write this as a stored procedure, and it can convert your data in the background, eating minimal resources...
Or another (probably better):
create fact_table as EVERY record from raw_fact_table AND one dimension_id. (so including dimension_text and dimension_id rows)
create dimension_table
create an after insert trigger for fact_table which:
searches for dimension_text in fact_table
if not found, creates a new record in dimension_table
updates dimension_id to this id
in a simle loop, insert every record from raw_fact_table to fact_table
You are omitting some details there at the end, but I don't see that there necessarily is a problem. It is not in evidence that all strings are actually compared to all other strings. If you do a join, PostgreSQL could very well pick a smarter join algorithm, such as a hash join, which might give you the same hashing that you are implementing yourself in your MySQL solution. (Again, your details are hazy on that.)
-- add unique index
CREATE UNIQUE INDEX uidx ON dimension_table USING hash(text);
-- for non case-sensitive hash(upper(text))
try hash(text); and btree(text) to see which one is faster
I an see several ways of solving your problem
There is md5 function in PostgreSql
md5(string) Calculates the MD5 hash of string, returning the result in hexadecimal
insert into dimension_table (select null, md5(text), text from raw_fact_table group by text)
add md5 field into raw_fact_table as well
select id into fact_table from dimension inner join raw_fact_table on (dimension.md5 = raw_fact_table.md5);
Indexes on MD5 filed might help as well
Or you can calculate MD5 on the fly while loading the data.
For example our ETL tool Advanced ETL processor can do it for you.
Plus it can load data into multiple tables same time.
There is a number of on-line tutorials available on our web site
For example this one demonstrates loading slow changing dimension
http://www.dbsoftlab.com/online-tutorials/advanced-etl-processor/advanced-etl-processor-working-with-slow-changing-dimension-part-2.html