Do Redshift column encodings affect query execution speed? - amazon-redshift

When creating data tables in Amazon Redshift, you can specify various encodings such as MOSTLY32 or BYTEDICT or LZO. Those are the compressions used when storing the columnar values on disk.
I am wondering if my choice of encoding is supposed to make a difference in query execution times. For example, if I make a column BYTEDICT would that make a difference over LZO when it comes to SELECTs, GROUP BYs or FILTERs?

Yes. The compression encoding used translates to amount of disk storage. Generally, the lower the storage the better would be query performance.
But, which encoding would be be more beneficial to you depends on your data type and its distribution. There is no gurantee that LZO will always be better than Bytedict or vice-a-versa. In my experience, I usually load some sample data in the intended table. Than do a analyze compression. Now whatever Redshift suggests, I go with it. That has worked for me.

Amazon actually has released a python script that can apply this automatically to your database. You can find this script here https://github.com/awslabs/amazon-redshift-utils/blob/master/src/ColumnEncodingUtility/analyze-schema-compression.py

Bit late but likely useful to anyone taking a look here:
Amazon can now decide on the best compression to use (Loading Tables with Automatic Compression), if you are using a COPY command to load your table, and there is no existing compression defined in your table.
You just have to add COMPUPDATE ON to your COPY command.

Related

How does Amazon Redshift store nulls

In Amazon Redshift, how are nulls stored? - will they take up physical space?
I'm looking to see how best to design a table - the data behind it may mean it will have many columns that are sparsely populated and so I would like to know if this has a negative impact (even after compression) or if nulls don't actually take up any space at all (for example like v5.0.3 or later in mysql)
thank you
Yes, columns with lots of nulls will provide excellent compression, and consequently great performance on Amazon Redshift.
Amazon Redshift is a columnar database engine. Columnar database are greatly optimized for data with repeating values, and those can be null.
So if you have a table where some of the columns have lots of nulls, this will more than likely compress extremely well, and provide savings in storage as well as processing speeds.
In order to achieve the proper compression you have two options:
DDL Design - Manually choose your encoding settings.
Automatic - Have the copy command automatically choose the optimal
encoding settings for your database.

Redshift COPY automatic compression

I am unclear on how the automatic compression works when using the COPY command with Redshift.
The documentation says:
By default, the COPY command applies automatic compression whenever you run the COPY command with an empty target table and all of the table columns either have RAW encoding or no encoding.
Does this mean that for my main table where the raw data is copied on an ongoing basis, the data will be compressed only the first time a COPY will occur to this table and never again for subsequent times? Seems like i misunderstand something cause that doesn't make sense it would work this way.
Thx
Basically an encoding(compression) type needs to be set for each column when creating a table. However there is an exception, as you quoted from AWS docs, when data is copied into an empty table, Redshift automatically analyzes and sets
a best encoding to all columns along with copied data. Then subsequent data will be compressed with the set encoding.
Therefore, the answer for your questions is "No". Once encoding(compression) is set through either way, subsequent items will be compressed.
I confirm Masashi's answer. Note however that:
Automatic compression analysis requires enough rows in the load data (at > least 100,000 rows per slice) to allow sampling to take place.
If you run COPY on a small batch, your table will be set to no encoding. And all the subsequent COPY calls won't change that. You can solve that later by running a deep copy of your table.

How to verify large postgresql Databases running different version have the same data without dumping

How Would I verify that the data in a 8.3 postgresql DB is the same as the data in a 9.0 DB
When I did a sql dump on a example table there we3re many differences that showed but this was due to 9.0 truncating 0's on the end and begining of date fields, also the order of the dump was not fixed, even though this can be sorted with sort(no pun intended). it does not allow validation as it would loose what table it was part of as the sorted sql dump would be a meaningless splat of sql commands with dump settings thrown in for extra.
count(*) is also not adequate.
I would like to be 100% sure that the data in one is equal to the data in the other despite the version differences and the way that at the very least dates are held in 9.0.
I should add I have several hundred tables and many hundred GB of data. so i need a automated process like diff DUMPa.sql DUMP2.sql, a SHA of the data(not the format) would be idea, but one cannot diff binary dumps of PostgreSQL for well known reasons. I am aware mysql has a checksum feature, but im using postgresql.
First the bad news. There is really no way to offer the full concerns you want addressed without loading all the data into an intermediary program and directly comparing. This will take time and it will drag your system down load-wise so my recommendation is set up some sort of replication and compare replicas.
One thing you might be able to do is to use something like Slony or Bucardo to replicate, and then triggers to move data into secondary child partitions and replicate those onto a consolidated server for comparison. You could then compare within PostgreSQL. This would reduce the load and it would mean your reporting data would be relatively easy to manage compared to other approaches. But all the data is going to have to be loaded and compared line-by-line.

Storing large numbers of images in a database? A good experience?

I'm writing an app which will store a large number of image (and possibly video) files. After they're uploaded they will be immediately pushed out to some cloud serving CDN for actual serving to the public. The idea is to have the images stored in a reliable, back-uppable store. I would anticipate of the order of 200,000 objects of up to 10KB each and possibly fewer video files of a few MB.
By default I would go to Postgres which the documentation suggests would be ok.
Is this is a sensible idea?
Will it make backing up the database a complete nightmare. Experiences?
Any reliability issues?
Will this affect the performance for other parts of the db? Bear in mind that the db will only be hit once or twice for each image.
I've got experience with storing images in a database this way in Oracle and MySQL. Performance and reliability are not an issue. Backing up is. Your backup will get very large. Since backing up is time consuming and expensive, it might be a good idea to save space. If that means you can shrink your database by 80% by just removing the images from the database, it might be a good idea to store them elsewhere. Backing up separate files is more efficient, because you can easily create incremental backups containing only new and modified images.
I have experiences with PostgreSQL, storing images as ByteA (a BLOB-like datatype), a good experience, and storing images in "dual solution" (images at filesystem, metadata at databases like MySQL and PostgreSQL), that I not recommend.
There are 3 aspects, or architecture considerations, that can help us in our decision:
Unify solution or not? Today, when we see that image volume (sizes and number of images) are growing and growing, in all applications, the "unified solutions" are the goal. Example: Wikimedia is a unified and specialized solution for Wikipedia.
Direct or indirect store? Like old "dual solutions", that not store image into the SQL table, some solutions can use external database or external data pointer... On PostgreSQL BLOB datatypes have indirect store (generates a separated backup), and BYTEA datatype is direct (backup-ed with tables). The choice need technical and performance considerations.
Original or processed images? We need some distinction between "original image" and "processed image", like thumbnail, that need database store (for caching!), but not need backup.
I recommend:
to store as blob (Binary Large OBject with indirect store) at your table: for original image store, but separated backup. See Ivan's answer, PostgreSQL additional supplied modules, How-tos etc.
to store as bytea (or blob), at a separated database (with DBlink): for original image store, at another (unified) database. In this case, I preffer bytea, but blob is near the same. Separating database is the best way for a "unified image webservice".
to store as bytea (BYTE Array with direct store) at your table: for caching processed images (typically thumbnails). Cache the little images to send it fast to the web-browser (avoiding renderization problems) and reduce server processing. Cache also the essential metadata, like width and height. Database caching is the easiest way, but check your needs and server configs (ex. Apache modules): store thumbnails at file system may be better, compare performances. Remember that it is a (unified) web-service, then can be stored at a separete database with no backups, serving many tables. See also PostgreSQL binary data types manual, tests with bytea column, etc.
My experience is limited to SQL server, but I have several million PDF-files that are larger than 10KB in a database, which is still performing quite nicely. Of course indexes are required. Full database backup takes no longer than expected with such an amount of data. Again, this is for MS-SQL server!

Does PostgreSQL support transparent compressing of tables (fragments)?

I'm going to store large amount of data (logs) in fragmented PostgreSQL tables (table per day). I would like to compress some of them to save some space on my discs, but I don't want to lose the ability to query them in the usual manner.
Does PostgreSQL support such a transparent compression and where can I read about it in more detail? I think there should be some well-known magic name for such a feature.
Yes, PostgreSQL will do this automatically for you when they go above a certain size. Compression is applied at each individual data value though - not at the full table level. Meaning that if you have a billion rows that are very narrow, they won't get compressed. Or if you have very many columns each with only a small value in it, they won't get compressed. Details about this scheme in the manual.
If you need it on the full table level, a solution is to create a TABLESPACE for those tables that you want compressed, and point it to a compressed filesystem. As long as the filesystem still obeys fsync() and standard POSIX semantics, this should be perfectly safe. Details about this in the manual.
Probably not what you have in mind but still useful info - Chapter 53. Database Physical Storage of the fine manual. The TOAST section warrants further attention.