Redshift: How to see columns that are analyzed in a table? - amazon-redshift

Is there a way to see the "Analyzed" values in a Redshift table? In Teradata, the help stats command gives you this. I'm wondering if there is an equivalent in Redshift. Can't find anything on that.

Yes you can, with some work. First off, remember that Redshift stores data in 1MB blocks and these blocks are distributed across the cluster (slices). So the metadata that is created by ANALYZE gives information about these blocks and what they contain.
The table stv_blocklist contains this information about block contents. It contains a number of pieces of information about the blocks but the pieces that will likely be of most interest are minvalue and maxvalue (along with table id, slice, and column so you know what part of the database any block is a part of).
Minvalue and maxvalue are BIGINTs so these values can make perfect sense if the data column they are for are also BIGINTs. However, if your column is text or a timestamp you will need to do some reverse engineering to understand the hash used to be able to store values for these column types as BIGINTs. I've done it for just about all data types in the past and it isn't too hard to work out. If memory serves there is some endian trick with the text values but again not too hard to work things out since you can see the input values and output values. With this decoder ring you can make evaluations on effectiveness of sort keys in speeding up queries.

Related

Why can't keyed table be splayed in kdb?

Keyed tables are nothing but dictionary mapping of two tables like:
q)kts:([] sym:`GOOG`AMZN`FB)!([] px:3?10.; size:3?100000)
q).Q.dpft[`:/path/db;.z.d;`id;`kts]
'nyi
[0] .Q.dpft[`:/path/db;.z.d;`id;`kts]
Why is there is a limitation that keyed tables cannot be splayed or partitioned?
I think the simplest answer comes from both the technical and the logical.
Technical: there is no way in the on-disk format to indicate this currently. The .d file indicates the order of columns on disk but not any further metadata. This could be changed at a later date in theory.
The logical answer comes from the size of the data in question. Splayed tables are typically used when you want to hold a few columns in memory. A decade ago this meant that splayed tables were useful for holding up to 100M rows but with 3.x and modern memory that upper limit can be well north of 250M. I don't think there's a good way to make that kind of join performant in ad-hoc calculation. The grouped attribute index supported to make that work is around the same size as the column on disk and would need to be constantly re-written as data is appended.
I think the use of 'nyi in this case, to mean, "we probably need to think about this one for a bit", is appropriate.
The obvious solution is to look at explicit row relationships via linking columns, where the lookup calculation is done ahead of time.

Redshift doesn't recognize newly loaded data as pre-sorted

I am trying to load very large volumes of data into redshift into a single table that will be too cost prohibitive to Vacuum once loaded. To avoid having to vacuum this table, I am loading data using COPY command, from a large number of pre-sorted CSV files. The files I am loading are pre-sorted based on the sort keys defined in the table.
However after loading the first two files, I find that redshift reports the table as ~50% unsorted. I have verified that the files have the data in the correct sort order. Why would redshift not recognize the new incoming data as already sorted?
Do I have to do anything special to let the copy command know that this new data is already in the correct sort order?
I am using the SVV_TABLE_INFO table to determine the sort percentage (using the unsorted field). The sort key is a composite key of three different fields (plane, x, y).
Official Answer by Redshift Support:
Here is what we say officially:
http://docs.aws.amazon.com/redshift/latest/dg/vacuum-load-in-sort-key-order.html
When your table has a sort key defined, the table is divided into 2
regions:
sorted, and
unsorted
As long as you load data in sorted key order, even though the data is
in the unsorted region, it is still in sort key order, so there is no
need for VACUUM to ensure the data is sorted. A VACUUM is still needed
to move the data from the unsorted region to the sorted region,
however this is less critical as the data in the unsorted region is
already sorted.
Storing sorted data in Amazon Redshift tables has an impact in several areas:
How data is displayed when ORDER BY is not used
The speed with which data is sorted when using ORDER BY (eg sorting is faster if it is mostly pre-sorted)
The speed of reading data from disk, based upon Zone Maps
While you might want to choose a SORTKEY that improves sort speed (eg Change order of sortkey to descending), the primary benefit of a SORTKEY is to make queries run faster by minimizing disk access through the use of Zone Maps.
I admit there doesn't seem to be a lot of documentation available about how Zone Maps work, so I'll try and explain it here.
Amazon Redshift stores data on disk in 1MB blocks. Each block contains data relating to one column of one table, and data from that column can occupy multiple blocks. Blocks can be compressed, so they will typically contain more than 1MB of data.
Each block on disk has an associated Zone Map that identifies the minimum and maximum value in that block for the column being stored. This enables Redshift to skip over blocks that do not contain relevant data. For example, if the SORTKEY is a timestamp and a query has a WHERE clause that limits data to a specific day, then Redshift can skip over any blocks where the desired date is not within that block.
Once Redshift locates the blocks with desired data, it will load those blocks into memory and will then perform the query across the loaded data.
Queries will run the fastest in Redshift when it only has to load the fewest possible blocks from disk. Therefore, it is best to use a SORTKEY that commonly matches WHERE clauses, such as timestamps where data is often restricted by date ranges. Sometimes it is worth setting the SORTKEY to the same column as the DISTKEY even though they are used for different purposes.
Zone maps can be viewed via the STV_BLOCKLIST virtual system table. Each row in this table includes:
Table ID
Column ID
Block Number
Minimum Value of the field stored in this block
Maximum Value of the field stored in this block
Sorted/Unsorted flag
I suspect that the Sorted flag is set after the table is vacuumed. However, tables do not necessarily need to be vacuumed. For example, if data is always appended in timestamp order, then the data is already sorted on disk, allowing Zone Maps to work most efficiently.
You mention that your SORTKEY is "a composite key using 3 fields". This might not be the best SORTKEY to use. It could be worth running some timing tests against tables with different SORTKEYs to determine whether the composite SORTKEY is better than using a single SORTKEY. The composite key would probably perform best if all 3 fields are often used in WHERE clauses.

Which column compression type should i choose in amazon redshift?

I have a table over 120 million rows.
Following command analyze compression tbl; shows LZO encoding for almost every VARCHAR field, but i think that runlenght encoding may be better for fields with finite number of options (traffic source, category, etc.).
So should i move certain fields to another encoding or stay with LZO?
Thoughts on runlength
The point about runlength, rather than a finite number of options, is that field values are repeated over many consecutive rows. This is usually the case when table is sorted by that column. You are right, though, that the fewer distinct values you have, the more likely it is for any particular value to occur in a sequence.
Documentation
Redshift states in their documentation:
We do not recommend applying runlength encoding on any column that is designated as a sort key. Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns are compressed much more highly than other columns in the same query, range-restricted scans might perform poorly.
And also:
LZO encoding provides a very high compression ratio with good performance. LZO encoding works especially well for CHAR and VARCHAR columns that store very long character strings, especially free form text, such as product descriptions, user comments, or JSON strings.
Benchmark
So, ultimately, you'll have to take a close look at your data, the way it is sorted, the way you are going to join other tables on it and, if in doubt, benchmark the encodings. Create the same table twice and apply runlength encoding to the column in one table, and lzo in the other. Ideally, you already have a query that you know will be used often. Run it several times for each table and compare the results.
My recommendation
Do your queries perform ok? Then don't worry about encoding and take Redshift's suggestion. If you want to take it as a learning project, then make sure that you are aware of how performance improves or degrades when you double (quadruple, ...) the rows in the table. 120 million rows are not many and it might well be that one encoding looks great now but will cause queries to perform poorly when a certain threshold is passed.

historic data storage and retrieval

I am using a standard splayed format for my trade data where i have directories for each date and each column as separate file in there.
I am reading from csv files and storing using the below code. I am using the trial version 32 bit on win 7, 64 bit.
readDat: {[x]
tmp: read data from csv file(x)
tmp: `sym`time`trdId xasc tmp;
/trd: update `g#sym from trd;
trade:: trd;
.Q.dpft[`:/kdb/ndb; dt; `sym; `trade];
.Q.gc[];
};
\t readDat each 50#dtlist
I have tried both using the `g#sym and without it. Data has typically 1.5MM rows per date. select time for this is from 0.5 to 1 second for a day
Is there a way to improve times for either of the below queries.
\t select from trade where date=x
\t select from trade where date=x, sym=y
I have read the docs on segmentation, partitioning etc. but not sure if anything would help here.
On second thoughts, will creating a table for each sym speed up things? I am trying that out but wanted to know if there are memory/space tradeoffs i should be aware of.
Have you done any profiling to see what the actual bottleneck is? If you find that the problem has to do with disk read speed (using something like iostat) you can either get a faster disk (SSD), more memory (for bigger disk cache), or use par.txt to shard your database across multiple disks such that the query happens on multiple disks and cores in parallel.
As you are using .Q.dpft, you are already partitioning your DB. If your use case is always to pass one date in your queries, then segmenting by date will not provide any performance improvements. You could possibly segment by symbol range (see here), although this is never something I've tried.
One basic way to improve performance would be to select a subset of the columns. Do you really need to read all of the fields when querying? Depending on the width of your table this can have a large impact as it now can ignore some files completely.
Another way to improve performance would be to apply `u# to the sym file. This will speed up your second query as the look up on the sym file will be faster. Although this really depends on the size of your universe. The benefit of this would be marginal in comparison to reducing the number of columns requested I would imagine.
As user1895961 mentioned, selecting only certain columns will be faster.
KDB splayed\partitioned tables are almost exactly just files on the filesystem, the smaller the files and the fewer you have to read, the faster it will be. The balance between the number of folders and the number of files is key. 1.5mln per partition is ok, but is on the large side. Perhaps you might want to partition by something else.
You may also want to normalise you data, splitting it into multiple tables and using linked columns to join it back again on the fly. Linked columns, if set up correctly, can be very powerful and can help avoid reading too much data back from disk if filtering is added in.
Also try converting your data to char instead of sym, i found big performance increases from doing so.

HBase row key design for monotonically increasing keys

I've an HBase table where I'm writing the row keys like:
<prefix>~1
<prefix>~2
<prefix>~3
...
<prefix>~9
<prefix>~10
The scan on the HBase shell gives an output:
<prefix>~1
<prefix>~10
<prefix>~2
<prefix>~3
...
<prefix>~9
How should a row key be designed so that the row with key <prefix>~10 comes last? I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
How should a row key be designed so that the row with key ~10 comes last?
You see the scan output in this way because rowkeys in HBase are kept sorted lexicographically irrespective of the insertion order. This means that they are sorted based on their string representations. Remember that rowkeys in HBase are treated as an array of bytes having a string representation. The lowest order rowkey appears first in a table. That's why 10 appears before 2 and so on. See the sections Rows on this page to know more about this.
When you left pad the integers with zeros their natural ordering is kept intact while sorting lexicographically and that's why you see the scan order same as the order in which you had inserted the data. To do that you can design your rowkeys as suggested by #shutty.
I'm looking for some recommended ways or the ways that are more popular for designing HBase row keys.
There are some general guidelines to be followed in order to devise a good design :
Keep the rowkey as small as possible.
Avoid using monotonically increasing rowkeys, such as timestamp etc. This is a poor shecma design and leads to RegionServer hotspotting. If you can't avoid that use someway, like hashing or salting to avoid hotspotting.
Avoid using Strings as rowkeys if possible. String representation of a number takes more bytes as compared to its integer or long representation. For example : A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String -- presuming a byte per character -- you need nearly 3x the bytes.
Use some mechanism, like hashing, in order to get uniform distribution of rows in case your regions are not evenly loaded. You could also create pre-splitted tables to achieve this.
See this link for more on rowkey design.
HTH
HBase stores rowkeys in lexicographical order, so you can try to use this schema with fixed-length rowrey:
<prefix>~0001
<prefix>~0002
<prefix>~0003
...
<prefix>~0009
<prefix>~0010
Keep in mind that you also should use random prefixes to avoid region hot-spotting (when a single region accepts most of the writes, while the other regions are idle).
monotonically increasing keys isnt a good schema for hbase.
you can read more here:
http://hbase.apache.org/book/rowkey.design.html
there also a link there to OpenTSDB that solve this problem.
Fixed length keys are really recommended if possible. Bytes.toBytes(Long value) can be used to get a byte array from a counter. It will sort well for positive longs less than Long.MAX_VALUE.