How can I get the range of values, min & max for each of the columns in the micro-partition in Snowflake? - metadata

It says Snowflake stores metadata about all rows stored in a micro-partition, including the range of values for each of the columns in the micro-partition in the following thread, https://community.snowflake.com/s/question/0D53r00009kz6HpCAI/are-min-max-values-stored-in-a-micro-partitions-metadata-, What function I can use to retrieve this information? I tried to run SYSTEM$CLUSTERING_INFORMATION and it returns total_partition_count, depth, overlaps related information but no information about the column values in the micro-partitions. Thanks!

Snowflake stores this meta-data about each partition internally to optimize queries, but it does not publish it.
Part of the reason is security, as knowing metadata about each partition can reveal data that should be masked to some users, or hidden through row level security.
But if there's an interesting business use case that you are looking for to have this data, Snowflake is listening.

Related

Redshift : loading & storing JSON (IoT) data

My source is JSON with nested arrays & structures. (examples at end of post)
Large volume of new data streaming real-time (20m/day)
I have to decide how to store this data, considering.....
-- End users want to use 'traditional' SQL
-- Performance (ingestion & query)
-- Load on Cluster
As far as I can see, my options, are to make use of the SUPER data type, or just convert everything to traditional relational tables & types.
(Even if I store the full JSON as a super, I still have to serialize critical attributes into regular columns for the purposes of Distribution/Sort.)
Regardless, been trying to weigh up the pros & cons of super vs. 'traditional'.
(1) Store full JSON as SUPER type
-- Very easy to ingest data with low load on cluster
-- Maybe an additional load on cluster & performance impact to execute end user queries?
-- End users would have to learn PartiQL and deal with unnesting & serialization etc
(2) 'Traditional' Relational Tables & Types
(a) Load as super, but then use PartiQL to unnest, serialize and store in relational tables
-- Additional continuous load on cluster
-- Easy to implement (insert into)
-- Would result in some massive tables for the 'tag' nodes
(b) Use lambda to pre-unnest & serialize json, insert/copy directly into relational tables
-- Lambda would be invoked continuously
(3) Redshift Spectrum
If I am converting to relational structure (2b), could simply store in S3 and utilze Redshift Spectrum to query
-- No load on cluster to ingest/process incoming data
-- Cost/maintenance of lambda/other process to transform JSON
Questions:
Are the above understandings correct ?
Any other considerations not listed ?
Is there a 'standard' for this scenario ?
Any other guidance/wisdeom welcome !
Background info
Schema:
events:array[ struct{
channels:array[
struct{
tags:array[struct{}]
}
tags:array[struct{}]
]
}
]
Schema Exploded view:
You will not want to leave things as a monolithic json. Any data that will be repeatedly queries in analytics will want to be its own column. The database work to expand the json at ingestion will be dwarfed by the work to repeatedly expand it for every query.
Any data that will be commonly used in a where clause, group by, partition, join condition etc will likely need to be its own column. I'd expect any data that is common for 90% of the json elements you will want to be in unique columns. Json element that are rare, unique, or of little analytic interest can be kept in super columns that have just these subset parts of the json.
The data size increase will be less than you think, Redshift is good a compressing columns. The ingestion load is unlikely to be a major concern but if it is then the Lambda approach is a reasonable way to extend the compute resources to address. I really don't expect this but if needed can be folded in easily to the existing ETL processes.
A hazard you will face is that users will only reference the json and not the extracted columns. Re-extracting the same data repeatedly costs. I'd consider NOT keeping the entire json in the main fact tables, only json pieces that represent the data not otherwise in columns. Keeping the original jsons in a separate table keyed with an identity column will allow joining if some need arises but the goal will be to not need to do this.
Spectrum does not look like a good fit for this use case. Spectrum does well when the compute elements in S3 can apply the first level where clauses and simple aggregations. I just don't see Spectrum working on this data so it will just send the entire json data to Redshift repeatedly. This will make things slow and tie up a ton of network bandwidth. Now storing the full original json with identity column in S3 and having all the expanded columns plus left-over json elements in native Redshift table does make sense. This way if some need to reference the full json arises it is just an external table reference away.

Redshift: How to see columns that are analyzed in a table?

Is there a way to see the "Analyzed" values in a Redshift table? In Teradata, the help stats command gives you this. I'm wondering if there is an equivalent in Redshift. Can't find anything on that.
Yes you can, with some work. First off, remember that Redshift stores data in 1MB blocks and these blocks are distributed across the cluster (slices). So the metadata that is created by ANALYZE gives information about these blocks and what they contain.
The table stv_blocklist contains this information about block contents. It contains a number of pieces of information about the blocks but the pieces that will likely be of most interest are minvalue and maxvalue (along with table id, slice, and column so you know what part of the database any block is a part of).
Minvalue and maxvalue are BIGINTs so these values can make perfect sense if the data column they are for are also BIGINTs. However, if your column is text or a timestamp you will need to do some reverse engineering to understand the hash used to be able to store values for these column types as BIGINTs. I've done it for just about all data types in the past and it isn't too hard to work out. If memory serves there is some endian trick with the text values but again not too hard to work things out since you can see the input values and output values. With this decoder ring you can make evaluations on effectiveness of sort keys in speeding up queries.

Redshift doesn't recognize newly loaded data as pre-sorted

I am trying to load very large volumes of data into redshift into a single table that will be too cost prohibitive to Vacuum once loaded. To avoid having to vacuum this table, I am loading data using COPY command, from a large number of pre-sorted CSV files. The files I am loading are pre-sorted based on the sort keys defined in the table.
However after loading the first two files, I find that redshift reports the table as ~50% unsorted. I have verified that the files have the data in the correct sort order. Why would redshift not recognize the new incoming data as already sorted?
Do I have to do anything special to let the copy command know that this new data is already in the correct sort order?
I am using the SVV_TABLE_INFO table to determine the sort percentage (using the unsorted field). The sort key is a composite key of three different fields (plane, x, y).
Official Answer by Redshift Support:
Here is what we say officially:
http://docs.aws.amazon.com/redshift/latest/dg/vacuum-load-in-sort-key-order.html
When your table has a sort key defined, the table is divided into 2
regions:
sorted, and
unsorted
As long as you load data in sorted key order, even though the data is
in the unsorted region, it is still in sort key order, so there is no
need for VACUUM to ensure the data is sorted. A VACUUM is still needed
to move the data from the unsorted region to the sorted region,
however this is less critical as the data in the unsorted region is
already sorted.
Storing sorted data in Amazon Redshift tables has an impact in several areas:
How data is displayed when ORDER BY is not used
The speed with which data is sorted when using ORDER BY (eg sorting is faster if it is mostly pre-sorted)
The speed of reading data from disk, based upon Zone Maps
While you might want to choose a SORTKEY that improves sort speed (eg Change order of sortkey to descending), the primary benefit of a SORTKEY is to make queries run faster by minimizing disk access through the use of Zone Maps.
I admit there doesn't seem to be a lot of documentation available about how Zone Maps work, so I'll try and explain it here.
Amazon Redshift stores data on disk in 1MB blocks. Each block contains data relating to one column of one table, and data from that column can occupy multiple blocks. Blocks can be compressed, so they will typically contain more than 1MB of data.
Each block on disk has an associated Zone Map that identifies the minimum and maximum value in that block for the column being stored. This enables Redshift to skip over blocks that do not contain relevant data. For example, if the SORTKEY is a timestamp and a query has a WHERE clause that limits data to a specific day, then Redshift can skip over any blocks where the desired date is not within that block.
Once Redshift locates the blocks with desired data, it will load those blocks into memory and will then perform the query across the loaded data.
Queries will run the fastest in Redshift when it only has to load the fewest possible blocks from disk. Therefore, it is best to use a SORTKEY that commonly matches WHERE clauses, such as timestamps where data is often restricted by date ranges. Sometimes it is worth setting the SORTKEY to the same column as the DISTKEY even though they are used for different purposes.
Zone maps can be viewed via the STV_BLOCKLIST virtual system table. Each row in this table includes:
Table ID
Column ID
Block Number
Minimum Value of the field stored in this block
Maximum Value of the field stored in this block
Sorted/Unsorted flag
I suspect that the Sorted flag is set after the table is vacuumed. However, tables do not necessarily need to be vacuumed. For example, if data is always appended in timestamp order, then the data is already sorted on disk, allowing Zone Maps to work most efficiently.
You mention that your SORTKEY is "a composite key using 3 fields". This might not be the best SORTKEY to use. It could be worth running some timing tests against tables with different SORTKEYs to determine whether the composite SORTKEY is better than using a single SORTKEY. The composite key would probably perform best if all 3 fields are often used in WHERE clauses.

Schema for analytics table in Postgres

We use Postgres for analytics (star schema).
Every few seconds we get reports on ~500 metrics types.
The simplest schema would be:
timestamp metric_type value
78930890 FOO 80.9
78930890 ZOO 20
Our DBA has came up with a suggestion to flatten all reports of the same 5 seconds to:
timestamp metric1 metric2 ... metric500
78930890 90.9 20 ...
Some developers push back on this saying this adds a huge complexity on development (batching data so it is written in one shot) and to maintainability (just looking at the table or adding fields is more complex).
Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough?
EDIT: the end goal is to draw a line chart for the users. So queries will mostly be selecting a few metrics, folding them by hours, and selecting min/max/avg per hour (or any otehr time period).
EDIT: the DBA arguments are:
This is relevant from day 1 (see below) but even if was not this is something the system eventually will need to do and migrating from another schema will be a pain
Reducing the number of rows x500 times will allow more efficient indexes and memory (the table will contain hundreds of millions of rows before this optimization)
When selecting multiple metrics the suggested schema will allow one pass over the data instead of separate query for each metric (or some complex combinations of OR and GroupBY)
EDIT: 500 metrics is an "upper bound" but in practice most of the time only ~40 metrics are reported per 5 seconds (not the same 40 though)
The DBA's suggestion isn't totally unreasonable if the metrics are fairly fixed, and make sense to group together. A couple of problems you'll likely face, though:
Postgres has a limit of between 250 and 1,600 columns (depending on data type)
The table will be hard for developers to work with, especially if you often want to query for only a subset of the attributes
Adding new columns will be slow
Instead, you might want to consider using an HSTORE column:
CREATE TABLE metrics (
timestamp INTEGER,
values HSTORE
)
This will give you some flexibility in storing attributes, and allows for indices. For example, to index just one of the metrics:
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))
One drawback of this is that values can only be text stringsā€¦ so if you need to do numeric comparisons, a JSON column might also be worth considering:
CREATE TABLE metrics (
timestamp INTEGER,
values JSON
)
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))
The drawback here is that you'll need to use Postgres 9.3, which is still reasonably new.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.