Cost control in Redshift Spectrum when scanning external tables (S3 data) - amazon-redshift

Athena has some default service limits that can help ~ cap the cost from accidental "runaway" queries on a large data lake in S3. They are not great (based on ~ time, not volume of data scanned), but it's still helpful.
What about Redshift Spectrum?
What mechanisms does it provide can be easily used to cap cost or mitigate the risk of "accidentally" scanning too much data in a single runaway query against S3? What's a good way of tackling this problem?

Amazon Redshift allows you to apply granular controls over Spectrum query execution using WLM Query Monitoring Rules.
There are 2 Spectrum metrics available: Spectrum scan size (Number of mb scanned by the query) and Spectrum scan row count (Number of rows scanned by the query).
You can also use Query execution time to enforce a maximum duration but this will apply to all query types not just Spectrum.
Please note that these are sampled metrics. Queries are not aborted at precisely the point when they exceed the rule, they are aborted at the next sample interval.
If you have been running Spectrum queries on your cluster already you can get started with QMR by using our script wlm_qmr_rule_candidates to generate candidate rules. The generated rules are based on the 99th percentiles for each metric.

Related

Measure Redshift Performance

I am trying to create a single metrics which tells that redshift performance today is better than yesterday.
I know there are several metrics available to measure performance individually:
https://docs.amazonaws.cn/en_us/redshift/latest/mgmt/performance-metrics-perf.html
CPU utilization
Percentage disk space used
Database connections
Health status
Query duration
Query throughput
Concurrency scaling activity
Is it possible to have a single metrics which calculates performance based on all the above factors?

Prometheus query quantile of pod memory usage performance

I'd like to get the 0.95 percentile memory usage of my pods from the last x time. However this query start to take too long if I use a 'big' (7 / 10d) range.
The query that i'm using right now is:
quantile_over_time(0.95, container_memory_usage_bytes[10d])
Takes around 100s to complete
I removed extra namespace filters for brevity
What steps could I take to make this query more performant ? (except making the machine bigger)
I thought about calculating the 0.95 percentile every x time (let's say 30min) and label it p95_memory_usage and in the query use p95_memory_usage instead of container_memory_usage_bytes, so that i can reduce the amount of points the query has to go through.
However, would this not distort the values ?
As you already observed, aggregating quantiles (over time or otherwise) doesn't really work.
You could try to build a histogram of memory usage over time using recording rules, looking like a "real" Prometheus histogram (consisting of _bucket, _count and _sum metrics) although doing it may be tedious. Something like:
- record: container_memory_usage_bytes_bucket
labels:
le: 100000.0
expr: |
container_memory_usage_bytes > bool 100000.0
+
(
container_memory_usage_bytes_bucket{le="100000.0"}
or ignoring(le)
container_memory_usage_bytes * 0
)
Repeat for all bucket sizes you're interested in, add _count and _sum metrics.
Histograms can be aggregated (over time or otherwise) without problems, so you can use a second set of recording rules that computes an increase of the histogram metrics, at much lower resolution (e.g. hourly or daily increase, at hourly or daily resolution). And finally, you can use histogram_quantile over your low resolution histogram (which has a lot fewer samples than the original time series) to compute your quantile.
It's a lot of work, though, and there will be a couple of downsides: you'll only get hourly/daily updates to your quantile and the accuracy may be lower, depending on how many histogram buckets you define.
Else (and this only came to me after writing all of the above) you could define a recording rule that runs at lower resolution (e.g. once an hour) and records the current value of container_memory_usage_bytes metrics. Then you could continue to use quantile_over_time over this lower resolution metric. You'll obviously lose precision (as you're throwing away a lot of samples) and your quantile will only update once an hour, but it's much simpler. And you only need to wait for 10 days to see if the result is close enough. (o:
The quantile_over_time(0.95, container_memory_usage_bytes[10d]) query can be slow because it needs to take into account all the raw samples for all the container_memory_usage_bytes time series on the last 10 days. The number of samples to process can be quite big. It can be estimated with the following query:
sum(count_over_time(container_memory_usage_bytes[10d]))
Note that if the quantile_over_time(...) query is used for building a graph in Grafana (aka range query instead of instant query), then the number of raw samples returned from the sum(count_over_time(...)) must be multiplied by the number of points on Grafana graph, since Prometheus executes the quantile_over_time(...) individually per each point on the displayed graph. Usually Grafana requests around 1000 points for building smooth graph. So the number returned from sum(count_over_time(...)) must be multiplied by 1000 in order to estimate the number of raw samples Prometheus needs to process for building the quantile_over_time(...) graph. See more details in this article.
There are the following solutions for reducing query duration:
To add more specific label filters in order to reduce the number of selected time series and, consequently, the number of raw samples to process.
To reduce the lookbehind window in square brackets. For example, changing [10d] to [1d] reduces the number of raw samples to process by 10x.
To use recording rules for calculating coarser-grained results.
To try using other Prometheus-compatible systems, which may process heavy queries at faster speed. Try, for example, VictoriaMetrics.

How to perform large computations on Spark

I have 2 tables in Hive: user and item and I am trying to calculate cosine similarity between 2 features of each table for a cartesian product between the 2 tables, i.e. Cross Join.
There are around 20000 users and 5000 items resulting in 100 million rows of calculation. I am running the compute using Scala Spark on Hive Cluster with 12 cores.
The code goes a little something like this:
val pairs = userDf.crossJoin(itemDf).repartition(100)
val results = pairs.mapPartitions(computeScore) // computeScore is a function to compute the similarity scores I need
The Spark job will always fail due to memory issues (GC Allocation Failure) on the Hadoop cluster. If I reduce the computation to around 10 million, it will definitely work - under 15 minutes.
How do I compute the whole set without increasing the hardware specifications? I am fine if the job takes longer to run and does not fail halfway.
if you take a look in the Spark documentation you will see that spark uses different strategies for data management. These policies are enabled by the user via configurations in the spark configuration files or directly in the code or script.
Below the documentation about data management policies:
"MEMORY_AND_DISK" policy would be good for you because if the data (RDD) does not fit in the ram then the remaining partitons will be stored in the hard disk. But this strategy can be slow if you have to access the hard drive often.
There are few steps of doing that:
1. Check the expected Data volume after cross join and divide this by 200 as spark.sql.shuffle.partitions by default comes as 200. It has to be more than 1 GB raw data to each partition.
2. Calculate each row size and multiply with another table row count , you will be able to estimated the rough Volume. The process will work much better in Parquet in comparison to CSV file
3. spark.sql.shuffle.partitions needs to be set based on Total Data Volume/500 MB
4. spark.shuffle.minNumPartitionsToHighlyCompress needs to set a little less than Shuffle Partition
5. Bucketize the source parquet data based on the joining column for both of the files/tables
6. Provide a High Spark Executor Memory and Manage the Java Heap memory too considering the heap space

Estimating Redshift Table Size

I am trying to create an estimate for how much space a table in Redshift is going to use, however, the only resources I found were in calculating the minimum table size:
https://aws.amazon.com/premiumsupport/knowledge-center/redshift-cluster-storage-space/
The purpose of this estimate is that I need to calculate how much space a table with the following dimensions is going to occupy without running out of space on Redshift (I.e. it will define how many nodes we end up using)
Rows : ~500 Billion (The exact number of rows is known)
Columns: 15 (The data types are known)
Any help in estimating this size would be greatly appreciated.
Thanks!
The article you reference (Why does a table in my Amazon Redshift cluster consume more disk storage space than expected?) does an excellent job of explaining how storage is consumed.
The main difficulty in predicting storage is predicting the efficiency of compression. Depending upon your data, Amazon Redshift will select an appropriate Compression Encoding that will reduce the storage space required by your data.
Compression also greatly improves the speed of Amazon Redshift queries by using Zone Maps, which identify the minimum and maximum value stored in each 1MB block. Highly compressed data will be stored on fewer blocks, thereby requiring less blocks to be read from disk during query execution.
The best way to estimate your storage space would be to load a subset of the data (eg 1 billion rows), allow Redshift to automatically select the compression types and then extrapolate to your full data size.

How do I get an accurate measurement of a query's efficiency?

I am comparing queries on PostgreSQL 8.3.14 which return the same result set.
I have used EXPLAIN on my queries to track the estimated total cost. I have also run the queries a few times and recorded the total time it took to run. I understand consecutive runs will cause more data to be cached and skew the actual no-cache runtime.
Still I would expect that EXPLAIN cost would be somewhat proportional to the total runtime (with cache skew).
My data denies this. I compared 4 queries.
Query A
Total Cost: 119 500
Average Runtime: 28.101 seconds
Query B
Total Cost: 115 700
Average Runtime: 28.291 seconds
Query C
Total Cost: 116 200
Average Runtime: 32.409 seconds
Query D
Total Cost: 93 200
Average Runtime: 37.503 seconds
I ran Query D last and if anything it should be the fastest because of the caching problem. Since running the queries without cache seems to be difficult based on this Q+A:
[SO]:See and clear Postgres caches/buffers?
How can I measure which query is the most efficient?
The query cost shown by the planner is a function of the structure of your indexes and also the relative frequencies of certain values in the relevant tables. PostgreSQL keeps track of the most common values seen in all of the columns of all of your tables so that it can get an idea of how many rows each stage of each plan is likely to operate on.
This information can become out of date. If you are really trying to get an accurate idea of how costly a query is, make sure that the statistics postgres is using is up to date, by executing a VACUUM ANALYZE statement.
Beyond that, the planner is forced to do some apples to oranges comparisons; somehow comparing the time it takes to seek versus the time it takes to run a tight loop over an in-memory relation. Since different hardware can do these things at different relative speeds, sometimes, especially for near ties, postgres may guess wrong. These relative costs can be tuned in the configuration of your server's config file
Edit:
The statistics collected by postgesql do not relate to "query performance" and are not updated by successive queries. They only describe the frequency and distribution of values in each column of each table (except where disabled.) Having accurate statistics is important for accurate query planning, but its on you, the operator, to tell PostgreSQL how often and to what level of detail those statstics should be gathered. The discrepency you are observing is a sign that the stastics are out of date, or that you could benefit from tuning other planner parameters.
Try running them through explain analyze and posting the output from that to http://explain.depesz.com/