Running out of space in join queries

Running out of space in join queries - amazon-redshift

I have a star schema in Redshift and for some BI purpose I am trying to create a flat table by joining the fact table with dimension table. Both table are huge, the fact table is around 1TB and the dimension table is around 10GB.
When I run a join query the query fails, even when I can confirm that there is space in the redshift cluster. Temporarily to complete the process I am running the join by adding one column at a time.
In my understanding while the join query is running the space requirement is quite high, once the join complete the space comes down.
Can anyone suggest an efficient way to complete such join?

You can allocate more memory for a query with setting wlm_query_slot_count (http://docs.aws.amazon.com/redshift/latest/dg/r_wlm_query_slot_count.html) to a higher value.
Also check if it makes sense for you to have your dimension tables replicated across all your nodes with DIST_ALL (http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html). It will take more disk space, but it will speed up the join queries.
Another option is to flatten the large dimension into the fact table, like you might be doing in other DWH schema

Related

Performant Redshift query to return year and shipping mode with max count?

I have a Redshift table lineitem with 303 million rows. The sortkey is on l_receiptdate.
l_receiptdate
l_shipmode
1992-01-03
TRUCK
1992-01-03
TRUCK
1992-03-03
SHIP
1993-02-03
AIR
1993-05-03
SHIP
1993-07-03
AIR
1993-09-05
AIR
Ultimate goal: find what shipmode was used the most for each year. Return year, shipmode, and count for that most popular ship mode.
Expected output:
receiptyear
shipmode
ship_mode_count
1992
TRUCK
2
1993
AIR
3
I'm new to Redshift and it's nuances. I know 303 million rows isn't considered big data but I'd like to start learning Redshift best query practices from the beginning. Below is what I have so far, not sure how to move forward:
select DATE_TRUNC('year', l_receiptdate) as receiptyear,
l_shipmode as shipmode,
count(*) as ship_mode_count
FROM lineitem
group by 1,2

Your query is fine, in a general sense. The missing piece of data is what is the distribution key of the table? You see Redshift is a clustered (distributed) database and this distribution is controlled by the DISTSTYLE and DISTKEY of the table.
Here's a simple way to think about the performance of a Redshift query. Given the nature of Redshift there are few aspects that tend to dominate poorly performing queries:
Too much network redistribution of data
Scanning too much data from disk
Spilling to disk, making more data than needed through cross or looped joins, and a whole bunch of other baddies.
Your query has no joins so #3 isn't an issue. Your query needs to scan the entire table from disk so there is nothing that can be better in #2. However, #1 is where your could get in trouble especially when your data grows.
Your query needs to group by the ship mode and the year. This means that all the data for each unique combination of these needs to be brought together. So if your table was distributed by ship mode (don't do this)
then all the data for each value would reside on a single "slice" of the database and no network data transmission would be needed to perform the count. However you don't to do this in this case since you are just dealing with a COUNT() function and Redshift is smart enough to count locally and then ship the partial results, which are much smaller than the original data, to one place for the final count.
If more complicated actions were being performed that can't be done in parts, then the distribution of the table could make a big difference to the query. Having the data all in one place when rows need to be combined (join, group by, partition, etc) can prevent a lot of data needed to be shipped around the cluster via the network.
Your query will work fine but hopefully walking through this mental exercise helps you understand Redshift better.

How to decrease size of a large postgresql table

I have a postgresql table that is "frozen" i.e. no new data is coming into it. The table is strictly used for reading purposes. The table contains about 17M records. The table has 130 columns and can be queried multiple different ways. To make the queries faster, I created indices for all combinations for filters that can be used. So I have a total of about 265 indexes on the table. Each index is about 1.1 GB. This makes the total table size to be around 265 GB. I have vacuumed the table as well.
Question
Is there a way to further bring down the disk usage of this table?
Is there a better way to handle queries for "frozen" tables that never get any data entered into them?

If your table or indexes are bloated, then VACUUM FULL tablename could shrink them. But if they aren't bloated then this won't do any good. This is not a benign operation, it will lock the table for a period of time (needing rebuild hundreds of index, probably a long period of time) and generate large amounts of IO and of WAL, the last of which will be especially troublesome for replicas. So I would test it on a non-production clone to see it actually shrinks things and see about how long of a maintenance window you will need to declare.
Other than that, be more judicious in your choice of indexes. How did you get the list of "all combinations for filters that can be used"? Was it by inspecting your source code, or just by tackling slow queries one by one until you ran out of slow queries? Maybe you can look at snapshots of pg_stat_user_indexes taken a few days apart to see if all them are actually being used.
Are these mostly two-column indexes?

Spark SQL (Scala) - What's the most efficient way to join two very large tables that have skewed keys

My current job is to create ETL processes with SparkSQL/Scala using Spark 2.2 with Hive support (all tables are on Hive warehouse/HDFS).
One specific process requires joining a table with 1b unique records with another one of 5b unique records.
The join key is skewed, in the sense that some keys are repeated way more than others, but our Hive is not configured to skew by that field, nor it is possible to implement that in the current cenario.
Currently I read each table into two separate dataframes and perform a join between them. Tried inner join and a right outer join on the 5b table to see if there was any performance gain (I'd drop the rows with null records afterwards). Could not notice one, but it could be caused by cluster instability (am not sure if a right join would require less shuffling than an inner one)
Have tried filtering the keys from 1b table on the 5b one by creating a temp view and adding a where clause to the select statement of the 5b table, still couldn't notice any performance gains (obs: it's not possible to collect the unique keys from 1b table, since it'll trigger memory exception). Have also tried doing the entire thing on one SQL query, but again, no luck.
I've read some people talking about creating a PairRDD and performing partitionBy with hashPartitioner, but this seems outdated with the release of dataframes. Right now, I'm in search for some solid guidance for dealing with this join of two very large datasets.
edit: there's an answer here that deals pretty much with the same problem that I have, but it's 2 years old and simply tells me to firstly join a broadcasted set of records that correspond to the keys that repeat a lot, and then perform another join with the rest of the records, unioning the results. Is this still the best approach for my problem? I have skewed keys on both tables

OLAP Approach for Backend redshift connection

We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.

It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability

This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.

Slow select from one billion rows GreenPlum DB

I've created the following table on GreenPlum:
CREATE TABLE data."CDR"
(
mcc text,
mnc text,
lac text,
cell text,
from_number text,
to_number text,
cdr_time timestamp without time zone
)
WITH (
OIDS = FALSE,appendonly=true, orientation=column,compresstype=quicklz, compresslevel=1
)
DISTRIBUTED BY (from_number);
I've loaded one billion rows to this table but every query works very slow.
I need to do queries on all fields (not only one),
What can I do to speed up my queries?
Using PARTITION? using indexes?
maybe using a different DB like Cassandra or Hadoop?

This highly depends on the actual queries you are doing and what your hardware setup looks like.
Since you are querying all the fields the selectivity gained by going columnar orientation is probably hurting you more than helping, as you needs to scan all the data anyway. I would remove columnar orientation.
Generally speaking indexes don't help in a Greenplum system. Usually the amount of hardware that is involved tends to make scanning the data directory faster than doing index lookups.
Partitioning could be a great help but there would need to be a better understanding of the data. You are probably accessing specific time intervals so creating a partitioning scheme around cdr_time could eliminate the scan of data not needed for the result. The last thing I would worry about is indexes.
Your distribution by from_number could have an impact on query speed. The system will hash the data based on from_number so if you are querying selectively on the from_number the data will only be returned by the node that has it and you won't be leveraging the parallel nature of the system and spreading the request across all of the nodes. Unless you are joining to other tables on from_number, which allows the joins to be collocated and performed within the node, I would change that to be distributed RANDOMLY.
On top of all of that there is the question of what the hardware is and if you have a proper amount of segments setup and resources to feed them. Essentially every segment is a database. Good hardware can handle multiple segments per node, but if you are doing this on a light hardware you need to find the sweet spot where number of segments matches what the underlying system can provide.

#Dor,
I have same type of data where CDR info is stored for a telecom company, and daily 10-12 millions rows inserted and also heavy queries running on those CDRs related tables, I was also facing the same issue last year, and i have created partitions on those tables on the CDR timings column.
As per My understanding GP creates physical tables for each partition whereas logical tables created in other RDBMS. After this I got better performance with all SELECTs on these tables. Also I think you should convert text datatype to Character Varying for all columns (if text is really not required) I felt DB operations on Text field is very slow(specially order by, group by)
index will help you depends on your queries in my case i have huge inserts so i didnt try yet
If you are selecting all the columns in select so no need of Column Oriented table
Regards

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse