Spark SQL (Scala) - What's the most efficient way to join two very large tables that have skewed keys - scala

My current job is to create ETL processes with SparkSQL/Scala using Spark 2.2 with Hive support (all tables are on Hive warehouse/HDFS).
One specific process requires joining a table with 1b unique records with another one of 5b unique records.
The join key is skewed, in the sense that some keys are repeated way more than others, but our Hive is not configured to skew by that field, nor it is possible to implement that in the current cenario.
Currently I read each table into two separate dataframes and perform a join between them. Tried inner join and a right outer join on the 5b table to see if there was any performance gain (I'd drop the rows with null records afterwards). Could not notice one, but it could be caused by cluster instability (am not sure if a right join would require less shuffling than an inner one)
Have tried filtering the keys from 1b table on the 5b one by creating a temp view and adding a where clause to the select statement of the 5b table, still couldn't notice any performance gains (obs: it's not possible to collect the unique keys from 1b table, since it'll trigger memory exception). Have also tried doing the entire thing on one SQL query, but again, no luck.
I've read some people talking about creating a PairRDD and performing partitionBy with hashPartitioner, but this seems outdated with the release of dataframes. Right now, I'm in search for some solid guidance for dealing with this join of two very large datasets.
edit: there's an answer here that deals pretty much with the same problem that I have, but it's 2 years old and simply tells me to firstly join a broadcasted set of records that correspond to the keys that repeat a lot, and then perform another join with the rest of the records, unioning the results. Is this still the best approach for my problem? I have skewed keys on both tables

Related

Performant Redshift query to return year and shipping mode with max count?

I have a Redshift table lineitem with 303 million rows. The sortkey is on l_receiptdate.
l_receiptdate
l_shipmode
1992-01-03
TRUCK
1992-01-03
TRUCK
1992-03-03
SHIP
1993-02-03
AIR
1993-05-03
SHIP
1993-07-03
AIR
1993-09-05
AIR
Ultimate goal: find what shipmode was used the most for each year. Return year, shipmode, and count for that most popular ship mode.
Expected output:
receiptyear
shipmode
ship_mode_count
1992
TRUCK
2
1993
AIR
3
I'm new to Redshift and it's nuances. I know 303 million rows isn't considered big data but I'd like to start learning Redshift best query practices from the beginning. Below is what I have so far, not sure how to move forward:
select DATE_TRUNC('year', l_receiptdate) as receiptyear,
l_shipmode as shipmode,
count(*) as ship_mode_count
FROM lineitem
group by 1,2
Your query is fine, in a general sense. The missing piece of data is what is the distribution key of the table? You see Redshift is a clustered (distributed) database and this distribution is controlled by the DISTSTYLE and DISTKEY of the table.
Here's a simple way to think about the performance of a Redshift query. Given the nature of Redshift there are few aspects that tend to dominate poorly performing queries:
Too much network redistribution of data
Scanning too much data from disk
Spilling to disk, making more data than needed through cross or looped joins, and a whole bunch of other baddies.
Your query has no joins so #3 isn't an issue. Your query needs to scan the entire table from disk so there is nothing that can be better in #2. However, #1 is where your could get in trouble especially when your data grows.
Your query needs to group by the ship mode and the year. This means that all the data for each unique combination of these needs to be brought together. So if your table was distributed by ship mode (don't do this)
then all the data for each value would reside on a single "slice" of the database and no network data transmission would be needed to perform the count. However you don't to do this in this case since you are just dealing with a COUNT() function and Redshift is smart enough to count locally and then ship the partial results, which are much smaller than the original data, to one place for the final count.
If more complicated actions were being performed that can't be done in parts, then the distribution of the table could make a big difference to the query. Having the data all in one place when rows need to be combined (join, group by, partition, etc) can prevent a lot of data needed to be shipped around the cluster via the network.
Your query will work fine but hopefully walking through this mental exercise helps you understand Redshift better.

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

postgres: Partitioned tables: split on unique values

background information in this query.
Postgres version 10.10
I need to partition a table on unique values of a column (varchar) which contains a file name but I can not figure out how do this. LIST clearly does not work but I can't see how to specify a RANGE that will produce a partition for each unique value of the column.
Undoubtedly blindingly obvious when you see it!
I thought I might as well answer it since I have been searching for one myself.
I ended up using LIST partitioning myself with a single element in the LIST. It seems to give all the promised functionalities of the partitioned (and, in my case, sub-partitioned) tables.
However, Postgres doesn't seem to have algorithms inbuilt to use the indexes on the partition columns. Even while selecting distinct values on the partition column from the master table, Postgres does a seq scan on most of the level 1 and level 2 tables. That to me, is very silly since the partition LIST is known upfront, so all it had to do was get all the values from this LIST and do the seq scan on the default partition (if it existed)....
List partitioning would be correct. If that doesn't seem feasible, you probably shouldn't partition the table by that column at all.
You shouldn't end up with thousands of partitions if you want good performance. If you cannot enumerate the file names, it seems like you have too many of them.
The first question you'll have to answer is if your queries or (most important) your DELETE statements will become faster by partitioning. If not, don't do it at all.

OLAP Approach for Backend redshift connection

We have a system where we do some aggregations in Redshift based on some conditions. We aggregate this data with complex joins which usually takes about 10-15 minutes to complete. We then show this aggregated data on Tableau to generate our reports.
Lately, we are getting many changes regarding adding a new dimension ( which usually requires join with a new table) or get data on some more specific filter. To entertain these requests we have to change our queries everytime for each of our subprocesses.
I went through OLAP a little bit. I just want to know if it would be better in our use case or is there any better way to design our system to entertain such adhoc requests which does not require developer to change things everytime.
Thanks for the suggestions in advance.
It would work, rather it should work. Efficiency is the key here. There are few things which you need to strictly monitor to make sure your system (Redshift + Tableau) remains up and running.
Prefer Extract over Live Connection (in Tableau)
Live connection would query the system everytime someone changes the filter or refreshes the report. Since you said the dataset is large and queries are complex, prefer creating an extract. This'll make sure data is available upfront whenever someone access your dashboard .Do not forget to schedule the extract refresh, other wise the data will be stale forever.
Write efficient queries
OLAP systems are expected to query a large dataset. Make sure you write efficient queries. It's always better to first get a small dataset and join them rather than bringing everything in the memory and then joining / using where clause to filter the result.
A query like (select foo from table1 where ... )a left join (select bar from table2 where) might be the key at times where you only take out small and relevant data and then join.
Do not query infinite data.
Since this is analytical and not transactional data, have an upper bound on the data that Tableau will refresh. Historical data has an importance, but not from the time of inception of your product. Analysing the data for the past 3, 6 or 9 months can be the key rather than querying the universal dataset.
Create aggregates and let Tableau query that table, not the raw tables
Suppose you're analysing user traits. Rather than querying a raw table that captures 100 records per user per day, design a table which has just one (or two) entries per user per day and introduce a column - count which'll tell you the number of times the event has been triggered. By doing this, you'll be querying sufficiently smaller dataset but will be logically equivalent to what you were doing earlier.
As mentioned by Mr Prashant Momaya,
"While dealing with extracts,your storage requires (size)^2 of space if your dashboard refers to a data of size - **size**"
Be very cautious with whatever design you implement and do not forget to consider the most important factor - scalability
This is a typical problem and we tackled it by writing SQL generators in Python. If the definition of the metric is the same (like count(*)) but you have varying dimensions and filters you can declare it as JSON and write a generator that will produce the SQL. Example with pageviews:
{
metric: "unique pageviews"
,definition: "count(distinct cookie_id)"
,source: "public.pageviews"
,tscol: "timestamp"
,dimensions: [
['day']
,['day','country']
}
can be relatively easy translated to 2 scripts - this:
drop table metrics_daily.pageviews;
create table metrics_daily.pageviews as
select
date_trunc('day',"timestamp") as date
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1;
and this:
drop table metrics_daily.pageviews_by_country;
create table metrics_daily.pageviews_by_country as
select
date_trunc('day',"timestamp") as date
,country
,count(distinct cookie_id) as "unique_pageviews"
from public.pageviews
group by 1,2;
the amount of complexity of a generator required to produce such sql from such config is quite low but in increases exponentially as you need to add new joins etc. It's much better to keep your dimensions in the encoded form and just use a single wide table as aggregation source, or produce views for every join you might need and use them as sources.

Running out of space in join queries

I have a star schema in Redshift and for some BI purpose I am trying to create a flat table by joining the fact table with dimension table. Both table are huge, the fact table is around 1TB and the dimension table is around 10GB.
When I run a join query the query fails, even when I can confirm that there is space in the redshift cluster. Temporarily to complete the process I am running the join by adding one column at a time.
In my understanding while the join query is running the space requirement is quite high, once the join complete the space comes down.
Can anyone suggest an efficient way to complete such join?
You can allocate more memory for a query with setting wlm_query_slot_count (http://docs.aws.amazon.com/redshift/latest/dg/r_wlm_query_slot_count.html) to a higher value.
Also check if it makes sense for you to have your dimension tables replicated across all your nodes with DIST_ALL (http://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html). It will take more disk space, but it will speed up the join queries.
Another option is to flatten the large dimension into the fact table, like you might be doing in other DWH schema