I have a load of partitioned tables which I would like to consume into Tableau. This worked really well with Qlik sense, because it would consume each table into it's own memory, then processes it.
In Tableau I can't see a way to UNION tables (though you can UNION files). If I try to union it as custom sql, it just loads for hours, so I'm assuming it's just pulling all the data at once, which is 7GB of data and won't perform well on the db or Tableau. Database is PostgreSQL.
The partitions are pre-aggregated, so when I do the custom query union it looks like this:
SELECT user_id, grapes, day FROM steps.steps_2016_04_02 UNION
SELECT user_id, grapes, day FROM steps.steps_2016_04_03 UNION
SELECT user_id, grapes, day FROM steps.steps_2016_04_04 UNION
If you can guarantee that the data of each table is unique, then don't use UNION, because it has to an extra work to make distinct rows out of it.
Use UNION ALL instead, which is basically an append of rows. UNION or UNION DISTINCT (the same) like you showed is somewhat equivalent to:
SELECT DISTINCT * FROM (
SELECT user_id, grapes, day FROM steps.steps_2016_04_02 UNION ALL
SELECT user_id, grapes, day FROM steps.steps_2016_04_03 UNION ALL
SELECT user_id, grapes, day FROM steps.steps_2016_04_04
) t;
And the DISTINCT can be a very slow operation.
Another simpler option is to use PostgreSQL's partitioning with table inheritance and work on Tableau as a single table.
Related
I have tried PostgreSQL:count distinct (col1,col2,col3,col4,col5)
in BigQuery :Count distinct concat(col1,col2,col3,col4,col5)
My scenario is I need to get same result as PostgreSQL in BigQuery
Though this scenario works on 3 columns ,I am not getting same value as PostgreSQL for 5 columns.
sample query:
select col1,
count(distinct concat((col1,col2,col3,col4,col5)
from table A
group by col1
when I remove distinct and concat, simple count(col1,col2,col3,col4,col5) gives exact value as populated in PostgreSQL. But i need to have distinct of these columns. Is there any way to achieve this? and does bigquery concat works differently?
Below few options for BigQuery Standard SQL
#standardSQL
SELECT col1,
COUNT(DISTINCT TO_JSON_STRING((col1,col2,col3,col4,col5)))
FROM A
GROUP BY col1
OR
#standardSQL
SELECT col1,
COUNT(DISTINCT FORMAT('%T', [col1,col2,col3,col4,col5]))
FROM A
GROUP BY col1
An alternative suitable for the many databases that don't support that form of COUNT DISTINCT:
SELECT COUNT(*)
FROM (
SELECT DISTINCT Origin, Dest, Reporting_Airline
FROM `fh-bigquery.flights.ontime_201908`
WHERE FlightDate_year = "2018-01-01"
)
My guess on why CONCAT didn't work in your sample: Do you have any null columns?
Running out of spool space wondering if the query can be optimized.
I've tried running a DISTINCT and UNION ALL, Group By doesn't make sense.
SELECT DISTINCT T1.EMAIL, T2.BILLG_STATE_CD, T2.BILLG_ZIP_CD
FROM
(SELECT EMAIL
FROM CAT
UNION ALL
SELECT EMAIL
FROM DOG
UNION ALL
SELECT email As EMAIL
FROM MOUSE) As T1
LEFT JOIN HAMSTER As T2 ON T1.EMAIL =T2.EMAIL_ADDR;
I will need to do this same type of data pull often, looking for a viable solution other than doing three separate joins.
I need to union multiple tables (T1) and join columns from another table (T2) on (T1).
WHERE T2.ord_creatd_dt > DATE '2019-01-01' and T2.ord_creatd_dt < DATE '2019-11-08'
Since redshift does not natively support date partitioning, other than in redshift spectrum, all our tables are date partitioned
my_table_name_YYYY_MM_DD
So every time we do queries it's usually looks like this
select columns, i, want from
(select * from tbl1_date UNION ALL
select * from tbl2_date UNION ALL
select * from tbl3_date UNION ALL
select * from tbl4_date);
Where there's one UNION ALL per day.
Can stored procedures generate a date rangeso our business analysts stop losing their hair when I send them a python or bash script to generate the date range?
Yes, you could create a stored procedure that generates dynamic SQL using only the needed tables. See my answer here for a template to start from: Issue with passing column name as a parameter to "PREPARE" in Redshift
However, you should be aware that Redshift is able to achieve most of what you want automatically using a "Time Series Table" view. This documented here:
Using Time Series Tables
Use Time-Series Tables
You define a view that is composed of a UNION ALL over a sequence of identical tables with a sort key defined on a commonly filtered date or timestamp column. When you query that view Redshift is able to eliminate the scans on any UNION'ed tables that would not contain relevant data.
For example:
CREATE OR REPLACE VIEW store_sales_vw
AS SELECT * FROM store_sales_1998
UNION ALL SELECT * FROM store_sales_1999
UNION ALL SELECT * FROM store_sales_2001
UNION ALL SELECT * FROM store_sales_2002
UNION ALL SELECT * FROM store_sales_2003
;
SELECT cd.cd_education_status
,COUNT(*) sales_count
,AVG(ss_quantity) avg_quantity
FROM store_sales_vw vw
JOIN customer_demographics cd
ON vw.ss_cdemo_sk = cd.cd_demo_sk
WHERE ss_sold_ts BETWEEN '1999-09-01' AND '2000-08-31'
GROUP BY cd.cd_education_status
In this example Redshift will only use the store_sales_1999 and store_sales_2000 tables, skipping the other tables in the view. Note that the table skipping is not based the name of the table. Redshift knows the MIN and MAX values of the sort key timestamp in each table.
If you purse this approach please be sure to keep the total size of the UNION fairly low. I recommend (at most) daily tables for the last week [7], weekly tables for the last month [5], quarterly tables for the last year [4], and then yearly tables for older data.
You can use ALTER TABLE … APPEND to merge the daily tables in weekly tables and so on.
My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).
My current method of de-duping is really dumb.
select col1, col2 ... col500 from
(select col1, col2 ... col500, ROW_NUMBER() OVER(PARTITION BY uid) as row_num)
where row_num=1;
Is there a way to do this without a subquery? Select distinct is not an option as there can be small variations in the columns which are not significant for this output.
In Postgres distinct on () is typically faster then the equivalent solution using a window function and also doesn't require a sub-query:
select distinct on (uuid) *
from the_table
order by something
You have to supply an order by (which is something you should have done with row_number() as well) to get stable results - otherwise the chosen row is "random".
The above is true for Postgres. You also tagged your question with amazon-redshift - I have no idea if Redshift (which is in fact a very different DBMS) supports the same thing nor if it is as efficient.