How to reference a date range from another CTE in a where clause without joining to it? - date

I'm trying to write a query for Hive that uses the system date to determine both yesterday's date as well as the date 30 days ago. This will provide me with a rolling 30 days without the need to manually feed the date range to the query every time I run it.
I have that code working fine in a CTE. The problem I'm having is in referencing those dates in another CTE without joining the CTEs together, which I can't do since there's not a common field to join on.
I've tried various approaches but I get a "ParseException" every time.
WITH
date_range AS (
SELECT
CAST(from_unixtime(unix_timestamp()-30*60*60*24,'yyyyMMdd') AS INT) AS start_date,
CAST(from_unixtime(unix_timestamp()-1*60*60*24,'yyyyMMdd') AS INT) AS end_date
)
SELECT * FROM myTable
WHERE date_id BETWEEN (SELECT start_date FROM date_range) AND (SELECT end_date FROM date_range)
The intended result is the set of records from myTable that have a date_id between the start_date and end_date as found in the CTE date_range. Perhaps I'm going about this all wrong?

You can do a cross join, it does not require ON condition. Your date_range dataset is one row only, you can CROSS JOIN it with your_table if necessary and it will be transformed to a map-join (your small dataset will be broadcasted to all the mappers and loaded into each mapper memory and will work very fast), check the EXPLAIN command and make sure it is a map-join:
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=250000000;
WITH
date_range AS (
SELECT
CAST(from_unixtime(unix_timestamp()-30*60*60*24,'yyyyMMdd') AS INT) AS start_date,
CAST(from_unixtime(unix_timestamp()-1*60*60*24,'yyyyMMdd') AS INT) AS end_date
)
SELECT t.*
FROM myTable t
CROSS JOIN date_range d
WHERE t.date_id BETWEEN d.start_date AND d.end_date
Also instead if this you can calculate dates in the where clause:
SELECT t.*
FROM myTable t
CROSS JOIN date_range d
WHERE t.date_id
BETWEEN CAST(from_unixtime(unix_timestamp()-30*60*60*24,'yyyyMMdd') AS INT)
AND CAST(from_unixtime(unix_timestamp()-1*60*60*24,'yyyyMMdd') AS INT)

Related

How to update counts by date in table A with the counts by date returned from join of table B and table C

I can do this using a temporary table. Is it possible to do these two steps in a single update query?
All possible dates already exist in the TargetTable (no inserts are necessary).
I'm hoping to make this more efficient since it is run often as batches of data periodically pour into table T2.
Table T1: list of individual dates inserted or updated in this batch
Table T2: datetime2(3) field followed by several data fields, may be thousands for any particular date
Goal: update TargetTable: date field followed by int field to hold the total records by date (may have just come in to T2 or may be additional records appended to existing records already in T2)
select T1.date as TargetDate, count(*) as CountF1
into #Temp
from T1 inner join T2
on T1.date = cast(T2.DateTime as date)
group by T1.date
update TargetTable
set TargetField1 = CountF1
from #Temp inner join TargetTable
on TargetDate = TargetTable.Date
I agree with the recommendation of Zohar Peled. Use a "Common Table Expression" which is often abbreviated as "CTE". A CTE can replace the temporary table in your scenario. You write a CTE by using the WITH keyword, and remember that in many cases you will need to have a semicolon before the WITH keyword (or at the end of the previous statement, if you prefer). The solution then looks like this:
;WITH CTE AS
(
SELECT T1.date AS TargetDate, Count(*) AS CountF1
FROM T1 INNER JOIN T2
ON T1.date = Cast(T2.DateTime AS DATE)
GROUP BY T1.date
)
UPDATE TargetTable
SET TargetField1 = CTE.CountF1
FROM CTE INNER JOIN TargetTable
ON CTE.TargetDate = TargetTable.Date;
Here is more information on Common Table Expressions:
https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql
After having done this, then another thing you might benefit from is to add a new column to table T2, with the datatype DATE. This new column could have the value of Cast(T2.DateTime AS DATE). It might even be a (persisted) computed column. Then add an index on that new column. If you then join on the new column (instead of joining on the expression Cast(...) ) it might run faster depending on the distribution of the data. The only way to tell if it runs faster is to try it out.

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID.
Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.
Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.
Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.
So I went the old fashioned way and tried with a scalar subquery:
SELECT
xx.*
,(
SELECT COUNT(DISTINCT UserID)
FROM data_table AS yy
WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay
In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.
I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.
This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.
SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday
UPDATE
Tested it with a lot of data. The above works but is slow. Much faster to do it like this:
SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
SELECT calday, COUNT(DISTINCT userid) AS daily
FROM data_table
GROUP BY calday
) t1
JOIN data_table t2
ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2
So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).

PostgreSQL row diff timestamp, and calculate stddev for group

I have a table with an ID column called mmsi and another column of timestamp, with multiple timestamps per mmsi.
For each mmsi I want to calculate the standard deviation of the difference between consecutive timestamps.
I'm not very experienced with SQL but have tried to construct a function as follows:
SELECT
mmsi, stddev(time_diff)
FROM
(SELECT mmsi,
EXTRACT(EPOCH FROM (timestamp - lag(timestamp) OVER (ORDER BY mmsi ASC, timestamp ASC)))
FROM ais_messages.ais_static
ORDER BY mmsi ASC, timestamp ASC) AS time_diff
WHERE time_diff IS NOT NULL
GROUP BY mmsi;
Your query looks on the right track, but it has several problems. You labelled your subquery, which looks almost right, with an alias which you then select. But this subquery returns multiple rows and columns so this doesn't make any sense. Here is a corrected version:
SELECT
t.mmsi,
STDDEV(t.time_diff) AS std
FROM
(
SELECT
mmsi,
EXTRACT(EPOCH FROM (timestamp - LAG(timestamp) OVER
(PARTITION BY mmsi ORDER BY timestamp))) AS time_diff
FROM ais_messages.ais_static
ORDER BY mmsi, timestamp
) t
WHERE t.time_diff IS NOT NULL
GROUP BY t.mmsi
This approach should be fine but there is one edge case where it might not behave as expected. If a given mmsi group have only one record, then it would not even appear in the result set of standard deviations. This is because the LAG calculation would return NULL for that single record and it would be filtered off.

Specific number of quarters from date using HiveQL

I am trying to bring a specific number (8) of quarters using a traction date from table. the date format is YYYYMMDD
I could write a select using case to display specific quarter depending on current month.
I could find beginning of month using trunc function but could not find a logic to bring the last 8 quarters of data
Convert date to Hive format first.
Then use DENSE_RANK() to number rows by quarters (order by year desc and quarter desc) then filter by rnk<=8:
select * from
(
--calculate DENSE_RANK
select s.*, DENSE_RANK() over(order by year(your_date) desc, quarter(your_date) desc) as rnk
from
(
--convert date to YYYY-MM-DD format
select t.*, from_unixtime(unix_timestamp(),'yyyyMMdd') your_date
from table_name t
--Also restrict your dataset to select only few last years here
--because you do not need to scan all data
--so add WHERE clause here
)s
)s where rnk<=8;
See manual on functions here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
You can optimize this query knowing your data and how your table is partitioned, restrict the dataset. Also add partition by clause to the over() if you need to query last 8 quarters for each key.

Use generate_series to create a table

In Amazon Redshift, generate_series() seems to be supported on the leader node, but not on the compute nodes. Is there a way to use generate_series to create a table on the leader node, and then push it to the compute nodes?
This query runs fine, running on the leader node:
with
date_table as (select now()::date - generate_series(0, 7 * 10) as date),
hour_table as (select generate_series(0, 24) as hour),
time_table as (
select
date_table.date::date as date,
extract(year from date_table.date) as year,
extract(month from date_table.date) as month,
extract(day from date_table.date) as day,
hour_table.hour
from date_table CROSS JOIN hour_table
)
SELECT *
from time_table
However, this query fails:
create table test
diststyle all
as (
with
date_table as (select now()::date - generate_series(0, 7 * 10) as date),
hour_table as (select generate_series(0, 24) as hour),
time_table as (
select
date_table.date::date as date,
extract(year from date_table.date) as year,
extract(month from date_table.date) as month,
extract(day from date_table.date) as day,
hour_table.hour
from date_table CROSS JOIN hour_table
)
SELECT *
from time_table
);
The only solution I can think of right now is to pull the query results into another program (e.g. python) and then insert the result into the database, but that seems hackish.
For those of you who've never used redshift, it's a heavily modified variant of postgresql, and has lots of it's own idiosyncrasies. The below query is completely valid an runs fine:
create table test diststyle all as (select 1 as a, 2 as b);
select * from test
yields:
a b
1 2
The problem stems from the difference between leadernode only function and compute node functions on redshift. I'm pretty sure it's not due to a bug in my query.
I have not found a way to use leader-node only functions to create tables. There is not (AFAICT) any magic syntax that you can use to make them load their output back to a table.
I ended up using number tables to achieve a similar outcome. Even a huge number table will take up very little space on your Redshift cluster with runlength compression.