Take the following data and queries:
create table if not exists my_example(a_group varchar(1)
,the_date date
,metric numeric(4,3)
);
INSERT INTO my_example
VALUES ('1','2018-12-14',0.514)
,('1','2018-12-15',0.532)
,('2','2018-12-15',0.252)
,('3','2018-12-14',0.562)
,('3','2018-12-15',0.361);
select
t1.the_date
,t1.a_group
,t1.metric AS current_metric
,lag(t1.metric, 1) OVER (ORDER BY t1.a_group, t1.the_date) AS previous_metric
from
my_example t1;
Which yields the following results:
+------------+---------+----------------+-----------------+
| the_date | a_group | current_metric | previous_metric |
+------------+---------+----------------+-----------------+
| 2018-12-14 | 1 | 0.514 | NULL |
| 2018-12-15 | 1 | 0.532 | 0.514 |
| 2018-12-15 | 2 | 0.252 | 0.532 |
| 2018-12-14 | 3 | 0.562 | 0.252 |
| 2018-12-15 | 3 | 0.361 | 0.562 |
+------------+---------+----------------+-----------------+
I expected the value of previous_metric for the lone a_group==2 row to be NULL. However, as you can see, the value is showing as 0.532, which is being picked up from the previous row. How can I modify this query to yield a value of NULL as I expected?
You need to use LAG with a partition on a_group, since you want the lag values from a specific frame:
SELECT
t1.the_date,
t1.a_group,
t1.metric AS current_metric,
LAG(t1.metric, 1) OVER (PARTITION BY t1.a_group ORDER BY t1.the_date)
AS previous_metric
FROM my_example t1;
Related
When I create a Phoenix table there are two extra rows in SYSTEM.CATALOG. These are the first and the second rows in the output of SELECT * FROM SYSTEM.CATALOG ...... below. Can someone please help me understand what these two rows signify?
The third and fourth rows in the output of SELECT * FROM SYSTEM.CATALOG ...... below are easily relatable to the CREATE TABLE statement. Therefore, they look fine.
0: jdbc:phoenix:t40aw2.gaq> CREATE TABLE C5 (company_id INTEGER PRIMARY KEY, name VARCHAR(225));
No rows affected (4.618 seconds)
0: jdbc:phoenix:t40aw2.gaq> select * from C5;
+-------------+-------+
| COMPANY_ID | NAME |
+-------------+-------+
+-------------+-------+
No rows selected (0.085 seconds)
0: jdbc:phoenix:t40aw2.gaq> SELECT * FROM SYSTEM.CATALOG WHERE TABLE_NAME='C5';
+------------+--------------+-------------+--------------+----------------+----------------+-------------+----------+---------------+---------------+------------------+--------------+----+
| TENANT_ID | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | COLUMN_FAMILY | TABLE_SEQ_NUM | TABLE_TYPE | PK_NAME | COLUMN_COUNT | SALT_BUCKETS | DATA_TABLE_NAME | INDEX_STATE | IM |
+------------+--------------+-------------+--------------+----------------+----------------+-------------+----------+---------------+---------------+------------------+--------------+----+
| | | C5 | | | 0 | u | | 2 | null | | | fa |
| | | C5 | | 0 | null | | | null | null | | | |
| | | C5 | COMPANY_ID | | null | | | null | null | | | |
| | | C5 | NAME | 0 | null | | | null | null | | | |
+------------+--------------+-------------+--------------+----------------+----------------+-------------+----------+---------------+---------------+------------------+--------------+----+
4 rows selected (0.557 seconds)
0: jdbc:phoenix:t40aw2.gaq>
The Phoenix version I am using is: 4.1.8.29
Kindly note that no other operations where done on the table other than the 3 listed above, namely, create table, select * from the table, and select * from system.catalog where TABLE_NAME=the concerned table name.
I have a query that I'm attempting to optimize and running into some unexpected behavior.
For example, take a table like this:
CREATE TABLE objects
(id int, external_id int, timestamp timestamp);
INSERT INTO objects
(id, external_id, timestamp)
VALUES
(1, 1, '2019-08-16 12:00:00'),
(2, 1, NULL),
(3, 2, '2019-08-16 12:00:00'),
(4, 2, NULL);
I use a query as so to partition the objects by their external_id and then select the maximum value for timestamp across the partition:
SELECT *,
max(timestamp) OVER (PARTITION BY external_id) as max_timestamp,
row_number() OVER (PARTITION BY external_id ORDER BY timestamp ASC NULLS FIRST, id ASC)
FROM objects;
This produces the following (which is what I'm looking for):
[Results]:
| id | external_id | timestamp | max_timestamp | row_number |
|----|-------------|----------------------|----------------------|------------|
| 2 | 1 | (null) | 2019-08-16T12:00:00Z | 1 |
| 1 | 1 | 2019-08-16T12:00:00Z | 2019-08-16T12:00:00Z | 2 |
| 4 | 2 | (null) | 2019-08-16T12:00:00Z | 1 |
| 3 | 2 | 2019-08-16T12:00:00Z | 2019-08-16T12:00:00Z | 2 |
I want to remove the multiple window functions in favor of a single one. For example:
SELECT *,
max(timestamp) OVER w as max_timestamp,
row_number() OVER w
FROM objects
WINDOW w AS (PARTITION BY external_id ORDER BY timestamp ASC NULLS FIRST, id ASC);
However, doing this produces a different result with the max_timestamp set to NULL for half the results:
[Results]:
| id | external_id | timestamp | max_timestamp | row_number |
|----|-------------|----------------------|----------------------|------------|
| 2 | 1 | (null) | (null) | 1 |
| 1 | 1 | 2019-08-16T12:00:00Z | 2019-08-16T12:00:00Z | 2 |
| 4 | 2 | (null) | (null) | 1 |
| 3 | 2 | 2019-08-16T12:00:00Z | 2019-08-16T12:00:00Z | 2 |
Why would introducing a sort order to the window function have any effect on the return of max?
http://sqlfiddle.com/#!17/1fcc4/4
I have a table which contains Null values. I need to replace them with a previous non-Null value.
This is an example of data which I have:
date | category | start_period | period_number |
------------------------------------------------------
2018-01-01 | A | 1 | 1 |
2018-01-02 | A | 0 | Null |
2018-01-03 | A | 0 | Null |
2018-01-04 | A | 0 | Null |
2018-01-05 | B | 1 | 2 |
2018-01-06 | B | 0 | Null |
2018-01-07 | B | 0 | Null |
2018-01-08 | A | 1 | 3 |
2018-01-09 | A | 0 | Null |
2018-01-10 | A | 0 | Null |
The result should look like this:
date | category | start_period | period_number |
------------------------------------------------------
2018-01-01 | A | 1 | 1 |
2018-01-02 | A | 0 | 1 |
2018-01-03 | A | 0 | 1 |
2018-01-04 | A | 0 | 1 |
2018-01-05 | B | 1 | 2 |
2018-01-06 | B | 0 | 2 |
2018-01-07 | B | 0 | 2 |
2018-01-08 | A | 1 | 3 |
2018-01-09 | A | 0 | 3 |
2018-01-10 | A | 0 | 3 |
I tried the following query, but in this case, only the first Null value will be replaced.
select
date,
category,
start_period,
case
when period_number isnull then lag(period_number) over()
else period_number
end as period_number
from period_table;
Also, I tried to use first_value() window function, but I don't know how to set up the correct window.
Any help is highly appreciated.
You can join table with itself and get desired value. Assuming your date column is the primary key or unique.
update your_table upd set period_number = tbl.period_number
from
(
select b.date, max(b2.date) as d2 from your_table b
inner join d_batch_tab b2 on b2.date< b.date and b2.period_number is not null
group by b.date
)t
inner join your_table tbl on tbl.date = t.d2
where t.date= upd.date
If you don't need to update the table but only a select statement then
select yt.date, yt.category, yt.start_period, tbl.period_number
from your_table yt
inner join
(
select b.date, max(b2.date) as d2 from your_table b
inner join d_batch_tab b2 on b2.date< b.date and b2.period_number is not null
group by b.date
)t on yt.date = t.date
inner join your_table tbl on tbl.date = t.d2
If you replace your case statement with:
(
select
_.period_number
from
period_table as _
where
_.period_number is not null
and _.category = period_table.category
and _.date <= period_table.date
order by
_.date desc
limit 1
) as period_number
Then it should have the intended effect. It's nowhere near as elegant as a window function but I don't think window functions are quite flexible enough for your specific use case here (Or at least, if they are, I don't know how to flex them that much)
Examples of windows function and frame clause:
select
date,category,score
,FIRST_VALUE(score) OVER (
PARTITION BY category
ORDER BY date RANGE BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW
) as last_score
from testing.rec_test
order by date, category
select
date,category,score
,LAST_VALUE(score) OVER (
PARTITION BY category
ORDER BY date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) as last_score
from testing.rec_test
order by date, category
I have a table with hundreds of millions of records in 'prices' table with only four columns: uid, price, unit, dt. dt is a datetime in standard format like '2017-05-01 00:00:00.585'.
I can quite easily to select a period using
SELECT uid, price, unit from prices
WHERE dt > '2017-05-01 00:00:00.000'
AND dt < '2017-05-01 02:59:59.999'
What I can't understand how to select price for every last record in each second. (I also need a very first one of each second too, but I guess it will be a similar separate query). There are some similar example (here), but they did not work for me when I try to adapt them to my needs generating errors.
Could some please help me to crack this nut?
Let say that there is a table which has been generated with a help of this command:
CREATE TABLE test AS
SELECT timestamp '2017-09-16 20:00:00' + x * interval '0.1' second As my_timestamp
from generate_series(0,100) x
This table contains an increasing series of timestamps, each timestamp differs by 100 milliseconds (0.1 second) from neighbors, so that there are 10 records within each second.
| my_timestamp |
|------------------------|
| 2017-09-16T20:00:00Z |
| 2017-09-16T20:00:00.1Z |
| 2017-09-16T20:00:00.2Z |
| 2017-09-16T20:00:00.3Z |
| 2017-09-16T20:00:00.4Z |
| 2017-09-16T20:00:00.5Z |
| 2017-09-16T20:00:00.6Z |
| 2017-09-16T20:00:00.7Z |
| 2017-09-16T20:00:00.8Z |
| 2017-09-16T20:00:00.9Z |
| 2017-09-16T20:00:01Z |
| 2017-09-16T20:00:01.1Z |
| 2017-09-16T20:00:01.2Z |
| 2017-09-16T20:00:01.3Z |
.......
The below query determines and prints the first and the last timestamp within each second:
SELECT my_timestamp,
CASE
WHEN rn1 = 1 THEN 'First'
WHEN rn2 = 1 THEN 'Last'
ELSE 'Somwhere in the middle'
END as Which_row_within_a_second
FROM (
select *,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp
) rn1,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp DESC
) rn2
from test
) xx
WHERE 1 IN (rn1, rn2 )
ORDER BY my_timestamp
;
| my_timestamp | which_row_within_a_second |
|------------------------|---------------------------|
| 2017-09-16T20:00:00Z | First |
| 2017-09-16T20:00:00.9Z | Last |
| 2017-09-16T20:00:01Z | First |
| 2017-09-16T20:00:01.9Z | Last |
| 2017-09-16T20:00:02Z | First |
| 2017-09-16T20:00:02.9Z | Last |
| 2017-09-16T20:00:03Z | First |
| 2017-09-16T20:00:03.9Z | Last |
| 2017-09-16T20:00:04Z | First |
| 2017-09-16T20:00:04.9Z | Last |
| 2017-09-16T20:00:05Z | First |
| 2017-09-16T20:00:05.9Z | Last |
A working demo you can find here
I had to create a cross tab table from a Query where dates will be changed into column names. These order dates can be increase or decrease as per the dates passed in the query. The order date is in Unix format which is changed into normal format.
Query is following:
Select cd.cust_id
, od.order_id
, od.order_size
, (TIMESTAMP 'epoch' + od.order_date * INTERVAL '1 second')::Date As order_date
From consumer_details cd,
consumer_order od,
Where cd.cust_id = od.cust_id
And od.order_date Between 1469212200 And 1469212600
Order By od.order_id, od.order_date
Table as follows:
cust_id | order_id | order_size | order_date
-----------|----------------|---------------|--------------
210721008 | 0437756 | 4323 | 2016-07-22
210721008 | 0437756 | 4586 | 2016-09-24
210721019 | 10749881 | 0 | 2016-07-28
210721019 | 10749881 | 0 | 2016-07-28
210721033 | 13639 | 2286145 | 2016-09-06
210721033 | 13639 | 2300040 | 2016-10-03
Result will be:
cust_id | order_id | 2016-07-22 | 2016-09-24 | 2016-07-28 | 2016-09-06 | 2016-10-03
-----------|----------------|---------------|---------------|---------------|---------------|---------------
210721008 | 0437756 | 4323 | 4586 | | |
210721019 | 10749881 | | | 0 | |
210721033 | 13639 | | | | 2286145 | 2300040