wrong results with analytic functions in hive - hiveql

I am trying to use analytic functions with partitioning clause in hive but getting wrong results.
for example, the data is as follows
col1 col2
a 1
a 2
a 3
d 1
d 2
e 1
e 2
dense_rank() over(partition by col1,col2)
is giving 1 as result for all rows.
do we have to enable analytic functions with some set options?
do the underlying table need to be partitioned?
i am using hive on cdh 5

Related

How to efficiently join two huge tables by nearest timestamp?

I have two huge tables, A and B. Table A has around 500 million rows of time-series data. Table B has around 10 million rows of time-series data. To simplify, we can assume they are constituted by the following columns:
Table A
factory
machine
timestamp_1
part
suplement
1
1
2022-01-01 23:54:01
1
1
1
1
2022-01-01 23:54:05
1
2
1
1
2022-01-01 23:54:10
1
3
...
...
...
...
Table B
machine
timestamp_2
measure
1
2022-01-01 23:54:00
0
1
2022-01-01 23:54:07
10
1
2022-01-01 23:54:08
0
...
...
...
I want to create a table C, that results from "joining" both tables by matching each value of timestamp_1 of table A to the nearest value of timestamp_2 of table B whose measure is 0, and also for the same factory and machine. I also only need this for the part = 1 values of table A. For the small example above, the resulting table C would have the same amount of rows as A and would look like:
Table C
machine
timestamp_1
time_since_measure_0
1
2022-01-01 23:54:01
1
1
2022-01-01 23:54:05
5
1
2022-01-01 23:54:10
2
...
...
...
Some things that are also important to consider are:
Table A has an index on columns (factory, machine, timestamp_1, part, suplement). That index is essential working great for other queries not related to this. Table B has indexes on columns (machine, timestamp_2, measure).
Table A is a compressed timescaleDB partitioned table by (factory, timestamp_1). This is also because of other queries. Table B is a postgresql vanilla table.
I used the following statement to create table C:
create table C (
machine int4 not null,
timestamp_1 timestamptz,
time_since_measure_0 interval,
constraint C primary key (machine,timestamp_1)
)
I then tried this code to select and insert data into table C:
insert into C (
select
factory,
machine,
timestamp_1,
timestamp_1 - (
select timestamp_2
from B
where
A.machine = B.machine
and B.measure = 0
and B.timestamp_2 <= A.timestamp_1
order by B.timestamp_2 desc
limit 1
) as "time_since_measure_0"
from A
where A.part = 1
)
However, this seems takes a loooot of time. I know I am dealing with very big tables, but is there something I am missing or how could I optimize this?
Because of course we don't have access to your tables and you haven't posted a query plan it's difficult to do more than make some general observations. The indexes you describe as being in place do not appear to be useful to this query. Looking at your query it appears to me that you need to add the following indexes:
Table A
Index on (machine, timestamp_1)
Table B
Index on (machine, measure, timestamp_2)
Give that a shot and see what happens.
What you want is called "as-of join". That joins each timestamp to the nearest value in the other table.
Some time-series databases, like clickhouse, support this directly. This is the only way to make it fast. It is quite similar to a merge join, with a few modifications: the engine must scan both tables in timestamp order, and join to the nearest value row instead of the equal value row.
I've looked into it briefly and it doesn't look like timescaledb supports it, but this post shows a workaround using lateral join and a covering index. This is likely to have similar performance to your query, because it will use a nested loop and an index-only scan to pick the nearest value for each row.

Issue with using percentile_cont function in Postgresql

This is my table
ID Total
1 2019.21
3 87918.32
2 562900.3
3 982688.98
1 56788.34
2 56792.32
3 909728.23
Now I would like to find the 25th,50th,75th,90th and 100th percentile of the values (Total) in the above Table. Assume my table consists of Whole Lot of data (some 2 Million Records of the same format) . I've Used the Following code :
CODE :
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY Total) as disc_func
FROM my_table
The Error I've come across :
ERROR: syntax error at or near "("
LINE 3: percentile_disc(0.5) WITHIN GROUP (ORDER BY total...
You use PostgreSQL < 9.4 . It does not support WITHIN GROUP
https://www.postgresql.org/docs/9.4/static/functions-aggregate.html
https://www.postgresql.org/docs/9.3/static/functions-aggregate.html

SQL: How to prevent double summing

I'm not exactly sure what the term is for this but, when you have a many-to-many relationship when joining 2 tables and you want to sum up one of the variables, I believe that you can sum the same values over and over again.
What I want to accomplish is to prevent this from happening. How do I make sure that my sum function is returning the correct number?
I'm using PostgreSQL
Example:
Table 1 Table 2
SampleID DummyName SampleID DummyItem
1 John 1 5
1 John 1 4
2 Doe 1 5
3 Jake 2 3
3 Jake 2 3
3 2
If I join these two tables ON SampleID, and I want to sum the DummyItem for each DummyName, how can I do this without double summing?
The solution is to first aggregate and then do the join:
select t1.sampleid, t1.dummyname, t.total_items
from table_1 t1
join (
select t2.sampleid, sum(dummyitem) as total_items
from table_2 t2
group by t2
) t ON t.sampleid = t1.sampleid;
The real question is however: why are the duplicates in table_1?
I would take a step back and try to assess the database design. Specifically, what rules allow such duplicate data?
To address your specific issue given your data, here's one option: create a temp table that contains unique rows from Table 1, then join the temp table with Table 2 to get the sums I think you are expecting.

How to fuse values from 2 separate columns into a common column in PostgreSQL?

Is there a simple way to fuse values from two separate (albeit similar) columns in PostgreSQL?
For example, the following statement:
SELECT a, b FROM stuff;
would currently result in:
a b
-----------
1 2
1 3
1 4
However, I'd like to have the two columns fused in the following way:
ab
---
1
1
1
2
3
4
If you need to get 2 results from same complex query without losing performance try something like:
WITH source AS
(SELECT A,B
FROM your_complex_query)
SELECT A as AB
FROM source
UNION ALL
SELECT B as AB
FROM source
select a as ab from stuff
union all
select b from stuff
order by 1

T-SQL - CROSS APPLY to a PIVOT? (using pivot with a table-valued function)?

I have a table-valued function, basically a split-type function, that returns up to 4 rows per string of data.
So I run:
select * from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
I get:
Seq Data
1 15
2 25
However, my end data needs to look like
15 25
so I do a pivot.
select [1],[2],[3],[4]
from dbo.split('a','1,a15,b20,c40;2,a25,d30;3,e50')
pivot (max(data) for seq in ([1],[2],[3],[4]))
as pivottable
which works as expected:
1 2
--- ---
15 25
HOWEVER, that's great for one row. I now need to do it for several hundred records at once. My thought is to do a CROSS APPLY, but not sure how to combine a CROSS APPLY and a PIVOT.
(yes, obviously the easy answer is to write a modified version that returns 4 columns, but that's not a great option for other reasons)
Any help greatly appreciated.
And the reason I'm doing this: the current query uses as scalar-valued version of SPLIT, called 12 times within the same SELECT against the same million rows (where the data string is 500+ bytes).
So far as I know, that would require it scan the same 500bytes * 1000000rows, 12 times.
This is how you use cross apply. Assume table1 is your table and Line is the field in your table you want to split
SELECT * fROM table1 as a
cross apply dbo.split(a.Line) as b
pivot (max(data) for seq in ([1],[2],[3],[4])) as p