How to use hive variable substitution rightly - hiveql

When I'm using variable substitution in hive, I meet some errors, but I need your help.
My code:
set hievar:b='on t1.id=t2.id where t2.id is null';
select * from t_old as t1 full outer join t_new as t2 ${b};
when I run this code in hive shell, it give me some error about ${b}.
I also try this:
set hivevar:c='select * from t_old as t1 full outer join t_new as t2 on t1.id=t2.id where t2.id is null';
${c};
It gives me the same error.

Fix hivevar namespace name (in your code it is hievar) and remove quotes, because they are also passed as is in Hive.
Example:
set hivevar:b=where 1=1; --without quotes
select 1 ${hivevar:b}; --you can do without hivevar: as in your example
Result:
OK
1
Time taken: 0.129 seconds, Fetched: 1 row(s)
Second example:
hive> set hivevar:c=select 1 where 1=1;
hive> ${c};
OK
1
Time taken: 0.491 seconds, Fetched: 1 row(s)

Related

postgresql query is fast with one filter value, very slow with other value

I have two tables each with 100 million+ rows, table_1 and tabke_2.
We insert 60,000+ rows with the same date of "today" in one column of each table. This "date" field is indexed in each of the two tables.
We're doing this insert every day.
Following the insert, if we run a query
select count(*)
from ((select field1 from table_1 where date_field = 'yyyy-mm-dd' --**yesterday's date** ) a
INNER JOIN
(select field1 from table_2 where date_field = 'yyyy-mm-dd' --**yesterday's date** ) b
ON a.field1 = b.field1) c
runs in 1 second
select count(*)
from ((select field1 from table_1 where date_field = 'yyyy-mm-dd' --**today's date** ) a
INNER JOIN
(select field1 from table_2 where date_field = 'yyyy-mm-dd' --**today's date** ) b
ON a.field1 = b.field1) c
runs in 6 hours!
Tomorrow, this query will run in 1 second, and the next day's date query runs for 6 hours.
I'm totally puzzled. Why does the same query runs for 1 second, but the recently inserted data query runs for 6 hours? The next day the 6 hours query runs in 1 second, and that day's date query runs for 6 hours...
I would check the explain plans for the queries, and I suspect that statistics are being automatically gathered after your slow-performing query that change the execution plan by the next day.
Edit: Oh so the fix would be to invoke analyse after loading new data.

SQL Group By that works in SQLite does not work in Postgres

This statement works in SQLite, but not in Postgres:
SELECT A.*, B.*
FROM Readings A
LEFT JOIN Offsets B ON A.MeterNum = B.MeterNo AND A.DateTime > B.TimeDate
WHERE A.MeterNum = 1
GROUP BY A.DateTime
ORDER BY A.DateTime DESC
The Readings table contains electric submeter readings each with a date stamp. The Offsets table holds an adjustment that the user enters after a failed meter is replaced with a new one that starts again at zero. Without the Group By statement the query returns a line for each meter reading with each prior adjustment made before the reading date while I only want the last adjustment.
All the docs I've seen on Group By for Postgres indicate I should be including an aggregate function which I don't need and can't use (The Reading column contains the Modbus string returned from the meter).
Just pick the latest reading in a derived table. In Postgres this can be done quite efficiently using distinct on ()
SELECT A.*, B.*
FROM readings A
left join (
select distinct on (meterno) o.*
from offsets o
order by o.meterno, o.timedate desc
) B ON A.MeterNum = B.MeterNo AND A.DateTime > B.TimeDate
WHERE A.meternum = 1
ORDER BY A.DateTime DESC
distinct on () will only return one row per meterno and this is the "latest" row due to the order by ... , timedate desc
The query might even be faster by pushing the condition on datetime > timedate into the derived table using a lateral join:
SELECT A.*, B.*
FROM readings A
left join lateral (
select distinct on (meterno) o.*
from offsets o
where a.datetime > o.timedeate
order by o.meterno, o.timedate desc
) B ON A.MeterNum = B.MeterNo
WHERE A.meternum = 1
ORDER BY A.DateTime DESC

How to implement time tolerance in SQL for general data within a column?

SUMMARY: I have two data tables: table1 and table2. I want to join them in the following way:
- a unique id value that is the same within both tables
- ALSO, there is a time value in both tables that I need to be within a certain vicinity of one another (e.g. 30 seconds)
- However, the data in both tables is within YYYY-MM-DD HH-MM-SS form
I have tried to do the following:
SELECT * FROM table1 AS t1
LEFT OUTER JOIN table2 t2
ON t1.id = t2.id
AND t2.message_time BETWEEN t1.time + INTERVAL '20 seconds' AND t1.time - INTERVAL '20 seconds'
but it keeps printing out the same row over and over again when I am looping through the result in psycopg2. Is there a better way to do this? Is the query referring to the respective t1.properly?
You need to switch the BETWEEN parameters:
BETWEEN smaller_value AND greater_value
So your join condition needs to be:
ON t1.id = t2.id
AND t2.message_time BETWEEN t1.message_time - interval '20 seconds' AND t1.message_time + interval '20 seconds'
demo:db<>fiddle

Dividing 2 count statements in Postgresql

I do have a question about the division of 2 count statements below, which give me the error underneath.
(SELECT COUNT(transactions.transactionNumber)
FROM transactions
INNER JOIN account ON account.sfid = transactions.accountsfid
INNER JOIN transactionLineItems ON transactions.transactionNumber
= transactionLineItems.transactionNumber
INNER JOIN products ON transactionLineItems.USIM = products.USIM
WHERE products.gender = 'male' AND products.agegroup = 'adult'
AND transactions.transactionDate >= current_date - interval
'730' day)/
(SELECT COUNT(transactions.transactionNumber)
FROM transactions
WHERE transactions.transactionDate >=
current_date - interval '730' day)
ERROR: syntax error at or near "/"
LINE 6: ...tions.transactionDate >= current_date - interval '730' day)/``
What I think the problem is, that the my count statements are creating tables, and the division of the tables is the problem, but how can I make this division work?
Afterwards I want to check the result against a percentage, e.g. < 0.2.
Can anyone help me with this.
Is that your complete query? Something like this works in Postgres 10:
SELECT
(SELECT COUNT(id) FROM test WHERE state = false) / (SELECT COUNT(id) FROM test WHERE state = true) as y
The extra SELECT in front of both sub queries with the division is what's important. Otherwise I also get the error you mentioned.
See also my DB Fiddle version of this query.

Postgresql running sum of previous groups?

Given the following data:
sequence | amount
1 100000
1 20000
2 10000
2 10000
I'd like to write a sql query that gives me the sum of the current sequence, plus the sum of the previous sequence. Like so:
sequence | current | previous
1 120000 0
2 20000 120000
I know the solution likely involves windowing functions but I'm not too sure how to implement it without subqueries.
SQL Fiddle
select
seq,
amount,
lag(amount::int, 1, 0) over(order by seq) as previous
from (
select seq, sum(amount) as amount
from sa
group by seq
) s
order by seq
If your sequence is "sequencial" without holes you can simply do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount) from mytable t2 WHERE t2.sequence = t1.sequence - 1)
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence
Otherwise, instead of t2.sequence = t1.sequence - 1 you could do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount)
from mytable t2
WHERE t2.sequence = (SELECT MAX(t3.sequence)
FROM mytable t3
WHERE t3.sequence < t1.sequence))
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence;
You can see both approaches in this fiddle