Hive Calculation of Percentage - command

I am trying to write a simple code to calculate the percentage of occurrence of distinct instances in a table.
Can I do this in one go???
Below is my code which is giving me error.
select 100 * total_sum/sum(total_sum) from jav_test;

In the past when I have had to do similar things this is the approach I've taken:
SELECT
jav_test.total_sum AS total_sum,
withsum.total_sum AS sum_of_all_total_sum,
100 * (jav_test.total_sum / withsum.total_sum) AS percentage
FROM
jav_test,
(SELECT sum(total_sum) AS total_sum FROM jav_test) withsum -- This computes sum(total_sum) here as a single-row single-column table aliased as "withsum"
;
The presence of the total_sum and sum_of_all_total_sum columns in the output is just to convince myself that the correct math took place - the one you are interested in is percentage, based on the query you posted in the question.
After populating a small dummy table, this was the result:
hive> describe jav_test;
OK
total_sum int
Time taken: 1.777 seconds, Fetched: 1 row(s)
hive> select * from jav_test;
OK
28
28
90113
90113
323694
323694
Time taken: 0.797 seconds, Fetched: 6 row(s)
hive> SELECT
> jav_test.total_sum AS total_sum,
> withsum.total_sum AS sum_of_all_total_sum,
> 100 * (jav_test.total_sum / withsum.total_sum) AS percentage
> FROM jav_test, (SELECT sum(total_sum) AS total_sum FROM jav_test) withsum;
...
... lots of mapreduce-related spam here
...
Total MapReduce CPU Time Spent: 3 seconds 370 msec
OK
28 827670 0.003382990805514275
28 827670 0.003382990805514275
90113 827670 10.887551802046708
90113 827670 10.887551802046708
323694 827670 39.10906520714777
323694 827670 39.10906520714777
Time taken: 41.257 seconds, Fetched: 6 row(s)
hive>

Related

Calculating average time of events with same ID

I'm trying to calculate the average of the elapsed time of events with the same id.
Event 1 started at 123 and ended at 129 -> it lasted 6 seconds.
Event 2 started at 134 and ended at 138 -> it lasted 4 seconds.
time
id
123
1
125
1
129
1
134
2
138
2
The average would be 5 seconds.
Like this I just get all elapsed times, not grouped by ids.
SELECT elapsed(id)
FROM measurement1
GROUP BY id
You can use the query for getting all the elapsed times as a subquery to get the average time elapsed across id's.
SELECT AVG(elapsed)
FROM (
SELECT MAX(time) - MIN(time) as elapsed
FROM measurement1
GROUP BY id1
)

How to fetch data in a given interval in postgresql

SELECT time,CEIL(AVG(value)) from table
where col1 = 1
and col2='matchThis'
and col3>='2022-04-10T18:30:00.00Z'
and col3<='2022-04-25T12:58:23.00Z'
and mod(to_char(col3, 'MI')::int, 15)=0
GROUP BY time
Semple response of the query to get 15-minute interval data
25-04-2022 01:00
25-04-2022 01:15
25-04-2022 01:30
25-04-2022 01:45
The above query works fine in 15, 30, and 60 minutes intervals but I have to create a query return interval data as the option shown below.
15 minutes
30 minutes
1 hour
2 hours
6 hours
12 hours
1 day
SELECT
ceil(avg(column_name)),
to_timestamp(floor((extract('epoch' from column_name) / 600 )) *600)
AT TIME ZONE 'UTC' as interval
FROM table_name
WHERE id=1
and column='value'
and col >='2022-04-21'
and col <= '2022-04-30'
GROUP BY interval ORDER BY interval ASC"

Postgres range select between timestamps

Having two timestamps, and result in minutes between them lets say 320 minutes I need to calculate full hours, lets say we have here 5h and 20 minutes and I need to insert 6 rows with minutes column (5 rows with 60 as minutes column and last one with 20 minutes)
What is best way to do it in Postgres, some loops or trying to select numbers with cte?
demo:db<>fiddle
WITH timestamps AS (
SELECT '2019-01-07 03:30:00'::timestamp as ts1, '2019-01-07 08:50:00'::timestamp as ts2
)
SELECT 60 as minutes
FROM timestamps, generate_series(1, date_part('hour', ts2 - ts1)::int)
UNION ALL
SELECT date_part('minute', ts2 - ts1)::int
FROM timestamps
date_part extracts the hour (or minute) value from the timestamp difference.
with the generate_series function I am generating n rows with value 60 (n = hours)
Adding the remaining minutes with UNION ALL
Edit: For more than 1 day:
Instead of date_part use EXTRACT(EPOCH FROM ...) which gives you the difference in seconds.
WITH timestamps AS (
SELECT '2019-01-06 03:30:00'::timestamp as ts1, '2019-01-07 08:50:00'::timestamp as ts2
)
SELECT 60 as minutes
FROM timestamps, generate_series(1, (EXTRACT(EPOCH FROM ts2 - ts1) / 60 / 60)::int)
UNION ALL
SELECT (EXTRACT(EPOCH FROM ts2 - ts1) / 60)::int % 60
FROM timestamps
Calculate the seconds into hours with / 60 / 60
Calculate the remaining seconds with / 60 % 60 (first step gives you the minutes, the modulo operator % gives you the remaining minutes to hour)

How to subtract seconds from postgres datetime without having to add it in group by clause?

Say I have column of type dateTime with value "2014-04-14 12:17:55.772" & I need to subtract seconds "2" seconds from it to get o/p like this "12:17:53".
userid EndDate seconds
--------------------------------------------------------
1 "2014-04-14 12:17:14.295" 512
1 "2014-04-14 12:31:14.295" 12
2 "2014-04-14 12:48:14.295" 2
2 "2014-04-14 13:22:14.295" 12
& the query is
select (enddate::timestamp - (seconds* interval '1 second')) seconds, userid
from user
group by userid
Now I need to group by userid only but enddate & seconds added to select query that is asking me to add it in group by clause which will not give me correct o/p.
I am expecting data in this format where I need to calculate start_time from end_time & total seconds spent.
user : 1
start_time end_time total (seconds)
"12:17" "12:17" 1
"12:22" "12:31" 512
total: 513
user : 2
"12:43" "12:48" 288
"13:22" "13:22" 1
total 289
Is there some way i could avoid group by clause in this?
Like #IMSoP says, you can use a window function to include a total for each user in your query output:
SELECT userid
, (enddate - (seconds * interval '1 second')) as start_time
, enddate as end_time
, seconds
, sum(seconds) OVER (PARTITION BY userid) as total
FROM so23063314.user;
Then you would only display the parts of the row you're interested in for each subtotal line, and display the total at the end of each block.

Retrieving the start and end hour queries correctly in PostgreSQL Query

I have a CTE-based query in which I retrieve hourly intervals between two given timespans. My query works as following:
Getting start and end datetimes (let's say 07-13-2011 00:21:09 and 07-31-2011 21:11:21)
get the hourly total query values between the hourly intervals (in here it's from 00 to 21, a total of 21 hours but this is parametric and depends on the hours I give for the inputs) for each day. This query works well but there is a problem. It displays hourly amounts but for the start time, it gets all the queries between 00:00:00 and 00:59:59 for each day instead of 00:21:09 - 00:59:59 and same applies for the end time, it gets all the queries between 21:00:00 and 22:00:00 for each day instead of 21:00:00 and 21:11:21. -By the way, the other hour intervals e.g 03:00 - 04:00 etc are currently retrieved normally, no minute and seconds provided, just 1 hour flat intervals- How can I fix that? The query is below, thanks.
WITH cal AS (
SELECT generate_series('2011-02-02 00:00:00'::timestamp , '2012-04-01 05:00:00'::timestamp , '1 hour'::interval) AS stamp
)
, qqq AS (
SELECT date_trunc('hour', calltime) AS stamp
, count(*) AS zcount
FROM mytable
WHERE calltime >= '07-13-2011 00:21:09' AND calltime <='07-31-2011 21:11:21' AND date_part('hour', calltime) >= 0 AND date_part('hour', calltime) <= 21
GROUP BY date_trunc('hour', calltime)
)
SELECT cal.stamp
, COALESCE (qqq.zcount, 0) AS zcount
FROM cal
LEFT JOIN qqq ON cal.stamp = qqq.stamp
WHERE cal.stamp >= '07-13-2011 00:00:00' AND cal.stamp<='07-31-2011 21:11:21' AND date_part('hour', cal.stamp) >= 0 AND date_part('hour', cal.stamp) <= 21
ORDER BY stamp ASC;
EDIT:
What I mean with my problem is, despite giving 00:21:09 for my starting hour on first day, the days after that day calculate the total query count for the first hour interval as count of total queries between 00:00:00-01:00:00 instead of 00:21:09-01:00:00.(by the way this should apply to the first hour interval for every day, I can give 04:30:21 for the starting hour and the day will start to count total queries hourly starting from there etc.- Same applies to the ending hour 21:00:00-21:11:21, only the LAST day in the query results take this interval, other days before it take the query count between hour 21 and 22 by counting all queries between 21:00:00-22:00:00 instead of 21:00:00-21:11:21.
For example, if there are 200 queries between 00:00:00 and 01:00:00 on july 14 2011 (the next day after july 13, the start date) but there are 159 queries between 00:21:09 - 01:00:00, I should get 159 queries instead of 200. Also, if there are 300 queries between 21:00:00-22:00:00 on any random day, and 123 of them are between 21:00:00-21:11:21, I should get 123 queries as result instead of 300. (This applies to every single day, other hourly intervals should be counted as usual such as 01:00-02:00, 20:00-21:00 etc. This is parametric, hourly intervals and start-end times depend on user input-
Adding AND calltime::time >= '00:21:09' AND calltime::time <= '21:11:21' to the WHERE calltime >= '07-13-2011 00:21:09' AND calltime <='07-31-2011 21:11:21' block solved the issue.