how does one implement max(count(field_1)) in Hive? - hiveql

I have ran a query in Hive whose result gets me 2 columns (year and count).
1900 2
1901 5
1902 7
1903 3
1904 5
I need to find the maximum count and return both the year and the count;
expecting answer 1902 7
I ran a nested query like in SQL but it gives me a parse error saying "..cannot recognize input 'select'in expression specification.."
Can anyone let me know? Thanks.
regards,
Rahul

Use the collect_max UDF which returns the keys and values with the maximum values from Brickhouse ( http://github.com/klout/brickhouse )
select collect_max( year, count , 1 )
from mytable;
Or if you want separate columns
select array_index( map_keys( map_max ), 0 ) as max_year,
array_index( map_values( map_max ), 0 ) as max_value
from
( select collect_max( year, count, 1 ) from mytable );

Related

powerBI Dax - two unrelated tables/dates comparison/if overlap, sum variable in one table

I would like to create a measure total_count from two unrelated tables (Table 1 & Table 2)-
Table 1
start date
end date
count
22/05/2020
31/07/2020
3
25/06/2021
08/12/2021
4
17/01/2022
10/08/2022
6
15/05/2020
11/10/2022
10
Table 2
program date
01/06/2020
31/07/2021
27/03/2022
RESULT -
program date
total_count
01/06/2020
13
31/07/2021
14
27/03/2022
16
criteria - if the program date in Table 2 falls in between a period in Table 1, sum up the "count" from Table 1.
I have attempted to create a measure (total_count) in Table 2
total_count =
CALCULATE(
SUM(Table 1[count]),
DATESBETWEEN('Table 2'[program date], (Table 1[start date]), (Table 1[end date]))
)
Can I get help to properly creating a correct measure? Much appreciated.
total_count =
CALCULATE(
SUMX(
FILTER('filename',
Table1[end date]>=MAX(Table2[program date]) && Table1[start date] <= MAX(Table2[program date])
),
Table1[count])
)

KDB Select rows from a table based on one of its column while comparing it to another table

I have table1 as below.
num
value
1
10
2
15
3
20
table2
ver
value
1.0
5
2.0
15
3.0
18
Output should be as below. I need to select all rows from table1 such that table1.value <= table2.value.
num
value
1
10
2
15
I tried this, it's not working.
select from table1 where value <= (exec value from table2)
From a logical point of view what you're asking kdb to compare is:
10 15 20<=5 15 18
Because these are equal lengths, kdb assumes you mean pairwise comparison, aka
10<=5
15<=15
20<=18
to which it would return
q)10 15 20<=5 15 18
010b
What you actually seem to mean (based on your expected output) is 10 15 20<=max(5 15 18). So in that case you would want:
q)t1:([]num:1 2 3;val:10 15 20)
q)t2:([]ver:1 2 3.;val:5 15 18)
q)select from t1 where val<=exec max val from t2
num val
-------
1 10
2 15
As an aside, you can't/shouldn't have a column called value as it clashes with a keyword
value is a keyword so don't assign to it.
Assuming you want all values from table1 with value less than the max value in table2 you could do:
q)table1:([]num:til 3;val:10 15 20)
q)table2:([]ver:`float$til 3;val:5 15 18)
q)select from table1 where val<=max table2`val
num val
-------
0 10
1 15

Postgresql : Average over a limit of Date with group by

I have a table like this
item_id date number
1 2000-01-01 100
1 2003-03-08 50
1 2004-04-21 10
1 2004-12-11 10
1 2010-03-03 10
2 2000-06-29 1
2 2002-05-22 2
2 2002-07-06 3
2 2008-10-20 4
I'm trying to get the average for each uniq Item_id over the last 3 dates.
It's difficult because there are missing date in between so a range of hardcoded dates doesn't always work.
I expect a result like :
item_id MyAverage
1 10
2 3
I don't really know how to do this. Currently i manage to do it for one item but i have trouble extending it to multiples items :
SELECT AVG(MyAverage.number) FROM (
SELECT date,number
FROM item_list
where item_id = 1
ORDER BY date DESC limit 3
) as MyAverage;
My main problem is with generalising the "DESC limit 3" over a group by id.
attempt :
SELECT item_id,AVG(MyAverage.number)
FROM (
SELECT item_id,date,number
FROM item_list
ORDER BY date DESC limit 3) as MyAverage
GROUP BY item_id;
The limit is messing things up there.
I have made it " work " using between date and date but it's not working as i want because i need a limit and not an hardcoded date..
Can anybody help
You can use row_number() to assign 1 to 3 for the records with the last date for an ID an then filter for that.
SELECT x.item_id,
avg(x.number)
FROM (SELECT il.item_id,
il.number,
row_number() OVER (PARTITION BY il.item_id
ORDER BY il.date DESC) rn
FROM item_list il) x
WHERE x.rn BETWEEN 1 AND 3
GROUP BY x.item_id;

Multi-Column, Multi-Row PIVOT

Consider that I have a table which contains data in the following form:
Foo_FK MonthCode_FK Activity_FK SumResultsX SumResultsY
-----------------------------------------------------------
1 201312 0 10 2
1 201312 1 5 1
1 201401 0 15 3
1 201401 1 7 2
2 201312 0 9 3
2 201312 1 1 2
2 201401 0 6 2
2 201401 1 17 4
For my purposes, it is safe to assume that this table is an aggregation which would have been created by a GROUP BY on Foo_FK, MonthCode_FK, Activity_FK with SUM( ResultsA ), SUM( ResultsB ) to obtain the data, making Foo_FK, MonthCode_FK, Activity_FK unique per record.
If for some reason I found it preferable to PIVOT this table in a stored procedure to ease the amount of screwing around with SSRS I'd have to do ( and undoubtedly later maintain ), wishing to get the following format for consumption via a matrix tablix thingy:
Foo_FK 1312_0_X 1312_0_Y 1312_1_X 1312_1_Y 1401_0_X 1401_0_Y 1401_1_X 1401_1_Y
--------------------------------------------------------------------------------------
1 10 2 5 1 15 3 7 2
2 9 3 1 2 6 2 17 4
How would I go about doing this in a not-mental way? Please refer to this SQL Fiddle at proof I am likely trying to use a hammer to build a device that pushes in nails. Don't worry about a dynamic version as I'm sure I can figure that out once I'm guided through the static solution for this test case.
Right now, I've tried to create a Foo_FK, MonthCode_FK set via the following, which I then attempt to PIVOT ( see the Fiddle for the full mess ):
SELECT Foo_FK = ISNULL( a0.Foo_FK, a1.Foo_FK ),
MonthCode_FK = ISNULL( a0.MonthCode_FK, a1.MonthCode_FK ),
[0_X] = ISNULL( a0.SumResultX, 0 ),
[0_Y] = ISNULL( a0.SumResultY, 0 ),
[1_X] = ISNULL( a1.SumResultX, 0 ),
[1_Y] = ISNULL( a1.SumResultY, 0 )
FROM ( SELECT Foo_FK, MonthCode_FK, Activity_FK,
SumResultX, SumResultY
FROM dbo.t_FooActivityByMonth
WHERE Activity_FK = 0 ) a0
FULL OUTER JOIN (
SELECT Foo_FK, MonthCode_FK, Activity_FK,
SumResultX, SumResultY
FROM dbo.t_FooActivityByMonth
WHERE Activity_FK = 1 ) a1
ON a0.Foo_FK = a1.Foo_FK;
I have come across some excellent advice on this SO question, so I'm in the process of performing some form of UNPIVOT before I twist everything back out using PIVOT and MAX, but if there's a better way to do this, I'm all ears.
It seems that you should be able to do this by applying unpivot to your SumResultX and SumResultY columns first, then pivoting the data:
;with cte as
(
select Foo_FK,
col = cast(MonthCode_FK as varchar(6))+'_'
+cast(activity_fk as varchar(1))+'_'+sumresult,
value
from dbo.t_FooActivityByMonth
cross apply
(
values
('X', SumResultX),
('Y', SumResultY)
) c (sumresult, value)
)
select Foo_FK,
[201312_0_X], [201312_0_Y], [201312_1_X], [201312_1_Y],
[201401_0_X], [201401_0_Y], [201401_1_X], [201401_1_Y]
from cte
pivot
(
max(value)
for col in ([201312_0_X], [201312_0_Y], [201312_1_X], [201312_1_Y],
[201401_0_X], [201401_0_Y], [201401_1_X], [201401_1_Y])
) piv;
See SQL Fiddle with Demo

How to find the average of certain records T-SQL

I have a table variable that I am dumping data into:
DECLARE #TmpTbl_SKUs AS TABLE
(
Vendor VARCHAR (255),
Number VARCHAR(4),
SKU VARCHAR(20),
PurchaseOrderDate DATETIME,
LastReceivedDate DATETIME,
DaysDifference INT
)
Some records don't have a purchase order date or last received date, so the days difference is null as well. I have done a lot of inner joins on itself, but data seems to take too long, or comes out incorrect most of the time.
Is it possible to get the average per SKU days difference? how would I check if there is only 1 record of that SKU? I need the data, if there is only 1 record, then I have to find it at a champvendor level the average.
Here is the structure:
Vendor has many Numbers and Numbers has many SKUs
Any help would be great, I can't seem to crack this one, nor can I find anything related to this online. Thanks in advance.
Here is some sample data:
Vendor Number SKU PurchaseOrderDate LastReceivedDate DaysDifference
OTHER PMDD 1111 OP1111 2009-08-21 00:00:00.000 2009-09-02 00:00:00.000 12
OTHER PMDD 1111 OP1112 2009-12-09 00:00:00.000 2009-12-17 00:00:00.000 8
MANTOR 3333 MA1111 2006-02-15 00:00:00.000 2006-02-23 00:00:00.000 8
MANTOR 3333 MA1112 2006-02-15 00:00:00.000 2006-02-23 00:00:00.000 8
I'm sorry I may have written this wrong. If there is only 1 SKU for a record, then I want to return the DaysDifference (if it's not null), if it has more than 1 record and they are not null, then return the average days difference. If it is all nulls, then at a vendor level check for the average of the skus that are not null, otherwise it should just return 7. This is what I have tried:
SELECT t1.SKU, ISNULL
(
AVG(t1.DaysDifference),
(
SELECT ISNULL(AVG(t2.DaysDifference), 7)
FROM #TmpTbl_SKUs t2
WHERE t2.SKU=t1.SKU
GROUP BY t2.ChampVendor, t2.VendorNumber, t2.SKU
)
)
FROM #TmpTbl_SKUs t1
GROUP BY t1.SKU
Keep playing with this. I somewhat have what I got, but just don't understand how I would check if it has multiple records, and how to check at a vendor level.
Try this:
EDITED: added NULLIF(..., 0) to treat 0s as NULLs.
SELECT
t1.SKU,
COALESCE(
NULLIF(AVG(t1.DaysDifference), 0),
NULLIF(t2.AvgDifferenceVendor, 0),
7
) AS AvgDiff
FROM #TmpTbl_SKUs t1
INNER JOIN (
SELECT Vendor, AVG(DaysDifference) AS AvgDifferenceVendor
FROM #TmpTbl_SKUs
GROUP BY Vendor
) t2 ON t1.Vendor = t2.Vendor
GROUP BY t1.SKU, t2.AvgDifferenceVendor
EDIT 2: how I tested the script.
For testing I'm using the sample data posted with the question.
DECLARE #TmpTbl_SKUs AS TABLE
(
Vendor VARCHAR (255),
Number VARCHAR(4),
SKU VARCHAR(20),
PurchaseOrderDate DATETIME,
LastReceivedDate DATETIME,
DaysDifference INT
)
INSERT INTO #TmpTbl_SKUs
(Vendor, Number, SKU, PurchaseOrderDate, LastReceivedDate, DaysDifference)
SELECT 'OTHER PMDD', '1111', 'OP1111', '2009-08-21 00:00:00.000', '2009-09-02 00:00:00.000', 12
UNION ALL
SELECT 'OTHER PMDD', '1111', 'OP1112', '2009-12-09 00:00:00.000', '2009-12-17 00:00:00.000', 8
UNION ALL
SELECT 'MANTOR', '3333', 'MA1111', '2006-02-15 00:00:00.000', '2006-02-23 00:00:00.000', 8
UNION ALL
SELECT 'MANTOR', '3333', 'MA1112', '2006-02-15 00:00:00.000', '2006-02-23 00:00:00.000', 8;
First I'm running the script on the unmodified data. Here's the result:
SKU AvgDiff
-------------------- -----------
MA1111 8
MA1112 8
OP1111 12
OP1112 8
AvgDiff for every SKU is identical to the original DaysDifference for every SKU, because there's only one row per each one.
Now I'm changing DaysDifference for SKU='MA1111' to 0 and running the script again. Ther result is:
SKU AvgDiff
-------------------- -----------
MA1111 4
MA1112 8
OP1111 12
OP1112 8
Now AvgDiff for MA1111 is 4. Why? Because the average for the SKU is 0, and so the average by Vendor is taken, which has been calculated as (0 + 8) / 2 = 4.
Next step is to set DaysDifference to 0 for all the SKUs of the same Vendor. In this case I'm setting it for SKUs MA1111 and MA1112. Here's the result of the script for this change:
SKU AvgDiff
-------------------- -----------
MA1111 7
MA1112 7
OP1111 12
OP1112 8
So now AvgDiff is 7 for both MA1111 and MA1112. How has it become so? Both have DaysDifference = 0. That means that the average by Vendor should be taken for each one. But Vendor average is 0 too in this case. According to the requirement, the average here should default to 7, which is what the script has returned.
So the script seems to be working correctly. I understand that it's either me having missed something or you having forgotten to mention some details. In any case, I would be glad to see where this script fails to solve your problem.