sub query in select clause with hive - hiveql

I am unable to figure out a way to achieve below functionality through a valid query in Hive. Intention is get the top rated movies in a released in a year based on weighted average.
To be more clear this is what I should be able to do in hive in a single query.
var allMoviesRated = select count(movieid) where year(from_unixtime(unixtime)) = 1997;
select movieid, avg(rating), count(movieid), avg(rating)/allMoviesRated as weighted from
(select movieid, rating, year(from_unixtime(unixtime)) as year from u_data where u_data_new.year = 1997) u_data_new group by movieid order by weighted desc limit 10;

sadly .. I dnt think there is a way to this in a single query using a subquery to get a count of all movies rated.
You may to write a script which execute 2 queries
First query one fetch the allMoviesRated and that is stored in a script variable.
Second query is your ranking query to which this value is passed using hiveconf
Thus your script can look like
your script.bash or python------------start--------
var allMoviesRated = os.cmd (hive -S "use db; select count(distinct movieid);")
ranking = os.cmd ( hive -S -hiveconf NUM_MOVIES = allMoviesRated -f ranking_query.hql)
your script.bash or python------------end--------
ranking_query.hql:
select movieid, avg(rating), count(movieid), avg(rating)/${hiveconf:NUM_MOVIES }as weighted
from (
select movieid, rating, year(from_unixtime(unixtime)) as year
from u_data where u_data_new.year = 1997) u_data_new
group by movieid order by weighted desc limit 10;

Related

Use postgresql query results to form another query

I am trying to select from one table using the select result from another table. I can run this in two queries but would like to optimize it into just one.
First query.. Select ids where matching other id
select id from lookuptable where paid = '547'
This results in something like this
6316352
6316353
6318409
6318410
6320468
6320469
6320470
6322526
6322527
6324586
6324587
6326648
I would like to then use this result to make another selection. I can do it manually like below. Note, there could be many rows with these values so I've been using a IN statement
select * from "othertable" where id in (6316352,6316353,6318409,6318410,6320468,6320469,6320470,6322526,6322527,6324586,6324587,6326648);
select
ot.*
from
"othertable" as ot
join
lookuptable as lt
on
ot.id = lt.id
where
lt.paid = '547'
The IN operator supports not just value lists but also subqueries, so you can literally write
select * from "othertable" where id in (select id from lookuptable where paid = '547');

How to keep one record in specific column and make other record value 0 in group by clause in PostgreSQL?

I have a set of data like this
The Result should look Like this
My Query
SELECT max(pi.pi_serial) AS proforma_invoice_id,
max(mo.manufacturing_order_master_id) AS manufacturing_order_master_id,
max(pi.amount_in_local_currency) AS sales_value,
FROM proforma_invoice pi
JOIN schema_order_map som ON pi.pi_serial = som.pi_id
LEFT JOIN manufacturing_order_master mo ON som.mo_id = mo.manufacturing_order_master_id
WHERE to_date(pi.proforma_invoice_date, 'DD/MM/YYYY') BETWEEN to_date('01/03/2021', 'DD/MM/YYYY') AND to_date('19/04/2021', 'DD/MM/YYYY')
AND pi.pi_serial in (9221,
9299)
GROUP BY mo.manufacturing_order_master_id,
pi.pi_serial
ORDER BY pi.pi_serial
Option 1: Create a "Running Total" field in Crystal Reports to sum up only one "sales_value" per "proforma_invoice_id".
Option 2: Add a helper column to your Postgresql query like so:
case
when row_number()
over (partition by proforma_invoice_id
order by manufacturing_order_master_id)
= 1
then sales_value
else 0
end
as sales_value
I prepared this SQLFiddle with an example for you (and would of course like to encourage you to do the same for your next db query related question on SO, too :-)

how to filter data array_agg from postgresql

I have view at postgres db from this query
SELECT order_product.order_id,
array_agg(order_product.product_id) AS itemset
FROM order_product
GROUP BY order_product.order_id
ORDER BY order_product.order_id;
and this is the structure look like:
And the question is, how can U filter data at (itemset) just show where the value is more than 1 (example: don't show = {8}, just show the value when containing 2 data or more like this = {8,10})
Use the having() clause:
SELECT op.order_id,
array_agg(op.product_id) AS itemset
FROM order_product op
GROUP BY op.order_id
HAVING count(*) > 1 --<< here
ORDER BY op.order_id;

EXISTS in filter returning too many values

I need to write a query that uses EXISTS, rather than IN, so that it will run fast. The filter is being fed so many parameter values that EXISTS seems like the only option. The difference is between a 20+ minute query and a 5 second query.
This is the query I have:
SELECT DISTINCT d.GROUP_NAME
FROM [EMPLOYEE] e JOIN [DATA_FACT] d ON (e.KEY = d.KEY)
WHERE d.DATE BETWEEN #Start and #End
AND EXISTS
(
select '1234567' -- #ID
)
AND e.Location IN (#Location)
ORDER BY d.GROUP_NAME ASC
The problem is that it is returning too many records. Based on the values I'm passing to filter on, I should get 1 row back but instead I am getting 28.
If I remove the EXISTS and add the following then I get the 1 record I need:
AND e.ID IN ('1234567')
Is there a way to fix the query to work with EXISTS so that I get the correct results?
This is essentially what you want if you are going to try to use exists to filter your data_fact table by parameters in your employee table. Not sure how much it's going to improve your performance though when you throw a massive number of employee IDs at it.
SELECT
d.GROUP_NAME
FROM [DATA_FACT] AS d
WHERE d.DATE BETWEEN #Start and #End
AND EXISTS
(
select 1
from EMPLOYEE AS e
WHERE d.[KEY] = e.[KEY]
AND e.[Location] IN (#Location)
AND e.ID IN ('1234567')
)
ORDER BY d.GROUP_NAME ASC

Tableau - Calculating average where date is less than value from another data source

I am trying to calculate the average of a column in Tableau, except the problem is I am trying to use a single date value (based on filter) from another data source to only calculate the average where the exam date is <= the filtered date value from the other source.
Note: Parameters will not work for me here, since new date values are being added constantly to the set.
I have tried many different approaches, but the simplest was trying to use a calculated field that pulls in the filtered exam date from the other data source.
It successfully can pull the filtered date, but the formula does not work as expected. 2 versions of the calculation are below:
IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated])) THEN AVG([Raw Score]) END
IF DATEDIFF('day', DATE(ATTR([Exam Date])), DATE(ATTR([Averages (Tableau Test Scores)].[Updated]))) > 1 THEN AVG([Raw Score]) END
Basically, I am looking for the equivalent of this in SQL Server:
SELECT AVG([Raw Score]) WHERE ExamDate <= (Filtered Exam Date)
Below a workbook that shows an example of what I am trying to accomplish. Currently it returns all blanks, likely due to the many-to-one comparison I am trying to use in my calculation.
Any feedback is greatly appreciated!
Tableau Test Exam Workbook
I was able to solve this by using Custom SQL to join the tables together and calculate the average based on my conditions, to get the column results I wanted.
Would still be great to have this ability directly in Tableau, but whatever gets the job done.
Edit:
SELECT
[AcademicYear]
,[Discipline]
--Get the number of student takers
,COUNT([Id]) AS [Students (N)]
--Get the average of the Raw Score
,CAST(AVG(RawScore) AS DECIMAL(10,2)) AS [School Mean]
--Get the number of failures based on an "adjusted score" column
,COUNT([AdjustedScore] < 70 THEN 1 END) AS [School Failures]
--This is the column used as the cutoff point for including scores
,[Average_Update].[Updated]
FROM [dbo].[Average] [Average]
FULL OUTER JOIN [dbo].[Average_Update] [Average_Update] ON ([Average_Update].[Id] = [Average].UpdateDateId)
--The meat of joining data for accurate calculations
FULL OUTER JOIN (
SELECT DISTINCT S.[Id], S.[LastName], S.[FirstName], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[Subject], P.[Id] AS PeriodId
FROM [StudentScore] S
FULL OUTER JOIN
(
--Get only the 1st attempt
SELECT DISTINCT [NBOMEId], S2.[Subject], MIN([ExamDate]) AS ExamDate
FROM [StudentScore] S2
GROUP BY [NBOMEId],S2.[Subject]
) B
ON S.[NBOMEId] = B.[NBOMEId] AND S.[Subject] = B.[Subject] AND S.[ExamDate] = B.[ExamDate]
--Group in "Exam Periods" based on the list of periods w/ start & end dates in another table.
FULL OUTER JOIN [ExamPeriod] P
ON S.[ExamDate] = P.PeriodStart AND S.[ExamDate] <= P.PeriodEnd
WHERE S.[Subject] = B.[Subject]
GROUP BY P.[Id], S.[Subject], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[NBOMEId], S.[NBOMELastName], S.[NBOMEFirstName], S.[SecondYrTake]) [StudentScore]
ON
([StudentScore].PeriodId = [Average_Update].ExamPeriodId
AND [StudentScore].Subject = [Average].Subject
AND [StudentScore].[ExamDate] <= [Average_Update].[Updated])
--End meat
--Joins to pull in relevant data for normalized tables
FULL OUTER JOIN [dbo].[Student] [Student] ON ([StudentScore].[NBOMEId] = [Student].[NBOMEId])
INNER JOIN [dbo].[ExamPeriod] [ExamPeriod] ON ([Average_Update].ExamPeriodId = [ExamPeriod].[Id])
INNER JOIN [dbo].[AcademicYear] [AcademicYear] ON ([ExamPeriod].[AcademicYearId] = [AcademicYear].[Id])
--This will pull only the latest update entry for every academic year.
WHERE [Updated] IN (
SELECT DISTINCT MAX([Updated]) AS MaxDate
FROM [Average_Update]
GROUP BY[ExamPeriodId])
GROUP BY [AcademicYear].[AcademicYearText], [Average].[Subject], [Average_Update].[Updated],
ORDER BY [AcademicYear].[AcademicYearText], [Average_Update].[Updated], [Average].[Subject]
I couldn't download your file to test with your data, but try reversing the order of taking the average ie
average(IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated]) then [Raw Score]) END)
as written, I believe you'll be averaging the data before returning it from the if statement, whereas you want to return the data, then average it.