Implement ROW_NUMBER() in beamSQL - apache-beam

I have the below query :
SELECT DISTINCT Summed, ROW_NUMBER () OVER (order by Summed desc) as Rank from table1
I have to write it in Apache Beam(beamSql). Below is my code :
PCollection<BeamRecord> rec_2_part2 = rec_2.apply(BeamSql.query("SELECT DISTINCT Summed, ROW_NUMBER(Summed) OVER (ORDER BY Summed) Rank1 from PCOLLECTION "));
But I'm getting the below error :
Caused by: java.lang.UnsupportedOperationException: Operator: ROW_NUMBER is not supported yet!
Any idea how to implement ROW_NUMBER() in beamSql ?

Here is one way you can approximate your current query without using ROW_NUMBER:
SELECT
t1.Summed,
(SELECT COUNT(*) FROM (SELECT DISTINCT Summed FROM table1) t2
WHERE t2.Summed >= t1.Summed) AS Rank
FROM
(
SELECT DISTINCT Summed
FROM table1
) t1
The basic idea is to first subquery to get a table with only distinct Summed values. Then, use a correlated subquery to simulate the row number. This isn't a very efficient method, but if ROW_NUMBER is not available, then you're stuck with some alternative.

The solution which worked for the above query:
PCollection<BeamRecord> rec_2 = rec_1.apply(BeamSql.query("SELECT max(Summed) as maxed, max(Summed)-10 as least, 'a' as Dummy from PCOLLECTION"));

Related

How to translate SQL to DAX, Need to add FILTER

I want to create calculated table that will summarize In_Force Premium from existing table fact_Premium.
How can I filter the result by saying:
TODAY() has to be between `fact_Premium[EffectiveDate]` and (SELECT TOP 1 fact_Premium[ExpirationDate] ORDE BY QuoteID DESC)
In SQL I'd do that like this:
`WHERE CONVERT(date, getdate()) between CONVERT(date, tblQuotes.EffectiveDate)
and (
select top 1 q2.ExpirationDate
from Table2 Q2
where q2.ControlNo = Table1.controlno
order by quoteid` desc
)
Here is my DAX statement so far:
In_Force Premium =
FILTER(
ADDCOLUMNS(
SUMMARIZE(
//Grouping necessary columns
fact_Premium,
fact_Premium[QuoteID],
fact_Premium[Division],
fact_Premium[Office],
dim_Company[CompanyGUID],
fact_Premium[LineGUID],
fact_Premium[ProducerGUID],
fact_Premium[StateID],
fact_Premium[ExpirationDate]
),
"Premium", CALCULATE(
SUM(fact_Premium[Premium])
),
"ControlNo", CALCULATE(
DISTINCTCOUNT(fact_Premium[ControlNo])
)
), // Here I need to make sure TODAY() falls between fact_Premium[EffectiveDate] and (SELECT TOP 1 fact_Premium[ExpirationDate] ORDE BY QuoteID DESC)
)
Also, what would be more efficient way, to create calculated table from fact_Premium or create same table using sql statement (--> Get Data--> SQL Server) ?
There are 2 potential ways in T-SQL to get the next effective date. One is to use LEAD() and another is to use an APPLY operator. As there are few facts to work with here are samples:
select *
from (
select *
, lead(EffectiveDate) over(partition by CompanyGUID order by quoteid desc) as NextEffectiveDate
from Table1
join Table2 on ...
) d
or
select table1.*, oa.NextEffectiveDate
from Table1
outer apply (
select top(1) q2.ExpirationDate AS NextEffectiveDate
from Table2 Q2
where q2.ControlNo = Table1.controlno
order by quoteid desc
) oa
nb. an outer apply is a little similar to a left join in that it will allow rows with a NULL to be returned by the query, if that is not needed than use cross apply instead.
In both these approaches you may refer to NextEffectiveDate in a final where clause, but I would prefer to avoid using the convert function if that is feasible (this depends on the data).

multiple extract() with WHERE clause possible?

So far I have come up with the below:
WHERE (extract(month FROM orders)) =
(SELECT min(extract(month from orderdate))
FROM orders)
However, that will consequently return zero to many rows, and in my case, many, because many orders exist within that same earliest (minimum) month, i.e. 4th February, 9th February, 15th Feb, ...
I know that a WHERE clause can contain multiple columns, so why wouldn't the below work?
WHERE (extract(day FROM orderdate)), (extract(month FROM orderdate)) =
(SELECT min(extract(day from orderdate)), min(extract(month FROM orderdate))
FROM orders)
I simply get: SQL Error: ORA-00920: invalid relational operator
Any help would be great, thank you!
Sample data:
02-Feb-2012
14-Feb-2012
22-Dec-2012
09-Feb-2013
18-Jul-2013
01-Jan-2014
Output:
02-Feb-2012
14-Feb-2012
Desired output:
02-Feb-2012
I recreated your table and found out you just messed up the brackets a bit. The following works for me:
where
(extract(day from OrderDate),extract(month from OrderDate))
=
(select
min(extract(day from OrderDate)),
min(extract(month from OrderDate))
from orders
)
Use something like this:
with cte1 as (
select
extract(month from OrderDate) date_month,
extract(day from OrderDate) date_day,
OrderNo
from tablename
), cte2 as (
select min(date_month) min_date_month, min(date_day) min_date_day
from cte1
)
select cte1.*
from cte1
where (date_month, date_day) = (select min_date_month, min_date_day from cte2)
A common table expression enables you to restructure your data and then use this data to do your select. The first cte-block (cte1) selects the month and the day for each of your table rows. Cte2 then selects min(month) and min(date). The last select then combines both ctes to select all rows from cte1 that have the desired month and day.
There is probably a shorter solution to that, however I like common table expressions as they are almost all the time better to understand than the "optimal, shortest" query.
If that is really what you want, as bizarre as it seems, then as a different approach you could forget the extracts and the subquery against the table to get the minimums, and use an analytic approach instead:
select orderdate
from (
select o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
from orders o
)
where rn = 1;
ORDERDATE
---------
01-JAN-14
The row_number() effectively adds a pseudo-column to every row in your original table, based on the month and day in the order date. The rn values are unique, so there will be one row marked as 1, which will be from the earliest day in the earliest month. If you have multiple orders with the same day/month, say 01-Jan-2013 and 01-Jan-2014, then you'll still only get exactly one with rn = 1, but which is picked is indeterminate. You'd need to add further order by conditions to make it deterministic, but I have no idea what you might want.
That is done in the inner query; the outer query then filters so that only the records marked with rn = 1 is returned; so you get exactly one row back from the overall query.
This also avoids the situation where the earliest day number is not in the earliest month number - say if you only had 01-Jan-2014 and 02-Feb-2014; comparing the day and month separately would look for 01-Feb-2014, which doesn't exist.
SQL Fiddle (with Thomas Tschernich's anwer thrown in too, giving the same result for this data).
To join the result against your invoice table, you don't need to join to the orders table again - especially not with a cross join, which is skewing your results. You can do the join (at least) two ways:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
) o, invoices i
WHERE i.invno = o.invno
AND rn = 1;
Or:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT orderno, orderdate, invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
)
WHERE rn = 1
) o, invoices i
WHERE i.invno = o.invno;
The first looks like it does more work but the execution plans are the same.
SQL Fiddle with your pastebin-supplied query that gets two rows back, and these two that get one.

In Firebird, how to aggregate the first N rows?

I would like to do something like this:
CNT=2;
//[edit]
select avg(price) from (
select first :CNT p.Price
from Price p
order by p.Date desc
);
This does not work, Firebird does not allow :cnt as a parameter to FIRST. I need to average the first CNT newest prices. The number 2 changes so it can not be hard-coded.
This can be broken out into a FOR SELECT loop and break when a count is reached. Is that the best way though? Can this be done in a single SQL statement?
Creating the SQL as a string and running it is not the best fit either. It is important that the database compile my SQL statement.
You don't have to use CTE, you can do it directly:
select avg(price) from (
select first :cnt p.Price
from Price p
order by p.Date desc
);
You can use a CTE (Common Table Expression) (see http://www.firebirdsql.org/refdocs/langrefupd21-select.html#langrefupd21-select-cte) to select data before calculate average.
See example below:
with query1 as (
select first 2 p.Price
from Price p
order by p.Date desc
)
select avg(price) from query1

Help with RANK() Over Scalar Function - SQL Server 2008

I have the following Inline Table-Valued Function:
SELECT Locations.LocationId,
dbo.Search_GetSuitability(#SearchPreferences,
Score.FieldA, Score.FieldB, Score.FieldC) AS OverallSuitabilityScore,
RANK() OVER (ORDER BY OverallSuitabilityScore) AS OverallSuitabilityRank
FROM dbo.LocationsView Locations
INNER JOIN dbo.LocationScores Score ON Locations.LocationId = Score.LocationId
WHERE Locations.CityId = #LocationId
That RANK() line is giving me an error:
Invalid column name 'OverallSuitabilityScore'.
The function dbo.Search_GetSuitability is a scalar-function which returns a DECIMAL(8,5). I need to assign a rank to each row based on that value.
The only way i can get the above to work is to add the scalar function call in the ORDER BY part again - which is silly. I have about 5 of these scalar function calls and i need seperate RANK() values for each.
What can i do? Can i use a Common Table Expression (CTE) ?
Yep, you can't reference a column alias in the SELECT clause. The CTE sounds good though. Here's an example
WITH Score as
(
select Score.LocationId, Score.FieldA, Score.FieldB, Score.FieldC,
dbo.Search_GetSuitability(#SearchPreferences,
Score.FieldA, Score.FieldB, Score.FieldC) AS OverallSuitabilityScore
from dbo.LocationScores
)
SELECT TOP(10)
Locations.LocationId,
Score.OverallSuitabilityScore,
RANK() OVER (ORDER BY OverallSuitabilityScore) AS OverallSuitabilityRank
FROM dbo.LocationsView Locations
INNER JOIN Score ON Locations.LocationId = Score.LocationId
WHERE Locations.CityId = #LocationId
An old school way of doing this is just to SUBQUERY the expression.
The CTE here only moves the subquery to the top
SELECT TOP(10) LocationId,
OverallSuitabilityScore,
RANK() OVER (ORDER BY OverallSuitabilityScore) AS OverallSuitabilityRank
FROM
(
SELECT
Locations.LocationId,
dbo.Search_GetSuitability(#SearchPreferences,
Score.FieldA, Score.FieldB, Score.FieldC) AS OverallSuitabilityScore
FROM dbo.LocationsView Locations
INNER JOIN dbo.LocationScores Score ON Locations.LocationId = Score.LocationId
WHERE Locations.CityId = #LocationId
) X

Perl prepare DB2 statement not returning what I need

Since I am using DB2, in order to select a portion of a database in the middle (like a limit/offset pairing), I need to do a different kind of prepare statement. The example I was given was this:
SELECT *
FROM (SELECT col1, col2, col3, ROW_NUMBER() OVER () AS RN FROM table) AS cols
WHERE RN BETWEEN 1 AND 10000;
Which I adapted to this:
SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY 2,3,4,6,7 ASC) AS rownum FROM TRANSACTIONS) AS foo WHERE rownum >= 500 AND rownum <1000
And when I call the fetchall_arrayref(), I do come out with 500 results like I want to, but it is only returning an array with references to the row number, and not all of the data I want to pull. I know for a fact that that is what the code is SUPPOSED to do as its written, and I have tried a bunch of permutations to get my desired result with no luck.
All I want is to grab all of the columns like my previous prepare statement into an array of arrays:
SELECT * FROM TU_TRANSACTIONS ORDER BY 2, 3, 4, 6, 7
but just on a designated section. There is just a fundamental thing I am missing, and I just cant see it.
Any help is appreciated, even if its paired with some constructive criticism.
Your table expression:
(SELECT ROW_NUMBER() OVER (ORDER BY 2,3,4,6,7 ASC) AS rownum FROM TRANSACTIONS) as foo
Has only one column - rownum - so when you select "*" from "foo" you get only the one column.
Your table expression needs to include all of the columns you want, just like e example you posted.
I don't use DB2 so I could be off-base but it seems that:
SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY 2,3,4,6,7 ASC) AS rownum FROM TRANSACTIONS) AS foo WHERE rownum >= 500 AND rownum <1000
Would only return the row numbers because while the sub-query references the table the main query does not. All it seems it would see is the set of numbers (which would return a single column with the number filled in)
Perhaps this would work:
SELECT * FROM TRANSACTIONS, (SELECT ROW_NUMBER() OVER (ORDER BY 2,3,4,6,7 ASC) AS rownum FROM TRANSACTIONS) AS foo WHERE rownum >= 500 AND rownum <1000