pentaho distinct count over date - postgresql

I am currently working on Pentaho and I have the following problem:
I want to get a "rooling distinct count on a value, which ignores the "group by" performed by Business Analytics. For instance:
Date Field
2013-01-01 A
2013-02-05 B
2013-02-06 A
2013-02-07 A
2013-03-02 C
2013-04-03 B
When I use a classical "distinct count" aggregator in my schema, sum it, and then add "month" to column, I get:
Month Count Sum
2013-01 1 1
2013-02 2 3
2013-03 1 4
2013-04 1 5
What I would like to get would be:
Month Sum
2013-01 1
2013-02 2
2013-03 3
2013-04 3
which is the distinct count of all Fields so far. Does anyone has any idea on this topic?
my database is in Postgre, and I'm looking for any solution under PDI, PSW, PBA or PME.
Thank you!

A naive approach in PDI is the following:
Sort the rows by the Field column
Add a sequence for changing values in the Field column
Map all sequence values > 1 to zero
These first 3 effectively flag the first time a value was seen (no matter the date).
Sort the rows by year/month
Sum the mapped sequence values by year+month
Get a Cumulative Sum of all the previous sums
These 3 aggregate the distinct values per month, then keep a cumulative sum. In PDI this might look something like:
I posted a Gist of this transformation here.
A more efficient solution is to parallelize the two sorts, then join at the latest point possible. I posted this one as it is easier to explain, but it shouldn't be too difficult to take this transformation and make it more parallel.

Related

Changing a functional qSQL query to involve multiple columns in calculation KDB+/Q

I have a ? exec query like so:
t:([]Quantity: 1 2 3;Price 4 5 6;date:2020.01.01 2020.01.02 2020.01.03);
?[t;enlist(within;`date;(2020.01.01,2020.01.02));0b;(enlist `Quantity)!enlist (sum;(`Quantity))]
to get me the sum of the Quantity in the given date range. I want to adjust this to get me the sum of the Notional in the date range; Quantity*Price. So the result should be (1x4)+(2x5)=14.
I tried things like the following
?[t;enlist(within;`date;(2020.01.01,2020.01.02));0b;(enlist `Quantity)!enlist (sum;(`Price*`Quantity))]
but couldn't get it to work. Any advice would be greatly appreciated!
I would advise in such a scenario to think about the qSql style query that you are looking for and then work from there.
So in this case you are looking, I believe, to do something like:
select sum Quantity*Price from t where date within 2020.01.01 2020.01.02
You can then run parse on this to break it into its function form i.e the ? exec query you refer to.
q)parse"select sum Quantity*Price from t where date within 2020.01.01 2020.01.02"
?
`t
,,(within;`date;2020.01.01 2020.01.02)
0b
(,`Quantity)!,(sum;(*;`Quantity;`Price))
This is your functional form that you need; table, where clause, by and aggregation.
You can see your quantity here is just the sum of the multiplication of the two columns.
q)?[t;enlist(within;`date;(2020.01.01;2020.01.02));0b;enlist[`Quantity]!enlist(sum;(*;`Quantity;`Price))]
Quantity
--------
14
You could also extend this to change the column as necessary and create a function for it too, if you so wish:
q)calcNtnl:{[sd;ed] ?[t;enlist(within;`date;(sd;ed));0b;enlist[`Quantity]!enlist(sum;(*;`Quantity;`Price))]}
q)calcNtnl[2020.01.01;2020.01.02]
Quantity
--------
14

How to dynamically pivot based on rows data and parameter value?

I am trying to pivot using crosstab function and unable to achieve for the requirement. Is there is a way to perform crosstab dynamically and also dynamic result set?
I have tried using crosstab built-in function and unable to meet my requirement.
select * from crosstab ('select item,cd, type, parts, part, cnt
from item
order by 1,2')
AS results (item text,cd text, SUM NUMERIC, AVG NUMERIC);
Sample Data:
ITEM CD TYPE PARTS PART CNT
Item 1 A AVG 4 1 10
Item 1 B AVG 4 2 20
Item 1 C AVG 4 3 30
Item 1 D AVG 4 4 40
Item 1 A SUM 4 1 10
Item 1 B SUM 4 2 20
Item 1 C SUM 4 3 30
Item 1 D SUM 4 4 40
Expected Results:
ITEM CD PARTS TYPE_1 CNT_1 TYPE_1 CNT_1 TYPE_2 CNT_2 TYPE_2 CNT_2 TYPE_3 CNT_3 TYPE_3 CNT_3 TYPE_4 CNT_4 TYPE_4 CNT_4
Item 1 A 4 AVG 10 SUM 10 AVG 20 SUM 20 AVG 30 SUM 30 AVG 40 SUM 40
The PARTS value is based on a parameter passed by the user. If the user passes 2 for example, there will be 4 rows in the result set (2 parts for AVG and 2 parts of SUM).
Can I achieve this requirement using CROSSTAB function or is there a custom SQL statement that need to be developed?
I'm not following your data, so I can't offer examples based on it. But I have been looking at pivot/cross-tab features over the past few days. I was just looking at dynamic cross tabs just before seeing your post. I'm hoping that your question gets some good answers, I'll start off with a bit of background.
You can use the crosstab extension for standard cross tabs, what when wrong when you tried it? Here's an example I wrote for myself the other day with a bunch of comments and aliases for clarity. The pivot is looking at item scans to see where the scans were "to", like the warehouse or the floor.
/* Basic cross-tab example for crosstab (text) format of pivot command.
Notice that the embedded query has to return three columns, see the aliases.
#1 is the row label, it shows up in the output.
#2 is the category, what determines how many columns there are. *You have to work this out in advance to declare them in the return.*
#3 is the cell data, what goes in the cross tabs. Note that this form of the crosstab command may return NULL, and coalesce does not work.
To get rid of the null count/sums/whatever, you need crosstab (text, text).
*/
select *
from crosstab ('select
specialty_name as row_label,
scanned_to as column_splitter,
count(num_inst)::numeric as cell_data
from scan_table
group by 1,2
order by 1,2')
as scan_pivot (
row_label citext,
"Assembly" numeric,
"Warehouse" numeric,
"Floor" numeric,
"QA" numeric);
As a manual alternative, you can use a series of FILTER statements. Here's an example that summaries errors_log records by day of the week. The "down" is the error name, the "across" (columns) are the days of the week.
select "error_name",
count(*) as "Overall",
count(*) filter (where extract(dow from "updated_dts") = 0) as "Sun",
count(*) filter (where extract(dow from "updated_dts") = 1) as "Mon",
count(*) filter (where extract(dow from "updated_dts") = 2) as "Tue",
count(*) filter (where extract(dow from "updated_dts") = 3) as "Wed",
count(*) filter (where extract(dow from "updated_dts") = 4) as "Thu",
count(*) filter (where extract(dow from "updated_dts") = 5) as "Fri",
count(*) filter (where extract(dow from "updated_dts") = 6) as "Sat"
from error_log
where "error_name" is not null
group by "error_name"
order by 1;
You can do the same thing with CASE, but FILTER is easier to write.
It looks like you want something basic, maybe the FILTER solution appeals? It's easier to read than calls to crosstab(), since that was giving you trouble.
FILTER may be slower than crosstab. Probably. (The crosstab extension is written in C, and I'm not sure how smart FILTER is about reading off indexes.) But I'm not sure as I haven't tested it out yet. (It's on my to do list, but I haven't had time yet.) I'd be super interested if anyone can offer results. We're on 11.4.
I wrote a client-side tool to build FILTER-based pivots over the past few days. You have to supply the down and across fields, an aggregate formula and the tool spits out the SQL. With support for coalesce for folks who don't want NULL, ROLLUP, TABLESAMPLE, view creation, and some other stuff. It was a fun project. Why go to that effort? (Apart from the fun part.) Because I haven't found a way to do dynamic pivots that I actually understand. I love this quote:
"Dynamic crosstab queries in Postgres has been asked many times on SO all involving advanced level functions/types. Consider building your needed query in application layer (Java, Python, PHP, etc.) and pass it in a Postgres connected query call. Recall SQL is a special-purpose, declarative type while app layers are general-purpose, imperative types." – Parfait
So, I wrote a tool to pre-calculate and declare the output columns. But I'm still curious about dynamic options in SQL. If that's of interest to you, have a look at these two items:
https://postgresql.verite.pro/blog/2018/06/19/crosstab-pivot.html
Flatten aggregated key/value pairs from a JSONB field?
Deep magic in both.

How to sum a calculated a field on top of another calculated field?

The current issue I have maybe a bit difficult to describe but I will do my best.
Currently, in my workbook, I am experiencing duplicates at the product level so I created a calculated field to work around that. I wanted to know which individual products have not been quoted the past year so the answer should be 1 for not quoted and 0 for quoted, I worked around that by doing an if statement with % products not quoted the formula looks as such:
IF [% Not Quoted] > 0
then 1
else 0
end
which I named Prod Not Quoted. That worked great for me,
however now I want do the count or sum(?) of products not quoted at the vendor level which would mean my products not quoted needs to be grouped by the vendor name. To be specific my objective is to provide a table of all products not quoted by vendor name without duplicates at the product level. What I tried to do is create a new calculated field using the previously which I calculated as the following:
IF [Prod Not Quoted] = 1
then sum(1)
else 0
end
The latter calculation, however, gave me the sum of all products quoted and not quoted along with the duplicates I am trying to avoid. Why would IF [prod Not Quoted] = 1 then sum(1) else 0 end not work?
Someone on tableau forum with the following suggestion
SUM (IF [Prod Not Quoted] = 1 THEN 1 ELSE 0 END)
however I got the error cannot sum something that is not already aggregated.
Is there any kind of work around to the issue I am having?
If you have duplicates at the dimension level you can use the FIXED calculation to select just a single row at the dimension level. For example:
{FIXED [Product]: MAX(quoted)}
What this is doing is saying "for each Product, what is the max quoted value?" So if you have duplicates, it'll return only one value
Now we are solving the duplicates problem we can wrap it in SUM:
SUM({FIXED [Product]: MAX(quoted)})
To answer your question about why the sum isn't working, it's because you're trying to do a sum of a sum. If you still wanted to take the approach that was offered to you then you could change it to:
IF [Prod Not Quoted] = 1
then {FIXED [Product]: MAX(quoted)}
end
(I called the aggregation field quoted, but it'll be the measure you're working with)

How to pick items from warehouse to minimise travel in TSQL?

I am looking at this problem from a TSQL point of view, however any advice would be appreciated.
Scenario
I have 2 sets of criteria which identify items in a warehouse to be selected.
Query 1 returns 100 items
Query 2 returns 100 items
I need to pick any 25 of the 100 items returned in query 1.
I need to pick any 25 of the 100 items returned in query 2.
- The items in query 1/2 will not be the same, ever.
Each item is stored in a segment of the warehouse.
A segment of the warehouse may contain numerous items.
I wish to select the 50 items (25 from each query) in a way as to reduce the number of segments I must visit to select the items.
Suggested Approach
My initial idea has been to combined the 2 result sets and produce a list of
Segment ID, NumberOfItemsRequiredInSegment
I would then select 25 items from each query, giving preference to those in a segments with the most NumberOfItemsRequiredInSegment. I know this would not be optimal but would be an easy to implement heuristic.
Questions
1) I suspect this is a standard combinational problem, but I don't recognise it.. perhaps multiple knapsack, does anyone recognise it?
2) Is there a better (easy-ish to impliment) heuristic or solution - ideally in TSQL?
Many thanks.
This might also not be optimal but i think would at least perform fairly well.
Calculate this set for query 1.
Segment ID, NumberOfItemsRequiredInSegment
take the top 25, Just by sorting by NumberOfItemsRequiredInSegment. call this subset A.
take the top 25 from query 2, by joining to A and sorting by "case when A.segmentID is not null then 1 else 0, NumberOfItemsRequiredInSegmentFromQuery2".
repeat this but take the top 25 from query 2 first. return the better performing of the 2 sets.
The one scenario where i think this fails would be if you got something like this.
Segment Count Query 1 Count Query 2
A 10 1
B 5 1
C 5 1
D 5 4
E 5 4
F 4 4
G 4 5
H 1 5
J 1 5
K 1 10
You need to make sure you choose A, D, E, from when choosing the best segments from query 1. To deal with this you'd almost still need to join to query two, so you can get the count from there to use as a tie breaker.

SQL Sum and Group By for a running Tally?

I'm completely rewriting my question to simplify it. Sorry if you read the prior version. (The previous version of this question included a very complex query example that created a distraction from what I really need.) I'm using SQL Express.
I have a table of lessons.
LessonID StudentID StudentName LengthInMinutes
1 1 Chuck 120
2 2 George 60
3 2 George 30
4 1 Chuck 60
5 1 Chuck 10
These would be ordered by date. (Of course the actual table is thousands of records with dates and other lesson-related data but this is a simplification.)
I need to query this table such that I get all rows (or a subset of rows by a date range or by student), but I need my query to add a new column we might call PriorLessonMinutes. That is, the sum of all minutes of all lessons for the same student in lessons of PRIOR dates only.
So the query would return:
LessonID StudentID StudentName LengthInMinutes PriorLessonMinutes
1 1 Chuck 120 0
2 2 George 60 0
3 2 George 30 60 (The sum Length from row 2 only)
4 1 Chuck 60 120 (The sum Length from row 1 only)
5 1 Chuck 10 180 (The sum of Length from rows 1 and 4)
In essence, I need a running tally of the sum of prior lesson minutes for each student. Ideally the tally shouldn't include the current row, but if it does, no big deal as I can do subtraction in the code that receives the query.
Further, (and this is important) if I retrieve only a subset of records, (for example by a date range) PriorLessonMinutes must be a sum that considers rows that are NOT returned.
My first idea was to use SUM() and to GROUP BY Student, but that isn't right because unless I'm mistaken it would include a sum of minutes for all rows for each student, including rows that come after the row which aren't relevant to the sum I need.
OPTIONS I'M REJECTING: I could scan through all rows in my code that receives it, (although this would force me to retrieve all rows unnecessarily) but that's obviously inefficient. I could also put a real data field in there and populate it, but this too presents problems when other records are deleted or altered.
I have no idea how to write such a query together. Any guidance?
This is a great opportunity to use Windowed Aggregates. The trick is that you need SQL Server 2012 Express. If you can get it, then this is the query you are looking for:
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
Note that it returns NULLs instead of 0s (zeroes). If you insist on zeroes, use COALESCE function to turn NULLs into zeroes.
I suggest using a nested query to limit the number of rows returned:
select * from
(
select *,
sum(LengthInMinutes)
over (partition by StudentId order by LessonId
rows between unbounded preceding and 1 preceding)
as PriorLessonMinutes
from Lessons
) as NestedLessons
where LessonId > 3 -- this is an example of a filter
This way the filter is applied after the aggregation is complete.
Now, if you want to apply a filter that doesn't affect the aggregation (like only querying data for a certain student), you should apply the filter to the inner query, as pruning the rows that don't affect the computation early (like data for other students) will improve the performance.
I feel the following code will serve your purpose.Check it:-
select Students.StudentID ,Students.First, Students.Last,sum(Lessons.LengthInMinutes)
as TotalPriorMinutes from lessons,students
where Lessons.StartDateTime < getdate()
and Lessons.StudentID = Students.StudentID
and StartDateTime >= '20090130 00:00:00' and StartDateTime < '20790101 00:00:00'
group by Students.StudentID ,Students.First, Students.Last